Spinlocks - trouble with speed

Thomas Richter

unread,

Jul 30, 2009, 12:17:56 PM7/30/09

to

Hi folks,

to protect a shared resource which needs to be locked only very shortly,
I'm using a very simple spinlock technique:

static inline void FutexObtain(volatile int *fu)
{
while(__sync_fetch_and_add(fu,-1) <= 0) {
Locked_Add(fu,1);
while(*fu <=0) {};
}
}
static inline void FutexRelease(volatile int *fu)
{
__sync_fetch_and_add(fu,1);
}

where __sync_fetch_and_add is the GNU gcc built-in for a locked add
instruction on the x86.

Interestingly, while this works on x86 as intended, it is pretty fast on
some systems, while performance on other systems is slower than the
corresponding Os mutex implementations. I'm here only speaking about
dual (or multi-) core x86 implementations, and the (current) target is
Linux.

I wonder whether there are known defects about this simple spinlocking
technique - and remember, the typical number of spins of a core on this
lock has to be verified to be very low on average.

Thanks,

Thomas

Loïc Domaigné

unread,

Jul 30, 2009, 5:31:59 PM7/30/09

to

Hi Thomas,

I'm a noob in lock-free stuff, but don't you need a membar for the
while (*fu<=0) {};

Second idea: what about using pthread_spin_lock() family
synchronization device.

Cheers,
Loïc
--
My Blog: http://www.domaigne.com/blog

"Should array indices start at 0 or 1? My compromise of 0.5 was
rejected without, I thought, proper consideration." -- Stan Kelly-
Bootle

David Schwartz

unread,

Jul 30, 2009, 6:03:17 PM7/30/09

to

On Jul 30, 9:17 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
> Hi folks,
>
> to protect a shared resource which needs to be locked only very shortly,
> I'm using a very simple spinlock technique:
>
> static inline void FutexObtain(volatile int *fu)
> {
> while(__sync_fetch_and_add(fu,-1) <= 0) {
> Locked_Add(fu,1);
> while(*fu <=0) {};
> }
> }
> static inline void FutexRelease(volatile int *fu)
> {
> __sync_fetch_and_add(fu,1);
> }

Wow, you made every possible mistake!

> I wonder whether there are known defects about this simple spinlocking
> technique - and remember, the typical number of spins of a core on this
> lock has to be verified to be very low on average.

Are there any known defects?! Your naive spinlock has every defect
known to man!

Here's the biggest. Consider an x86 machine with four cores. Core A
holds the spinlock. Cores B, C, and D are blocked on the spinlock. How
is core A supposed to make any forward progress if cores B, C, D are
saturating the FSB fighting over ownership of the spinlock integer?
(NEVER spin on an operation that writes or can write!!)

Here's the second biggest: When a CPU does acquire the spinlock after
spinning on it, it takes the mother of all branch mispredictions,
likely flushing all pipelines and resulting in a huge performance
penalty. So, now you have a CPU that's acquired a contested spinlock
-- it's absolutely vital that it perform at the very optimum to
preserve performance -- and you've slowed it to a crawl.

Here's the third biggest: Consider a hyper-threaded CPU. Now imagine
one virtual core holds the spinlock and the other is blocked on it.
The "blocked" CPU will spin at full speed, depriving the other virtual
core of execution resources. So, again, the CPU that holds the
spinlock will run at a massively reduced speed at precisely the time
it most needs to hurry.

There are more, of course. You've almost literally made every possible
mistake. Spinlocks are phenomenally complex. You have to have a very
deep understanding of the CPU internals to get them right.

DS

David Schwartz

unread,

Jul 30, 2009, 6:09:34 PM7/30/09

to

On Jul 30, 3:03 pm, David Schwartz <dav...@webmaster.com> wrote:

> Are there any known defects?! Your naive spinlock has every defect
> known to man!

Sorry, I misread your code. You don't have this defect:

> Here's the biggest. Consider an x86 machine with four cores. Core A
> holds the spinlock. Cores B, C, and D are blocked on the spinlock. How
> is core A supposed to make any forward progress if cores B, C, D are
> saturating the FSB fighting over ownership of the spinlock integer?
> (NEVER spin on an operation that writes or can write!!)

But you do have the branch misprediction and the execution unit
deprivation.

DS

Thomas Richter

unread,

Jul 30, 2009, 6:41:32 PM7/30/09

to

David Schwartz wrote:
> On Jul 30, 9:17 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
>> Hi folks,
>>
>> to protect a shared resource which needs to be locked only very shortly,
>> I'm using a very simple spinlock technique:
>>
>> static inline void FutexObtain(volatile int *fu)
>> {
>> while(__sync_fetch_and_add(fu,-1) <= 0) {
>> Locked_Add(fu,1);
>> while(*fu <=0) {};
>> }
>> }
>> static inline void FutexRelease(volatile int *fu)
>> {
>> __sync_fetch_and_add(fu,1);
>> }

As the "biggest" problem is apparently not present, let's go to the other:

> Here's the second biggest: When a CPU does acquire the spinlock after
> spinning on it, it takes the mother of all branch mispredictions,
> likely flushing all pipelines and resulting in a huge performance
> penalty. So, now you have a CPU that's acquired a contested spinlock
> -- it's absolutely vital that it perform at the very optimum to
> preserve performance -- and you've slowed it to a crawl.

So how to avoid this, i.e. instead of a negative critique, provide an
improvement. If the CPU leaves the spinlock, of course the branch cache
is no longer up to date, but how can it be if it leaves the spin. Why is
the pipeline flushed? How, how can one avoid the pipeline stall?

> Here's the third biggest: Consider a hyper-threaded CPU. Now imagine
> one virtual core holds the spinlock and the other is blocked on it.
> The "blocked" CPU will spin at full speed, depriving the other virtual
> core of execution resources. So, again, the CPU that holds the
> spinlock will run at a massively reduced speed at precisely the time
> it most needs to hurry.

I would currently not care enough about hyperthreading CPUs right now,
but I can throw in a couple of "nops" in the loop, or a bogus
computation. Feel ensured that I tried this, made no difference, but
this isn't a hyperthreading CPU in first place.

> There are more, of course. You've almost literally made every possible
> mistake. Spinlocks are phenomenally complex. You have to have a very
> deep understanding of the CPU internals to get them right.

Ok, so get them right, please.

Thanks,

Thomas

Thomas Richter

unread,

Jul 30, 2009, 6:44:09 PM7/30/09

to

Lo�c Domaign� wrote:

>> I wonder whether there are known defects about this simple spinlocking
>> technique - and remember, the typical number of spins of a core on this
>> lock has to be verified to be very low on average.
>
> I'm a noob in lock-free stuff, but don't you need a membar for the
> while (*fu<=0) {};

Actually, you should. The "locked add" keeps the bus locked for the
working processor available.

> Second idea: what about using pthread_spin_lock() family
> synchronization device.

Yes, of course, if I'd had them available - which I don't.

So long,
Thomas

Chris M. Thomasson

unread,

Jul 30, 2009, 6:46:42 PM7/30/09

to

"Lo�c Domaign�" <loic.d...@googlemail.com> wrote in message
news:df4ef88b-41e0-4115...@r2g2000yqm.googlegroups.com...
Hi Thomas,

> to protect a shared resource which needs to be locked only very shortly,
> I'm using a very simple spinlock technique:
>
> static inline void FutexObtain(volatile int *fu)
> {
> while(__sync_fetch_and_add(fu,-1) <= 0) {
> Locked_Add(fu,1);
> while(*fu <=0) {};
> }
> }
> static inline void FutexRelease(volatile int *fu)
> {
> __sync_fetch_and_add(fu,1);
> }

[...]

> I'm a noob in lock-free stuff, but don't you need a membar for the
> while (*fu<=0) {};

No. When that while loop exits it loops back around to the
`__sync_fetch_and_add()' which is a membar.

> Second idea: what about using pthread_spin_lock() family
> synchronization device.

Good idea. :^)

Chris M. Thomasson

unread,

Jul 30, 2009, 6:56:14 PM7/30/09

to

"Thomas Richter" <th...@math.tu-berlin.de> wrote in message
news:h4t7eq$bf0$1...@infosun2.rus.uni-stuttgart.de...
[...]

> I would currently not care enough about hyperthreading CPUs right now, but
> I can throw in a couple of "nops" in the loop, or a bogus computation.
> Feel ensured that I tried this, made no difference, but this isn't a
> hyperthreading CPU in first place.

Use the PAUSE instruction. Here is a simple spinlock (in MSVC++) you can try
out:
________________________________________________________________________
typedef __int32 atomicword_ia32;
typedef atomicword_ia32 spinlock_ia32;

#define SPINLOCK_IA32_INIT 0

#define spinlock_ia32_init(mp_self) ( \
*(mp_self) = 0 \
)

__declspec(naked) void
spinlock_ia32_acquire(
spinlock_ia32* const self
) {
_asm {
MOV ECX, [ESP + 4]
MOV EAX, 1

spinlock_ia32_acquire_retry:
XCHG [ECX], EAX
TEST EAX, EAX
JNZ spinlock_ia32_acquire_failed
RET

spinlock_ia32_acquire_failed:
PAUSE
JMP spinlock_ia32_acquire_retry
}
}

__declspec(naked) void
spinlock_ia32_release(
spinlock_ia32* const self
) {
_asm {
MOV EAX, [ESP + 4]
MOV DWORD PTR [EAX], 0
RET
}
}
________________________________________________________________________

You can easily convert this to AT&T syntax and use GAS to assemble it.

Chris M. Thomasson

unread,

Jul 30, 2009, 7:00:50 PM7/30/09

to

"Thomas Richter" <th...@math.tu-berlin.de> wrote in message

news:h4sgvg$kd$1...@infosun2.rus.uni-stuttgart.de...

> Hi folks,
>
> to protect a shared resource which needs to be locked only very shortly,
> I'm using a very simple spinlock technique:
>
>
> static inline void FutexObtain(volatile int *fu)
> {
> while(__sync_fetch_and_add(fu,-1) <= 0) {
> Locked_Add(fu,1);
> while(*fu <=0) {};
> }
> }
> static inline void FutexRelease(volatile int *fu)
> {
> __sync_fetch_and_add(fu,1);
> }

[...]

Why are you using an atomic RMW instruction to release the lock on an x86?
That is unnecessary overhead as you only need a plain MOV instruction:

http://www.intel.com/Assets/PDF/manual/253668.pdf
(read section 8.2)

Also, why are you using FAA? Its clearly not needed. BTW, what's the point
of `Locked_Add()'?

Chris M. Thomasson

unread,

Jul 30, 2009, 7:07:13 PM7/30/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:h4t8m2$10f9$1...@news.ett.com.ua...

>
> "Thomas Richter" <th...@math.tu-berlin.de> wrote in message
> news:h4sgvg$kd$1...@infosun2.rus.uni-stuttgart.de...
>> Hi folks,
>>
>> to protect a shared resource which needs to be locked only very shortly,
>> I'm using a very simple spinlock technique:

[...]
> Why are you using an atomic RMW instruction to release the lock on an x86?
> That is unnecessary overhead as you only need a plain MOV instruction:
>
>
> http://www.intel.com/Assets/PDF/manual/253668.pdf
> (read section 8.2)
>
>
> Also, why are you using FAA? Its clearly not needed. BTW, what's the point
> of `Locked_Add()'?

One other really important question... Are you sure that you have eliminated
any chance of false-sharing between the lock word and any state within the
critical section? Or between the lock work and any other lock word? You need
to ensure that the lock word is padded up to and aligned on a L2 cache line.

Chris M. Thomasson

unread,

Jul 30, 2009, 7:24:51 PM7/30/09

to

"Thomas Richter" <th...@math.tu-berlin.de> wrote in message
news:h4sgvg$kd$1...@infosun2.rus.uni-stuttgart.de...

> Hi folks,
>
> to protect a shared resource which needs to be locked only very shortly,
> I'm using a very simple spinlock technique:

[...]

I think I notice a potential problem... Think about this scenario:

static inline void FutexObtain(volatile int *fu)
{

O1: while(__sync_fetch_and_add(fu,-1) <= 0) {
O2: Locked_Add(fu,1);
O3: while(*fu <=0) {};

}
}
static inline void FutexRelease(volatile int *fu)
{

R1: __sync_fetch_and_add(fu,1);
}

These operations will be assigned to threads A-F...

assuming lock word is init to 1

A1: O1 // lock word == 0 ** OWNER **
B1: O1 // lock word == -1
C1: O1 // lock word == -2
D1: O1 // lock word == -3
E1: O1 // lock word == -4
A2: R1 // lock word == -3 ** LOCK IS FREE **
F1: O1 // lock word == -2 ** OOPS!!! **

Notice how Thread A released the lock at A2, which means the next thread
that comes along should be able to acquire the lock (e.g., thread F).
However, it cannot acquire it because the lock word was -3, and it
decremented it to -2 which forces a slow-path. That is absolutely horrible.
It seems as if your trying to hand off ownership, but even then it doesn't
seem like it would work. Lets continue the sequence to see just how long it
would take for another thread to acquire the lock:

B1: O2 // lock word -1
B1: O3 // spin on -1
C1: O2 // lock word 0
C1: O3 // spin on 0
D1: O2 // lock word 1
C1: O1 // lock word 0 ** OWNER **

Wow! That's a hardcore fuc%ed up nightmare. Also, I think I might be able to
come up with a sequence in which no thread can get the lock, and you
livelock on the acquire portion! ;^(...

Why not just do something simple like this first:

// init lock word to 0

static inline void FutexObtain(volatile int *fu)
{

while (__sync_lock_test_and_set(fu, 1)) {
asm("pause"); // backoff
}
}

static inline void FutexRelease(volatile int *fu)
{

__sync_lock_release(fu);
}

Were you looking for lock fairness? If so, just use a standard bakery
algorithm and be done with it.

Chris M. Thomasson

unread,

Jul 30, 2009, 8:52:31 PM7/30/09

to

First of all, sorry for all the posts.

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:h4ta33$115h$1...@news.ett.com.ua...
[...]

> B1: O2 // lock word -1
> B1: O3 // spin on -1
> C1: O2 // lock word 0
> C1: O3 // spin on 0
> D1: O2 // lock word 1
> C1: O1 // lock word 0 ** OWNER **

[...]

I forgot to increment the thread's sequence numbers on the latter part; here
is entire corrected sequence:

A1: O1 // lock word == 0 ** OWNER **
B1: O1 // lock word == -1
C1: O1 // lock word == -2
D1: O1 // lock word == -3
E1: O1 // lock word == -4
A2: R1 // lock word == -3 ** LOCK IS FREE **
F1: O1 // lock word == -2 ** OOPS!!! **

B2: O2 // lock word == -1
B3: O3 // spin on -1
C2: O2 // lock word == 0
C3: O3 // spin on 0
D2: O2 // lock word == 1
C4: O1 // lock word == 0 ** OWNER... **

Sorry for any confusion!

;^(...

Anyway, this is a very bad performance problem indeed. I suggest that you
look into it any apply proper corrections to the algorithm.

Loïc Domaigné

unread,

Jul 30, 2009, 10:27:54 PM7/30/09

to

Good Morning,

> > Second idea: what about using pthread_spin_lock() family
> > synchronization device.
>
> Yes, of course, if I'd had them available - which I don't.

I guess, you mean that the following grep returns nothing, right?
$ grep "pthread_spin" /usr/include/pthread.h

Otherwise you may perhaps just need to tell the compiler to turn the
SuSv6 feature on (by setting _XOPEN_SOURCE=600 or
_C_POSIX_SOURCE=200112L)

Cheers,
LD

--
My Blog: http://www.domaigne.com/blog

"If debugging is the process of removing bugs, then programming must
be the process of putting them in." -- Edsger W. Dijkstra.

David Schwartz

unread,

Jul 31, 2009, 2:06:47 AM7/31/09

to

On Jul 30, 3:41 pm, Thomas Richter <t...@math.tu-berlin.de> wrote:

> > Here's the second biggest: When a CPU does acquire the spinlock after
> > spinning on it, it takes the mother of all branch mispredictions,
> > likely flushing all pipelines and resulting in a huge performance
> > penalty. So, now you have a CPU that's acquired a contested spinlock
> > -- it's absolutely vital that it perform at the very optimum to
> > preserve performance -- and you've slowed it to a crawl.

> So how to avoid this, i.e. instead of a negative critique, provide an
> improvement. If the CPU leaves the spinlock, of course the branch cache
> is no longer up to date, but how can it be if it leaves the spin. Why is
> the pipeline flushed? How, how can one avoid the pipeline stall?

His implementation is not worth fixing. It's just too broken. You
avoid it by implementing a proper spinlock from top to bottom.

> > Here's the third biggest: Consider a hyper-threaded CPU. Now imagine
> > one virtual core holds the spinlock and the other is blocked on it.
> > The "blocked" CPU will spin at full speed, depriving the other virtual
> > core of execution resources. So, again, the CPU that holds the
> > spinlock will run at a massively reduced speed at precisely the time
> > it most needs to hurry.

> I would currently not care enough about hyperthreading CPUs right now,
> but I can throw in a couple of "nops" in the loop, or a bogus
> computation. Feel ensured that I tried this, made no difference, but
> this isn't a hyperthreading CPU in first place.

You can't fix this kind of code by trying things and seeing how they
work. You're supporting CPU past, present, and future. (Unless this is
some kind of niche code.) You have to go by *guaranteed* behavior, not
observed behavior. (Unless you're doing CPU-specific tuning, but I
don't see any CPU selection code in there.)

> > There are more, of course. You've almost literally made every possible
> > mistake. Spinlocks are phenomenally complex. You have to have a very
> > deep understanding of the CPU internals to get them right.

> Ok, so get them right, please.

You get them right by writing a proper spinlock, not by fixing a
completely broken spinlock.

You want a jet plane and you present a jalopy. You don't "fix" the
jalopy to make a jet plane, you get a jet plane.

DS

Thomas Richter

unread,

Jul 31, 2009, 1:54:23 PM7/31/09

to

David Schwartz wrote:

> You want a jet plane and you present a jalopy. You don't "fix" the
> jalopy to make a jet plane, you get a jet plane.

Since you're not providing any hints how to get it right, I assume you
don't know either. In the latter case, please stop wasting bandwidth.

Thank you.

Thomas Richter

unread,

Jul 31, 2009, 1:57:04 PM7/31/09

to

I would hope so - the data around the counter remains constant
throughout the remaining program and is no longer touched or required. A
cache line is around 16 bytes, is that right?

So long,
Thomas

Chris Friesen

unread,

Jul 31, 2009, 3:57:33 PM7/31/09

to

Thomas Richter wrote:
> Chris M. Thomasson wrote:

>> One other really important question... Are you sure that you have
>> eliminated any chance of false-sharing between the lock word and any
>> state within the critical section? Or between the lock work and any
>> other lock word? You need to ensure that the lock word is padded up to
>> and aligned on a L2 cache line.
>
> I would hope so - the data around the counter remains constant
> throughout the remaining program and is no longer touched or required. A
> cache line is around 16 bytes, is that right?

I think core2 has a line size of 64 bytes.

Chris

Chris M. Thomasson

unread,

Jul 31, 2009, 4:04:42 PM7/31/09

to

"Thomas Richter" <th...@math.tu-berlin.de> wrote in message

news:h4vb59$8tp$2...@infosun2.rus.uni-stuttgart.de...

More like 64-bytes, or 128-bytes. It's all platform dependant. You can
enforce this by doing something like:
_____________________________________________________________________
#include <stdio.h>
#include <assert.h>

#if ! defined (UINTPTR_TYPE)
# define UINTPTR_TYPE size_t
#endif

typedef UINTPTR_TYPE uintptr_type;

typedef char static_assert[
sizeof(uintptr_type) == sizeof(void*) ? 1 : -1
];

#define ALIGN_UP(mp_ptr, mp_align) \
((void*)( \
(((uintptr_type)(mp_ptr)) + ((mp_align) - 1)) \
& ~(((mp_align) - 1)) \
))

#define L2CACHE 128U

union lockword_pad {
int lockword;
char pad[L2CACHE];
};

static unsigned char g_lockword_raw[
sizeof(union lockword_pad) + L2CACHE - 1
];

static int* g_lockword = NULL;

int main(void) {
g_lockword = ALIGN_UP(g_lockword_raw, L2CACHE);

assert(! ((uintptr_type)(g_lockword) % L2CACHE));

return 0;
}
_____________________________________________________________________

Now, `g_lockword' points to a lockword that is all alone on its very own
cache line. This gets rid of false-sharing between it and any data in a
critical-section or other lockwords for that matter.

Thomas Richter

unread,

Jul 31, 2009, 8:03:07 PM7/31/09

to

Lo�c Domaign� wrote:
> Good Morning,
>
>>> Second idea: what about using pthread_spin_lock() family
>>> synchronization device.
>> Yes, of course, if I'd had them available - which I don't.
>
> I guess, you mean that the following grep returns nothing, right?
> $ grep "pthread_spin" /usr/include/pthread.h

It returns "file not found" in some sense. This is not supposed to work
on WinXP, Linux or any specific operating system.

> Otherwise you may perhaps just need to tell the compiler to turn the
> SuSv6 feature on (by setting _XOPEN_SOURCE=600 or
> _C_POSIX_SOURCE=200112L)

Sure, having that available I agree that this is the better option. But
its not available.

So long,
Thomas

David Schwartz

unread,

Jul 31, 2009, 8:15:35 PM7/31/09

to

On Jul 31, 10:54 am, Thomas Richter <t...@math.tu-berlin.de> wrote:

> David Schwartz wrote:

> > You want a jet plane and you present a jalopy. You don't "fix" the
> > jalopy to make a jet plane, you get a jet plane.

> Since you're not providing any hints how to get it right, I assume you
> don't know either.

Right. I immediately spotted two complex and subtle deficiencies in
your code and so you naturally assume I have no idea how to fix them.

In fact, pointing out the biggest problems is the best hint for how to
get it right I could give you. Other than to say, "go find a sensible
spinlock written by someone who knows what they're doing and learn
from it".

> In the latter case, please stop wasting bandwidth.

Huh? What was the former case?

Are you upset that I didn't give you a course in how to right proper
synchronization primitives? They're *phenomenally* complex and people
spend their entire lives developing them.

DS

David Schwartz

unread,

Jul 31, 2009, 11:03:40 PM7/31/09

to

On Jul 31, 5:03 pm, Thomas Richter <t...@math.tu-berlin.de> wrote:

> > I guess, you mean that the following grep returns nothing, right?
> > $ grep "pthread_spin" /usr/include/pthread.h

> It returns "file not found" in some sense. This is not supposed to work
> on WinXP, Linux or any specific operating system.

You can't write "generic spinlocks" and expect decent performance.
Synchronization primitives get their performance from details about
how the hardware is implemented. You can, at least to some extent,
write generic x86 spinlocks.

> > Otherwise you may perhaps just need to tell the compiler to turn the
> > SuSv6 feature on (by setting _XOPEN_SOURCE=600 or
> > _C_POSIX_SOURCE=200112L)

> Sure, having that available I agree that this is the better option. But
> its not available.

I think you are proceeding in a direction that doesn't make any sense.
You platform has mutexes, but you want spinlocks. Presumably this is
for performance purposes. But you're demanding a spinlock
implementation with no knowledge of the hardware and operating system
details. It is virtually impossible for such a spinlock to outperform
the platform's native mutexes.

I would suggest using native spinlocks where they exist (Windows has
them, most modern POSIX systems have them) and using mutexes where
they don't.

DS

Amine

unread,

Aug 1, 2009, 4:40:23 AM8/1/09

to

David Schwartz wrote:
> [...]

> Spinlocks are phenomenally complex. You have to have a very
> deep understanding of the CPU internals to get them right.

You have said complex ?

Is it just a false impression ?

Or is it the truth ?

Does the complex or difficul really exist ?

In my opinion:

Even if we don't comprehand the things

*NOTHING* is complex or difficul.

To say something is complex or difficul is just a FALSE impression

Why ?

Cause as soon as you COMPREHAND the thing *COMPLETLY*

The thing will become SIMPLE

and this final impression IS the REAL truth.

Hence, *NOTHING* is complex or difficul.

Think about it...

Regards,
Amine Moulay Ramdane.

Chris M. Thomasson

unread,

Aug 1, 2009, 6:30:58 AM8/1/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:h4t8de$105j$1...@news.ett.com.ua...

> "Thomas Richter" <th...@math.tu-berlin.de> wrote in message
> news:h4t7eq$bf0$1...@infosun2.rus.uni-stuttgart.de...
> [...]
>> I would currently not care enough about hyperthreading CPUs right now,
>> but I can throw in a couple of "nops" in the loop, or a bogus
>> computation. Feel ensured that I tried this, made no difference, but this
>> isn't a hyperthreading CPU in first place.
>
> Use the PAUSE instruction. Here is a simple spinlock (in MSVC++) you can
> try out:
> ________________________________________________________________________

[...]

> ________________________________________________________________________
>
>
>
>
> You can easily convert this to AT&T syntax and use GAS to assemble it.

Here is another version which does not spin on XCHG:

________________________________________________________________________
typedef __int32 atomicword_ia32;
typedef atomicword_ia32 spinlock_ia32;

#define SPINLOCK_IA32_INIT 0

#define spinlock_ia32_init(mp_self) ( \
*(mp_self) = 0 \
)

__declspec(naked) void
spinlock_ia32_acquire(
spinlock_ia32* const self
) {
_asm {
MOV ECX, [ESP + 4]
MOV EAX, 1

spinlock_ia32_acquire_retry:
XCHG [ECX], EAX
TEST EAX, EAX
JNZ spinlock_ia32_acquire_failed
RET

spinlock_ia32_acquire_failed:
PAUSE
MOV EDX, [ECX]
TEST EDX, EDX
JNZ spinlock_ia32_acquire_failed
JMP spinlock_ia32_acquire_retry
}
}

__declspec(naked) void
spinlock_ia32_release(
spinlock_ia32* const self
) {
_asm {
MOV EAX, [ESP + 4]
MOV DWORD PTR [EAX], 0
RET
}
}
________________________________________________________________________

I suggest that you try both versions out and see what you come up with in
terms of performance characteristics.

David Schwartz

unread,

Aug 1, 2009, 9:35:39 AM8/1/09

to

On Aug 1, 3:30 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> __declspec(naked) void
> spinlock_ia32_release(
> spinlock_ia32* const self
> ) {
> _asm {
> MOV EAX, [ESP + 4]
> MOV DWORD PTR [EAX], 0
> RET
> }}

This will deadlock on SMP PentiumPro systems due to an erratum. If the
second MOV occurs just as the other CPU releases the cache line, it
won't get invalidated and the next spinlock acquisition, if it takes
place on that CPU, may erroneously spin. The deadlock won't break
until the other CPU happens to try to acquire the spinlock.

I'm not sure anyone cares about SMP PPro systems anymore though.

DS

Thomas Richter

unread,

Aug 1, 2009, 12:30:20 PM8/1/09

to

David Schwartz wrote:
> On Jul 31, 5:03 pm, Thomas Richter <t...@math.tu-berlin.de> wrote:
>
>>> I guess, you mean that the following grep returns nothing, right?
>>> $ grep "pthread_spin" /usr/include/pthread.h
>
>> It returns "file not found" in some sense. This is not supposed to work
>> on WinXP, Linux or any specific operating system.
>
> You can't write "generic spinlocks" and expect decent performance.
> Synchronization primitives get their performance from details about
> how the hardware is implemented. You can, at least to some extent,
> write generic x86 spinlocks.

Now, guess what this is about? Writing an x86 generic spin lock to some
extend.

>>> Otherwise you may perhaps just need to tell the compiler to turn the
>>> SuSv6 feature on (by setting _XOPEN_SOURCE=600 or
>>> _C_POSIX_SOURCE=200112L)
>
>> Sure, having that available I agree that this is the better option. But
>> its not available.
>
> I think you are proceeding in a direction that doesn't make any sense.
> You platform has mutexes, but you want spinlocks. Presumably this is
> for performance purposes. But you're demanding a spinlock
> implementation with no knowledge of the hardware and operating system
> details. It is virtually impossible for such a spinlock to outperform
> the platform's native mutexes.

Are you misreading my posts on purpose? Just to say this again: "My
platform does not have mutexes available". Is that clear by now?

> I would suggest using native spinlocks where they exist (Windows has
> them, most modern POSIX systems have them) and using mutexes where
> they don't.

This is *neither* windows *nor* POSIX. Got it?

So long,
Thomas

Amine

unread,

Aug 1, 2009, 12:50:49 PM8/1/09

to

To make it more CLEARER...

I wrote:

" You have said complex ?

Is it just a false impression ?

Or is it the truth ?

Does the complex or difficul really exist ?

In my opinion:

Even if we don't comprehand the things

*NOTHING* is complex or difficul.

To say something is complex or difficul is just a FALSE impression

Why ?

Cause as soon as you COMPREHAND the thing *COMPLETLY*

The thing will become SIMPLE

and this final impression IS the REAL truth.

Hence, *NOTHING* is complex or difficul.

Think about it..."

To make it simple cause it's in *REALITY* simple:

If you have noticed i said

"Cause as soon as you COMPREHAND the thing *COMPLETLY*

The thing will become SIMPLE

and this final impression IS the REAL truth.

Hence, *NOTHING* is complex or difficul.
"

So:

If you don't comprehand a thing

and

you have a false impression of it

and

another person has comprehand *COMPLETLY* the same thing

where now the truth lies ?

In the complete understanding side ?

or in the not complete understanding side ?

The truth is:

It is in the understanding side

And since as i said before:

As soon as you COMPREHAND the thing *COMPLETLY*

The thing will become SIMPLE

and this final impression IS the REAL truth.

Hence, *NOTHING* is complex or difficul.

Regards,
Amine Moulay Ramdane

Thomas Richter

unread,

Aug 1, 2009, 1:02:41 PM8/1/09

to

Hi folks,

thanks to those who helped (and despite of those who only flamed) the
overall performance of this simple spinlock improved considerably now.

Here's the current solution for further thoughts and comments:

To obtain this futex, this is what we have currently:

static void inline FutexObtain(volatile int *fu)
{
asm volatile (
"\n"
"0:\n"
"\tmovl $0,%%eax\n"
"\tmovl $1,%%ebx\n"
"\tlock\n"
"\tcmpxchgl %%ebx,(%0)\n"
"\tje 1f\n"
"2:\n"
"\tpause\n"
"\tmovl (%0),%%eax\n"
"\ttest %%eax,%%eax\n"
"\tjne 2b\n"
"\tjmp 0b\n"
"1:\n"
:
: "r"(fu)
: "%eax", "%ebx" );
}

All this is pretty gnu-ish, the closest C approximation of this would be
(without the pause):

static void inline FutexObtain(volatile int *fu)
{
while(_InterlockedCompareExchange(fu,1,0)) {
while(*fu) { };
}
}

To release a futex, I'm now simply doing:

static inline void FutexRelease(volatile int *fu)
{

(*fu) = 0;
}

The futex variable itself is aligned to a 256 byte boundary, with the
next 256 bytes reserved from any other usage, and not touched by the
remaining program. This helped to improve the performance considerably.

So, thanks again to all those who did help instead of just flaming
around, thanks for the support, and if you see further chances to
improve this, please let me know.

So long,
Thomas

Chris M. Thomasson

unread,

Aug 1, 2009, 1:32:38 PM8/1/09

to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:c2aa7764-7da3-43ec...@g1g2000pra.googlegroups.com...

On Aug 1, 3:30 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > __declspec(naked) void
> > spinlock_ia32_release(
> > spinlock_ia32* const self
> > ) {
> > _asm {
> > MOV EAX, [ESP + 4]
> > MOV DWORD PTR [EAX], 0
> > RET
> > }}

> This will deadlock on SMP PentiumPro systems due to an erratum. If the
> second MOV occurs just as the other CPU releases the cache line, it
> won't get invalidated and the next spinlock acquisition, if it takes
> place on that CPU, may erroneously spin. The deadlock won't break
> until the other CPU happens to try to acquire the spinlock.

Ah yes:

http://lkml.org/lkml/2008/1/21/366

I totally forgot about this nonsense. Thanks David for the reminder; I
appreciate it.

> I'm not sure anyone cares about SMP PPro systems anymore though.

Yeah. Well, I am not sure it's "worth" creating a workaround (e.g., using
XCHG to unlock)...

Chris M. Thomasson

unread,

Aug 1, 2009, 1:54:12 PM8/1/09

to

"Amine" <ami...@colba.net> wrote in message
news:4c3af672-cef1-4110...@j32g2000yqh.googlegroups.com...

>
>
>
> To make it more CLEARER...
>
> I wrote:
>
> " You have said complex ?
>
> Is it just a false impression ?
>
> Or is it the truth ?
>
> Does the complex or difficul really exist ?
>
> In my opinion:
>
> Even if we don't comprehand the things
>
> *NOTHING* is complex or difficul.
>
> To say something is complex or difficul is just a FALSE impression
>
> Why ?
>
> Cause as soon as you COMPREHAND the thing *COMPLETLY*
>
> The thing will become SIMPLE

[...]

Sometimes, it seems as if some things are too complex and/or difficult for
some people to completely comprehend.

Chris M. Thomasson

unread,

Aug 1, 2009, 2:14:58 PM8/1/09

to

"Thomas Richter" <th...@math.tu-berlin.de> wrote in message

news:h51sb6$aln$1...@infosun2.rus.uni-stuttgart.de...

> Hi folks,
>
> thanks to those who helped (and despite of those who only flamed) the
> overall performance of this simple spinlock improved considerably now.
>
> Here's the current solution for further thoughts and comments:
>
> To obtain this futex, this is what we have currently:
>
> static void inline FutexObtain(volatile int *fu)
> {
> asm volatile (
> "\n"
> "0:\n"
> "\tmovl $0,%%eax\n"
> "\tmovl $1,%%ebx\n"
> "\tlock\n"
> "\tcmpxchgl %%ebx,(%0)\n"
> "\tje 1f\n"
> "2:\n"
> "\tpause\n"
> "\tmovl (%0),%%eax\n"
> "\ttest %%eax,%%eax\n"
> "\tjne 2b\n"
> "\tjmp 0b\n"
> "1:\n"
> :
> : "r"(fu)
> : "%eax", "%ebx" );
> }
>

[...]

I changed it to this in order to get it to compile under GCC:
_________________________________________________________________
__attribute__((always_inline))
static __inline__ void
FutexObtain(
volatile int *fu
) {
__asm__ __volatile__ (

"\n"
"0:\n"
"\tmovl $0,%%eax\n"
"\tmovl $1,%%ebx\n"
"\tlock\n"
"\tcmpxchgl %%ebx,(%0)\n"
"\tje 1f\n"
"2:\n"
"\tpause\n"
"\tmovl (%0),%%eax\n"
"\ttest %%eax,%%eax\n"
"\tjne 2b\n"
"\tjmp 0b\n"
"1:\n"
:
: "r"(fu)

: "cc", "memory", "%eax", "%ebx" );
}

__attribute__((always_inline))
static __inline__ void
FutexRelease(
volatile int *fu
) {
__asm__ __volatile__ ("" : : : "memory");
*fu = 0;
}
_________________________________________________________________

I would recommend that you add "memory" to the clobbers list to get a
compiler barrier. Perhaps add a dummy compiler barrier to the
`FutexRelease()' procedure as well.

> The futex variable itself is aligned to a 256 byte boundary, with the next
> 256 bytes reserved from any other usage, and not touched by the remaining
> program. This helped to improve the performance considerably.

I bet! False-sharing is a performance nightmare.

> So, thanks again to all those who did help instead of just flaming around,
> thanks for the support, and if you see further chances to improve this,
> please let me know.

No problem. AFAICT, the algorithm is looking pretty good now. Much better
than the previous one.

:^)

David Schwartz

unread,

Aug 1, 2009, 7:01:22 PM8/1/09

to

On Aug 1, 9:30 am, Thomas Richter <t...@math.tu-berlin.de> wrote:

> Are you misreading my posts on purpose? Just to say this again: "My
> platform does not have mutexes available". Is that clear by now?

You have an SMP x86 platform that has no mutexes?! Seriously?

My apologies if I missed where you said that, but that's an extremely
unusual situation.

Do you have to support all x86 CPUs or just some subset?

DS

David Schwartz

unread,

Aug 1, 2009, 7:10:11 PM8/1/09

to

On Aug 1, 10:32 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Yeah. Well, I am not sure it's "worth" creating a workaround (e.g., using
> XCHG to unlock)...

My code had such a workaround as soon as this erratum was discovered.
I removed the workaround for the erratum (and changed the code to
refuse to run on SMP PPro systems) in early 2008.

I'm looking forward to the day I can completely drop support for all
pre-P3 systems.

The ability to inline the release of a spinlock is a big win, so
making code conditional on CPU type is not really doable.

Though you could at least catch illegal instruction faults and
dynamically patch 'ud2' to 'lock' on SMP systems and 'nop' on UP
systems if you really wanted to. This means you can't use 'xchg' and
have to use something like 'cmpxchgl' instead though. Since most
systems are SMP these days, this is probably another obsolete
workaround/optimization.

DS

David Schwartz

unread,

Aug 1, 2009, 7:30:43 PM8/1/09

to

On Aug 1, 10:02 am, Thomas Richter <t...@math.tu-berlin.de> wrote:

> thanks to those who helped (and despite of those who only flamed) the

I apologize for holding you back by pointing out what was wrong with
your code.

> static void inline FutexObtain(volatile int *fu)
> {
> asm volatile (
> "\n"
> "0:\n"
> "\tmovl $0,%%eax\n"
> "\tmovl $1,%%ebx\n"
> "\tlock\n"
> "\tcmpxchgl %%ebx,(%0)\n"
> "\tje 1f\n"
> "2:\n"
> "\tpause\n"
> "\tmovl (%0),%%eax\n"
> "\ttest %%eax,%%eax\n"
> "\tjne 2b\n"
> "\tjmp 0b\n"
> "1:\n"
> :
> : "r"(fu)
> : "%eax", "%ebx" );
>
> }

This looks pretty good, but it's a highly-specialized spinlock. It
will likely outperform more typical spinlocks for the use cases it is
good for, but will fail horribly if misused. That's fine, so long as
you're careful not to misuse it.

Typical spinlocks, for example, spin for only a limited amount of
time. This spinlock will spin forever. Typical spinlocks have
safeguards against being used on UP machines. This doesn't. Typical
spinlocks have safeguards against scenarios where a thread is
descheduled while holding a spinlock and then other threads waste all
their timeslices on all CPUs spinning on a spinlock that can't be
released because the only thread that can release it can't get to the
CPU because other threads with higher dynamic priorities are spinning
on it.

So long as you understand these limitations and use this code only for
the cases in which it is suitable, it should do quite well. You should
use it in cases where it is not expected to ever be contested very
much -- where the lock is needed just on the off chance two threads
happen to try to do the same thing at the same time. You should not
use it for cases where more threads than there are CPUs are likely to
be competing for the same spinlock. You should not use it for locks
that are held for more than about a thousand instructions. You should
not use it among threads of different static priorities.

By not dealing with any of these cases, you can get more performance
in appropriate cases. However, this is not very user-friendly -- you
have to know what you're doing to use this, and changing conditions
can potentially bite you.

DS

Chris M. Thomasson

unread,

Aug 2, 2009, 2:04:37 PM8/2/09

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:36cb500f-73d3-4222...@g1g2000pra.googlegroups.com...

On Aug 1, 10:02 am, Thomas Richter <t...@math.tu-berlin.de> wrote:

[...]

> This looks pretty good, but it's a highly-specialized spinlock. It
> will likely outperform more typical spinlocks for the use cases it is
> good for, but will fail horribly if misused. That's fine, so long as
> you're careful not to misuse it.

Agreed.

> Typical spinlocks, for example, spin for only a limited amount of
> time.

I need a quick clarification. Are you referring to adaptive mutexs?

> This spinlock will spin forever. Typical spinlocks have
> safeguards against being used on UP machines. This doesn't. Typical
> spinlocks have safeguards against scenarios where a thread is
> descheduled while holding a spinlock and then other threads waste all
> their timeslices on all CPUs spinning on a spinlock that can't be
> released because the only thread that can release it can't get to the
> CPU because other threads with higher dynamic priorities are spinning
> on it.

> So long as you understand these limitations and use this code only for
> the cases in which it is suitable, it should do quite well. You should
> use it in cases where it is not expected to ever be contested very
> much -- where the lock is needed just on the off chance two threads
> happen to try to do the same thing at the same time. You should not
> use it for cases where more threads than there are CPUs are likely to
> be competing for the same spinlock. You should not use it for locks
> that are held for more than about a thousand instructions. You should
> not use it among threads of different static priorities.

> By not dealing with any of these cases, you can get more performance
> in appropriate cases. However, this is not very user-friendly -- you
> have to know what you're doing to use this, and changing conditions
> can potentially bite you.

FWIW, I just wanted to point out that all of the above also applies to the
two spinlock examples I posted in this thread.

Francis Moreau

unread,

Aug 2, 2009, 4:49:49 PM8/2/09

to

Hello,

On Jul 31, 12:03 am, David Schwartz <dav...@webmaster.com> wrote:
>
> Here's the second biggest: When a CPU does acquire the spinlock after
> spinning on it, it takes the mother of all branch mispredictions,
> likely flushing all pipelines and resulting in a huge performance
> penalty. So, now you have a CPU that's acquired a contested spinlock
> -- it's absolutely vital that it perform at the very optimum to
> preserve performance -- and you've slowed it to a crawl.

Could you develop more on that point because I have no idea what
you mean here ?

for example where does the previous code take all branch
mispredictions ?

Thanks

Thomas Richter

unread,

Aug 2, 2009, 8:46:32 PM8/2/09

to

David Schwartz wrote:
> On Aug 1, 9:30 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
>
>> Are you misreading my posts on purpose? Just to say this again: "My
>> platform does not have mutexes available". Is that clear by now?
>
> You have an SMP x86 platform that has no mutexes?! Seriously?

Seriously. Due to constraints that are beyond me, the final product has
no access to operating system services, and doesn't require them for
other parts of the operation (heavy number crunching, basically).
Specifically, I do not know which operating system it will run on.

> My apologies if I missed where you said that, but that's an extremely
> unusual situation.

> Do you have to support all x86 CPUs or just some subset?

Currently SMP machines with a uniform memory architecture are of
concern. Specifically, I'm looking into dual- and many-core machines
where all cores are on the same chip. NUMA is currently not of any
concern. Basically, this P4 upwards. Hyperthreading may be of some
relevance, but I guess for the given algorithms performance benefits are
not worth it.

Anyhow. The spinlock is only required to protect access to a short code
sequence that maintains a common resource of multiple threads. That is,
the spinlock will never spin for a long time - only for the time the
blocking core runs the given code sequence - and this sequence is pretty
short. Basically, a list of jobs to perform by the threads.
Unfortunately, the structure of this job queue is too complex to allow
lock-free operations at least right now. This is something I need to
look into in the future.

So long,
Thomas

David Schwartz

unread,

Aug 2, 2009, 10:12:21 PM8/2/09

to

On Aug 2, 5:46 pm, Thomas Richter <t...@math.tu-berlin.de> wrote:

> Anyhow. The spinlock is only required to protect access to a short code
> sequence that maintains a common resource of multiple threads. That is,
> the spinlock will never spin for a long time - only for the time the
> blocking core runs the given code sequence - and this sequence is pretty
> short. Basically, a list of jobs to perform by the threads.
> Unfortunately, the structure of this job queue is too complex to allow
> lock-free operations at least right now. This is something I need to
> look into in the future.

Then it sounds like your average case performance of this code should
be perfect. Its worst case performance could be pretty poor, but
you're very unlikely to see that consistently on the same job or the
same machine, so it shouldn't matter.

I would just be careful as the number of threads goes up (if it does).
You don't want every core on the machine spinning on the same spinlock
while the thread that holds the spinlock is unable to run.

I should point out one other issue with this spinlock -- spinning
threads will see their dynamic priority drop. So when they finally do
get the spinlock, they'll be easily pre-empted by other tasks. On some
operating systems, this could result on your application getting
effectively lower priority -- especially if other threads "get in
line" behind the thread that holds the spinlock and similarly see
their priorities drop.

Keep a close eye on how long you hold this spinlock. If it starts to
get over about 1,000 instruction cycles, I would suggest reconsidering
the design. (Unless the code is grab the spinlock, get a job, release
the spinlock, then do a job for many, many thousands of instructions.
Then it shouldn't much matter.)

DS

David Schwartz

unread,

Aug 2, 2009, 10:18:29 PM8/2/09

to

On Aug 2, 1:49 pm, Francis Moreau <francis.m...@gmail.com> wrote:

> Could you develop more on that point because I have no idea what
> you mean here ?

Sure.

> for example where does the previous code take all branch
> mispredictions ?

Here is his lock acquire function:

static inline void FutexObtain(volatile int *fu)
{
while(__sync_fetch_and_add(fu,-1) <= 0) {
Locked_Add(fu,1);
while(*fu <=0) {};
}
}

Now, consider a thread acquiring a contested spinlock. It will spin in
the inner 'while' loop until the other thread releases the spinlock.
At this point, the branch prediction engine will be assuming the
'while' loop will spin again, but it won't. So at the very most
critical instant -- when a thread has just acquired a contested
spinlock -- you will take a horrible branch misprediction.

What's sad about that is that Intel went out of their way to avoid
this problem. For example, see section 2.1 of this paper:
http://cache-www.intel.com/cd/00/00/01/76/17689_w_spinlock.pdf

DS

David Schwartz

unread,

Aug 2, 2009, 10:26:05 PM8/2/09

to

On Aug 2, 11:04 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > Typical spinlocks, for example, spin for only a limited amount of
> > time.

> I need a quick clarification. Are you referring to adaptive mutexs?

Almost. The terms are not used with 100% precision, but normally an
"adaptive mutex" refers to a mutex that spins only if the thread that
holds the lock is known to be running on another core. I was referring
to a spinlock that unconditionally spins (if the lock cannot be
immediately acquires), but spins only for a finite amount of time.
These are sometimes called "spin/sleep" locks.

The problem is that sometimes programmers are simply wrong about what
synchronization type is correct. This may be because they don't
understand synchronization or they don't understand their workload.
But it can also be just because code is used in a way they didn't
expect or on hardware or schedulers that don't act the way they're
used to. Because of this, typical synchronization primitives are
designed to "fail gracefully".

This spinlock will not fail gracefully. So it's important only to use
it for problems for which it is suited. (The obvious example -- it
will fail horribly on UP machines.)

> > By not dealing with any of these cases, you can get more performance
> > in appropriate cases. However, this is not very user-friendly -- you
> > have to know what you're doing to use this, and changing conditions
> > can potentially bite you.

> FWIW, I just wanted to point out that all of the above also applies to the
> two spinlock examples I posted in this thread.

Most platforms, by default, provide very user-friendly mutexes and
spinlocks. They will generally not hurt you too badly if you use a
spinlock where you shouldn't or use a mutex inappropriately. They do
have to sacrifice some best case performance to do this. For example,
a spinlock that can sleep has to check if a thread is sleeping when
the spinlock is released (so it can wake it). This reduces the best
case performance at least a bit.

DS

Francis Moreau

unread,

Aug 3, 2009, 3:57:19 AM8/3/09

to

On Aug 3, 4:18 am, David Schwartz <dav...@webmaster.com> wrote:
> On Aug 2, 1:49 pm, Francis Moreau <francis.m...@gmail.com> wrote:
>
> > Could you develop more on that point because I have no idea what
> > you mean here ?
>
> Sure.

thanks !

>
> > for example where does the previous code take all branch
> > mispredictions ?
>
> Here is his lock acquire function:
>
> static inline void FutexObtain(volatile int *fu)
> {
> while(__sync_fetch_and_add(fu,-1) <= 0) {
> Locked_Add(fu,1);
> while(*fu <=0) {};
> }
> }
>
> Now, consider a thread acquiring a contested spinlock. It will spin in
> the inner 'while' loop until the other thread releases the spinlock.
> At this point, the branch prediction engine will be assuming the
> 'while' loop will spin again, but it won't. So at the very most
> critical instant -- when a thread has just acquired a contested
> spinlock -- you will take a horrible branch misprediction.

Ah I see now.

>
> What's sad about that is that Intel went out of their way to avoid
> this problem. For example, see section 2.1 of this paper:http://cache-www.intel.com/cd/00/00/01/76/17689_w_spinlock.pdf
>

thanks for the pointer, it looks very interesting.

David Schwartz

unread,

Aug 3, 2009, 5:59:51 PM8/3/09

to

On Jul 30, 4:24 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Why not just do something simple like this first:
>
> // init lock word to 0

>
> static inline void FutexObtain(volatile int *fu)
> {

> while (__sync_lock_test_and_set(fu, 1)) {
> asm("pause"); // backoff
> }
> }

You can't do that. On x86, this will act like a write on every spin
which unshares the cache line. If two or more cores spin on the same
spinlock, they will saturate the FSB, preventing the core that has the
lock from making forward progress at anything resembling a normal
rate.

A more typical loop (for a pure-spin lock) would be something more
like:

while(sync_exchange(fu, FU_LOCKED)!=FU_UNLOCKED)
{
while(sync_get(fu)!=FU_UNLOCKED)
cpu_relax(); // 'pause' on x86
}

Synchronization primitives are tricky.

DS

Chris M. Thomasson

unread,

Aug 3, 2009, 6:19:26 PM8/3/09

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:82df48de-4ef6-4a45...@v15g2000prn.googlegroups.com...

On Jul 30, 4:24 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > Why not just do something simple like this first:
> >
> > // init lock word to 0
> >
> > static inline void FutexObtain(volatile int *fu)
> > {
> > while (__sync_lock_test_and_set(fu, 1)) {
> > asm("pause"); // backoff
> > }
> > }

> You can't do that.

;^)

> On x86, this will act like a write on every spin
> which unshares the cache line. If two or more cores spin on the same
> spinlock, they will saturate the FSB, preventing the core that has the
> lock from making forward progress at anything resembling a normal
> rate.

I still think it might perform a little bit better than the OP's original
algorithm, because of the following scenario:

http://groups.google.com/group/comp.programming.threads/msg/507af5f5e10e7bb4

It might be interesting to test it.

> A more typical loop (for a pure-spin lock) would be something more
> like:

> while(sync_exchange(fu, FU_LOCKED)!=FU_UNLOCKED)
> {
> while(sync_get(fu)!=FU_UNLOCKED)
> cpu_relax(); // 'pause' on x86
> }

Yes. That's exactly the same as the following code I posted:

http://groups.google.com/group/comp.programming.threads/msg/bc170b1e76dd1c70

> Synchronization primitives are tricky.

Indeed.

Chris M. Thomasson

unread,

Aug 7, 2009, 8:10:31 AM8/7/09

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:064b127d-57ea-4717...@p36g2000prn.googlegroups.com...

On Aug 2, 11:04 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > Typical spinlocks, for example, spin for only a limited amount of
> > > time.
>
> > I need a quick clarification. Are you referring to adaptive mutexs?

> Almost.

;^)

> The terms are not used with 100% precision, but normally an
> "adaptive mutex" refers to a mutex that spins only if the thread that
> holds the lock is known to be running on another core. I was referring
> to a spinlock that unconditionally spins (if the lock cannot be
> immediately acquires), but spins only for a finite amount of time.
> These are sometimes called "spin/sleep" locks.

Thanks for the clarification: slow-path can possibly decay into an actual
call to, perhaps `Sleep(N)'.

> The problem is that sometimes programmers are simply wrong about what
> synchronization type is correct. This may be because they don't
> understand synchronization or they don't understand their workload.
> But it can also be just because code is used in a way they didn't
> expect or on hardware or schedulers that don't act the way they're
> used to. Because of this, typical synchronization primitives are
> designed to "fail gracefully".

> This spinlock will not fail gracefully. So it's important only to use
> it for problems for which it is suited. (The obvious example -- it
> will fail horribly on UP machines.)

Agreed. Side bar: this is one of the reasons why I love eventcounts. They
allow non-blocking algorithms to "fail gracefully" in the sense of the
ability to _block_ on boundary conditions (e.g.,
empty/full/locked/pending/ect...). non-blocking algorithms "need" the
ability to actually block! Sounds strange to some people, but it's actually
a very important ability, IMHO at least...

> > > By not dealing with any of these cases, you can get more performance
> > > in appropriate cases. However, this is not very user-friendly -- you
> > > have to know what you're doing to use this, and changing conditions
> > > can potentially bite you.

> > FWIW, I just wanted to point out that all of the above also applies to
> > the
> > two spinlock examples I posted in this thread.

> Most platforms, by default, provide very user-friendly mutexes and
> spinlocks. They will generally not hurt you too badly if you use a
> spinlock where you shouldn't or use a mutex inappropriately. They do
> have to sacrifice some best case performance to do this. For example,
> a spinlock that can sleep has to check if a thread is sleeping when
> the spinlock is released (so it can wake it). This reduces the best
> case performance at least a bit.

IMVVHO, it's perhaps "overkill?" when you have to contort a spinlock like
that; it's like a half-assed mutex...

Humm.

David Schwartz

unread,

Aug 7, 2009, 8:35:21 PM8/7/09

to

On Aug 7, 5:10 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> IMVVHO, it's perhaps "overkill?" when you have to contort a spinlock like
> that; it's like a half-assed mutex...

Yeah. And you're right, you don't *have* to check if another thread is
sleeping, you can just let that other thread unblock itself. I guess I
was thinking of a spinlock that falls back to blocking in the kernel.
You can, however, just fall back to a yield or sleep loop.

It just comes down to how nice you want to be to someone who totally
misuses it. Everyone, sooner or later, will wind up misusing their own
spinlock, so it's worth thinking about.

One big problem with unlimited spins is that it means that your
dynamic priority will drop as you wait for a lock. In a bad case (say
another thread takes a page fault while holding the spinlock), by the
time you get the spinlock, every other thread in the process may have
higher dynamic priority than you. You get pre-empted over and over as
the other threads drop their dynamic priorities. Then the thread that
finally gets the lock does so with a reduced dynamic priority, and the
cycle repeats.

Here's a piece of sage advice: Avoid a pure spinlock if the process is
likely to have other work to do that doesn't require acquiring the
spinlock. It's better to de-schedule contending threads if there are
enough non-contending threads. (Even if they don't spin, they will
ping-pong data across the FSB, slowing the other cores in the system
too.)

Synchronization, good. Contention, bad.

DS