Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Lockfree or not to lockfree that's the question !

5 views
Skip to first unread message

Amine

unread,
Jan 2, 2008, 1:53:00 AM1/2/08
to
 
Hello all,
 
If you look carefully at those lockfree algorithms you will find that
they exibit a pattern: they are spining around, like: 
 
===========================================
repeat

Until CAS(dword(TData64(self.data1^).next), dword(newNode.Next), dword(newNode));

or in another one we have

repeat

until newNodeWasAdded;

=============================================

So, if we have a lot of threads you will find yourself wasting

a lot of ressrouces - the bandwidth of the bus and CPU -  

hence can we say that they don't scale well ?

 

we have in one side the advantage: that they avoid deadlock and priority inversion

and in the other side, if we have a lot of threads contending, we will waste ressources.

Does STM exibit the same pattern ?

 

 

Amine

unread,
Jan 2, 2008, 2:12:13 AM1/2/08
to
 
 
The mutex does not exhibit this pattern: cause the threads are blocked waiting( not spining around).
 
 

Dmitriy Vyukov

unread,
Jan 2, 2008, 5:38:56 AM1/2/08
to


First of all, you must have a lot of *processors*, a lot of not
running threads - not a problem here.

Second, why you think that CAS will be executed many times?

Third, CAS can fail only if other CAS succeed. Contrary to HTM where
you can get livelock.


Dmitriy V'jukov

Joe Seigh

unread,
Jan 2, 2008, 6:15:31 AM1/2/08
to
Amine wrote:
>
>
> Hello all,
>
> If you look carefully at those lockfree algorithms you will find that
> they exibit a pattern: they are spining around, like:
>
> ===========================================
> repeat
>
> Until CAS(dword(TData64(self.data1^).next), dword(newNode.Next),
> dword(newNode));
>
> or in another one we have
>
> repeat
>
> until newNodeWasAdded;
>
> =============================================
>
> So, if we have a lot of threads you will find yourself wasting
>
> a lot of ressrouces - the bandwidth of the bus and CPU -
>
> hence can we say that they don't scale well ?

Research in that area is based on the assumption that they scale
better than lock-based solutions. Current usage is based on
lock-free solutions scaling better than lock based solutions.

>
>
>
> we have in one side the advantage: that they avoid deadlock and priority
> inversion
>
> and in the other side, if we have a lot of threads contending, we will
> waste ressources.
>
> Does STM exibit the same pattern ?

Some STM solutions are only obstruction-free*, meaning that they will
perform well as long as contention isn't too high. So I suppose that
means they may not scale very well.

* I did an STM solution that is lock-free so obviously STM doesn't have
to be obstruction free.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Dmitriy Vyukov

unread,
Jan 2, 2008, 7:58:52 AM1/2/08
to
On 2 янв, 14:15, Joe Seigh <jseigh...@xemaps.com> wrote:


> > Does STM exibit the same pattern ?
>
> Some STM solutions are only obstruction-free*, meaning that they will
> perform well as long as contention isn't too high. So I suppose that
> means they may not scale very well.
>
> * I did an STM solution that is lock-free so obviously STM doesn't have
> to be obstruction free.


Stop. Obstruction-free or lock-free can be only *implementation* of
STM. But the optimistic locking nature of STM requires that
transactions *can* abort and retry. No matter lock-free or even wait-
free the implementation.

Assume transaction1 reads variable X1 and then writes variable X2. And
transaction2 reads variable X2 and then writes variable X1. With bad
luck you can get live-lock here with any STM implementation...


Dmitriy V'jukov

Joe Seigh

unread,
Jan 2, 2008, 7:50:11 PM1/2/08
to

Lock-free is defined as at least one thread makes forward progress so
by definition not all threads can be "live-locked".

Dmitriy Vyukov

unread,
Jan 3, 2008, 3:47:32 AM1/3/08
to
On 3 янв, 03:50, Joe Seigh <jseigh...@xemaps.com> wrote:

> >> * I did an STM solution that is lock-free so obviously STM doesn't have
> >> to be obstruction free.
>
> > Stop. Obstruction-free or lock-free can be only *implementation* of
> > STM. But the optimistic locking nature of STM requires that
> > transactions *can* abort and retry. No matter lock-free or even wait-
> > free the implementation.
>
> > Assume transaction1 reads variable X1 and then writes variable X2. And
> > transaction2 reads variable X2 and then writes variable X1. With bad
> > luck you can get live-lock here with any STM implementation...
>
> Lock-free is defined as at least one thread makes forward progress so
> by definition not all threads can be "live-locked".


But how can you guarantee that some thread makes forward progress in
example above?
Imvho with lock-free STM you can't achieve this. You can achieve this
only with lock-based STM.


Dmitriy V'jukov

Chris Thomasson

unread,
Jan 3, 2008, 4:12:48 AM1/3/08
to
> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:1565c6c5-e9b6-42ab...@r60g2000hsc.googlegroups.com...

Yeah, you can get live-lock in that scenario. This kind of reminds me of the
three variable example here:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/b21b212dd175baab

Sometimes, I can't see how these obstruction-free algorithms could ever work
if they were hit with periods of "bad luck"... There is no guarantee, except
in the transaction manager which can serialize everything I suppose...

Joe Seigh

unread,
Jan 3, 2008, 5:57:42 AM1/3/08
to
If some thread fails to make forward progress it's because some other
thread did make forward progress. At least in my implementation it does.
Herlihy coined obstruction-free because in Sun's kCAS algorithm, if I understand
it correctly, all the compare locations, even the ones being read, are modify
operations, with threads backing out or overwriting operations by other threads,
and committing when all the operations are overwritten. Very prone to live-lock.
Plus have reads being modifies just increases contention.

Chris Thomasson

unread,
Jan 3, 2008, 6:38:09 AM1/3/08
to
> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:511f16cc-9d26-4363...@e6g2000prf.googlegroups.com...

> On 3 янв, 03:50, Joe Seigh <jseigh...@xemaps.com> wrote:

> > >> * I did an STM solution that is lock-free so obviously STM doesn't
> > >> have
> > >> to be obstruction free.
> >
> > > Stop. Obstruction-free or lock-free can be only *implementation* of
> > > STM. But the optimistic locking nature of STM requires that
> > > transactions *can* abort and retry. No matter lock-free or even wait-
> > > free the implementation.
> >
> > > Assume transaction1 reads variable X1 and then writes variable X2. And
> > > transaction2 reads variable X2 and then writes variable X1. With bad
> > > luck you can get live-lock here with any STM implementation...
> >
> > Lock-free is defined as at least one thread makes forward progress so
> > by definition not all threads can be "live-locked".


> But how can you guarantee that some thread makes forward progress in
> example above?

> imvho with lock-free STM you can't achieve this. You can achieve this
> only with lock-based STM.

FWIW, here is some info on Joes implementation:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c50b14850c0fa0e5

David Schwartz

unread,
Jan 3, 2008, 7:33:51 PM1/3/08
to
On Jan 1, 10:53 pm, "Amine" <ami...@colba.net> wrote:

> If you look carefully at those lockfree algorithms you will find that
> they exibit a pattern: they are spining around, like:
>
> ===========================================
> repeat
> Until CAS(dword(TData64(self.data1^).next), dword(newNode.Next), dword(newNode));
>
> or in another one we have
>
> repeat
>
> until newNodeWasAdded;
>
> =============================================
>
> So, if we have a lot of threads you will find yourself wasting
>
> a lot of ressrouces - the bandwidth of the bus and CPU -
>
> hence can we say that they don't scale well ?

No. The only way this can spin is if something changed during the CAS
operation. So if thread spins once, it can only be because some other
thread has made forward progress.

So if you have, say, 8 CPUs. Worst case, you will on average spin 7
times. The worst case is all 8 CPUs try to CAS the same location at
the same time and 7 fail.

The problem with lockfree lies elsewhere. Lockfree designs maximize
contention while locks minimize contention. Contention is bad.

For example, suppose I have two CPUs and three threads that one to
run. Threads A and B access the same object a lot and thread C is
doing something else entirely.

If a lock protects the object, either thread A or thread B will find
itself blocked. This will cause threads A and B to alternate runing
concurrently with thread C and there will be very little contention.

If the object is handled in a lock free way, one-third of the time, on
average, threads A and B will be running at the same time. They will
contend hideously and the CPU will run at FSB speed rather than core
speed. Even if we have other CPUs, they will perform badly due to FSB
saturation from threads A and B.

Contention is bad. Locks deschedule contending threads so that non-
contending threads tend to run concurrently. Lock-free algorithms
allow contending threads to continue to contend expensively.

That doesn't mean they don't have their place. For example, in low-
level algorithms such as memory management, where you can expect the
majority of threads will find themselves rather frequently and there
may not be non-contending threads, lock free can be a win.

But it is frequently used because it is assumed to be more efficient
even in cases where it is truly sub-optimal. The justification is the
improved performance on unrealistic micro-benchmarks that don't
measure the cost to the rest of the system.

For example, in an 8 CPU system with two threads contending, an
algorithm that benchmarks slower may be better for the system as a
whole because it causes less load on the FSB. Typical micro-benchmarks
won't show this, leading you to think the algorithm is superior.

DS

Amine

unread,
Jan 3, 2008, 11:31:16 PM1/3/08
to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:d2c2967b-c215-42a3...@h11g2000prf.googlegroups.com...

> On Jan 1, 10:53 pm, "Amine" <ami...@colba.net> wrote:
>
> > If you look carefully at those lockfree algorithms you will find that
> > they exibit a pattern: they are spining around, like:
> >
> > ===========================================
> > repeat
> > Until CAS(dword(TData64(self.data1^).next), dword(newNode.Next),
dword(newNode));
> >
> > or in another one we have
> >
> > repeat
> >
> > until newNodeWasAdded;
> >
> > =============================================
> >
> > So, if we have a lot of threads you will find yourself wasting
> >
> > a lot of ressrouces - the bandwidth of the bus and CPU -
> >
> > hence can we say that they don't scale well ?
>
> No. The only way this can spin is if something changed during the CAS
> operation. So if thread spins once, it can only be because some other
> thread has made forward progress.
>
> So if you have, say, 8 CPUs. Worst case, you will on average spin 7
> times. The worst case is all 8 CPUs try to CAS the same location at
> the same time and 7 fail.

You are right.

>
> The problem with lockfree lies elsewhere. Lockfree designs maximize
> contention while locks minimize contention. Contention is bad.
>
> For example, suppose I have two CPUs and three threads that one to
> run. Threads A and B access the same object a lot and thread C is
> doing something else entirely.
>
> If a lock protects the object, either thread A or thread B will find
> itself blocked. This will cause threads A and B to alternate runing
> concurrently with thread C and there will be very little contention.
>
> If the object is handled in a lock free way, one-third of the time, on
> average, threads A and B will be running at the same time. They will
> contend hideously and the CPU will run at FSB speed rather than core
> speed.
> Even if we have other CPUs, they will perform badly due to FSB
> saturation from threads A and B.
>
> Contention is bad. Locks deschedule contending threads so that non-
> contending threads tend to run concurrently. Lock-free algorithms
> allow contending threads to continue to contend expensively.


If the 'object' is a lockfree queue that is accessed a lot by producers
and consumers, since the threads will not be descheduled even with
100 CPUs (multicore) the FSB will become the bottleneck.
So, what i have said before about the bus is correct, the bus will be
saturated. Hence the lockfree object will not scale well.


> That doesn't mean they don't have their place. For example, in low-
> level algorithms such as memory management, where you can expect the
> majority of threads will find themselves rather frequently and there
> may not be non-contending threads, lock free can be a win.
>
> But it is frequently used because it is assumed to be more efficient
> even in cases where it is truly sub-optimal. The justification is the
> improved performance on unrealistic micro-benchmarks that don't
> measure the cost to the rest of the system.
>
> For example, in an 8 CPU system with two threads contending, an
> algorithm that benchmarks slower may be better for the system as a
> whole because it causes less load on the FSB. Typical micro-benchmarks
> won't show this, leading you to think the algorithm is superior.

you are absolutly right.

>
> DS


Chris Thomasson

unread,
Jan 3, 2008, 11:51:46 PM1/3/08
to

Chris Thomasson

unread,
Jan 3, 2008, 11:53:41 PM1/3/08
to

Chris Thomasson

unread,
Jan 3, 2008, 11:55:31 PM1/3/08
to
STM exhibits _all_ the live-lock characteristics you will ever need to worry
about...

Joe Seigh

unread,
Jan 4, 2008, 6:14:01 AM1/4/08
to
Amine wrote:
> "David Schwartz" <dav...@webmaster.com> wrote in message
> news:d2c2967b-c215-42a3...@h11g2000prf.googlegroups.com...
>> On Jan 1, 10:53 pm, "Amine" <ami...@colba.net> wrote:

>> Contention is bad. Locks deschedule contending threads so that non-
>> contending threads tend to run concurrently. Lock-free algorithms
>> allow contending threads to continue to contend expensively.
>
>
> If the 'object' is a lockfree queue that is accessed a lot by producers
> and consumers, since the threads will not be descheduled even with
> 100 CPUs (multicore) the FSB will become the bottleneck.
> So, what i have said before about the bus is correct, the bus will be
> saturated. Hence the lockfree object will not scale well.
>

[citation needed] Or are you making all of this up? All the lock-free papers
I've seen that specifically measure scalability show lock-free scaling better
than lock based solutions. And the fact that Linus wouldn't allow performance
improvements like RCU into the Linux kernel unless they could show provable and
significant performance improvements. And that goes pretty much for other kernels
as well.

Dave Butenhof

unread,
Jan 4, 2008, 9:24:52 AM1/4/08
to
David Schwartz wrote:
> On Jan 1, 10:53 pm, "Amine" <ami...@colba.net> wrote:
>
>> If you look carefully at those lockfree algorithms you will find that
>> they exibit a pattern: they are spining around, like:
>>
>> ===========================================
>> repeat
>> Until CAS(dword(TData64(self.data1^).next), dword(newNode.Next), dword(newNode));
>>
>> or in another one we have
>>
>> repeat
>>
>> until newNodeWasAdded;
>>
>> =============================================
>>
>> So, if we have a lot of threads you will find yourself wasting
>>
>> a lot of ressrouces - the bandwidth of the bus and CPU -
>>
>> hence can we say that they don't scale well ?
>
> No. The only way this can spin is if something changed during the CAS
> operation. So if thread spins once, it can only be because some other
> thread has made forward progress.
>
> So if you have, say, 8 CPUs. Worst case, you will on average spin 7
> times. The worst case is all 8 CPUs try to CAS the same location at
> the same time and 7 fail.
>
> The problem with lockfree lies elsewhere. Lockfree designs maximize
> contention while locks minimize contention. Contention is bad.

Contention is inevitable. "Good" or "bad" don't apply; we're talking
simple pragmatic performance analysis here, not moral decisions.

The way code responds (or should respond) to contention depends on the
nature of the contention, the length of contention, the average
frequency of contention, and the number of entities (threads or
processes) doing the contending.

Anyone who says "lock-free is bad" or "blocking locks are bad" is
oversimplifying to the point of deliberate misstatement.

We've been all through this numerous times before, our (mostly/often)
late and unlamented SenderX being an example of particular note, and I'm
kinda tired of the whole thing.

Yes, locks are better. Yes, lock-free is better. Do you walk to work or
take your lunch?

Nobody is well served by this over-simplification, or by the mad
religious wars that follow. Instead, let's stick with hard data for
SPECIFIC instances where one, the other, or in fact a combination, most
reliably provides the best performance. Chris has graduated to a
(mostly) objective and balanced outlook here, and has produced a lot of
good work and information; let's all try to emulate that evolution.

Amine

unread,
Jan 4, 2008, 12:41:30 PM1/4/08
to
I wrote:
> If the 'object' is a lockfree queue that is accessed a lot by producers
> and consumers, since the threads will not be descheduled even with
> 100 CPUs (multicore) the FSB will become the bottleneck.
> So, what i have said before about the bus is correct, the bus will be
> saturated. Hence the lockfree object will not scale well.


May i add another think ?

Am i corect to say that the mutex does schedule the threads to be executed
in something like a FIFO manner ( and this is good for realtime
applications).
Lockfree don't.

"David Schwartz" <dav...@webmaster.com> wrote in message
news:d2c2967b-c215-42a3...@h11g2000prf.googlegroups.com...

Amine

unread,
Jan 4, 2008, 1:28:10 PM1/4/08
to

typo

I wrote:
> If the 'object' is a lockfree queue that is accessed a lot by producers
> and consumers, since the threads will not be descheduled even with
> 100 CPUs (multicore) the FSB will become the bottleneck.
> So, what i have said before about the bus is correct, the bus will be
> saturated. Hence the lockfree object will not scale well.


May i add another thing ?

Am i correct to say that the mutex does schedule the threads to be executed
in something like a FIFO manner or priority scheduling or soft realtime
scheduling etc.
(and this is good for realtime applications) ?

Lockfree don't.

Chris Thomasson

unread,
Jan 4, 2008, 2:23:44 PM1/4/08
to
"Amine" <ami...@colba.net> wrote in message
news:13nsr5g...@corp.supernews.com...

What's the platform?

>
> Lockfree don't.

http://groups.google.com/group/comp.arch/msg/5dba6fee05aa07ea

http://groups.google.com/group/comp.sys.super/msg/642b6b608294e426
(I briefly mention lock-free wrt hard real-time near the end of the post)


I would personally go with loop-less (e.g., wait-free) algorithms; _not_
lock-free. Wait-free techniques can help you meet the deadlines demanded by
a hard real-time system...

Amine

unread,
Jan 4, 2008, 4:00:55 PM1/4/08
to
Chris Thomasson wrote:
> http://groups.google.com/group/comp.arch/msg/5dba6fee05aa07ea
>
> http://groups.google.com/group/comp.sys.super/msg/642b6b608294e426
> (I briefly mention lock-free wrt hard real-time near the end of the post)
> I would personally go with loop-less (e.g., wait-free) algorithms; _not_
> lock-free. Wait-free techniques can help you meet the deadlines demanded
by
> a hard real-time system...

I am curious, can you show me some loop-less wait-free 'implementations' on
internet ?


"Chris Thomasson" <cri...@comcast.net> wrote in message

news:0d-dnUEfFv4jGOPa...@comcast.com...

Chris Thomasson

unread,
Jan 4, 2008, 8:30:26 PM1/4/08
to
"Amine" <ami...@colba.net> wrote in message
news:13nt46a...@corp.supernews.com...

> Chris Thomasson wrote:
>> http://groups.google.com/group/comp.arch/msg/5dba6fee05aa07ea
>>
>> http://groups.google.com/group/comp.sys.super/msg/642b6b608294e426
>> (I briefly mention lock-free wrt hard real-time near the end of the post)
>> I would personally go with loop-less (e.g., wait-free) algorithms; _not_
>> lock-free. Wait-free techniques can help you meet the deadlines demanded
> by
>> a hard real-time system...
>
> I am curious, can you show me some loop-less wait-free 'implementations'
> on
> internet ?
[...]

- Here are two single-producer/consumer FIFO (e.g., one by Haron, and one by
me):

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0844ba4f70387e80

- Here is another VERY NEAT trick by Joe Seigh:

http://groups.google.com/group/comp.lang.asm.x86/msg/8851fd98e7f056da

- Here is a multiple-producer/consumer LIFO/FIFO hybrid with a wait-free pop
function:

http://groups.google.com/group/comp.programming.threads/msg/a359bfb41c68b98b

- Here is a read-write mutex that has wait-free read-access on the fast-path
(e.g., uncontended case):

http://appcore.home.comcast.net/misc/winsync_v0001_rwmutex_hpp.html

- Here are two read-write spinlocks that have wait-free read/write-access on
the fast-path:

http://groups.google.com/group/comp.programming.threads/msg/9f3e76e60c2cf91a
(Joes excellent algorithm...)

http://appcore.home.comcast.net/~appcore/appcore/include/ac_rwspinlock_algo1_h.html

- Here is my eventcount which has wait-free signal on the fast-path:

http://appcore.home.comcast.net/misc/winsync_v0001_evcount_hpp.html

- Here are some papers on wait-free:

http://portal.acm.org/citation.cfm?doid=114005.102808

http://portal.acm.org/citation.cfm?id=847109

http://citeseer.ist.psu.edu/attiya91are.html

Conclusion: Wait-free implementations on the Internet are few and far
between!

Chris Thomasson

unread,
Jan 4, 2008, 8:55:43 PM1/4/08
to
"Amine" <ami...@colba.net> wrote in message
news:13nt46a...@corp.supernews.com...

> Chris Thomasson wrote:
>> http://groups.google.com/group/comp.arch/msg/5dba6fee05aa07ea
>>
>> http://groups.google.com/group/comp.sys.super/msg/642b6b608294e426
>> (I briefly mention lock-free wrt hard real-time near the end of the post)
>> I would personally go with loop-less (e.g., wait-free) algorithms; _not_
>> lock-free. Wait-free techniques can help you meet the deadlines demanded
> by
>> a hard real-time system...
>
> I am curious, can you show me some loop-less wait-free 'implementations'
> on
> internet ?
[...]

I almost forgot my allocator algorithm that has wait-free allocations, and
wait-free _local_ deallocations:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69


David Schwartz

unread,
Jan 5, 2008, 3:10:37 AM1/5/08
to
On Jan 4, 9:41 am, "Amine" <ami...@colba.net> wrote:

> Am i corect to say that the mutex does schedule the threads to be executed
> in something like a FIFO manner ( and this is good for realtime
> applications).

Some may, but IMO the ideal implementation would not. LIFO tends to be
far more efficient. Whichever thread most recently blocked is most
likely to have its data warmest in the cache. The one that has been
blocking the longest likely would run the slowest.

The scheduler should schedule the thread that can run the best.
Fairness is *not* a POSIX requirement, and IMO, performance should
almost always win out over fairness. (See my other post on why
'technical' notions of fairness to threads may not make applications
any fairer and definitely hurt performance.)

> Lockfree don't.

Lockfree algorithms tend to be less predictable than locks. I don't
see this as a particular serious disadvantage though I suppose it
could be in some cases. My understanding is that most modern CPUs do
try to make things like CAS fair, such that a single CPU won't "always
win".

They have to do this because even locks, generally when you get down
to the low-level implementation, are actually lock free. Primitives
like CAS are all the cores give you, and it's hard to build even sane
operating systems if one CPU always wins the locks.

DS

David Schwartz

unread,
Jan 5, 2008, 3:14:14 AM1/5/08
to
On Jan 4, 6:24 am, Dave Butenhof <david.buten...@hp.com> wrote:

> Yes, locks are better. Yes, lock-free is better. Do you walk to work or
> take your lunch?

I agree.

> Instead, let's stick with hard data for
> SPECIFIC instances where one, the other, or in fact a combination, most
> reliably provides the best performance.

This has proven to be almost impossible to do as it seems that nobody
knows how to benchmark these kinds of things. What is a "typical
usage" and what is a "typical system load" to test a low-level
structure under?

How should you reconcile the performance of the things directly tested
against the cost to other things running on the system (for example,
if the threads using the structure run faster but other threads slow
down)?

How do you test under various combinations of FSB speed to core speed?
Can you compare results on one architecture to another?

DS

Amine

unread,
Jan 5, 2008, 3:51:39 PM1/5/08
to

Thank you.

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:qsOdnRjOQIAyRuPa...@comcast.com...

Dave Butenhof

unread,
Jan 5, 2008, 4:48:11 PM1/5/08
to David Schwartz
David Schwartz wrote:
> On Jan 4, 6:24 am, Dave Butenhof <david.buten...@hp.com> wrote:
>
>> Yes, locks are better. Yes, lock-free is better. Do you walk to work or
>> take your lunch?
>
> I agree.
>
>> Instead, let's stick with hard data for
>> SPECIFIC instances where one, the other, or in fact a combination, most
>> reliably provides the best performance.
>
> This has proven to be almost impossible to do as it seems that nobody
> knows how to benchmark these kinds of things. What is a "typical
> usage" and what is a "typical system load" to test a low-level
> structure under?

Indeed; so then can we stick to concrete and objective advice for
archetypical workloads and tasks and avoid the religious
overgeneralizations entirely? ;-)

Dave Butenhof

unread,
Jan 5, 2008, 4:51:21 PM1/5/08
to
David Schwartz wrote:
> On Jan 4, 9:41 am, "Amine" <ami...@colba.net> wrote:
>
>> Am i corect to say that the mutex does schedule the threads to be executed
>> in something like a FIFO manner ( and this is good for realtime
>> applications).
>
> Some may, but IMO the ideal implementation would not. LIFO tends to be
> far more efficient. Whichever thread most recently blocked is most
> likely to have its data warmest in the cache. The one that has been
> blocking the longest likely would run the slowest.
>
> The scheduler should schedule the thread that can run the best.
> Fairness is *not* a POSIX requirement, and IMO, performance should
> almost always win out over fairness. (See my other post on why
> 'technical' notions of fairness to threads may not make applications
> any fairer and definitely hurt performance.)

To take this a step further...

Fairness always depends on the situation -- there's no possible way to
objectively and universally define "fair". But one CAN objectively
define (and implement) "fastest" (well, OK, "fast") in specific
scheduling situations.

One can BUILD one's own synchronization layer on top of these primitives
that schedules application threads in a way that meets the module's
needs (whether you call that "fair", "throughput", "fast", or whatever),
and that layer will always work best when the primitives were designed
to be FAST rather than for some entirely different definition of "fair".

David Schwartz

unread,
Jan 5, 2008, 10:34:57 PM1/5/08
to
On Jan 5, 1:51 pm, Dave Butenhof <david.buten...@hp.com> wrote:

> One can BUILD one's own synchronization layer on top of these primitives
> that schedules application threads in a way that meets the module's
> needs (whether you call that "fair", "throughput", "fast", or whatever),
> and that layer will always work best when the primitives were designed
> to be FAST rather than for some entirely different definition of "fair".

That's precisely what I was trying to say.

Some "fairness" in the primitives helps. You probably wouldn't like a
platform that always ran the lowest-numbered ready-to-run thread, even
though it would likely be a bit faster than one that was more fair.
But in the vast majority of cases, the primitives should definitely
prefer speed over "fairness". The primitives have no idea what's
"fair" to the application. The application almost definitely wants
fast though.

DS

David Schwartz

unread,
Jan 5, 2008, 10:37:09 PM1/5/08
to
On Jan 5, 1:48 pm, Dave Butenhof <david.buten...@hp.com> wrote:

> Indeed; so then can we stick to concrete and objective advice for
> archetypical workloads and tasks and avoid the religious
> overgeneralizations entirely? ;-)

I think I've avoided religious overgeneralizations. I think it's fair
to say "X is bad because Y" when Y is a serious downside to X, even if
that doesn't make X objectively always bad.

If it's not those comments you are talking about, I'm not sure what
is. I simply presented a list of lock-free algorithms downsides. My
post didn't really contain any advice. Its sole purpose was to make
sure people understood the downside of lock-free algorithms.
Specifically, that contention is expensive and lock-free algorithms
don't schedule uncontending threads the way locks do. (Which is very
helpful if there are uncontending threads. If not, not so much.)

DS

Chris Thomasson

unread,
Jan 6, 2008, 12:00:49 AM1/6/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:f2636424-ee26-435b...@l32g2000hse.googlegroups.com...

> On Jan 5, 1:48 pm, Dave Butenhof <david.buten...@hp.com> wrote:
>
>> Indeed; so then can we stick to concrete and objective advice for
>> archetypical workloads and tasks and avoid the religious
>> overgeneralizations entirely? ;-)
>
> I think I've avoided religious overgeneralizations. I think it's fair
> to say "X is bad because Y" when Y is a serious downside to X, even if
> that doesn't make X objectively always bad.
>
> If it's not those comments you are talking about, I'm not sure what
> is. I simply presented a list of lock-free algorithms downsides. My
> post didn't really contain any advice. Its sole purpose was to make
> sure people understood the downside of lock-free algorithms.
[...]

Explain in detail how spinning on an adaptive mutex's internal state
_fundamentally_ differs from spinning on a CAS failure wrt lock-free
algorithms in general? IMVHO, its good to keep in mind that more than one
mutex implementation decides to spin a little before deferring to the
operating system... So be it as locks are essential to many lock-free
algorithms indeed. Think writer-side of a low-overhead reader-pattern...

A little spin on the mutex is fine, as it can increase performance somewhat
by turning things into a "temporary" spinlock-and-backoff semantics before
the "ultimate" slow-path is hit, thus delegating to the kernel waitset that
happens to be assigned to said mutex synchronization logic...

Chris Thomasson

unread,
Jan 6, 2008, 12:11:54 AM1/6/08
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:KvqdnV4xELvnwx3a...@comcast.com...

Here is a "hint": The CAS wrt lock-free is spinning an _actual_application
_data_, while the CAS wrt mutex is spinning on its proxy private/internal
state... Sometimes you only need to atomically modify a word or two...

Basic fundamental difference; no?

David Schwartz

unread,
Jan 6, 2008, 1:11:09 AM1/6/08
to
On Jan 5, 9:00 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> Explain in detail how spinning on an adaptive mutex's internal state
> _fundamentally_ differs from spinning on a CAS failure wrt lock-free
> algorithms in general?

The difference is that spinning on an adaptive mutex's internal state
is limited and if it fails, the result will be that the thread will be
descheduled (rather than eating up its whole time slice spinning). In
fact, I honestly don't see any similarity at all between these two
things, they're completely different.

In essence, lock free algorithms do not allow contending threads to be
descheduled. They allow contending threads to instead run a bit
slower. This is a win if the cost of scheduling is high and a loss if
the cost of the contention is high.

Sure, under a microscope an adaptive mutex may look a lot like a lock-
free queue. But performance disappears when you look at code under a
microscope.

> IMVHO, its good to keep in mind that more than one
> mutex implementation decides to spin a little before deferring to the
> operating system... So be it as locks are essential to many lock-free
> algorithms indeed. Think writer-side of a low-overhead reader-pattern...

> A little spin on the mutex is fine, as it can increase performance somewhat
> by turning things into a "temporary" spinlock-and-backoff semantics before
> the "ultimate" slow-path is hit, thus delegating to the kernel waitset that
> happens to be assigned to said mutex synchronization logic...

I agree. But even this can be a pessimization if there's another
thread ready-to-run that wouldn't contend. For example, consider two
threads, A and B, that are operating on the same data, protected by an
adaptive mutex. Assume neither thread holds the mutex for very long.
In this case, an adaptive mutex will allow A and B to run concurrently
for as long as their timeslices hold up. This will mean a lot of ping-
ponging on the FSB.

This is a win if A and B are the only threads on the system. This is a
win if the FSB ping-ponging isn't very significant. This is a huge
loss though if it means thread C is not scheduler, when thread C
wouldn't have touched this particular mutex at all.

As I said, lock-free algorithms tend to work well when there aren't
other threads that won't contend. They also work well when they are
ping-pong free, as for example reader lock optimizations can be. They
tend to be really good in low-level code, like memory management,
because most threads win up in memory management. Obviously, they can
be great in OS code, since most threads wind up in OS code.

DS

David Schwartz

unread,
Jan 6, 2008, 1:11:37 AM1/6/08
to
On Jan 5, 9:11 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> Here is a "hint": The CAS wrt lock-free is spinning an _actual_application
> _data_, while the CAS wrt mutex is spinning on its proxy private/internal
> state... Sometimes you only need to atomically modify a word or two...
>
> Basic fundamental difference; no?

The difference is that one will deschedule threads that contend too
much and one never will.

DS


Dave Butenhof

unread,
Jan 6, 2008, 1:38:19 AM1/6/08
to
David Schwartz wrote:
> On Jan 5, 1:51 pm, Dave Butenhof <david.buten...@hp.com> wrote:
>
>> One can BUILD one's own synchronization layer on top of these primitives
>> that schedules application threads in a way that meets the module's
>> needs (whether you call that "fair", "throughput", "fast", or whatever),
>> and that layer will always work best when the primitives were designed
>> to be FAST rather than for some entirely different definition of "fair".
>
> That's precisely what I was trying to say.

Sorry; I was emphasizing, not arguing. ;-)

> Some "fairness" in the primitives helps. You probably wouldn't like a
> platform that always ran the lowest-numbered ready-to-run thread, even
> though it would likely be a bit faster than one that was more fair.
> But in the vast majority of cases, the primitives should definitely
> prefer speed over "fairness". The primitives have no idea what's
> "fair" to the application. The application almost definitely wants
> fast though.

That's perfectly fair to the lowest numbered thread, though. (OK, using
a 2 year old definition of "fair", but then that's the whole point...
any definition is subjective, to varying degrees. ;-) )

But, yes; there's a continuum from "fast" to "fair", for some reasonable
definition of fast and fair; and where "fair" implies passing over the
most immediately obvious candidate for another, it's going to be less fast.

David Schwartz

unread,
Jan 6, 2008, 3:31:59 AM1/6/08
to
On Jan 5, 10:38 pm, Dave Butenhof <david.buten...@hp.com> wrote:
> David Schwartz wrote:

> > On Jan 5, 1:51 pm, Dave Butenhof <david.buten...@hp.com> wrote:

> > That's precisely what I was trying to say.

> Sorry; I was emphasizing, not arguing. ;-)

I interpreted as such, I was agreeing.

> But, yes; there's a continuum from "fast" to "fair", for some reasonable
> definition of fast and fair; and where "fair" implies passing over the
> most immediately obvious candidate for another, it's going to be less
> fast.

The quintessential example is the choice of the system's timeslice.
The larger the timeslice, the faster. The smaller the timeslice, the
fairer. Within reason.

The usual way to make this tradeoff is as follows:

1) Calculate the smallest timeslice that will not significantly hurt
performance.

2) If it's smaller than any rational fairness benefit imaginable (for
the expected workload, desktop or server, and so on), raise it.

Once this is done, the other trade offs fall in line. You want to keep
a thread running for its full timeslice if you can. Then you know you
can accept the cost of a context switch in the name of fairness
because of '1' above.

DS

Chris Thomasson

unread,
Jan 6, 2008, 4:15:01 AM1/6/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:684c8ad7-bfe3-4b88...@x69g2000hsx.googlegroups.com...

What about the differene between contending on the lock proxy state -vs-
contention on _actual_application_data_? Humm, well, you can deschedule
threads in lock-free algorithms by using a mutex; something like:
_____________________________________________________________
static void
adaptive_lockfree_stack_push(
anchor* const _this,
pthread_mutex_t* const mtx,
int spincount,
node* const new_node
) {
int backoff = 0;
node* cmp;
for (;;) {
cmp = _this->next;
if (CAS(&_this->next, cmp, new_node)) {
break;
} else if (--spincount < 1 && ! backoff) {
pthread_mutex_lock(mtx);
backoff = 1;
}
}
if (backoff) {
pthread_mutex_unlock(mtx);
}
}
_____________________________________________________________

Would that be "somewhat" better?

Chris Thomasson

unread,
Jan 6, 2008, 8:14:40 AM1/6/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:8dddb67e-3c08-4fe9...@v46g2000hsv.googlegroups.com...

> On Jan 5, 9:00 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> Explain in detail how spinning on an adaptive mutex's internal state
>> _fundamentally_ differs from spinning on a CAS failure wrt lock-free
>> algorithms in general?
>
[...]

> Sure, under a microscope an adaptive mutex may look a lot like a lock-
> free queue.

Well, it depends on the queue implementation; agreed?

What about a loop-less wait-free single-producer/consumer queue that does
not use any interlocked RMW instructions whatsoever? It plays very nice with
the FSB indeed! Here is sample x86 implementation (from AppCore):

http://groups.google.com/group/comp.programming.threads/msg/6033226050b6a48c
_____________________________________________________
ac_i686_queue_spsc_push:
MOVL 4(%ESP), %EAX
MOVL 8(%ESP), %ECX
MOVL 4(%EAX), %EDX
MOVL %ECX, (%EDX)
MOVL %ECX, 4(%EAX)
RET

ac_i686_queue_spsc_pop:
PUSHL %EBX
MOVL 8(%ESP), %ECX
MOVL (%ECX), %EAX
CMPL 4(%ECX), %EAX
JE ac_i686_queue_spsc_pop_failed
MOVL (%EAX), %EDX
MOVL 12(%EDX), %EBX
MOVL %EDX, (%ECX)
MOVL %EBX, 12(%EAX)
POPL %EBX
RET

ac_i686_queue_spsc_pop_failed:
XORL %EAX, %EAX
POPL %EBX
RET
_____________________________________________________


> But performance disappears when you look at code under a
> microscope.

[...]

Humm, if you actually put the code above under a "microscope", you will find
that there are only 6 instructions in the push function, 14 for the pop,
making a grand total of 20. You should also notice that its completely
devoid of LOCK prefixes; the MOV instruction is all that's needed. IMHO,
that's fairly "lightweight"; what do you think about all of this?

Is it crap?

;^)

Chris Thomasson

unread,
Jan 6, 2008, 6:33:05 PM1/6/08
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:atGdnQ3R0eyQBx3a...@comcast.com...

> "David Schwartz" <dav...@webmaster.com> wrote in message
> news:684c8ad7-bfe3-4b88...@x69g2000hsx.googlegroups.com...
>> On Jan 5, 9:11 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>>
>>> Here is a "hint": The CAS wrt lock-free is spinning an
>>> _actual_application
>>> _data_, while the CAS wrt mutex is spinning on its proxy
>>> private/internal
>>> state... Sometimes you only need to atomically modify a word or two...
>>>
>>> Basic fundamental difference; no?
>>
>> The difference is that one will deschedule threads that contend too
>> much and one never will.
>
> What about the differene between contending on the lock proxy state -vs-
> contention on _actual_application_data_? Humm, well, you can deschedule
> threads in lock-free algorithms by using a mutex; something like:


ARGH!

> _____________________________________________________________
> static void
> adaptive_lockfree_stack_push(
> anchor* const _this,
> pthread_mutex_t* const mtx,
> int spincount,
> node* const new_node
> ) {
> int backoff = 0;
> node* cmp;
> for (;;) {
> cmp = _this->next;

^^^^^^^^^^^^^^^^^^^^^^^^^

Of course it would be nice if I actually linked the node into the list!
Therefore, the following line needs to be added:

new_node->next = cmp;


> if (CAS(&_this->next, cmp, new_node)) {
> break;
> } else if (--spincount < 1 && ! backoff) {
> pthread_mutex_lock(mtx);
> backoff = 1;
> }
> }
> if (backoff) {
> pthread_mutex_unlock(mtx);
> }
> }
> _____________________________________________________________
>
> Would that be "somewhat" better?


I think this would work fine.


rfis...@gmail.com

unread,
Jan 7, 2008, 7:25:29 AM1/7/08
to
On 7 Gen, 00:33, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Chris Thomasson" <cris...@comcast.net> wrote in message

[...]

> > int backoff = 0;
> > node* cmp;
> > for (;;) {
> > cmp = _this->next;
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Of course it would be nice if I actually linked the node into the list!
> Therefore, the following line needs to be added:
>
> new_node->next = cmp;

Hey, Chris -
you see that little button with a disk on it in your text
editor? If you click that every so often you can keep your
code, even when your computer is off! It saves you time
and avoids errors because you don't have to rewrite your
program everytime you want to run it.

Bonne année,
RF.

David Schwartz

unread,
Jan 7, 2008, 3:01:04 PM1/7/08
to
On Jan 6, 5:14 am, "Chris Thomasson" <cris...@comcast.net> wrote:

> What about a loop-less wait-free single-producer/consumer queue that does
> not use any interlocked RMW instructions whatsoever? It plays very nice
> with
> the FSB indeed! Here is sample x86 implementation (from AppCore):

It's hard to know without benchmarking, and it's very hard to
benchmark. Intuitively, it seems like it should be pretty good.

The limitation of a single producer and single consumer strikes me as
pretty significant. Using this just because you happen to have a
single producer and a single consumer strikes me as premature
optimization.

Obviously, there's no point if the queue is rarely-used. You might as
well use mutexes and keep it portable.

Note that if you have a busy producer and a busy consumer, this
algorithm will make it more likely that both producer and consumer run
concurrently. This may be a win and it may be a loss.

I can think of cases where it could be a significant loss. For
example, imagine if the objects on the queue are very large and only a
single CPU is available to run the producer and the consumer. This
algorithm will likely let the producer run to completion, pushing each
large object into and out of the CPU cache, forcing the consumer to
read them all in. A pre-emptive mutex would let the consumer run as
soon as the producer releases the lock, manipulating the large object
while it's still in the CPU cache.

So to use this generically would be premature optimization of the
worst kind. For one thing, things break horribly if you ever decide
you need a second producer or a second consumer. For another thing,
it's not clear if you're in the cases where this is a win or the cases
where this is a loss.

> Is it crap?

I don't know. Most likely, you don't either.

If you can prove it works better in a particular case where
performance matters, by all means use it. But if you reach for it
first when you need a single-producer, single-consumer queue, you are
probably at least a bit misguided.

DS

David Schwartz

unread,
Jan 7, 2008, 3:07:36 PM1/7/08
to
On Jan 4, 3:14 am, Joe Seigh <jseigh...@xemaps.com> wrote:

> [citation needed] Or are you making all of this up? All the lock-free papers
> I've seen that specifically measure scalability show lock-free scaling better
> than lock based solutions.

That's often because the particular lock-free implementation they're
showing is specifically optimized for the measure of scalability they
are using. How many of them test on, say a four-CPU system where two
of the CPUs run threads that use the lock free structure and the other
two run threads that don't?

For any reasonable class you can write, I can create a benchmark for
which your class is optimal.

> And the fact that Linus wouldn't allow performance
> improvements like RCU into the Linux kernel unless they could show provable and
> significant performance improvements. And that goes pretty much for other kernels
> as well.

Kernels and threading libraries are two places where lock-free
algorithms of all kinds can work very well. Lock free tends to get
ugly when the permitted contention slows down the rest of the system.
Kernels and threading libraries live at such a low level that there
typically isn't a "rest of the system" to care about.

Another class of lock free algorithms that work very well are
algorithms that remove all interlocking from read-only cases
(sometimes by making cases that wouldn't otherwise be read-only be
that way). RCU and read-write lock optimizations fall into this
category.

The losers are the ones that are intended for general application use
that permit contending threads to "fight it out" rather than
scheduling uncontending threads. Clearly, you can only see this effect
if your test includes uncontending threads.

DS

Chris Thomasson

unread,
Jan 7, 2008, 8:35:08 PM1/7/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:dfc70529-e541-4124...@p69g2000hsa.googlegroups.com...

> On Jan 6, 5:14 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>> What about a loop-less wait-free single-producer/consumer queue that does
>> not use any interlocked RMW instructions whatsoever? It plays very nice
>> with
>> the FSB indeed! Here is sample x86 implementation (from AppCore):
>
> It's hard to know without benchmarking, and it's very hard to
> benchmark. Intuitively, it seems like it should be pretty good.
>
> The limitation of a single producer and single consumer strikes me as
> pretty significant.

[...]

This is not a limitation at all. I use a distributed multiplexing algorithm
to get multi-consumer/producer out of a single-consumer/producer:

http://groups.google.com/group/comp.programming.threads/msg/4230a9d3cbc20ea9

http://groups.google.com/group/comp.arch/msg/a7501814f19731cd

http://groups.google.com/group/comp.arch/msg/ac41b1c179f0e7ca

That works, and scales extremely well. What do you think?

Chris Thomasson

unread,
Jan 7, 2008, 8:38:53 PM1/7/08
to

<rfis...@gmail.com> wrote in message
news:5b2ab508-6143-4dc3...@d4g2000prg.googlegroups.com...

On 7 Gen, 00:33, "Chris Thomasson" <cris...@comcast.net> wrote:
> > "Chris Thomasson" <cris...@comcast.net> wrote in message
[...]

> Hey, Chris -


> you see that little button with a disk on it in your text
> editor? If you click that every so often you can keep your
> code, even when your computer is off!

Humm. I was thinking about trying to consolidate all of the pseudo-code and
corrections I have posted to cpt over the years into a PDF or something...


> It saves you time
> and avoids errors because you don't have to rewrite your
> program everytime you want to run it.

LOL! :^)

Chris Thomasson

unread,
Jan 7, 2008, 8:45:17 PM1/7/08
to
"David Schwartz" <dav...@webmaster.com> wrote in message
news:52cdb264-8404-4b8b...@i72g2000hsd.googlegroups.com...

> On Jan 4, 3:14 am, Joe Seigh <jseigh...@xemaps.com> wrote:
>
>> [citation needed] Or are you making all of this up? All the lock-free
>> papers
>> I've seen that specifically measure scalability show lock-free scaling
>> better
>> than lock based solutions.
>
> That's often because the particular lock-free implementation they're
> showing is specifically optimized for the measure of scalability they
> are using. How many of them test on, say a four-CPU system where two
> of the CPUs run threads that use the lock free structure and the other
> two run threads that don't?

What would the other threads be doing? In a sense, there are surely other
threads running on the system when their benchmarks are executing. It sounds
like your saying that a program which makes use of lock-free algorithm(s) in
any way, shape or form, will cast a marked negative effect upon the threads
of other processes running on the system?

[...]

David Schwartz

unread,
Jan 7, 2008, 10:18:08 PM1/7/08
to
On Jan 7, 5:45 pm, "Chris Thomasson" <cris...@comcast.net> wrote:

> > That's often because the particular lock-free implementation they're
> > showing is specifically optimized for the measure of scalability they
> > are using. How many of them test on, say a four-CPU system where two
> > of the CPUs run threads that use the lock free structure and the other
> > two run threads that don't?

> What would the other threads be doing?

Anything else that puts reasonable load on CPU and memory/FSB
resources.

> In a sense, there are surely other
> threads running on the system when their benchmarks are executing.

Not likely. Most such benchmarks I've seen quiesce the rest of the
system for fear of "contaminating" the results. However, an algorithm
that makes the rest of the system slower may be inferior to one that
doesn't in a realistic application.

> It