Lock Free -- where to start

219 views
Skip to first unread message

Imre Palik

unread,
Sep 24, 2005, 1:48:00 PM9/24/05
to
Hi,

I read a few discussions on lock-free programming lately in various
newsgroups. As it sounds quite intereting, I tried to google for it, to
find some introduction, uses, idioms, patterns, etc. to learn about it.

But I am pretty much lost in the results. Could anybody point me to a good
introduction, and some real-life (OpenSource?) examples on its use? (I
read boost::shared_ptr already, but I have the feeling, there is more to
it.)

Thx

ImRe

Chris Thomasson

unread,
Sep 24, 2005, 5:19:09 PM9/24/05
to

Here is my "old and out-dated" experimental library:

http://appcore.home.comcast.net/

IMHO, the there are three very useful API's on here.


(1) The x86 lock-free support assembly library:

http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html


(2) The zero-overhead single-producer/consumer lock-free queue

http://appcore.home.comcast.net/appcore/include/ac_queue_spsc_h.html
http://appcore.home.comcast.net/appcore/src/ac_queue_spsc_c.html


(3) The zero-overhead eventcount

http://appcore.home.comcast.net/appcore/include/ac_eventcount_algo1_h.html
http://appcore.home.comcast.net/appcore/src/ac_eventcount_algo1_c.html


I am currently working on incorporating my new VZOOM library into some
in-house server software; however, I would be happy to discuss AppCore, if
you need any more information...


****VERY IMPORTANT****

SMR Hazard Pointers went PATENT PENDING in 2002!
--- You can't use them for anything commercial!

:O


Markus Elfring

unread,
Sep 25, 2005, 6:08:27 AM9/25/05
to
> But I am pretty much lost in the results. Could anybody point me to a good
> introduction, and some real-life (OpenSource?) examples on its use?

How do you think about this article?
http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms

Regards,
Markus


Joe Seigh

unread,
Sep 25, 2005, 8:46:46 AM9/25/05
to
Google on lock-free (http://scholar.google.com/ if it's up).

shared_ptr, though it's internally lock-free in some cases, isn't
really usable for lock-free since it depends on you owning the
shared_ptr's you copy.

Look at Java JSR-133 volatiles which are atomic with acquire and release
semantics. Also look at java.util.concurrent (JSR-166) packages. The
Java environment is probably a better one to play around in since the
memory model is better defined, and it has GC, atomicity, and interlocked
primatives.

If you're going to use anything commercially, take care to only use stuff
in the public domain. Just because a paper has been published on it doesn't
mean a patent application isn't in the works, which is pretty much everything
in the last 6 years or so. And a lot of stuff before that already has
patents.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

David Schwartz

unread,
Sep 28, 2005, 12:13:12 AM9/28/05
to

Let me just give you my general caveat about lock free code. First, lock
free code is generally non-portable and extremely complex. Such code can
easily contain bugs that are extremely difficult even for experts to spot.
The complexity is almost always unnecessary.

But worse, in most realistic cases, lock free tremendously *reduce*
performance. The myth is that locks are expensive. The truth is that
*contention* is expensive. Lock free algorithms allow code that is
contending to run in parallel. Locks allow other code to run that probably
won't contend.

For example, consider three threads, A, B, and C. A and B both want to
manipulate a queue, C wants to do something else entirely. Assume two CPUs.
With a lock, either A or B will grab the queue lock. A or B will run with C
and both threads will run at full speed with minimal contention. With a
lock-free queue, A can run concurrently with B. C will have to wait, while A
and B fight over the cache lines that contain the queues and its structures.

Well, you say, this is a carefully crafted example. What if there's
another processor? Guess what, you're still screwed. Because A and B running
in parallel will saturate the front side bus as they bounce the queue
structures back and forth. C will still run at a low speed and system
performance will likely be worse than if A ran by itself and then B ran by
itself.

Lock free algorithms work great in kernels and low level threading
libraries for a variety of reasons. Portability is not that important. The
probability that there are other things to do is lower in core code (because
everything passes through it). But for application code, or even most
libraries on top of threading libraries, it's a pure loss.

DS


Joe Seigh

unread,
Sep 28, 2005, 11:03:54 AM9/28/05
to
David Schwartz wrote:
> Let me just give you my general caveat about lock free code. First, lock
> free code is generally non-portable and extremely complex. Such code can
> easily contain bugs that are extremely difficult even for experts to spot.
> The complexity is almost always unnecessary.
>
[...]

>
> Lock free algorithms work great in kernels and low level threading
> libraries for a variety of reasons. Portability is not that important. The
> probability that there are other things to do is lower in core code (because
> everything passes through it). But for application code, or even most
> libraries on top of threading libraries, it's a pure loss.
>

So Java isn't portable or they shouldn't have put support for lock-free in
Java or what?

David Butenhof

unread,
Sep 28, 2005, 1:21:12 PM9/28/05
to
Joe Seigh wrote:
> David Schwartz wrote:
>> Let me just give you my general caveat about lock free code.
>> First, lock free code is generally non-portable and extremely complex.
>> Such code can easily contain bugs that are extremely difficult even
>> for experts to spot. The complexity is almost always unnecessary.
>>
> [...]
>>
>> Lock free algorithms work great in kernels and low level threading
>> libraries for a variety of reasons. Portability is not that important.
>> The probability that there are other things to do is lower in core
>> code (because everything passes through it). But for application code,
>> or even most libraries on top of threading libraries, it's a pure loss.
>>
>
> So Java isn't portable or they shouldn't have put support for lock-free in
> Java or what?

Java isn't just a language, or a compiler, or a runtime library. It's
also the MACHINE -- or at least the portable definition of a Virtual
Machine. It can specify its own rules, and the engineers who PORT the
Java VM have to do whatever is necessary (easy, hard, fast, slow,
whatever) to follow those rules for the target machine.

POSIX, on the other hand, can specify only "tweaks" on the language,
place loose requirements on the compiler, and provide a specification
for part of a runtime library... and can't say anything at all about the
machine.

The two cases are exactly the same except that everything is different.

--
Dave Butenhof, David.B...@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062

David Schwartz

unread,
Sep 28, 2005, 3:13:42 PM9/28/05
to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:oKadnXgTBfu...@comcast.com...

> David Schwartz wrote:

>> Let me just give you my general caveat about lock free code. First,
>> lock free code is generally non-portable and extremely complex. Such code
>> can easily contain bugs that are extremely difficult even for experts to
>> spot. The complexity is almost always unnecessary.

>> Lock free algorithms work great in kernels and low level threading

>> libraries for a variety of reasons. Portability is not that important.
>> The probability that there are other things to do is lower in core code
>> (because everything passes through it). But for application code, or even
>> most libraries on top of threading libraries, it's a pure loss.

> So Java isn't portable or they shouldn't have put support for lock-free in
> Java or what?

Java is quasi-portable. Java reflects a design decision that you'll take
all the performance penalties up front, and then you can use lots of clever
tricks to get the performance back. I find that an odd compromise.

If you care about performance, why would you force a memory model that
may or may not match the hardware at all? And if you don't care about
performance, why are you trying to mess with lock free algorithms?

In any event, I believe it is impossible to construct lock-free
algorithms in Java, the way the term "lock free" is ordinarily used. Java
does not let you get close enough to the hardware to know if there are locks
or not.

DS


Chris Thomasson

unread,
Sep 28, 2005, 4:15:01 PM9/28/05
to
> Let me just give you my general caveat about lock free code. First,
> lock free code is generally non-portable and extremely complex. Such code
> can easily contain bugs that are extremely difficult even for experts to
> spot. The complexity is almost always unnecessary.

The complexity can be hidden behind a simple and portable API, that anybody
can use. Like the pthreads api. We all know that a particular pthread
implementation may be very complex, and may very well use some lock-free
algorithms. But, does the average user actually care about this kind of
stuff? My point is, again "The complexity can be hidden behind a simple
API". Really.

Would there be bugs... Well, maybe; it depends on the implementer. There may
be very subtle bugs in your pthread implementation as well...


> But worse, in most realistic cases, lock free tremendously *reduce*
> performance. The myth is that locks are expensive.

Humm... Well, it really depends on the type of lock-free algorithm you
compare to locks. I would agree that "CAS-loop" based lock-free algorithms
can thrash the cache...

http://groups.google.com/group/comp.lang.c++.moderated/msg/6b2ccf76ba145c9c?hl=en
http://www.hpl.hp.com/techreports/2004/HPL-2004-105.html


however:

Contended locks can adversely affect performance and simply prevent
scalability in some situations:

Take a global read-write lock protecting a "read-mostly" data-structure.
Under heavy read contention, forward progress can be stalled waiting for the
updates to the internal lock state to be "shuttled around the CPU's caches".
If the size of the read-side critical section is moderate, the cost of
moving lock updates from one CPU's cache to another can actually rival that
of the critical section itself; This can prevent readers from truly
executing in parallel. The more processors you add, the worse it gets. This
is not true for some special high-performance reader/writer solutions...


> The truth is that *contention* is expensive. Lock free algorithms allow
> code that is contending to run in parallel. Locks allow other code to run
> that probably won't contend.

Think of a portable algorithm that can totally avoid contention for very
high numbers of "reader threads"; NO atomic operations and NO memory
barriers required...

How can a mutex compete with that?

Think frequent iteration over large database-like "read-mostly"
data-structures, before you answer.


> For example, consider three threads, A, B, and C. A and B both want to
> manipulate a queue, C wants to do something else entirely. Assume two
> CPUs. With a lock, either A or B will grab the queue lock. A or B will run
> with C and both threads will run at full speed with minimal contention.
> With a lock-free queue, A can run concurrently with B. C will have to
> wait, while A and B fight over the cache lines that contain the queues and
> its structures.

IMHO, this is an example of poor application engineering/design. I would
have C queue a request and wait for something else to do... Then A or B
could do the work; after all, they do sound like they are "worker threads"
of some sort...

;)


> Lock free algorithms work great in kernels and low level threading
> libraries for a variety of reasons. Portability is not that important.

Lock-free algorithms can "easily" be hidden behind a portable API.


> But for application code, or even most libraries on top of threading
> libraries, it's a pure loss.

Application code would not use raw lock-free structures directly. They would
use a library that exports a "portable API".


> , it's a pure loss.

Yeah...

The ability for read-only iterations (frequent database searches?) to
achieve a substantial increase in the number of reads per-second,
per-reader-thread, would most certainly be a loss...

;)


David Schwartz

unread,
Sep 28, 2005, 6:34:52 PM9/28/05
to

"Chris Thomasson" <_no_spam_cristom@no_spam_comcast._net> wrote in message
news:_42dnY09Tc9...@comcast.com...

> The ability for read-only iterations (frequent database searches?) to
> achieve a substantial increase in the number of reads per-second,
> per-reader-thread, would most certainly be a loss...

I can't figure out what scenario you're talking about. If you have a
large chunk of read-mostly data, how can the overhead of bouncing around one
cache line prior to these large accesses possibly matter? If you're talking
about lots of small accesses, well, then your lock scope is badly chosen.

However, I do agree that you picked one of the best examples. Dealing
with large collections of read-mostly data is probably one of the best cases
for a lock free algorithm.

As for hiding non-portable code behind a portable API, that argument
misses the point. If you use an API to do something, you do not care whether
there's a lock free or lock based algorithm behind the API. The issue is
about *writing* code, not using code.

DS


Message has been deleted
Message has been deleted

Joe Seigh

unread,
Sep 28, 2005, 8:06:26 PM9/28/05
to
Oliver S. wrote:
>>>But for application code, or even most libraries
>>>on top of threading libraries, it's a pure loss.
>
>
>>Application code would not use raw lock-free structures directly.
>>They would use a library that exports a "portable API".
>
>
> I think what David means, is that lock-free is over-emphasized and I
> must agree on that. With producer/consumer-patterns typical in server
> -applications, intervals where queues are locked are extremely short
> (in the best case only a pointer is fetched from a queue) and the
> processing-intervals when a dequeued item is processed at least some
> thousand times higher. So there's no real need for lock-free queues.
> Lock-free lifo-pools may make sense for memory-allocators, but good
> allocators magage thread-local pools so this need is minimized. And
> if you need maximum performance on memory-allocation, you manage your
> own pools.
> Lock-free processing may look fascinating, but primarily superfluous.

So JSR-166 was a waste of time?

Message has been deleted

David Hopwood

unread,
Sep 28, 2005, 11:11:01 PM9/28/05
to
Joe Seigh wrote:
> Oliver S. wrote:
>
>> I think what David [Schwartz] means, is that lock-free is over-emphasized

>> and I must agree on that. With producer/consumer-patterns typical in server
>> -applications, intervals where queues are locked are extremely short
>> (in the best case only a pointer is fetched from a queue) and the
>> processing-intervals when a dequeued item is processed at least some
>> thousand times higher. So there's no real need for lock-free queues.
>> Lock-free lifo-pools may make sense for memory-allocators, but good
>> allocators magage thread-local pools so this need is minimized. And
>> if you need maximum performance on memory-allocation, you manage your
>> own pools.
>> Lock-free processing may look fascinating, but is primarily superfluous.

>
> So JSR-166 was a waste of time?

Mostly, yes.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,
Sep 28, 2005, 10:13:13 PM9/28/05
to
> I think what David means, is that lock-free is over-emphasized and I

> must agree on that. With producer/consumer-patterns typical in server
> -applications, intervals where queues are locked are extremely short
> (in the best case only a pointer is fetched from a queue) and the
> processing-intervals when a dequeued item is processed at least some
> thousand times higher. So there's no real need for lock-free queues.

This would be true for the Michael-and-Scott queue algorithm; its loaded
with cache thrashing CAS-loops. However, "zero-overhead" loopless
single-producer/consumer queues are ideal for various distributed
producer/consumer systems. IMHO, a thread that is bound to a particular cpu
core, and wants to send a message to another thread that is bound to a
"nearby core", should use a method that has no atomic operations; just a
StoreStore for producers, and a data-dependant load for the consumers should
be enough. I have a feeling that scaleable designs for the new multi-core
stuff that's coming out will be NUMA-like...


The ability to remove atomic operations and nasty memory barriers from a
distributed messaging system is a plus. As we all know, multi-processor
systems do not take kindly to these types of instructions...

;)


So, "certain types" of lock-free queues are useful for high-performance
cpu-to-cpu communications. Especially, when a core on one cpu has to send a
message to a core on another cpu; atomic operations and membars would
greatly reduce performance.


> Lock-free lifo-pools may make sense for memory-allocators, but good
> allocators magage thread-local pools so this need is minimized. And
> if you need maximum performance on memory-allocation, you manage your
> own pools.

> Lock-free processing may look fascinating, but primarily superfluous.

They also work well as solutions to the reader/writer problem. They can be
used to remove the need for atomic operations and StoreLoad style memory
barriers; imho, that's not really superfluous.


Chris Thomasson

unread,
Sep 28, 2005, 10:16:00 PM9/28/05
to
>>> Lock-free processing may look fascinating, but is primarily superfluous.
>>
>> So JSR-166 was a waste of time?
>
> Mostly, yes.

:O


David Butenhof

unread,
Sep 29, 2005, 8:21:11 AM9/29/05
to

;-)

Lock-free, like threads, like Perl, like emacs, like news, like so many
other things, is an enormously powerful tool for certain things. And,
yes, absolutely useless (or even worse) for many other things.

Whether you're inclined to say that it's important (or relevant) for
"95%" or "5%" of "real work" depends entirely on your personal
definition of "real work"... which in turn depends on current position,
experience, and training.

As much as most programming today is pretty high level and abstract, it
IS still valuable to understand how transistors work, and chip logic
design, and memory busses and cache and pipelined instruction
processing. When you're programming a parallel application,
understanding what lock-free does and how it will affect your
application contributes to your understanding of the memory model and
implementation of your machine and application... and that can be
powerful even if you don't actually use lock-free techniques. (But then,
you probably do, even if you don't know it... and using them incorrectly
can be dangerous.)

David Hopwood

unread,
Sep 29, 2005, 11:04:07 AM9/29/05
to
Chris Thomasson wrote:
>>I think what David means, is that lock-free is over-emphasized and I
>>must agree on that. With producer/consumer-patterns typical in server
>>-applications, intervals where queues are locked are extremely short
>>(in the best case only a pointer is fetched from a queue) and the
>>processing-intervals when a dequeued item is processed at least some
>>thousand times higher. So there's no real need for lock-free queues.
>
> This would be true for the Michael-and-Scott queue algorithm; it's loaded
> with cache thrashing CAS-loops. However, "zero-overhead" loopless
> single-producer/consumer queues are ideal for various distributed
> producer/consumer systems. IMHO, a thread that is bound to a particular cpu
> core, and wants to send a message to another thread that is bound to a
> "nearby core", should use a method that has no atomic operations; just a
> StoreStore for producers, and a data-dependant load for the consumers should
> be enough. I have a feeling that scaleable designs for the new multi-core
> stuff that's coming out will be NUMA-like...

Right, but isn't this missing the point of David Schwartz's comments?

This newsgroup has had *many* threads with the following pattern:

- someone asks how to solve a problem. Typically it's clear from this
post that the poster has relatively little experience with concurrent
programming in general.

- one of the usual suspects suggests "using lock-free techniques".

- the thread wanders off into esoterica about how to implement stuff at
a completely disjoint level of abstraction to the original question.


To some extent this is just par for the Usenet course; OTOH, this group
does seem to have a particular collective obsession with lock-free at
the moment. Some of us are concerned that newbies are getting, at best
unfollowable, and sometimes just plain bad advice as a result.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,
Sep 29, 2005, 11:58:43 AM9/29/05
to
David Hopwood wrote:
> Chris Thomasson wrote:
>
[...]

>
>
> Right, but isn't this missing the point of David Schwartz's comments?
>
> This newsgroup has had *many* threads with the following pattern:
>
> - someone asks how to solve a problem. Typically it's clear from this
> post that the poster has relatively little experience with concurrent
> programming in general.
>
> - one of the usual suspects suggests "using lock-free techniques".
>
> - the thread wanders off into esoterica about how to implement stuff at
> a completely disjoint level of abstraction to the original question.

Then both you and David Schartz are missing the point. The OP specifically
asked about lock-free.

>
>
> To some extent this is just par for the Usenet course; OTOH, this group
> does seem to have a particular collective obsession with lock-free at
> the moment. Some of us are concerned that newbies are getting, at best
> unfollowable, and sometimes just plain bad advice as a result.

People get bad advice in this newsgroup all the time. Often from the
people who are allegedly concerned about people getting bad advice.
I've given up trying to correct these kind of things, e.g. telling
people mutexes have forward progress guarantees about visibility,
anything about observable behavior from cache (apart from performance),
etc... The most blantant misinformation is that lock-free is too
complex to use when the typical way of dealing with complexity is to
encapsulate it with an easier to use api. It's like saying that
programming with locks is too complex because most programmers would
have a hard time implementing a mutex correctly.

Message has been deleted

Joe Seigh

unread,
Sep 29, 2005, 1:28:56 PM9/29/05
to
Oliver S. wrote:
>>This would be true for the Michael-and-Scott queue algorithm;
>>its loaded with cache thrashing CAS-loops.
>
>
> You misunderstand me; I talked about the producer/consumer-pattern
> with the usual mutex-locks (which cause a kernel-intervention only on
> collisions). In this case, the lock is held an extremely short inter-
> val and the processing of the data taken from the queue lasts at least
> some thousand times longer than the dequeue-process. So the likehood
> of a collision is that short, that there's no need for lock-free pro-
> cessing?
>

That would be for a small number of threads. If probability of lock contention
for a single thread is .001 in your example, then for 100 threads it's about
.100, 200 threads about .199, 1000 threads about .630, etc...

Lock-free is about scalability, not the actual overhead. If you know your
application will never experience high contention or you don't care about
scalability then you should stick with conventional lock based solutions.

Michel

unread,
Sep 29, 2005, 1:53:35 PM9/29/05
to
Joe Seigh wrote:
> That would be for a small number of threads. If probability of lock
> contention
> for a single thread is .001 in your example, then for 100 threads it's
> about
> ..100, 200 threads about .199, 1000 threads about .630, etc...

>
> Lock-free is about scalability, not the actual overhead. If you know your
> application will never experience high contention or you don't care about
> scalability then you should stick with conventional lock based solutions.


Wouldn't you say there is a design or structural problem having 100-1000
threads contending for the same lock in any somewhat ordinary
application at the server or the client side? Wouldn't context switching
drain the performance anyways?

/Michel

Joe Seigh

unread,
Sep 29, 2005, 2:11:34 PM9/29/05
to
Depends on how many processors you have. There may be other things causing
context switching also but blocking on a lock certainly isn't going to help
performance.

David Schwartz

unread,
Sep 29, 2005, 2:28:46 PM9/29/05
to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:XfydndTAWZc...@comcast.com...

> Lock-free is about scalability, not the actual overhead. If you know
> your
> application will never experience high contention or you don't care about
> scalability then you should stick with conventional lock based solutions.

The point is, locks allow you to avoid high contention (by scheduling
tasks that don't contend). Lock free algorithms, in general, force you to
suffer through it (by allowing contending tasks to continue attempting to
run in parallel).

DS


David Schwartz

unread,
Sep 29, 2005, 2:30:38 PM9/29/05
to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:AridnXxvsKI...@comcast.com...

> Depends on how many processors you have. There may be other things
> causing
> context switching also but blocking on a lock certainly isn't going to
> help
> performance.

No, that's not true. Blocking on a lock is going to help performance
because useful work will be done until it's possible for the blocked thread
to run without contention. Blocking on a lock is good because it allows
contention to be avoided.

Lock free algorithms only help if the thing contended for is so critical
and important that you can't do much of anything else (on the entire system)
without accessing it. That's why they're very useful inside kernels where
this is often true and almost useless in application where it almost never
is.

DS


David Hopwood

unread,
Sep 29, 2005, 2:34:46 PM9/29/05
to
Oliver S. wrote:
>>So, "certain types" of lock-free queues are useful
>>for high-performance cpu-to-cpu communications.
>
> I doubt that there's a noteworthy need for this kind of high-perfor-
> mance cpu-to-cpu communications because of the usual code patterns
> mentioned above. Please tell me only a single case in which the en-
> queuing or deqeuing with usual mutexes takes a noteworthy time in
> relation to the time taken to process the dequeued items afterwards.

In the implementation of a message passing language, where this overhead
applies to every message send/reception.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,
Sep 29, 2005, 2:41:17 PM9/29/05
to
David Schwartz wrote:

> "Joe Seigh" <jsei...@xemaps.com> wrote:
>
>>Lock-free is about scalability, not the actual overhead. If you know
>>your application will never experience high contention or you don't care about
>>scalability then you should stick with conventional lock based solutions.
>
> The point is, locks allow you to avoid high contention (by scheduling
> tasks that don't contend). Lock free algorithms, in general, force you to
> suffer through it (by allowing contending tasks to continue attempting to
> run in parallel).

Note that message passing also allows you to avoid high contention by
scheduling tasks that don't contend. This is independent of whether the
synchronization needed for message passing between processes scheduled on
different processors/cores is implemented in terms of locks or by lock-free
techniques.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,
Sep 29, 2005, 3:22:46 PM9/29/05
to

So you're saying lock based solutions will scale better then lock-free ones?

David Schwartz

unread,
Sep 29, 2005, 3:44:52 PM9/29/05
to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:QradnVuGiqT...@comcast.com...

> So you're saying lock based solutions will scale better then lock-free
> ones?

It depends upon which of the dozens of possible meanings of "scale" you
intend. In general, if by "scale" you mean show improved real world
performance as load, in general, increases, then yes, lock-based solutions
will scale better than lock-free ones.

Consider a case where you have four threads. Threads A and B want to
access the same collection and threads C and D want to access the same
collection. With only a single CPU, both a lock-free and lock-based solution
will run one thread at a time and it will run at full speed.

Now, let's scale this to two CPUs. The lock-based solution will never
run thread A and B at the same time. It will always run one of A/B and one
of C/D. The scaling is perfect and the performance doubles since contending
threads never run.

The lock-free solution will randomly match threads. Whenever it happens
to schedule A and B or C and D at the same time, each thread will run at
some small fraction of its maximum as the FSB bounces the shared/modified
data back and forth.

Granted this is an extreme example, but the point is that contention is
bad and lock free algorithms tend to maximize contention.

DS


Joe Seigh

unread,
Sep 29, 2005, 5:00:54 PM9/29/05
to

What happens when you scale to 4 cpu's?

David Schwartz

unread,
Sep 29, 2005, 5:11:28 PM9/29/05
to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:EfSdnUtuDOz...@comcast.com...

> What happens when you scale to 4 cpu's?

The lock-based works *much* better than the lock free. The lock free
allows all four threads to run, totally trashing the FSB. The locks allow
only two threads to run at full speed, which not only get more work done
than all four fighting for the FSB and playing cache ping-pong but also
allow the system to accomplish other useful work as well.

DS


Joe Seigh

unread,
Sep 29, 2005, 7:20:12 PM9/29/05
to
In your rather contrived example, you had the four threads partitioned
into 2 threads each doing totally separate work that could just as well
be done on separate machines. Then you have one of the threads in each
thread pair totally lock out the other so only one thread runs at a time
meaning you are only getting the throughput of a single thread. The
question is why have two threads in the first place when a single thread
would do as well or better? This is the best example you could come up
with?
Message has been deleted

David Schwartz

unread,
Sep 29, 2005, 11:03:38 PM9/29/05
to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:8Y6dnavVYui...@comcast.com...

> David Schwartz wrote:

>> The lock-based works *much* better than the lock free. The lock free
>> allows all four threads to run, totally trashing the FSB. The locks allow
>> only two threads to run at full speed, which not only get more work done
>> than all four fighting for the FSB and playing cache ping-pong but also
>> allow the system to accomplish other useful work as well.

> In your rather contrived example, you had the four threads partitioned
> into 2 threads each doing totally separate work that could just as well
> be done on separate machines.

That's actually pretty typical. Generally the threads in a process will
subdivide into logical groups of one or more thread each. Each thread in a
group does roughly similar tasks but tasks that are different from what the
threads in the other groups do. There will usually be common paths that all
the thread groups pass through, but they are usually in the OS and the
threading library. (Where lock free techniques *can* be very valuable.)

> Then you have one of the threads in each
> thread pair totally lock out the other so only one thread runs at a time
> meaning you are only getting the throughput of a single thread.

No, that's not what I'm doing. I'm saying that's what happens in the
case where they block on the same lock. Regardless of what each thread is
going to do once it acquires the lock or after releasing it, if one thread
blocks on a lock the other holds, it will be totally locked out until the
other thread either releases the lock or uses up its timeslice.

> The
> question is why have two threads in the first place when a single thread
> would do as well or better?

The most common reason is because if you use a single thread, and that
thread blocks unexpectedly (say on a page fault or file read), you are
totally screwed. Writing code so that you must be 100% certain that no tiny
bit of it (whether performance critical or not) ever blocks is just too
hard.

> This is the best example you could come up
> with?

Obviously I picked the example that best illustrated my point. The more
your situation varies from my example, the less applicable the example.

If you're going to fairly evaluate lock free algorithms, you should
understand the prototypical cases where they fail very well, if for no other
reason than for you to make the thoroughly informed decision that those
weaknesses aren't too important in your particular case.

DS


Chris Thomasson

unread,
Sep 29, 2005, 11:06:23 PM9/29/05
to
> Ok, but in this case someone asked for resources on lock-free program-
> ming.

>
>> - one of the usual suspects suggests "using lock-free techniques".
>
> Yes, and it's pretty anoying when Chris Thomasson jumps into the
> scene iust to emphasize his work

Well, the OP did ask for some open-source.


> (with a IMHO very ugly coding-style).

Sorry about that.

;)


> He suggessts to use lock-free-processing although the cases where this
> is really necessary because of a notable performance-advantages are
> rarer than the neelde in a haystack.

The libraries basic philosophy is to blend "lock-free with lock-based
programming". I also post a clear link to a discussion of some of the most
important "caveats" that come along with lock-free programming. I clearly
suggest that programmers "test" AppCore against "traditional lock-based"
solutions, that's all. Why not let application programmers try to
"experimentally improve" some of their existing designs? When the multi-core
stuff becomes available to "regular customers", it may require some
applications to "scale-up". In the next couple of years, you "might be"
receiving questions from customers like:


I bought one of those new dual-core systems with 4 processors, and your
application suite seems to execute critical tasks a little bit slower than
usual! Why is that? Do I have to set some special configuration options?
Please help...


> And in the current case, he sug-
> gests his nearly undocumented and hardly readable source as an origin
> for lock-free-processing knowledge.

It's only an example. I didn't really have time to document it, because I
was working hard on something else. AppCore is pretty much dead right now.
Anyway, I don't claim that AppCore is an "origin for lock-free-processing
knowledge"; after all, its experimental. If I am coming across that way, I'm
sorry.


Chris Thomasson

unread,
Sep 29, 2005, 11:34:00 PM9/29/05
to
> You misunderstand me; I talked about the producer/consumer-pattern
> with the usual mutex-locks (which cause a kernel-intervention only on
> collisions).

Yes. However, they also execute an atomic operation and a nasty StoreLoad
style membar for the lock and unlock, contention or not. IMO, these types of
operations are very expensive for modern processors to execute. Especially
if they have to lock the bus...


> In this case, the lock is held an extremely short inter-
> val and the processing of the data taken from the queue lasts at least
> some thousand times longer than the dequeue-process. So the likehood

> of a collision is that short:

Actually, to further enhance your point, there is an efficient lock-based
queue out there that is simply guaranteed to get absolutely no "collisions"
between a producer and consumer (can show you pseudo-code if you want).


> that there's no need for lock-free processing?

Well, the only thing that a lock-free queue would buy you here, is reduced
overhead wrt atomic operations and membars.


>> Especially, when a core on one cpu has to send a message to a
>> core on another cpu; atomic operations and membars would greatly
>> reduce performance.
>

> In nearly every case this does not reduce the performance because
> enqueuing an dequeuing are only a very tiny part of the whole pro-
> cessing in almost any case.

Granted. However, imho, I still think that the ability to remove atomic
operations and membars from producer/consumer systems and database searches
is going to gradually become more and more important...


Chris Thomasson

unread,
Sep 30, 2005, 12:03:30 AM9/30/05
to
>> The ability for read-only iterations (frequent database searches?) to
>> achieve a substantial increase in the number of reads per-second,
>> per-reader-thread, would most certainly be a loss...
>
> I can't figure out what scenario you're talking about.

Reader threads getting a vast increase in the number of reads to a shared
database-like structure they can perform per-second. IHMO, this can be very
important.


> If you have a large chunk of read-mostly data, how can the overhead of
> bouncing around one cache line prior to these large accesses possibly
> matter?

Lock-free, or not; atomic operations and membars reduce shared reads
per-second, in virtually all cases:

http://www.rdrop.com/users/paulmck/RCU/lockperf.2004.01.17a.pdf


In some cases, it can be a difference between tens-of-thousands of reads
per-second, or million+ reads-per-second...

No kidding.


> If you're talking about lots of small accesses, well, then your lock scope
> is badly chosen.

Probably.


> However, I do agree that you picked one of the best examples. Dealing
> with large collections of read-mostly data is probably one of the best
> cases for a lock free algorithm.

Very true.


> As for hiding non-portable code behind a portable API, that argument
> misses the point. If you use an API to do something, you do not care
> whether there's a lock free or lock based algorithm behind the API.

I have a feeling that users are going to start to care about how good an API
can "scale-up" to the new systems that are going to come out. Though I do
agree that the user won't really care about what a vender uses, lock-free or
lock-based, to achieve that.


> The issue is about *writing* code, not using code.

Ahh. Yes, actually writing lock-free code can be very tedious, indeed:

http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6


;)


Joe Seigh

unread,
Sep 30, 2005, 8:00:23 AM9/30/05
to
David Schwartz wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:8Y6dnavVYui...@comcast.com...
>
>>In your rather contrived example, you had the four threads partitioned
>>into 2 threads each doing totally separate work that could just as well
>>be done on separate machines.
>
>
> That's actually pretty typical. Generally the threads in a process will
> subdivide into logical groups of one or more thread each. Each thread in a
> group does roughly similar tasks but tasks that are different from what the
> threads in the other groups do. There will usually be common paths that all
> the thread groups pass through, but they are usually in the OS and the
> threading library. (Where lock free techniques *can* be very valuable.)
>
You just making this up. Applications can't be parallelized to some
arbitrary degree. If they could, whoever figured out how to do it
would become very famous. That's why scalability *is* an issue.

[...]

>
>
>>This is the best example you could come up
>>with?
>
>
> Obviously I picked the example that best illustrated my point. The more
> your situation varies from my example, the less applicable the example.
>
> If you're going to fairly evaluate lock free algorithms, you should
> understand the prototypical cases where they fail very well, if for no other
> reason than for you to make the thoroughly informed decision that those
> weaknesses aren't too important in your particular case.
>

Well, it was a fairly bad example. We're talking about scalability here.
All the literature and research that I'm aware of show lock-free algorithms
scale better than lock based ones. You're perfectly welcome to put out
your own paper showing lock based ones scale better than lock-free ones.

Message has been deleted

Sean Kelly

unread,
Sep 30, 2005, 1:29:40 PM9/30/05