Lock Free -- where to start

Imre Palik

unread,

Sep 24, 2005, 1:48:00 PM9/24/05

to

Hi,

I read a few discussions on lock-free programming lately in various
newsgroups. As it sounds quite intereting, I tried to google for it, to
find some introduction, uses, idioms, patterns, etc. to learn about it.

But I am pretty much lost in the results. Could anybody point me to a good
introduction, and some real-life (OpenSource?) examples on its use? (I
read boost::shared_ptr already, but I have the feeling, there is more to
it.)

Thx

ImRe

Chris Thomasson

unread,

Sep 24, 2005, 5:19:09 PM9/24/05

to

Here is my "old and out-dated" experimental library:

http://appcore.home.comcast.net/

IMHO, the there are three very useful API's on here.

(1) The x86 lock-free support assembly library:

http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html

(2) The zero-overhead single-producer/consumer lock-free queue

http://appcore.home.comcast.net/appcore/include/ac_queue_spsc_h.html
http://appcore.home.comcast.net/appcore/src/ac_queue_spsc_c.html

(3) The zero-overhead eventcount

http://appcore.home.comcast.net/appcore/include/ac_eventcount_algo1_h.html
http://appcore.home.comcast.net/appcore/src/ac_eventcount_algo1_c.html

I am currently working on incorporating my new VZOOM library into some
in-house server software; however, I would be happy to discuss AppCore, if
you need any more information...

****VERY IMPORTANT****

SMR Hazard Pointers went PATENT PENDING in 2002!
--- You can't use them for anything commercial!

:O

Markus Elfring

unread,

Sep 25, 2005, 6:08:27 AM9/25/05

to

> But I am pretty much lost in the results. Could anybody point me to a good
> introduction, and some real-life (OpenSource?) examples on its use?

How do you think about this article?
http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms

Regards,
Markus

Joe Seigh

unread,

Sep 25, 2005, 8:46:46 AM9/25/05

to

Google on lock-free (http://scholar.google.com/ if it's up).

shared_ptr, though it's internally lock-free in some cases, isn't
really usable for lock-free since it depends on you owning the
shared_ptr's you copy.

Look at Java JSR-133 volatiles which are atomic with acquire and release
semantics. Also look at java.util.concurrent (JSR-166) packages. The
Java environment is probably a better one to play around in since the
memory model is better defined, and it has GC, atomicity, and interlocked
primatives.

If you're going to use anything commercially, take care to only use stuff
in the public domain. Just because a paper has been published on it doesn't
mean a patent application isn't in the works, which is pretty much everything
in the last 6 years or so. And a lot of stuff before that already has
patents.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

David Schwartz

unread,

Sep 28, 2005, 12:13:12 AM9/28/05

to

Let me just give you my general caveat about lock free code. First, lock
free code is generally non-portable and extremely complex. Such code can
easily contain bugs that are extremely difficult even for experts to spot.
The complexity is almost always unnecessary.

But worse, in most realistic cases, lock free tremendously *reduce*
performance. The myth is that locks are expensive. The truth is that
*contention* is expensive. Lock free algorithms allow code that is
contending to run in parallel. Locks allow other code to run that probably
won't contend.

For example, consider three threads, A, B, and C. A and B both want to
manipulate a queue, C wants to do something else entirely. Assume two CPUs.
With a lock, either A or B will grab the queue lock. A or B will run with C
and both threads will run at full speed with minimal contention. With a
lock-free queue, A can run concurrently with B. C will have to wait, while A
and B fight over the cache lines that contain the queues and its structures.

Well, you say, this is a carefully crafted example. What if there's
another processor? Guess what, you're still screwed. Because A and B running
in parallel will saturate the front side bus as they bounce the queue
structures back and forth. C will still run at a low speed and system
performance will likely be worse than if A ran by itself and then B ran by
itself.

Lock free algorithms work great in kernels and low level threading
libraries for a variety of reasons. Portability is not that important. The
probability that there are other things to do is lower in core code (because
everything passes through it). But for application code, or even most
libraries on top of threading libraries, it's a pure loss.

DS

Joe Seigh

unread,

Sep 28, 2005, 11:03:54 AM9/28/05

to

David Schwartz wrote:
> Let me just give you my general caveat about lock free code. First, lock
> free code is generally non-portable and extremely complex. Such code can
> easily contain bugs that are extremely difficult even for experts to spot.
> The complexity is almost always unnecessary.
>

[...]

>
> Lock free algorithms work great in kernels and low level threading
> libraries for a variety of reasons. Portability is not that important. The
> probability that there are other things to do is lower in core code (because
> everything passes through it). But for application code, or even most
> libraries on top of threading libraries, it's a pure loss.
>

So Java isn't portable or they shouldn't have put support for lock-free in
Java or what?

David Butenhof

unread,

Sep 28, 2005, 1:21:12 PM9/28/05

to

Joe Seigh wrote:
> David Schwartz wrote:
>> Let me just give you my general caveat about lock free code.
>> First, lock free code is generally non-portable and extremely complex.
>> Such code can easily contain bugs that are extremely difficult even
>> for experts to spot. The complexity is almost always unnecessary.
>>
> [...]
>>
>> Lock free algorithms work great in kernels and low level threading
>> libraries for a variety of reasons. Portability is not that important.
>> The probability that there are other things to do is lower in core
>> code (because everything passes through it). But for application code,
>> or even most libraries on top of threading libraries, it's a pure loss.
>>
>
> So Java isn't portable or they shouldn't have put support for lock-free in
> Java or what?

Java isn't just a language, or a compiler, or a runtime library. It's
also the MACHINE -- or at least the portable definition of a Virtual
Machine. It can specify its own rules, and the engineers who PORT the
Java VM have to do whatever is necessary (easy, hard, fast, slow,
whatever) to follow those rules for the target machine.

POSIX, on the other hand, can specify only "tweaks" on the language,
place loose requirements on the compiler, and provide a specification
for part of a runtime library... and can't say anything at all about the
machine.

The two cases are exactly the same except that everything is different.

--
Dave Butenhof, David.B...@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062

David Schwartz

unread,

Sep 28, 2005, 3:13:42 PM9/28/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:oKadnXgTBfu...@comcast.com...

> David Schwartz wrote:

>> Let me just give you my general caveat about lock free code. First,
>> lock free code is generally non-portable and extremely complex. Such code
>> can easily contain bugs that are extremely difficult even for experts to
>> spot. The complexity is almost always unnecessary.

>> Lock free algorithms work great in kernels and low level threading

>> libraries for a variety of reasons. Portability is not that important.
>> The probability that there are other things to do is lower in core code
>> (because everything passes through it). But for application code, or even
>> most libraries on top of threading libraries, it's a pure loss.

> So Java isn't portable or they shouldn't have put support for lock-free in
> Java or what?

Java is quasi-portable. Java reflects a design decision that you'll take
all the performance penalties up front, and then you can use lots of clever
tricks to get the performance back. I find that an odd compromise.

If you care about performance, why would you force a memory model that
may or may not match the hardware at all? And if you don't care about
performance, why are you trying to mess with lock free algorithms?

In any event, I believe it is impossible to construct lock-free
algorithms in Java, the way the term "lock free" is ordinarily used. Java
does not let you get close enough to the hardware to know if there are locks
or not.

DS

Chris Thomasson

unread,

Sep 28, 2005, 4:15:01 PM9/28/05

to

> Let me just give you my general caveat about lock free code. First,
> lock free code is generally non-portable and extremely complex. Such code
> can easily contain bugs that are extremely difficult even for experts to
> spot. The complexity is almost always unnecessary.

The complexity can be hidden behind a simple and portable API, that anybody
can use. Like the pthreads api. We all know that a particular pthread
implementation may be very complex, and may very well use some lock-free
algorithms. But, does the average user actually care about this kind of
stuff? My point is, again "The complexity can be hidden behind a simple
API". Really.

Would there be bugs... Well, maybe; it depends on the implementer. There may
be very subtle bugs in your pthread implementation as well...

> But worse, in most realistic cases, lock free tremendously *reduce*
> performance. The myth is that locks are expensive.

Humm... Well, it really depends on the type of lock-free algorithm you
compare to locks. I would agree that "CAS-loop" based lock-free algorithms
can thrash the cache...

http://groups.google.com/group/comp.lang.c++.moderated/msg/6b2ccf76ba145c9c?hl=en
http://www.hpl.hp.com/techreports/2004/HPL-2004-105.html

however:

Contended locks can adversely affect performance and simply prevent
scalability in some situations:

Take a global read-write lock protecting a "read-mostly" data-structure.
Under heavy read contention, forward progress can be stalled waiting for the
updates to the internal lock state to be "shuttled around the CPU's caches".
If the size of the read-side critical section is moderate, the cost of
moving lock updates from one CPU's cache to another can actually rival that
of the critical section itself; This can prevent readers from truly
executing in parallel. The more processors you add, the worse it gets. This
is not true for some special high-performance reader/writer solutions...

> The truth is that *contention* is expensive. Lock free algorithms allow
> code that is contending to run in parallel. Locks allow other code to run
> that probably won't contend.

Think of a portable algorithm that can totally avoid contention for very
high numbers of "reader threads"; NO atomic operations and NO memory
barriers required...

How can a mutex compete with that?

Think frequent iteration over large database-like "read-mostly"
data-structures, before you answer.

> For example, consider three threads, A, B, and C. A and B both want to
> manipulate a queue, C wants to do something else entirely. Assume two
> CPUs. With a lock, either A or B will grab the queue lock. A or B will run
> with C and both threads will run at full speed with minimal contention.
> With a lock-free queue, A can run concurrently with B. C will have to
> wait, while A and B fight over the cache lines that contain the queues and
> its structures.

IMHO, this is an example of poor application engineering/design. I would
have C queue a request and wait for something else to do... Then A or B
could do the work; after all, they do sound like they are "worker threads"
of some sort...

;)

> Lock free algorithms work great in kernels and low level threading
> libraries for a variety of reasons. Portability is not that important.

Lock-free algorithms can "easily" be hidden behind a portable API.

> But for application code, or even most libraries on top of threading
> libraries, it's a pure loss.

Application code would not use raw lock-free structures directly. They would
use a library that exports a "portable API".

> , it's a pure loss.

Yeah...

The ability for read-only iterations (frequent database searches?) to
achieve a substantial increase in the number of reads per-second,
per-reader-thread, would most certainly be a loss...

;)

David Schwartz

unread,

Sep 28, 2005, 6:34:52 PM9/28/05

to

"Chris Thomasson" <_no_spam_cristom@no_spam_comcast._net> wrote in message
news:_42dnY09Tc9...@comcast.com...

> The ability for read-only iterations (frequent database searches?) to
> achieve a substantial increase in the number of reads per-second,
> per-reader-thread, would most certainly be a loss...

I can't figure out what scenario you're talking about. If you have a
large chunk of read-mostly data, how can the overhead of bouncing around one
cache line prior to these large accesses possibly matter? If you're talking
about lots of small accesses, well, then your lock scope is badly chosen.

However, I do agree that you picked one of the best examples. Dealing
with large collections of read-mostly data is probably one of the best cases
for a lock free algorithm.

As for hiding non-portable code behind a portable API, that argument
misses the point. If you use an API to do something, you do not care whether
there's a lock free or lock based algorithm behind the API. The issue is
about *writing* code, not using code.

DS

Message has been deleted

Joe Seigh

unread,

Sep 28, 2005, 8:06:26 PM9/28/05

to

Oliver S. wrote:
>>>But for application code, or even most libraries
>>>on top of threading libraries, it's a pure loss.
>
>
>>Application code would not use raw lock-free structures directly.
>>They would use a library that exports a "portable API".
>
>

> I think what David means, is that lock-free is over-emphasized and I
> must agree on that. With producer/consumer-patterns typical in server
> -applications, intervals where queues are locked are extremely short
> (in the best case only a pointer is fetched from a queue) and the
> processing-intervals when a dequeued item is processed at least some
> thousand times higher. So there's no real need for lock-free queues.
> Lock-free lifo-pools may make sense for memory-allocators, but good
> allocators magage thread-local pools so this need is minimized. And
> if you need maximum performance on memory-allocation, you manage your
> own pools.
> Lock-free processing may look fascinating, but primarily superfluous.

So JSR-166 was a waste of time?

Message has been deleted

David Hopwood

unread,

Sep 28, 2005, 11:11:01 PM9/28/05

to

Joe Seigh wrote:
> Oliver S. wrote:
>
>> I think what David [Schwartz] means, is that lock-free is over-emphasized

>> and I must agree on that. With producer/consumer-patterns typical in server
>> -applications, intervals where queues are locked are extremely short
>> (in the best case only a pointer is fetched from a queue) and the
>> processing-intervals when a dequeued item is processed at least some
>> thousand times higher. So there's no real need for lock-free queues.
>> Lock-free lifo-pools may make sense for memory-allocators, but good
>> allocators magage thread-local pools so this need is minimized. And
>> if you need maximum performance on memory-allocation, you manage your
>> own pools.

>> Lock-free processing may look fascinating, but is primarily superfluous.

>
> So JSR-166 was a waste of time?

Mostly, yes.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,

Sep 28, 2005, 10:13:13 PM9/28/05

to

> I think what David means, is that lock-free is over-emphasized and I

> must agree on that. With producer/consumer-patterns typical in server
> -applications, intervals where queues are locked are extremely short
> (in the best case only a pointer is fetched from a queue) and the
> processing-intervals when a dequeued item is processed at least some
> thousand times higher. So there's no real need for lock-free queues.

This would be true for the Michael-and-Scott queue algorithm; its loaded
with cache thrashing CAS-loops. However, "zero-overhead" loopless
single-producer/consumer queues are ideal for various distributed
producer/consumer systems. IMHO, a thread that is bound to a particular cpu
core, and wants to send a message to another thread that is bound to a
"nearby core", should use a method that has no atomic operations; just a
StoreStore for producers, and a data-dependant load for the consumers should
be enough. I have a feeling that scaleable designs for the new multi-core
stuff that's coming out will be NUMA-like...

The ability to remove atomic operations and nasty memory barriers from a
distributed messaging system is a plus. As we all know, multi-processor
systems do not take kindly to these types of instructions...

;)

So, "certain types" of lock-free queues are useful for high-performance
cpu-to-cpu communications. Especially, when a core on one cpu has to send a
message to a core on another cpu; atomic operations and membars would
greatly reduce performance.

> Lock-free lifo-pools may make sense for memory-allocators, but good
> allocators magage thread-local pools so this need is minimized. And
> if you need maximum performance on memory-allocation, you manage your
> own pools.

> Lock-free processing may look fascinating, but primarily superfluous.

They also work well as solutions to the reader/writer problem. They can be
used to remove the need for atomic operations and StoreLoad style memory
barriers; imho, that's not really superfluous.

Chris Thomasson

unread,

Sep 28, 2005, 10:16:00 PM9/28/05

to

>>> Lock-free processing may look fascinating, but is primarily superfluous.
>>
>> So JSR-166 was a waste of time?
>
> Mostly, yes.

:O

David Butenhof

unread,

Sep 29, 2005, 8:21:11 AM9/29/05

to

;-)

Lock-free, like threads, like Perl, like emacs, like news, like so many
other things, is an enormously powerful tool for certain things. And,
yes, absolutely useless (or even worse) for many other things.

Whether you're inclined to say that it's important (or relevant) for
"95%" or "5%" of "real work" depends entirely on your personal
definition of "real work"... which in turn depends on current position,
experience, and training.

As much as most programming today is pretty high level and abstract, it
IS still valuable to understand how transistors work, and chip logic
design, and memory busses and cache and pipelined instruction
processing. When you're programming a parallel application,
understanding what lock-free does and how it will affect your
application contributes to your understanding of the memory model and
implementation of your machine and application... and that can be
powerful even if you don't actually use lock-free techniques. (But then,
you probably do, even if you don't know it... and using them incorrectly
can be dangerous.)

David Hopwood

unread,

Sep 29, 2005, 11:04:07 AM9/29/05

to

Chris Thomasson wrote:
>>I think what David means, is that lock-free is over-emphasized and I
>>must agree on that. With producer/consumer-patterns typical in server
>>-applications, intervals where queues are locked are extremely short
>>(in the best case only a pointer is fetched from a queue) and the
>>processing-intervals when a dequeued item is processed at least some
>>thousand times higher. So there's no real need for lock-free queues.
>

> This would be true for the Michael-and-Scott queue algorithm; it's loaded

> with cache thrashing CAS-loops. However, "zero-overhead" loopless
> single-producer/consumer queues are ideal for various distributed
> producer/consumer systems. IMHO, a thread that is bound to a particular cpu
> core, and wants to send a message to another thread that is bound to a
> "nearby core", should use a method that has no atomic operations; just a
> StoreStore for producers, and a data-dependant load for the consumers should
> be enough. I have a feeling that scaleable designs for the new multi-core
> stuff that's coming out will be NUMA-like...

Right, but isn't this missing the point of David Schwartz's comments?

This newsgroup has had *many* threads with the following pattern:

- someone asks how to solve a problem. Typically it's clear from this
post that the poster has relatively little experience with concurrent
programming in general.

- one of the usual suspects suggests "using lock-free techniques".

- the thread wanders off into esoterica about how to implement stuff at
a completely disjoint level of abstraction to the original question.

To some extent this is just par for the Usenet course; OTOH, this group
does seem to have a particular collective obsession with lock-free at
the moment. Some of us are concerned that newbies are getting, at best
unfollowable, and sometimes just plain bad advice as a result.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 29, 2005, 11:58:43 AM9/29/05

to

David Hopwood wrote:
> Chris Thomasson wrote:
>
[...]

>
>
> Right, but isn't this missing the point of David Schwartz's comments?
>
> This newsgroup has had *many* threads with the following pattern:
>
> - someone asks how to solve a problem. Typically it's clear from this
> post that the poster has relatively little experience with concurrent
> programming in general.
>
> - one of the usual suspects suggests "using lock-free techniques".
>
> - the thread wanders off into esoterica about how to implement stuff at
> a completely disjoint level of abstraction to the original question.

Then both you and David Schartz are missing the point. The OP specifically
asked about lock-free.

>
>
> To some extent this is just par for the Usenet course; OTOH, this group
> does seem to have a particular collective obsession with lock-free at
> the moment. Some of us are concerned that newbies are getting, at best
> unfollowable, and sometimes just plain bad advice as a result.

People get bad advice in this newsgroup all the time. Often from the
people who are allegedly concerned about people getting bad advice.
I've given up trying to correct these kind of things, e.g. telling
people mutexes have forward progress guarantees about visibility,
anything about observable behavior from cache (apart from performance),
etc... The most blantant misinformation is that lock-free is too
complex to use when the typical way of dealing with complexity is to
encapsulate it with an easier to use api. It's like saying that
programming with locks is too complex because most programmers would
have a hard time implementing a mutex correctly.

Message has been deleted

Joe Seigh

unread,

Sep 29, 2005, 1:28:56 PM9/29/05

to

Oliver S. wrote:
>>This would be true for the Michael-and-Scott queue algorithm;
>>its loaded with cache thrashing CAS-loops.
>
>

> You misunderstand me; I talked about the producer/consumer-pattern
> with the usual mutex-locks (which cause a kernel-intervention only on
> collisions). In this case, the lock is held an extremely short inter-
> val and the processing of the data taken from the queue lasts at least
> some thousand times longer than the dequeue-process. So the likehood
> of a collision is that short, that there's no need for lock-free pro-
> cessing?
>

That would be for a small number of threads. If probability of lock contention
for a single thread is .001 in your example, then for 100 threads it's about
.100, 200 threads about .199, 1000 threads about .630, etc...

Lock-free is about scalability, not the actual overhead. If you know your
application will never experience high contention or you don't care about
scalability then you should stick with conventional lock based solutions.

Michel

unread,

Sep 29, 2005, 1:53:35 PM9/29/05

to

Joe Seigh wrote:
> That would be for a small number of threads. If probability of lock
> contention
> for a single thread is .001 in your example, then for 100 threads it's
> about

> ..100, 200 threads about .199, 1000 threads about .630, etc...

>
> Lock-free is about scalability, not the actual overhead. If you know your
> application will never experience high contention or you don't care about
> scalability then you should stick with conventional lock based solutions.

Wouldn't you say there is a design or structural problem having 100-1000
threads contending for the same lock in any somewhat ordinary
application at the server or the client side? Wouldn't context switching
drain the performance anyways?

/Michel

Joe Seigh

unread,

Sep 29, 2005, 2:11:34 PM9/29/05

to

Depends on how many processors you have. There may be other things causing
context switching also but blocking on a lock certainly isn't going to help
performance.

David Schwartz

unread,

Sep 29, 2005, 2:28:46 PM9/29/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:XfydndTAWZc...@comcast.com...

> Lock-free is about scalability, not the actual overhead. If you know
> your
> application will never experience high contention or you don't care about
> scalability then you should stick with conventional lock based solutions.

The point is, locks allow you to avoid high contention (by scheduling
tasks that don't contend). Lock free algorithms, in general, force you to
suffer through it (by allowing contending tasks to continue attempting to
run in parallel).

DS

David Schwartz

unread,

Sep 29, 2005, 2:30:38 PM9/29/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:AridnXxvsKI...@comcast.com...

> Depends on how many processors you have. There may be other things
> causing
> context switching also but blocking on a lock certainly isn't going to
> help
> performance.

No, that's not true. Blocking on a lock is going to help performance
because useful work will be done until it's possible for the blocked thread
to run without contention. Blocking on a lock is good because it allows
contention to be avoided.

Lock free algorithms only help if the thing contended for is so critical
and important that you can't do much of anything else (on the entire system)
without accessing it. That's why they're very useful inside kernels where
this is often true and almost useless in application where it almost never
is.

DS

David Hopwood

unread,

Sep 29, 2005, 2:34:46 PM9/29/05

to

Oliver S. wrote:
>>So, "certain types" of lock-free queues are useful
>>for high-performance cpu-to-cpu communications.
>

> I doubt that there's a noteworthy need for this kind of high-perfor-
> mance cpu-to-cpu communications because of the usual code patterns
> mentioned above. Please tell me only a single case in which the en-
> queuing or deqeuing with usual mutexes takes a noteworthy time in
> relation to the time taken to process the dequeued items afterwards.

In the implementation of a message passing language, where this overhead
applies to every message send/reception.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Sep 29, 2005, 2:41:17 PM9/29/05

to

David Schwartz wrote:

> "Joe Seigh" <jsei...@xemaps.com> wrote:
>
>>Lock-free is about scalability, not the actual overhead. If you know
>>your application will never experience high contention or you don't care about
>>scalability then you should stick with conventional lock based solutions.
>
> The point is, locks allow you to avoid high contention (by scheduling
> tasks that don't contend). Lock free algorithms, in general, force you to
> suffer through it (by allowing contending tasks to continue attempting to
> run in parallel).

Note that message passing also allows you to avoid high contention by
scheduling tasks that don't contend. This is independent of whether the
synchronization needed for message passing between processes scheduled on
different processors/cores is implemented in terms of locks or by lock-free
techniques.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,

Sep 29, 2005, 3:22:46 PM9/29/05

to

So you're saying lock based solutions will scale better then lock-free ones?

David Schwartz

unread,

Sep 29, 2005, 3:44:52 PM9/29/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:QradnVuGiqT...@comcast.com...

> So you're saying lock based solutions will scale better then lock-free
> ones?

It depends upon which of the dozens of possible meanings of "scale" you
intend. In general, if by "scale" you mean show improved real world
performance as load, in general, increases, then yes, lock-based solutions
will scale better than lock-free ones.

Consider a case where you have four threads. Threads A and B want to
access the same collection and threads C and D want to access the same
collection. With only a single CPU, both a lock-free and lock-based solution
will run one thread at a time and it will run at full speed.

Now, let's scale this to two CPUs. The lock-based solution will never
run thread A and B at the same time. It will always run one of A/B and one
of C/D. The scaling is perfect and the performance doubles since contending
threads never run.

The lock-free solution will randomly match threads. Whenever it happens
to schedule A and B or C and D at the same time, each thread will run at
some small fraction of its maximum as the FSB bounces the shared/modified
data back and forth.

Granted this is an extreme example, but the point is that contention is
bad and lock free algorithms tend to maximize contention.

DS

Joe Seigh

unread,

Sep 29, 2005, 5:00:54 PM9/29/05

to

What happens when you scale to 4 cpu's?

David Schwartz

unread,

Sep 29, 2005, 5:11:28 PM9/29/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:EfSdnUtuDOz...@comcast.com...

> What happens when you scale to 4 cpu's?

The lock-based works *much* better than the lock free. The lock free
allows all four threads to run, totally trashing the FSB. The locks allow
only two threads to run at full speed, which not only get more work done
than all four fighting for the FSB and playing cache ping-pong but also
allow the system to accomplish other useful work as well.

DS

Joe Seigh

unread,

Sep 29, 2005, 7:20:12 PM9/29/05

to

In your rather contrived example, you had the four threads partitioned
into 2 threads each doing totally separate work that could just as well
be done on separate machines. Then you have one of the threads in each
thread pair totally lock out the other so only one thread runs at a time
meaning you are only getting the throughput of a single thread. The
question is why have two threads in the first place when a single thread
would do as well or better? This is the best example you could come up
with?

Message has been deleted

David Schwartz

unread,

Sep 29, 2005, 11:03:38 PM9/29/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:8Y6dnavVYui...@comcast.com...

> David Schwartz wrote:

>> The lock-based works *much* better than the lock free. The lock free
>> allows all four threads to run, totally trashing the FSB. The locks allow
>> only two threads to run at full speed, which not only get more work done
>> than all four fighting for the FSB and playing cache ping-pong but also
>> allow the system to accomplish other useful work as well.

> In your rather contrived example, you had the four threads partitioned
> into 2 threads each doing totally separate work that could just as well
> be done on separate machines.

That's actually pretty typical. Generally the threads in a process will
subdivide into logical groups of one or more thread each. Each thread in a
group does roughly similar tasks but tasks that are different from what the
threads in the other groups do. There will usually be common paths that all
the thread groups pass through, but they are usually in the OS and the
threading library. (Where lock free techniques *can* be very valuable.)

> Then you have one of the threads in each
> thread pair totally lock out the other so only one thread runs at a time
> meaning you are only getting the throughput of a single thread.

No, that's not what I'm doing. I'm saying that's what happens in the
case where they block on the same lock. Regardless of what each thread is
going to do once it acquires the lock or after releasing it, if one thread
blocks on a lock the other holds, it will be totally locked out until the
other thread either releases the lock or uses up its timeslice.

> The
> question is why have two threads in the first place when a single thread
> would do as well or better?

The most common reason is because if you use a single thread, and that
thread blocks unexpectedly (say on a page fault or file read), you are
totally screwed. Writing code so that you must be 100% certain that no tiny
bit of it (whether performance critical or not) ever blocks is just too
hard.

> This is the best example you could come up
> with?

Obviously I picked the example that best illustrated my point. The more
your situation varies from my example, the less applicable the example.

If you're going to fairly evaluate lock free algorithms, you should
understand the prototypical cases where they fail very well, if for no other
reason than for you to make the thoroughly informed decision that those
weaknesses aren't too important in your particular case.

DS

Chris Thomasson

unread,

Sep 29, 2005, 11:06:23 PM9/29/05

to

> Ok, but in this case someone asked for resources on lock-free program-
> ming.

>
>> - one of the usual suspects suggests "using lock-free techniques".
>

> Yes, and it's pretty anoying when Chris Thomasson jumps into the
> scene iust to emphasize his work

Well, the OP did ask for some open-source.

> (with a IMHO very ugly coding-style).

Sorry about that.

;)

> He suggessts to use lock-free-processing although the cases where this
> is really necessary because of a notable performance-advantages are
> rarer than the neelde in a haystack.

The libraries basic philosophy is to blend "lock-free with lock-based
programming". I also post a clear link to a discussion of some of the most
important "caveats" that come along with lock-free programming. I clearly
suggest that programmers "test" AppCore against "traditional lock-based"
solutions, that's all. Why not let application programmers try to
"experimentally improve" some of their existing designs? When the multi-core
stuff becomes available to "regular customers", it may require some
applications to "scale-up". In the next couple of years, you "might be"
receiving questions from customers like:

I bought one of those new dual-core systems with 4 processors, and your
application suite seems to execute critical tasks a little bit slower than
usual! Why is that? Do I have to set some special configuration options?
Please help...

> And in the current case, he sug-
> gests his nearly undocumented and hardly readable source as an origin
> for lock-free-processing knowledge.

It's only an example. I didn't really have time to document it, because I
was working hard on something else. AppCore is pretty much dead right now.
Anyway, I don't claim that AppCore is an "origin for lock-free-processing
knowledge"; after all, its experimental. If I am coming across that way, I'm
sorry.

Chris Thomasson

unread,

Sep 29, 2005, 11:34:00 PM9/29/05

to

> You misunderstand me; I talked about the producer/consumer-pattern
> with the usual mutex-locks (which cause a kernel-intervention only on
> collisions).

Yes. However, they also execute an atomic operation and a nasty StoreLoad
style membar for the lock and unlock, contention or not. IMO, these types of
operations are very expensive for modern processors to execute. Especially
if they have to lock the bus...

> In this case, the lock is held an extremely short inter-
> val and the processing of the data taken from the queue lasts at least
> some thousand times longer than the dequeue-process. So the likehood

> of a collision is that short:

Actually, to further enhance your point, there is an efficient lock-based
queue out there that is simply guaranteed to get absolutely no "collisions"
between a producer and consumer (can show you pseudo-code if you want).

> that there's no need for lock-free processing?

Well, the only thing that a lock-free queue would buy you here, is reduced
overhead wrt atomic operations and membars.

>> Especially, when a core on one cpu has to send a message to a
>> core on another cpu; atomic operations and membars would greatly
>> reduce performance.
>

> In nearly every case this does not reduce the performance because
> enqueuing an dequeuing are only a very tiny part of the whole pro-
> cessing in almost any case.

Granted. However, imho, I still think that the ability to remove atomic
operations and membars from producer/consumer systems and database searches
is going to gradually become more and more important...

Chris Thomasson

unread,

Sep 30, 2005, 12:03:30 AM9/30/05

to

>> The ability for read-only iterations (frequent database searches?) to
>> achieve a substantial increase in the number of reads per-second,
>> per-reader-thread, would most certainly be a loss...
>
> I can't figure out what scenario you're talking about.

Reader threads getting a vast increase in the number of reads to a shared
database-like structure they can perform per-second. IHMO, this can be very
important.

> If you have a large chunk of read-mostly data, how can the overhead of
> bouncing around one cache line prior to these large accesses possibly
> matter?

Lock-free, or not; atomic operations and membars reduce shared reads
per-second, in virtually all cases:

http://www.rdrop.com/users/paulmck/RCU/lockperf.2004.01.17a.pdf

In some cases, it can be a difference between tens-of-thousands of reads
per-second, or million+ reads-per-second...

No kidding.

> If you're talking about lots of small accesses, well, then your lock scope
> is badly chosen.

Probably.

> However, I do agree that you picked one of the best examples. Dealing
> with large collections of read-mostly data is probably one of the best
> cases for a lock free algorithm.

Very true.

> As for hiding non-portable code behind a portable API, that argument
> misses the point. If you use an API to do something, you do not care
> whether there's a lock free or lock based algorithm behind the API.

I have a feeling that users are going to start to care about how good an API
can "scale-up" to the new systems that are going to come out. Though I do
agree that the user won't really care about what a vender uses, lock-free or
lock-based, to achieve that.

> The issue is about *writing* code, not using code.

Ahh. Yes, actually writing lock-free code can be very tedious, indeed:

http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6

;)

Joe Seigh

unread,

Sep 30, 2005, 8:00:23 AM9/30/05

to

David Schwartz wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:8Y6dnavVYui...@comcast.com...
>

>>In your rather contrived example, you had the four threads partitioned
>>into 2 threads each doing totally separate work that could just as well
>>be done on separate machines.
>
>
> That's actually pretty typical. Generally the threads in a process will
> subdivide into logical groups of one or more thread each. Each thread in a
> group does roughly similar tasks but tasks that are different from what the
> threads in the other groups do. There will usually be common paths that all
> the thread groups pass through, but they are usually in the OS and the
> threading library. (Where lock free techniques *can* be very valuable.)
>

You just making this up. Applications can't be parallelized to some
arbitrary degree. If they could, whoever figured out how to do it
would become very famous. That's why scalability *is* an issue.

[...]

>
>
>>This is the best example you could come up
>>with?
>
>
> Obviously I picked the example that best illustrated my point. The more
> your situation varies from my example, the less applicable the example.
>
> If you're going to fairly evaluate lock free algorithms, you should
> understand the prototypical cases where they fail very well, if for no other
> reason than for you to make the thoroughly informed decision that those
> weaknesses aren't too important in your particular case.
>

Well, it was a fairly bad example. We're talking about scalability here.
All the literature and research that I'm aware of show lock-free algorithms
scale better than lock based ones. You're perfectly welcome to put out
your own paper showing lock based ones scale better than lock-free ones.

Message has been deleted

Sean Kelly

unread,

Sep 30, 2005, 1:29:40 PM9/30/05

to

David Schwartz wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:AridnXxvsKI...@comcast.com...
>

> No, that's not true. Blocking on a lock is going to help performance
> because useful work will be done until it's possible for the blocked thread
> to run without contention. Blocking on a lock is good because it allows
> contention to be avoided.
>
> Lock free algorithms only help if the thing contended for is so critical
> and important that you can't do much of anything else (on the entire system)
> without accessing it. That's why they're very useful inside kernels where
> this is often true and almost useless in application where it almost never
> is.

What about deadlock resistance? In the papers I've read, scalability
is but one of the aspects of lock-free programming that is said to be
beneficial. I don't suppose the faq for this group covers any of these
issues? It might help to have a clear explanation of the various uses
of locks, message-passing, and lock-free, and why one might be
preferred over another in different instances. Probably more a topic
for a book than a few paragraphs in a faq, but it would be useful
either way.

Sean

David Schwartz

unread,

Sep 30, 2005, 2:32:29 PM9/30/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:tLydnSFnu-S...@comcast.com...

> David Schwartz wrote:

>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:8Y6dnavVYui...@comcast.com...

>>>In your rather contrived example, you had the four threads partitioned
>>>into 2 threads each doing totally separate work that could just as well
>>>be done on separate machines.

>> That's actually pretty typical. Generally the threads in a process
>> will subdivide into logical groups of one or more thread each. Each
>> thread in a group does roughly similar tasks but tasks that are different
>> from what the threads in the other groups do. There will usually be
>> common paths that all the thread groups pass through, but they are
>> usually in the OS and the threading library. (Where lock free techniques
>> *can* be very valuable.)

> You just making this up.

Umm, no. That is very typical. For example, an application might have
server threads, I/O threads, and timer threads. It is very rare for an
application to keep all its threads in a single pool. Often the application
also uses libraries that have their own threads or even pools of threads.

> Applications can't be parallelized to some
> arbitrary degree. If they could, whoever figured out how to do it
> would become very famous.

I'm not just talking about individual applications, I'm talking about
entire systems.

> That's why scalability *is* an issue.

Scalability is not just more threads doing the same thing on more CPUs.
It's more processes doing the same thing on more CPUs. It's more threads
doing different things on more CPUs. It's more processes doing different
things on the same CPUs.

Contention is bad. Lock free algorithms are about maximizing contention.

>Well, it was a fairly bad example. We're talking about scalability here.
>All the literature and research that I'm aware of show lock-free algorithms
>scale better than lock based ones. You're perfectly welcome to put out
>your own paper showing lock based ones scale better than lock-free ones.

For the limited case where scalability means more threads manipulating
the same collection at the same time and all you care about is how fast
those specific threads run. That's a tremendously unrealistic case. Unless
you're writing an OS or a threading library, there's almost always something
more useful the system could do than playing ping-pong with two contending
threads fighting over the cache.

DS

Joe Seigh

unread,

Sep 30, 2005, 4:32:27 PM9/30/05

to

Oliver S. wrote:
>
>>If probability of lock contention for a single thread is .001 in
>>your example, then for 100 threads it's about .100, 200 threads
>>about .199, 1000 threads about .630, etc...
>
>

> If you're handing over items to other threads in that granularity,
> you are doing something wrong. In those cases it's smarter to push
> a block of items to the queue and to pop about the same number of
> items from the queue at one time. So this problem (which only would
> occur on extremely large machines) is eliminated.

Well, until 2010 anyway. That's when Intel claims 100 core processors
will be commonly available. Individual cores probably won't be much
faster than they are now. If you want better performance than you have
now, you'll have to figure out how to do it with more threads. If you
know how to do that and keep synchronization low enough so the exponential
contention doesn't become a problem then good. Otherwise too bad I
guess.

David Schwartz

unread,

Sep 30, 2005, 5:25:23 PM9/30/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:3uudnX6_Yr6...@comcast.com...

> Well, until 2010 anyway. That's when Intel claims 100 core processors
> will be commonly available. Individual cores probably won't be much
> faster than they are now. If you want better performance than you have
> now, you'll have to figure out how to do it with more threads. If you
> know how to do that and keep synchronization low enough so the exponential
> contention doesn't become a problem then good. Otherwise too bad I
> guess.

I don't think anybody does yet. A lot will depend on what the memory
access characteristics of those processors turns out to be.

DS

Markus Elfring

unread,

Oct 1, 2005, 2:45:55 PM10/1/05

to

> No, that's not true. Blocking on a lock is going to help performance
> because useful work will be done until it's possible for the blocked thread
> to run without contention. Blocking on a lock is good because it allows
> contention to be avoided.

Do you also consider the cases like signal handling where locks can not be used?

Regards,
Markus

David Schwartz

unread,

Oct 2, 2005, 4:24:07 AM10/2/05

to

"Markus Elfring" <Markus....@web.de> wrote in message
news:3q81vpF...@individual.net...

If signal handling is part of your performance-critical code, you're
doing something very, very wrong. Generally you handle signals synchronously
by having a dedicated thread block. Having signals interrupt running code is
really only appropriate when you only have one thread (and thus don't have
better choices).

DS

chris noonan

unread,

Oct 2, 2005, 5:52:32 PM10/2/05

to

David Schwartz wrote (in various posts):

>Blocking on a lock is going to help performance because useful
>work will be done until it's possible for the blocked thread
>to run without contention. Blocking on a lock is good because it allows
>contention to be avoided.

>Contention is bad. Lock free algorithms are about maximizing contention

If locks are efficient, how is it that many programs on a quad
processor machine give more throughput if only one thread is
started, as opposed to four?

If locks are efficient, why does the technique called I/O Completion
Port exist?

This thread (in the other sense of the word) has been vague
about the mechanism by which contention between threads
for a lock can affect performance. Allow me to rectify
that:

Imagine a uniprocessor machine running 1000 threads (not
performing I/O). Contrary to popular belief, performance
is not affected in the slightest by this number of threads.
One context switch takes place every timeslice (say 20
milliseconds) whether there are two threads or 1000.
This frequency (50 Hz) of context switching reduces
throughput slightly but not greatly. How do I know?
Because the operating system timeslice is determined
as a tradeoff between throughput and latency.

Now introduce a lock, contended for by the threads,
such that the lock is held for 1% of a thread's
running time. Every hundredth timeslice, a thread
is switched out when it is holding the lock. During
the subsequent 20ms timeslice, all the other threads
run up against the lock, leading to 999 context
switches before the original thread gets the processor
again. Accordingly, every hundred timeslices there are
1099 switches instead of 100, an eleven-fold increase.
That *is* going to sap performance.

That example had threads holding a lock 1% of the
time. Things can be much worse. The literature suggests
that typical programs may spend 20% of their running
time inside the allocator (heap manager) routines.
It's lucky there are lock-free allocators available ;)

If it seems possible by clever design to reduce the time
spent holding a lock to almost nothing, bear in mind
that on recent Pentiums just acquiring the lock is going
to take over a hundred machine cycles; there is a
limit.

Chris

David Butenhof

unread,

Oct 2, 2005, 9:42:57 PM10/2/05

to

chris noonan wrote:
> David Schwartz wrote (in various posts):
>

>> Contention is bad. Lock free algorithms are about maximizing contention
>
> If locks are efficient, how is it that many programs on a quad
> processor machine give more throughput if only one thread is
> started, as opposed to four?

Largely because many programs are written badly. It IS the contention
that's killing these applications. And a complication (and one of the
big wrenches in lock-free design) is that unless you're coding your own
embedded/monolithic system from scratch, there's an enormous amount of
contention in the system (both on behalf of and independent of your
application) that you can't control or even see.

> If locks are efficient, why does the technique called I/O Completion
> Port exist?

That's a completely different issue. It's not locking vs lock-free. It's
because while threads may be much lighter weight than processes, they're
not free. If you're using processes only to get I/O parallelism, you may
switch to threads because they're lighter weight (smaller resource
footprint, quicker to switch); but if you're pushing the boundaries
you'll soon discover that even threads are heavier than you really want
or need. I/O completions (a form of asynchronous I/O) is just a way to
parallelize your I/O operations at a lower level, without any scheduling
overhead that's not actually needed for the purpose.

> This thread (in the other sense of the word) has been vague
> about the mechanism by which contention between threads
> for a lock can affect performance. Allow me to rectify
> that:
>
> Imagine a uniprocessor machine running 1000 threads (not
> performing I/O). Contrary to popular belief, performance
> is not affected in the slightest by this number of threads.

At a simple level, that's true. However it's not entirely true, either.
For example, most schedulers don't scale perfectly with long run lists.
On a 16-way multiprocessor, a 10 thread application many never suffer a
context switch... but 1000 threads almost certainly will even if "most"
are "usually" waiting for I/O. That's not even getting into subtle
effects on hardware, such as cache associativity.

> One context switch takes place every timeslice (say 20
> milliseconds) whether there are two threads or 1000.
> This frequency (50 Hz) of context switching reduces
> throughput slightly but not greatly. How do I know?
> Because the operating system timeslice is determined
> as a tradeoff between throughput and latency.

It's a tradeoff based on arbitrary data that doesn't apply to ANY real
application except occasionally and purely by accident. In many
application workloads the impact can indeed be "great". On average, no
application is average...

> Now introduce a lock, contended for by the threads,
> such that the lock is held for 1% of a thread's
> running time. Every hundredth timeslice, a thread
> is switched out when it is holding the lock. During
> the subsequent 20ms timeslice, all the other threads
> run up against the lock, leading to 999 context
> switches before the original thread gets the processor
> again. Accordingly, every hundred timeslices there are
> 1099 switches instead of 100, an eleven-fold increase.
> That *is* going to sap performance.
>
> That example had threads holding a lock 1% of the
> time. Things can be much worse. The literature suggests
> that typical programs may spend 20% of their running
> time inside the allocator (heap manager) routines.
> It's lucky there are lock-free allocators available ;)

And 20% probably optimistic as an average. A lot of code is really badly
designed, and ends up being pretty much entirely serialized.

> If it seems possible by clever design to reduce the time
> spent holding a lock to almost nothing, bear in mind
> that on recent Pentiums just acquiring the lock is going
> to take over a hundred machine cycles; there is a
> limit.

Also remember that MOST (though not all) of the time required to acquire
a lock is the exact same time (and machine instructions) necessary for a
lock-free transaction. There ARE real savings, especially on machines
like Alpha, PowerPC, Itanium, where you can often avoid expensive
"fence" or "barrier" instructions that are critical for locks but not
necessarily for (some) lock-free operations; but that isn't much of a
factor on a Pentium.

Most (practical) lock-free really could be described more as "reducing
the time spend holding the lock to nothing" than as "no locking".
Sometimes you can reliably read without any memory system
synchronization, but you can't ever reliably write that way if the
result needs to be shared.

"Lock-free" really just recognizes that the hardware doesn't usually do
"lock" and "unlock" -- it's performing a synchronized operation on data.
A lock merely performs that operation on a special flag and "proper use"
implicitly extends the synchronization properties to your data.
Lock-free uses the same operations but performs them on the data you
really wanted to change in the first place. The "lock" model is
conceptually simpler and more flexible; but it also adds overhead
because you're working through the proxy of the "lock" data.

The first SMP machine I programmed (a short-lived and long gone variant
of the PDP-11) had only one instruction that synchronized memory across
processors: ASRB (arithmetic shift right, byte). That was it. The only
"lock free" operation you could do was an arithmetic divide by a power
of two (and only for very small ranges of data). That's as close as I've
come to a machine that only implements "a lock" in hardware. Once you
get to "compare and swap", you gain an extremely powerful tool to
operate on real data instead of a lock -- and we've finally reached the
point where this flexibility is pretty nearly universal. And that
enables lock-free programming to become something more than a curiosity.

--
Dave Butenhof, David.B...@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062

David Schwartz

unread,

Oct 2, 2005, 9:31:38 PM10/2/05

to

"chris noonan" <use...@leapheap.co.uk> wrote in message
news:1128289952.8...@o13g2000cwo.googlegroups.com...

> David Schwartz wrote (in various posts):

>>Blocking on a lock is going to help performance because useful
>>work will be done until it's possible for the blocked thread
>>to run without contention. Blocking on a lock is good because it allows
>>contention to be avoided.

>>Contention is bad. Lock free algorithms are about maximizing contention

> If locks are efficient, how is it that many programs on a quad
> processor machine give more throughput if only one thread is
> started, as opposed to four?

Probably because the threads are fighting each other. If they used locks
more, then having four threads started would be the same as having one
thread started and they would give more performance. ;)

> If locks are efficient, why does the technique called I/O Completion
> Port exist?

I don't understand what one has to do with the other. IOCP is about
optimizing the number of concurrent threads without the risk of thread
starvation.

> This thread (in the other sense of the word) has been vague
> about the mechanism by which contention between threads
> for a lock can affect performance. Allow me to rectify
> that:

> Imagine a uniprocessor machine running 1000 threads (not
> performing I/O). Contrary to popular belief, performance
> is not affected in the slightest by this number of threads.
> One context switch takes place every timeslice (say 20
> milliseconds) whether there are two threads or 1000.
> This frequency (50 Hz) of context switching reduces
> throughput slightly but not greatly. How do I know?
> Because the operating system timeslice is determined
> as a tradeoff between throughput and latency.

Exactly. Context switches are only expensive if you somehow force a
large number of them.

> Now introduce a lock, contended for by the threads,
> such that the lock is held for 1% of a thread's
> running time. Every hundredth timeslice, a thread
> is switched out when it is holding the lock. During
> the subsequent 20ms timeslice, all the other threads
> run up against the lock, leading to 999 context
> switches before the original thread gets the processor
> again. Accordingly, every hundred timeslices there are
> 1099 switches instead of 100, an eleven-fold increase.
> That *is* going to sap performance.

Umm, no. Some thread will get the lock and unless the thread library is
poorly designed, it will be allowed to do useful work at full speed without
context switches. If other threads want the lock, they'll have to wait.

> That example had threads holding a lock 1% of the
> time. Things can be much worse. The literature suggests
> that typical programs may spend 20% of their running
> time inside the allocator (heap manager) routines.
> It's lucky there are lock-free allocators available ;)

Yes, as I pointed out, the allocator is one place where lock free
algorithms can be a big win because there may not be anything else to do.

> If it seems possible by clever design to reduce the time
> spent holding a lock to almost nothing, bear in mind
> that on recent Pentiums just acquiring the lock is going
> to take over a hundred machine cycles; there is a
> limit.

The lock-free algorithms generally increase the number of such expensive
operations, and they add extra cache misses.

DS

Chris Thomasson

unread,

Oct 2, 2005, 7:51:42 PM10/2/05

to

>> If it seems possible by clever design to reduce the time
>> spent holding a lock to almost nothing, bear in mind
>> that on recent Pentiums just acquiring the lock is going
>> to take over a hundred machine cycles; there is a
>> limit.
>
> The lock-free algorithms generally increase the number of such
> expensive operations,

Not all of them...

> and they add extra cache misses.

A good reader/writer solution can greatly enhance cache performance. If you
really want to be "cache friendly", try to avoid calling atomic operation
and/or StoreLoad style membar instructions.

Message has been deleted

Chris Thomasson

unread,

Oct 2, 2005, 8:46:33 PM10/2/05

to

> Also remember that MOST (though not all) of the time required to acquire a
> lock is the exact same time (and machine instructions) necessary for a
> lock-free transaction. There ARE real savings, especially on machines like
> Alpha, PowerPC, Itanium, where you can often avoid expensive "fence" or
> "barrier" instructions that are critical for locks but not necessarily for
> (some) lock-free operations; but that isn't much of a

> factor on a Pentium.
^^^^^^^

mfence aside, sometimes a locked atomic operation ( full fence ) will
actually lock the bus. All requests will be blocked until the atomic
operation has completed. This can have a system wide effect, and can reduce
"performance".

> Sometimes you can reliably read without any memory system synchronization,
> but you can't ever reliably write that way if the result needs to be
> shared.

Yes, generally writes use atomic operations and membars. Just to be clear,
when your readers are not using any membars you can actually perform a high
number of modifications in a number of ways. For instance, you could simply
use a mutex to atomically remove the node representing the object from a
collection, please note that this does not effect concurrent readers because
the object itself has not been modified. You then send the node through the
"collector", and when it comes out on the other side you know that there are
no outstanding references. Now you can safely modify the object and use a
mutex to push it back into the collection. Another way to perform a write is
to pop the node; create a copy; modify the copy; membar; push the copy; send
node through the "collector". You could use a mutex to protect the entire
operation. Another method is to use an eventcount. Writers would use a mutex
to isolate the object; perform its modifications; and increment the
eventcount. Readers would read the eventcount; read the data; read the
eventcount again and retry the read operation if the counts were not equal.
Please note that the eventcount method adds some overhead to the readers...

> The first SMP machine I programmed (a short-lived and long gone variant of
> the PDP-11) had only one instruction that synchronized memory across
> processors: ASRB (arithmetic shift right, byte). That was it. The only
> "lock free" operation you could do was an arithmetic divide by a power of
> two (and only for very small ranges of data).

Humm... You could actually use asrb to implement my Virtually Zero-Overhead
Object Management (VZOOM) solution!

:)

> That's as close as I've come to a machine that only implements "a lock" in
> hardware. Once you get to "compare and swap", you gain an extremely
> powerful tool to operate on real data instead of a lock -- and we've
> finally reached the point where this flexibility is pretty nearly
> universal. And that enables lock-free programming to become something more
> than a curiosity.

Agreed.

Markus Elfring

unread,

Oct 3, 2005, 5:43:56 AM10/3/05

to

> Things can be much worse. The literature suggests
> that typical programs may spend 20% of their running
> time inside the allocator (heap manager) routines.
> It's lucky there are lock-free allocators available ;)

Which implementations have you got in mind?
Do you know any such memory allocation library that can be used for all applications?

Regards,
Markus

Joe Seigh

unread,

Oct 3, 2005, 8:28:43 AM10/3/05

to

Oliver S. wrote:
>>If you really want to be "cache friendly",

>>try to avoid calling atomic operation ...
>
>
> Do you have any numbers on the performance of atomic operations
> on non-x86-machines?
>
>
>>... and/or StoreLoad style membar instructions.
>
>
> Do you know the costs of this membars on current "RISC"-architectures?
>

A hazard pointer implementation with 2 memory barriers, one for release
semantics and the other for the "store/load" barrier, on a 866 Mhz P3
with XCHG to simulate the membars is
81 nsec w/ membars and 8 nsec w/o membars
on a 1.2 Ghz powerpc it's
108 nsec w/ membars and 5 nsec w/o membars

If you got rid of the release membar, you might reduce the time a little
but probably not that much since they're pretty close together.

Noticable if you traverse a linked list. There are other factors like
dependent loads and cache hits but I think it came out anywhere from
a 15 to 20 percent differenct to as much as a 400% difference. Just
from the membars.

Joe Seigh

unread,

Oct 3, 2005, 11:01:46 AM10/3/05

to

Joe Seigh wrote:
> Noticable if you traverse a linked list. There are other factors like
> dependent loads and cache hits but I think it came out anywhere from
> a 15 to 20 percent differenct to as much as a 400% difference. Just
> from the membars.

Make that 300% to 400% difference. That 20% was from something else
not related.

chris noonan

unread,

Oct 3, 2005, 2:49:32 PM10/3/05

to

Oliver S. wrote:
> > The literature suggests that typical programs may spend 20% of
> > their running time inside the allocator (heap manager) routines.
>

> If you really need performance, use thread-local pooling for memory
> allocations (the local pool may swap memory-blocks with a global
> pool)

Swapping memory blocks - is that some sort of moveable memory?

> and if you need even more performance, use thread-local pools
> for equally sized objects. That's a magnitude faster than lock-free
> allocations on a global heap.

Which measurements of which lock-free allocator with a
global heap are you referring to?

Chris

chris noonan

unread,

Oct 3, 2005, 2:55:18 PM10/3/05

to

Markus Elfring wrote:
> > Things can be much worse. The literature suggests
> > that typical programs may spend 20% of their running
> > time inside the allocator (heap manager) routines.
> > It's lucky there are lock-free allocators available ;)
>
> Which implementations have you got in mind?

Take a look at http://www.leapheap.com

> Do you know any such memory allocation library that can be used for all applications?

There is no such library. There is (apparently, I've never
read it) a proof that any allocator can be defeated by some
application. This is because memory fragmentation cannot be
guaranteed to be controlled.

Chris

Message has been deleted

Chris Thomasson

unread,

Oct 3, 2005, 2:32:23 PM10/3/05

to

> I don't need any concrete implementation to claim, that thread-local
> pooling with equally sized-blocks ist much faster than any lock-free
> allocator.

Thread-local pools are lock-free by design?

;)

Message has been deleted

Joe Seigh

unread,

Oct 3, 2005, 9:40:24 PM10/3/05

to

Oliver S. wrote:
>>Well, until 2010 anyway. That's when Intel claims 100 core proces-
>>sors will be commonly available. Individual cores probably won't

>>be much faster than they are now. If you want better performance
>>than you have now, you'll have to figure out how to do it with
>>more threads.
>
>

> I think we'll have transactional memory in harware by then. Sun
> will implement transactional memory with Rock, Niagara's follow-up.

And what would you do with tranactional memory? Lock-free
algorithms?

David Hopwood

unread,

Oct 3, 2005, 10:26:04 PM10/3/05

to

Oliver S. wrote:

>Joe Seigh wrote:
>>A hazard pointer implementation with 2 memory barriers, one
>>for release semantics and the other for the "store/load" barrier,
>>on a 866 Mhz P3 with XCHG to simulate the membars is 81 nsec w/
>>membars and 8 nsec w/o membars on a 1.2 Ghz powerpc it's
>

> Eh, you don't need membars on x86 except for some SSE-operations.

Probably he needs the store-load barrier, but not the release barrier.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Message has been deleted

Sergey P. Derevyago

unread,

Oct 4, 2005, 7:06:57 AM10/4/05

to

David Schwartz wrote:
> But worse, in most realistic cases, lock free tremendously *reduce*
> performance. The myth is that locks are expensive. The truth is that
> *contention* is expensive. Lock free algorithms allow code that is
> contending to run in parallel. Locks allow other code to run that probably
> won't contend.
>
Moreover, some "lock free papers" define the "lock free" term in such a way
that old good spin _locks_ become lock free synchronization primitives!
Also some authors propose their own "lock free" synchronization primitives
which are no more then flawed spin locks...
--
With all respect, Sergey. http://ders.angen.net/
mailto : ders at skeptik.net

Joe Seigh

unread,

Oct 4, 2005, 8:00:54 AM10/4/05

to

Oliver S. wrote:
>>And what would you do with tranactional memory?
>>Lock-free algorithms?
>
>

> Yes, but with transactional memory, they're *much* smarter. A lot of
> tricks aren't needed any more (f.e. aba-counters). And when you've got
> a timeout-implementation (which aren't necessary when you use transac-
> tional memory only between threads of the same process), lock-free
> processing is much more efficient than whith this idiotic (D)CAS-algo-
> rithms.

So the problem with lock-free was using compare and swap to implement
it with?

Chris Thomasson

unread,

Oct 4, 2005, 8:46:22 AM10/4/05

to

>> A hazard pointer implementation with 2 memory barriers, one
>> for release semantics and the other for the "store/load" barrier,
>> on a 866 Mhz P3 with XCHG to simulate the membars is 81 nsec w/
>> membars and 8 nsec w/o membars on a 1.2 Ghz powerpc it's
>

> Eh, you don't need membars on x86 except for some SSE-operations.

You do need a full membar for x86 when you algorithm can't handle the fact
that a store followed by a load to another location can, and will, be
reordered...

David Hopwood

unread,

Oct 4, 2005, 12:20:47 PM10/4/05

to

Sergey P. Derevyago wrote:
> David Schwartz wrote:
>
>> But worse, in most realistic cases, lock free tremendously *reduce*
>>performance. The myth is that locks are expensive. The truth is that
>>*contention* is expensive. Lock free algorithms allow code that is
>>contending to run in parallel. Locks allow other code to run that probably
>>won't contend.
>
> Moreover, some "lock free papers" define the "lock free" term in such a way
> that old good spin _locks_ become lock free synchronization primitives!

You're probably confusing a statement about the *implementation* of the lock with
a statement about the lock itself. Of course the implementation of a lock will
be lock-free (unless one kind of lock is being implemented in terms of another).
It won't be wait-free, though.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Message has been deleted

Sergey P. Derevyago

unread,

Oct 5, 2005, 6:09:58 AM10/5/05

to

David Hopwood wrote:
> > Moreover, some "lock free papers" define the "lock free" term in such a way
> > that old good spin _locks_ become lock free synchronization primitives!
>
> You're probably confusing a statement about the *implementation* of the lock with
> a statement about the lock itself.
>

No, I'm not.

> Of course the implementation of a lock will
> be lock-free (unless one kind of lock is being implemented in terms of another).
>

Yes. The point is that some "lock free" algorithms just implement well known
locking patterns in user code. Their argumentation is: I don't call
mutex.lock() so my code is lock free...

> It won't be wait-free, though.
>
--

Peter Dimov

unread,

Oct 5, 2005, 7:32:06 AM10/5/05

to

Sergey P. Derevyago wrote:

> Yes. The point is that some "lock free" algorithms just implement well
> known locking patterns in user code. Their argumentation is: I don't call
> mutex.lock() so my code is lock free...

I think that this is not allowed by the definition of "lock free". A
locking algorithm will not make progress if the thread that has just
acquired the lock is preempted and never receives a chance to run.

Joe Seigh

unread,

Oct 5, 2005, 8:29:41 AM10/5/05

to

Sergey was pointing out that they were using their own definition of
lock-free not Herlihy's. A definition that took lock-free literally.
It's similar to the arguement that algorihtms using interlocked
instructions aren't lock-free since the hardware uses locks.

Joe Seigh

unread,

Oct 5, 2005, 8:39:04 AM10/5/05

to

Joe Seigh wrote:

> Oliver S. wrote:
>>
>>
>> Do you know the costs of this membars on current "RISC"-architectures?
>>
>
> A hazard pointer implementation with 2 memory barriers, one for release
> semantics and the other for the "store/load" barrier, on a 866 Mhz P3
> with XCHG to simulate the membars is
> 81 nsec w/ membars and 8 nsec w/o membars
> on a 1.2 Ghz powerpc it's
> 108 nsec w/ membars and 5 nsec w/o membars
>
> If you got rid of the release membar, you might reduce the time a little
> but probably not that much since they're pretty close together.
>

With only a single membar (sync) the ppc version runs about 55 nsec. But
you can't get rid of the release semantics since it's all accesses you
have to worry about, not just stores. That means loads as well. You
can't have a load occurring from a GC managed object after it's been released
and reused. That would be a bad thing. I could probably go with an isync
instead of sync for the release membar if I restricted hazard pointer access
to read only but it's a moot point since the membar version is only for
comparision with the membar free version. It's not production code.
Neither is the membar free version but that's for different reasons.

Sergey P. Derevyago

unread,

Oct 5, 2005, 11:00:19 AM10/5/05

to

Yes, the progress is a must.
But look: spin lock loops shortly and obviously doesn't make any progress.
Unfortunately, some "lock free" algorithms make long loops (e.g. make some
progress, determine that some critical data was concurrently touched and start
the "progress" once again) and it's not so obvious that there can be no real
progress as well.

Joe Seigh

unread,

Oct 5, 2005, 11:36:47 AM10/5/05

to

Sergey P. Derevyago wrote:

> Peter Dimov wrote:
>>I think that this is not allowed by the definition of "lock free". A
>>locking algorithm will not make progress if the thread that has just
>>acquired the lock is preempted and never receives a chance to run.
>>
>
> Yes, the progress is a must.
> But look: spin lock loops shortly and obviously doesn't make any progress.
> Unfortunately, some "lock free" algorithms make long loops (e.g. make some
> progress, determine that some critical data was concurrently touched and start
> the "progress" once again) and it's not so obvious that there can be no real
> progress as well.

If a lock-free algorithm loops that usually means some other thread is
making progress. Looping would be a problem for wait-free algorithms
if there wasn't a bound on the number of loops but it's not a problem
for lock-free algorithms.

Non lock-free algorithms usually loop waiting for a specific action by some
other thread. If lock-free algorihtms loop it's because their waiting for
some interval of no actions by other threads.

Sergey P. Derevyago

unread,

Oct 5, 2005, 12:00:16 PM10/5/05

to

Joe Seigh wrote:
> > Unfortunately, some "lock free" algorithms make long loops (e.g. make some
> > progress, determine that some critical data was concurrently touched and start
> > the "progress" once again) and it's not so obvious that there can be no real
> > progress as well.
>
> If a lock-free algorithm loops that usually means some other thread is
> making progress.
>

Usually isn't equal to always, is it?

> Looping would be a problem for wait-free algorithms
> if there wasn't a bound on the number of loops but it's not a problem
> for lock-free algorithms.
>
> Non lock-free algorithms usually loop waiting for a specific action by some
> other thread. If lock-free algorihtms loop it's because their waiting for
> some interval of no actions by other threads.
>

IMHO even correctly implemented lock free algorithms don't solve fundamental
MT problems: if you really have to concurrently modify some data then you
really have to implement some synchronization scheme and this scheme doesn't
come for free.

Frankly, I consider the lock free approach as "make frequent event faster but
get rare event be slower". Sometimes the whole program runs faster, sometimes
it don't...

chris noonan

unread,

Oct 6, 2005, 2:17:10 PM10/6/05

to

Oliver S. wrote:
> >> If you really need performance, use thread-local pooling for
> >> memory allocations (the local pool may swap memory-blocks with
> > a global pool)
>
> > Swapping memory blocks - is that some sort of moveable memory?
>

> No, it just makes sense to give back pooled memory when the
> thread-local pool's size becomes to large.

>
> >> and if you need even more performance, use thread-local pools
> >> for equally sized objects. That's a magnitude faster than lock-free
> >> allocations on a global heap.
>
> > Which measurements of which lock-free allocator with a
> > global heap are you referring to?
>

> I don't need any concrete implementation to claim, that thread-local
> pooling with equally sized-blocks ist much faster than any lock-free
> allocator.

"A magnitude faster"? I think you do.

In any case your thread-local pool method could form the basis of
a lock-free allocator, rather than be contrasted to one. But
you omit to say how a memory block in a thread-local pool could
be freed by a thread other than the one that allocated it. That
is possible to arrange (Pentium processor) without a lock or even
a bus-locking instruction, but the extra complexity involved will
negate to some extent the time saved.

Chris

chris noonan

unread,

Oct 6, 2005, 3:29:41 PM10/6/05

to

Oliver S. wrote:
> > And what would you do with tranactional memory?
> > Lock-free algorithms?
>

> Yes, but with transactional memory, they're *much* smarter. A lot of
> tricks aren't needed any more (f.e. aba-counters). And when you've got
> a timeout-implementation (which aren't necessary when you use transac-
> tional memory only between threads of the same process), lock-free
> processing is much more efficient than whith this idiotic (D)CAS-algo-
> rithms.

On the subject of ABA, I predict that this problem will be more
simply averted by the introduction into processor instruction sets
of atomic PUSH and POP. Atomic POP would copy into a register the
value at a memory location pointed to by another memory location
addressed by the instruction operand. It would simultaneously
decrement the contents of the latter memory location (i.e. the
pointer). Atomic PUSH the converse.

chris noonan

unread,

Oct 6, 2005, 3:54:43 PM10/6/05

to

David Schwartz wrote:
> > If locks are efficient, why does the technique called I/O Completion
> > Port exist?
> I don't understand what one has to do with the other. IOCP is about
> optimizing the number of concurrent threads without the risk of thread
> starvation.

But you are assuming that the optimum number of concurrent threads
is one per processor! (OK, I know IOCP sometimes schedules more).
That is only true if the performance of a large number of threads
would be destroyed by locks.
(I will have more to say on the IOCP fraud in a separate topic.)

> > Now introduce a lock, contended for by the threads,
> > such that the lock is held for 1% of a thread's
> > running time. Every hundredth timeslice, a thread
> > is switched out when it is holding the lock. During
> > the subsequent 20ms timeslice, all the other threads
> > run up against the lock, leading to 999 context
> > switches before the original thread gets the processor
> > again. Accordingly, every hundred timeslices there are
> > 1099 switches instead of 100, an eleven-fold increase.
> > That *is* going to sap performance.
> Umm, no. Some thread will get the lock and unless the thread library is
> poorly designed, it will be allowed to do useful work at full speed without
> context switches. If other threads want the lock, they'll have to wait.

I am sure you are missing the point. Read the analysis again.
It is bound to happen that occasionally a thread switch (caused
asynchronously by a timer interrupt) will deschedule a thread
in the middle of some critical section thingy. No other thread
(in a homogeneous collection of threads running the same code path)
can thereafter get significant work done UNTIL the original thread
leaves the critical section. The scheduling algorithm in Windows NT
(for example) ensures that is not before 999 further switches have
been incurred. What operating system is it that allows a thread
to run indefinitely?

Chris

Chris Thomasson

unread,

Oct 6, 2005, 2:50:24 PM10/6/05

to

> So the problem with lock-free was using compare and swap to implement
> it with?

;)

Chris Thomasson

unread,

Oct 6, 2005, 2:55:09 PM10/6/05

to

> In any case your thread-local pool method

Thread-local pools are nothing new.

> could form the basis of
> a lock-free allocator, rather than be contrasted to one. But
> you omit to say how a memory block in a thread-local pool could
> be freed by a thread other than the one that allocated it.

I posted a solution to this here:

http://groups.google.com/group/comp.programming.threads/browse_thread/thread/8245f4b48591fc69/e3efa5628aad4a82?hl=en#e3efa5628aad4a82

Any thoughts?

Chris Thomasson

unread,

Oct 6, 2005, 2:56:49 PM10/6/05

to

> I posted a solution to this here:
>
> http://groups.google.com/group/comp.programming.threads/browse_thread/thread/8245f4b48591fc69/e3efa5628aad4a82?hl=en#e3efa5628aad4a82
>
>
> Any thoughts?

I am going to post some working code for my algorithm very soon.

David Schwartz

unread,

Oct 6, 2005, 5:16:24 PM10/6/05

to

"chris noonan" <use...@leapheap.co.uk> wrote in message
news:1128628483.1...@g49g2000cwa.googlegroups.com...

> David Schwartz wrote:

>> > If locks are efficient, why does the technique called I/O Completion
>> > Port exist?

>> I don't understand what one has to do with the other. IOCP is about
>> optimizing the number of concurrent threads without the risk of thread
>> starvation.

> But you are assuming that the optimum number of concurrent threads
> is one per processor! (OK, I know IOCP sometimes schedules more).
> That is only true if the performance of a large number of threads
> would be destroyed by locks.
> (I will have more to say on the IOCP fraud in a separate topic.)

The optimum number of concurrent threads is one per processor if the
threads are CPU bound. I still don't understand what you think the
connection between IOCP and lock efficiency is. Maybe it's obvious to you,
but I don't see it. I'm genuinely curious.

Again, you need the assumption that the system can get *no* useful work
done at all without this lock. You also need a very improbable interruption
inside a very small piece of code. You also need the thread to use up its
entire timeslice (which means it just did a *lot* of useful work at absolute
top speed).

In any event, if you don't have 1000 CPUs, it's pretty dumb for you to
have 1000 ready-to-run threads. If you do have 1000 CPUs, it's really just
the time it takes to do *one* context switch.

DS

Message has been deleted

Chris Thomasson

unread,

Oct 7, 2005, 12:10:02 PM10/7/05

to

> Yes, CAS is by far not that flexible like transactional memory.

Humm... IMHO, CAS is very flexible. You can use it to create so many
different algorithms. For instance, CAS/DWCAS is used to create virtually
every STM implementation I have seen. You can also do atomic pointers, proxy
gc, ect...

As for HTM:

http://groups.google.com/group/comp.programming.threads/msg/995379a16beb3b69?hl=en

Transactional memory has overhead. Mutexs are probably more efficient.

Chris Thomasson

unread,

Oct 7, 2005, 12:11:33 PM10/7/05

to

> BTW: CAS is a simple form of transactional-memory and LL/SC
> also a simple, but a more flexible (no need for aba etc).

Humm.. Not all LL/SC avoid ABA:

http://www.research.ibm.com/people/m/michael/disc-2004.pdf

:O

chris noonan

unread,

Oct 14, 2005, 12:39:52 PM10/14/05

to

David Schwartz wrote:
> The optimum number of concurrent threads is one per processor if the
> threads are CPU bound. I still don't understand what you think the
> connection between IOCP and lock efficiency is. Maybe it's obvious to you,
> but I don't see it. I'm genuinely curious.

I agree in principle about the optimum number of threads. However
when there is no synchronisation between the threads the
performance loss suffered by having a large number of threads
is so tiny as not to be measurable. My IOCP point was more
philosophical than technical: a "one thread good, many
threads bad" mantra has evolved through experience of
operating systems that take so many locks that multi-
threaded performance dives (that is my guess, I can't
be sure). IOCP is a means of reducing the number of
threads running, at the cost of software complexity.

> In any event, if you don't have 1000 CPUs, it's pretty dumb for you to
> have 1000 ready-to-run threads. If you do have 1000 CPUs, it's really just
> the time it takes to do *one* context switch.

It's dumb to run 1000 threads if you expect performance
deterioration. If I were programming a thousand-client-server
under Microsoft Windows, I'd use IOCP. What I am interested
in is the reason for the performance deterioration.

Chris

chris noonan

unread,

Oct 14, 2005, 12:52:15 PM10/14/05

to

Dave Butenhof wrote:
> > One context switch takes place every timeslice (say 20
> > milliseconds) whether there are two threads or 1000.
> > This frequency (50 Hz) of context switching reduces
> > throughput slightly but not greatly. How do I know?
> > Because the operating system timeslice is determined
> > as a tradeoff between throughput and latency.
> It's a tradeoff based on arbitrary data that doesn't apply to ANY real
> application except occasionally and purely by accident. In many
> application workloads the impact can indeed be "great". On average, no
> application is average...

I don't understand. If thread-switching is driven by I/0,
then it doesn't matter what the timeslice is. If the
application is CPU-heavy, then the OS designers say
something like "a 20 ms timeslice loses us 3% of the
CPU and that's acceptable". It doesn't matter what
real applications do. Admittedly Microsoft (e.g.) may not
have bothered updating their timeslices as processors
have become faster. Admittedly applications may differ
in their latency requirements. I don't think it's
a matter worth fussing over.

Chris

David Schwartz

unread,

Oct 14, 2005, 3:41:23 PM10/14/05

to

"chris noonan" <use...@leapheap.co.uk> wrote in message

news:1129307992.3...@g14g2000cwa.googlegroups.com...

> David Schwartz wrote:

>> The optimum number of concurrent threads is one per processor if the
>> threads are CPU bound. I still don't understand what you think the
>> connection between IOCP and lock efficiency is. Maybe it's obvious to
>> you,
>> but I don't see it. I'm genuinely curious.

> I agree in principle about the optimum number of threads. However
> when there is no synchronisation between the threads the
> performance loss suffered by having a large number of threads
> is so tiny as not to be measurable.

True, but there's still a loss of control. What you are doing is
transferring to the scheduler the job of deciding what work should be done.
You really want to keep that under application control.

> My IOCP point was more
> philosophical than technical: a "one thread good, many
> threads bad" mantra has evolved through experience of
> operating systems that take so many locks that multi-
> threaded performance dives (that is my guess, I can't
> be sure). IOCP is a means of reducing the number of
> threads running, at the cost of software complexity.

IOCP doesn't increase software complexity. In fact, in many cases it
reduces it because the implementation of a thread pool is simpler. Perhaps
you're subconsciously thinking something like "I have a library that does
I/O and thread pools and if I want IOCP, I'd have to code it myself". Well,
once you have a library that does IOCP thread pools and socket I/O, you have
it forever.

>> In any event, if you don't have 1000 CPUs, it's pretty dumb for you to
>> have 1000 ready-to-run threads. If you do have 1000 CPUs, it's really
>> just
>> the time it takes to do *one* context switch.

> It's dumb to run 1000 threads if you expect performance
> deterioration. If I were programming a thousand-client-server
> under Microsoft Windows, I'd use IOCP. What I am interested
> in is the reason for the performance deterioration.

It is, of course, a lot of things, and performance deterioration isn't
the only issue. For one thing, many schedulers don't handle large numbers of
ready-to-run threads well. If you code that performs well on all platforms,
you can't rely on the scheduler having particular optimizations.

As I understand it, the main problem with the "use 1,000 threads to do
1,000 jobs" approach is that each job will likely be very small, and thus
you will need 1,000 context switches to get 1,000 jobs done. If you have
1,000 large jobs (that will take many timeslices each to complete), the
penalty for having 1,000 threads doing them is much less.

Consider, for example, a chat server. We have a message we need to send
to 1,000 clients. If each client has its own thread, we'll need 1,000
context switches in rapid succession to get all the data sent. If we use a
thread pool, we'll wind up with one thread on each CPU that will run until
all 1,000 messages are sent or they use up a full timeslice. As a result,
we'll need maybe 3 or 4 context switches to get the messages sent instead of
1,000.

DS

Sean Kelly

unread,

Oct 14, 2005, 7:21:21 PM10/14/05

to

David Schwartz wrote:
>
> IOCP doesn't increase software complexity. In fact, in many cases it
> reduces it because the implementation of a thread pool is simpler. Perhaps
> you're subconsciously thinking something like "I have a library that does
> I/O and thread pools and if I want IOCP, I'd have to code it myself". Well,
> once you have a library that does IOCP thread pools and socket I/O, you have
> it forever.

I'm not sure I agree. Multiplexed I/O typically requires the use of
some sort of state machine, while dedicating a thread to each client is
very straightforward.

> Consider, for example, a chat server. We have a message we need to send
> to 1,000 clients. If each client has its own thread, we'll need 1,000
> context switches in rapid succession to get all the data sent. If we use a
> thread pool, we'll wind up with one thread on each CPU that will run until
> all 1,000 messages are sent or they use up a full timeslice. As a result,
> we'll need maybe 3 or 4 context switches to get the messages sent instead of
> 1,000.

This is the example I tend to use when thinking about scalable I/O.
One thread per client is definately the easiest to code, though the
options beyond that tend to vary based on the available options.
select() certainly isn't attractive once you're dealing with 1,000
simultaneous connections, but IOCP is no more complex for 1,000
connections than it is for 10. The only barrier I've found with IOCP
is the complete lack of documentation on it (as of a few years ago at
any rate)--the initial learning curve can be a bit steep.

Sean

David Schwartz

unread,

Oct 14, 2005, 11:25:22 PM10/14/05

to

"Sean Kelly" <se...@f4.ca> wrote in message
news:1129332081....@o13g2000cwo.googlegroups.com...

> David Schwartz wrote:

>> IOCP doesn't increase software complexity. In fact, in many cases it
>> reduces it because the implementation of a thread pool is simpler.
>> Perhaps
>> you're subconsciously thinking something like "I have a library that does
>> I/O and thread pools and if I want IOCP, I'd have to code it myself".
>> Well,
>> once you have a library that does IOCP thread pools and socket I/O, you
>> have
>> it forever.

> I'm not sure I agree. Multiplexed I/O typically requires the use of
> some sort of state machine, while dedicating a thread to each client is
> very straightforward.

Even if you dedicate a thread to each client, you still need some sort
of state machine. I can't see how you can avoid it. Doesn't each client have
to be in a state and perform processing appropriate for that state? What
about when an administrator wants to know what state every connected client
is in? What about when you need to check all authenticated connections and
disconnect any that authenticated with a particular credential?

>> Consider, for example, a chat server. We have a message we need to
>> send
>> to 1,000 clients. If each client has its own thread, we'll need 1,000
>> context switches in rapid succession to get all the data sent. If we use
>> a
>> thread pool, we'll wind up with one thread on each CPU that will run
>> until
>> all 1,000 messages are sent or they use up a full timeslice. As a result,
>> we'll need maybe 3 or 4 context switches to get the messages sent instead
>> of
>> 1,000.

> This is the example I tend to use when thinking about scalable I/O.
> One thread per client is definately the easiest to code, though the
> options beyond that tend to vary based on the available options.
> select() certainly isn't attractive once you're dealing with 1,000
> simultaneous connections, but IOCP is no more complex for 1,000
> connections than it is for 10. The only barrier I've found with IOCP
> is the complete lack of documentation on it (as of a few years ago at
> any rate)--the initial learning curve can be a bit steep.

One thread per client is the easiest to code if you already have a
library that provides that. IOCP is just as easy if you start with a library
that provides that. You write this kind of code just once and use it
hundreds of times. A tiny bit of extra complexity is utterly irrelevent.

DS

Sean Kelly

unread,

Oct 16, 2005, 2:18:18 PM10/16/05

to

David Schwartz wrote:
> "Sean Kelly" <se...@f4.ca> wrote in message
> news:1129332081....@o13g2000cwo.googlegroups.com...
>
> > David Schwartz wrote:
>
> >> IOCP doesn't increase software complexity. In fact, in many cases it
> >> reduces it because the implementation of a thread pool is simpler.
> >> Perhaps
> >> you're subconsciously thinking something like "I have a library that does
> >> I/O and thread pools and if I want IOCP, I'd have to code it myself".
> >> Well,
> >> once you have a library that does IOCP thread pools and socket I/O, you
> >> have
> >> it forever.
>
> > I'm not sure I agree. Multiplexed I/O typically requires the use of
> > some sort of state machine, while dedicating a thread to each client is
> > very straightforward.
>
> Even if you dedicate a thread to each client, you still need some sort
> of state machine. I can't see how you can avoid it. Doesn't each client have
> to be in a state and perform processing appropriate for that state?

Depends on the application. The client handling code for a simple chat
server with one thread per client amounts to something like this:

while( read into buffer succeeds ) {
if( EOM symbol exists in buffer ) {
post buffer from start to EOM to others
move buffer start to EOM location
}
}
post disconnect message to others

State in this case is implicit rather than tracked by some sort of key
value.

> What
> about when an administrator wants to know what state every connected client
> is in? What about when you need to check all authenticated connections and
> disconnect any that authenticated with a particular credential?

I would consider this application state more than I/O state, as it
would be the same regardless of the I/O method used.

> One thread per client is the easiest to code if you already have a
> library that provides that. IOCP is just as easy if you start with a library
> that provides that. You write this kind of code just once and use it
> hundreds of times. A tiny bit of extra complexity is utterly irrelevent.

If we assume that the programmer has a library to do all this then the
I/O method is irrelevant--complexity is a matter of library design
rather than any details of the I/O method. I grant that writing an
IOCP library isn't tremendously more complex than a BSD-style library,
but doing so requires somewhat more specialized knowledge--many
programmers are not particularly experienced with thread pooling and
such, for example.

Sean

David Schwartz

unread,

Oct 17, 2005, 1:30:41 AM10/17/05

to

"Sean Kelly" <se...@f4.ca> wrote in message

news:1129486698.0...@f14g2000cwb.googlegroups.com...

>> > I'm not sure I agree. Multiplexed I/O typically requires the use of
>> > some sort of state machine, while dedicating a thread to each client is
>> > very straightforward.

>> Even if you dedicate a thread to each client, you still need some
>> sort
>> of state machine. I can't see how you can avoid it. Doesn't each client
>> have
>> to be in a state and perform processing appropriate for that state?

> Depends on the application. The client handling code for a simple chat
> server with one thread per client amounts to something like this:
>
> while( read into buffer succeeds ) {
> if( EOM symbol exists in buffer ) {
> post buffer from start to EOM to others
> move buffer start to EOM location
> }
> }
> post disconnect message to others

Sure, but in the real world, people aren't writing simple servers.

> State in this case is implicit rather than tracked by some sort of key
> value.

Right, but in real applications, state needs to be accessible by other
clients or adminstrators anyway. Sure, this can be a difference in trivial
toy applications, but for real applications, being able to keep state on the
stack is not realistic. Other threads can't reliably access it.

>> What
>> about when an administrator wants to know what state every connected
>> client
>> is in? What about when you need to check all authenticated connections
>> and
>> disconnect any that authenticated with a particular credential?

> I would consider this application state more than I/O state, as it
> would be the same regardless of the I/O method used.

My point is that you are going to need a state machine no matter what.
There is nothing even remotely complicated about an I/O state machine.

>> One thread per client is the easiest to code if you already have a
>> library that provides that. IOCP is just as easy if you start with a
>> library
>> that provides that. You write this kind of code just once and use it
>> hundreds of times. A tiny bit of extra complexity is utterly irrelevent.

> If we assume that the programmer has a library to do all this then the
> I/O method is irrelevant--complexity is a matter of library design
> rather than any details of the I/O method. I grant that writing an
> IOCP library isn't tremendously more complex than a BSD-style library,
> but doing so requires somewhat more specialized knowledge--many
> programmers are not particularly experienced with thread pooling and
> such, for example.

If you assume the programmer is already familiar with A, then obviously
you'll reach the result that A is easier to code than non-A. The question is
whether there is something fundamental about A that makes it easier, and in
this case, there isn't. IOCP is simple and straightforward.

In general, I/O complexity is a library design issue because I/O will
generally be handled by some sort of library. This is critical in portable
applications because I/O varies from platform-to-platform.

In other words, the complexity of use argument is totally meaningless
for I/O methods, unless there was an I/O method that was dramatically more
complicated. Not only is IOCP not complicated, but it simplifies the task of
distributing I/O to threads.

In my company's portable I/O manager, the IOCP/WIN32 networking code is
2,100 lines. The corresponding UNIX code is 4,400 lines, more than twice as
many. The WIN32 code also supports fallback when IOCP is not available. The
UNIX code supports kqueue and epoll.

DS

David Butenhof

unread,

Oct 17, 2005, 8:12:24 AM10/17/05

to

chris noonan wrote:
> Dave Butenhof wrote:
>>> One context switch takes place every timeslice (say 20
>>> milliseconds) whether there are two threads or 1000.
>>> This frequency (50 Hz) of context switching reduces
>>> throughput slightly but not greatly. How do I know?
>>> Because the operating system timeslice is determined
>>> as a tradeoff between throughput and latency.
>>
>> It's a tradeoff based on arbitrary data that doesn't apply to ANY real
>> application except occasionally and purely by accident. In many
>> application workloads the impact can indeed be "great". On average, no
>> application is average...
>
> I don't understand. If thread-switching is driven by I/0,
> then it doesn't matter what the timeslice is.

Not entirely true. Timeslicing can have a substantial effect on LATENCY
in an I/O-bound application, because the issuing thread cannot do
anything with the I/O completion until it has control again. (Obviously
we're talking about thread-synchronous I/O rather than async I/O; though
even then there may be a direct effect depending on the form of I/O.)

> If the application is CPU-heavy, then the OS designers say
> something like "a 20 ms timeslice loses us 3% of the
> CPU and that's acceptable".

3% of which application(s)? Under what load and other conditions?
Acceptable to whom?

If your system runs a single application that has no threads that ever
run 20ms without blocking voluntarily, then your 20ms timeslice is just
stealing 3% of the cpu that you could have been using for real work. If
you have lots of threads doing 40ms, or even 21ms computations across
sets of threads that need to synchronize with each other to collaborate
on a joint solution, your application may easily spend most its time
waiting for results (because computations are interrupted in the middle
and spend 20ms, or 40ms, sitting on a run queue behind other jobs you
might not even know about), and an objective measurement of APPLICATION
overhead could easily show you've lost 50% of the "theoretical" system
throughput.

Nevermind that it's still statistically "only 3% of the CPU". There are,
as the saying goes, "lies, d**n lies, and statistics." If you have to
rely on a statistic to convince yourself that you're safe, you're in
trouble 90% of the time. ;-)

> It doesn't matter what real applications do.

Right. "It doesn't matter what real applications do" is precisely, in
effect, what OS designers have to do -- because they can't design for
specific real applications, only for broad estimates of some classes of
"average" applications. And there's no such thing as "average".

Real application developers, and their users, DO care, very much, about
what real applications do. You can trivialize it all you want, but this
is a very real and consequential disconnect. This is precisely why there
have been custom monolithic realtime applications; their needs were
sufficiently "non-average" that they couldn't rely on "average" OS
design tradeoffs. This has become less common for economic reasons
rather than because "average" has really gotten all that better.

--
Dave Butenhof, David.B...@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062

chris noonan

unread,

Oct 17, 2005, 10:07:38 AM10/17/05

to

David Butenhof wrote:
> > I don't understand. If thread-switching is driven by I/0,
> > then it doesn't matter what the timeslice is.

[snip]

> If your system runs a single application that has no threads that ever
> run 20ms without blocking voluntarily, then your 20ms timeslice is just
> stealing 3% of the cpu that you could have been using for real work.
> If you have lots of threads doing 40ms, or even 21ms computations across
> sets of threads that need to synchronize with each other to collaborate
> on a joint solution, your application may easily spend most its time
> waiting for results (because computations are interrupted in the middle
> and spend 20ms, or 40ms, sitting on a run queue behind other jobs you
> might not even know about), and an objective measurement of APPLICATION
> overhead could easily show you've lost 50% of the "theoretical" system
> throughput.

If a thread does not ever run 20ms without voluntarily relinquishing
the processor, then the timeslice is irrelevant because no thread
ever gets timed out. That is what I meant. However I have looked at
some Windows NT documentation and it is contradictory as to whether
a thread that is descheduled for I/O is restarted with a full
timeslice.

If you consider that preemptive scheduling per se is a major
determinant of performance, see if you can visualise the typical
NT Server timeslice of 180ms slugging a machine.

Chris