I read a few discussions on lock-free programming lately in various
newsgroups. As it sounds quite intereting, I tried to google for it, to
find some introduction, uses, idioms, patterns, etc. to learn about it.
But I am pretty much lost in the results. Could anybody point me to a good
introduction, and some real-life (OpenSource?) examples on its use? (I
read boost::shared_ptr already, but I have the feeling, there is more to
it.)
Thx
ImRe
Here is my "old and out-dated" experimental library:
http://appcore.home.comcast.net/
IMHO, the there are three very useful API's on here.
(1) The x86 lock-free support assembly library:
http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html
(2) The zero-overhead single-producer/consumer lock-free queue
http://appcore.home.comcast.net/appcore/include/ac_queue_spsc_h.html
http://appcore.home.comcast.net/appcore/src/ac_queue_spsc_c.html
(3) The zero-overhead eventcount
http://appcore.home.comcast.net/appcore/include/ac_eventcount_algo1_h.html
http://appcore.home.comcast.net/appcore/src/ac_eventcount_algo1_c.html
I am currently working on incorporating my new VZOOM library into some
in-house server software; however, I would be happy to discuss AppCore, if
you need any more information...
****VERY IMPORTANT****
SMR Hazard Pointers went PATENT PENDING in 2002!
--- You can't use them for anything commercial!
:O
How do you think about this article?
http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms
Regards,
Markus
shared_ptr, though it's internally lock-free in some cases, isn't
really usable for lock-free since it depends on you owning the
shared_ptr's you copy.
Look at Java JSR-133 volatiles which are atomic with acquire and release
semantics. Also look at java.util.concurrent (JSR-166) packages. The
Java environment is probably a better one to play around in since the
memory model is better defined, and it has GC, atomicity, and interlocked
primatives.
If you're going to use anything commercially, take care to only use stuff
in the public domain. Just because a paper has been published on it doesn't
mean a patent application isn't in the works, which is pretty much everything
in the last 6 years or so. And a lot of stuff before that already has
patents.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
But worse, in most realistic cases, lock free tremendously *reduce*
performance. The myth is that locks are expensive. The truth is that
*contention* is expensive. Lock free algorithms allow code that is
contending to run in parallel. Locks allow other code to run that probably
won't contend.
For example, consider three threads, A, B, and C. A and B both want to
manipulate a queue, C wants to do something else entirely. Assume two CPUs.
With a lock, either A or B will grab the queue lock. A or B will run with C
and both threads will run at full speed with minimal contention. With a
lock-free queue, A can run concurrently with B. C will have to wait, while A
and B fight over the cache lines that contain the queues and its structures.
Well, you say, this is a carefully crafted example. What if there's
another processor? Guess what, you're still screwed. Because A and B running
in parallel will saturate the front side bus as they bounce the queue
structures back and forth. C will still run at a low speed and system
performance will likely be worse than if A ran by itself and then B ran by
itself.
Lock free algorithms work great in kernels and low level threading
libraries for a variety of reasons. Portability is not that important. The
probability that there are other things to do is lower in core code (because
everything passes through it). But for application code, or even most
libraries on top of threading libraries, it's a pure loss.
DS
So Java isn't portable or they shouldn't have put support for lock-free in
Java or what?
Java isn't just a language, or a compiler, or a runtime library. It's
also the MACHINE -- or at least the portable definition of a Virtual
Machine. It can specify its own rules, and the engineers who PORT the
Java VM have to do whatever is necessary (easy, hard, fast, slow,
whatever) to follow those rules for the target machine.
POSIX, on the other hand, can specify only "tweaks" on the language,
place loose requirements on the compiler, and provide a specification
for part of a runtime library... and can't say anything at all about the
machine.
The two cases are exactly the same except that everything is different.
--
Dave Butenhof, David.B...@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062
> David Schwartz wrote:
>> Let me just give you my general caveat about lock free code. First,
>> lock free code is generally non-portable and extremely complex. Such code
>> can easily contain bugs that are extremely difficult even for experts to
>> spot. The complexity is almost always unnecessary.
>> Lock free algorithms work great in kernels and low level threading
>> libraries for a variety of reasons. Portability is not that important.
>> The probability that there are other things to do is lower in core code
>> (because everything passes through it). But for application code, or even
>> most libraries on top of threading libraries, it's a pure loss.
> So Java isn't portable or they shouldn't have put support for lock-free in
> Java or what?
Java is quasi-portable. Java reflects a design decision that you'll take
all the performance penalties up front, and then you can use lots of clever
tricks to get the performance back. I find that an odd compromise.
If you care about performance, why would you force a memory model that
may or may not match the hardware at all? And if you don't care about
performance, why are you trying to mess with lock free algorithms?
In any event, I believe it is impossible to construct lock-free
algorithms in Java, the way the term "lock free" is ordinarily used. Java
does not let you get close enough to the hardware to know if there are locks
or not.
DS
The complexity can be hidden behind a simple and portable API, that anybody
can use. Like the pthreads api. We all know that a particular pthread
implementation may be very complex, and may very well use some lock-free
algorithms. But, does the average user actually care about this kind of
stuff? My point is, again "The complexity can be hidden behind a simple
API". Really.
Would there be bugs... Well, maybe; it depends on the implementer. There may
be very subtle bugs in your pthread implementation as well...
> But worse, in most realistic cases, lock free tremendously *reduce*
> performance. The myth is that locks are expensive.
Humm... Well, it really depends on the type of lock-free algorithm you
compare to locks. I would agree that "CAS-loop" based lock-free algorithms
can thrash the cache...
http://groups.google.com/group/comp.lang.c++.moderated/msg/6b2ccf76ba145c9c?hl=en
http://www.hpl.hp.com/techreports/2004/HPL-2004-105.html
however:
Contended locks can adversely affect performance and simply prevent
scalability in some situations:
Take a global read-write lock protecting a "read-mostly" data-structure.
Under heavy read contention, forward progress can be stalled waiting for the
updates to the internal lock state to be "shuttled around the CPU's caches".
If the size of the read-side critical section is moderate, the cost of
moving lock updates from one CPU's cache to another can actually rival that
of the critical section itself; This can prevent readers from truly
executing in parallel. The more processors you add, the worse it gets. This
is not true for some special high-performance reader/writer solutions...
> The truth is that *contention* is expensive. Lock free algorithms allow
> code that is contending to run in parallel. Locks allow other code to run
> that probably won't contend.
Think of a portable algorithm that can totally avoid contention for very
high numbers of "reader threads"; NO atomic operations and NO memory
barriers required...
How can a mutex compete with that?
Think frequent iteration over large database-like "read-mostly"
data-structures, before you answer.
> For example, consider three threads, A, B, and C. A and B both want to
> manipulate a queue, C wants to do something else entirely. Assume two
> CPUs. With a lock, either A or B will grab the queue lock. A or B will run
> with C and both threads will run at full speed with minimal contention.
> With a lock-free queue, A can run concurrently with B. C will have to
> wait, while A and B fight over the cache lines that contain the queues and
> its structures.
IMHO, this is an example of poor application engineering/design. I would
have C queue a request and wait for something else to do... Then A or B
could do the work; after all, they do sound like they are "worker threads"
of some sort...
;)
> Lock free algorithms work great in kernels and low level threading
> libraries for a variety of reasons. Portability is not that important.
Lock-free algorithms can "easily" be hidden behind a portable API.
> But for application code, or even most libraries on top of threading
> libraries, it's a pure loss.
Application code would not use raw lock-free structures directly. They would
use a library that exports a "portable API".
> , it's a pure loss.
Yeah...
The ability for read-only iterations (frequent database searches?) to
achieve a substantial increase in the number of reads per-second,
per-reader-thread, would most certainly be a loss...
;)
> The ability for read-only iterations (frequent database searches?) to
> achieve a substantial increase in the number of reads per-second,
> per-reader-thread, would most certainly be a loss...
I can't figure out what scenario you're talking about. If you have a
large chunk of read-mostly data, how can the overhead of bouncing around one
cache line prior to these large accesses possibly matter? If you're talking
about lots of small accesses, well, then your lock scope is badly chosen.
However, I do agree that you picked one of the best examples. Dealing
with large collections of read-mostly data is probably one of the best cases
for a lock free algorithm.
As for hiding non-portable code behind a portable API, that argument
misses the point. If you use an API to do something, you do not care whether
there's a lock free or lock based algorithm behind the API. The issue is
about *writing* code, not using code.
DS
So JSR-166 was a waste of time?
Mostly, yes.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
This would be true for the Michael-and-Scott queue algorithm; its loaded
with cache thrashing CAS-loops. However, "zero-overhead" loopless
single-producer/consumer queues are ideal for various distributed
producer/consumer systems. IMHO, a thread that is bound to a particular cpu
core, and wants to send a message to another thread that is bound to a
"nearby core", should use a method that has no atomic operations; just a
StoreStore for producers, and a data-dependant load for the consumers should
be enough. I have a feeling that scaleable designs for the new multi-core
stuff that's coming out will be NUMA-like...
The ability to remove atomic operations and nasty memory barriers from a
distributed messaging system is a plus. As we all know, multi-processor
systems do not take kindly to these types of instructions...
;)
So, "certain types" of lock-free queues are useful for high-performance
cpu-to-cpu communications. Especially, when a core on one cpu has to send a
message to a core on another cpu; atomic operations and membars would
greatly reduce performance.
> Lock-free lifo-pools may make sense for memory-allocators, but good
> allocators magage thread-local pools so this need is minimized. And
> if you need maximum performance on memory-allocation, you manage your
> own pools.
> Lock-free processing may look fascinating, but primarily superfluous.
They also work well as solutions to the reader/writer problem. They can be
used to remove the need for atomic operations and StoreLoad style memory
barriers; imho, that's not really superfluous.
:O
;-)
Lock-free, like threads, like Perl, like emacs, like news, like so many
other things, is an enormously powerful tool for certain things. And,
yes, absolutely useless (or even worse) for many other things.
Whether you're inclined to say that it's important (or relevant) for
"95%" or "5%" of "real work" depends entirely on your personal
definition of "real work"... which in turn depends on current position,
experience, and training.
As much as most programming today is pretty high level and abstract, it
IS still valuable to understand how transistors work, and chip logic
design, and memory busses and cache and pipelined instruction
processing. When you're programming a parallel application,
understanding what lock-free does and how it will affect your
application contributes to your understanding of the memory model and
implementation of your machine and application... and that can be
powerful even if you don't actually use lock-free techniques. (But then,
you probably do, even if you don't know it... and using them incorrectly
can be dangerous.)
Right, but isn't this missing the point of David Schwartz's comments?
This newsgroup has had *many* threads with the following pattern:
- someone asks how to solve a problem. Typically it's clear from this
post that the poster has relatively little experience with concurrent
programming in general.
- one of the usual suspects suggests "using lock-free techniques".
- the thread wanders off into esoterica about how to implement stuff at
a completely disjoint level of abstraction to the original question.
To some extent this is just par for the Usenet course; OTOH, this group
does seem to have a particular collective obsession with lock-free at
the moment. Some of us are concerned that newbies are getting, at best
unfollowable, and sometimes just plain bad advice as a result.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
Then both you and David Schartz are missing the point. The OP specifically
asked about lock-free.
>
>
> To some extent this is just par for the Usenet course; OTOH, this group
> does seem to have a particular collective obsession with lock-free at
> the moment. Some of us are concerned that newbies are getting, at best
> unfollowable, and sometimes just plain bad advice as a result.
People get bad advice in this newsgroup all the time. Often from the
people who are allegedly concerned about people getting bad advice.
I've given up trying to correct these kind of things, e.g. telling
people mutexes have forward progress guarantees about visibility,
anything about observable behavior from cache (apart from performance),
etc... The most blantant misinformation is that lock-free is too
complex to use when the typical way of dealing with complexity is to
encapsulate it with an easier to use api. It's like saying that
programming with locks is too complex because most programmers would
have a hard time implementing a mutex correctly.
That would be for a small number of threads. If probability of lock contention
for a single thread is .001 in your example, then for 100 threads it's about
.100, 200 threads about .199, 1000 threads about .630, etc...
Lock-free is about scalability, not the actual overhead. If you know your
application will never experience high contention or you don't care about
scalability then you should stick with conventional lock based solutions.
Wouldn't you say there is a design or structural problem having 100-1000
threads contending for the same lock in any somewhat ordinary
application at the server or the client side? Wouldn't context switching
drain the performance anyways?
/Michel
> Lock-free is about scalability, not the actual overhead. If you know
> your
> application will never experience high contention or you don't care about
> scalability then you should stick with conventional lock based solutions.
The point is, locks allow you to avoid high contention (by scheduling
tasks that don't contend). Lock free algorithms, in general, force you to
suffer through it (by allowing contending tasks to continue attempting to
run in parallel).
DS
> Depends on how many processors you have. There may be other things
> causing
> context switching also but blocking on a lock certainly isn't going to
> help
> performance.
No, that's not true. Blocking on a lock is going to help performance
because useful work will be done until it's possible for the blocked thread
to run without contention. Blocking on a lock is good because it allows
contention to be avoided.
Lock free algorithms only help if the thing contended for is so critical
and important that you can't do much of anything else (on the entire system)
without accessing it. That's why they're very useful inside kernels where
this is often true and almost useless in application where it almost never
is.
DS
In the implementation of a message passing language, where this overhead
applies to every message send/reception.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
Note that message passing also allows you to avoid high contention by
scheduling tasks that don't contend. This is independent of whether the
synchronization needed for message passing between processes scheduled on
different processors/cores is implemented in terms of locks or by lock-free
techniques.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
So you're saying lock based solutions will scale better then lock-free ones?
> So you're saying lock based solutions will scale better then lock-free
> ones?
It depends upon which of the dozens of possible meanings of "scale" you
intend. In general, if by "scale" you mean show improved real world
performance as load, in general, increases, then yes, lock-based solutions
will scale better than lock-free ones.
Consider a case where you have four threads. Threads A and B want to
access the same collection and threads C and D want to access the same
collection. With only a single CPU, both a lock-free and lock-based solution
will run one thread at a time and it will run at full speed.
Now, let's scale this to two CPUs. The lock-based solution will never
run thread A and B at the same time. It will always run one of A/B and one
of C/D. The scaling is perfect and the performance doubles since contending
threads never run.
The lock-free solution will randomly match threads. Whenever it happens
to schedule A and B or C and D at the same time, each thread will run at
some small fraction of its maximum as the FSB bounces the shared/modified
data back and forth.
Granted this is an extreme example, but the point is that contention is
bad and lock free algorithms tend to maximize contention.
DS
What happens when you scale to 4 cpu's?
> What happens when you scale to 4 cpu's?
The lock-based works *much* better than the lock free. The lock free
allows all four threads to run, totally trashing the FSB. The locks allow
only two threads to run at full speed, which not only get more work done
than all four fighting for the FSB and playing cache ping-pong but also
allow the system to accomplish other useful work as well.
DS
> David Schwartz wrote:
>> The lock-based works *much* better than the lock free. The lock free
>> allows all four threads to run, totally trashing the FSB. The locks allow
>> only two threads to run at full speed, which not only get more work done
>> than all four fighting for the FSB and playing cache ping-pong but also
>> allow the system to accomplish other useful work as well.
> In your rather contrived example, you had the four threads partitioned
> into 2 threads each doing totally separate work that could just as well
> be done on separate machines.
That's actually pretty typical. Generally the threads in a process will
subdivide into logical groups of one or more thread each. Each thread in a
group does roughly similar tasks but tasks that are different from what the
threads in the other groups do. There will usually be common paths that all
the thread groups pass through, but they are usually in the OS and the
threading library. (Where lock free techniques *can* be very valuable.)
> Then you have one of the threads in each
> thread pair totally lock out the other so only one thread runs at a time
> meaning you are only getting the throughput of a single thread.
No, that's not what I'm doing. I'm saying that's what happens in the
case where they block on the same lock. Regardless of what each thread is
going to do once it acquires the lock or after releasing it, if one thread
blocks on a lock the other holds, it will be totally locked out until the
other thread either releases the lock or uses up its timeslice.
> The
> question is why have two threads in the first place when a single thread
> would do as well or better?
The most common reason is because if you use a single thread, and that
thread blocks unexpectedly (say on a page fault or file read), you are
totally screwed. Writing code so that you must be 100% certain that no tiny
bit of it (whether performance critical or not) ever blocks is just too
hard.
> This is the best example you could come up
> with?
Obviously I picked the example that best illustrated my point. The more
your situation varies from my example, the less applicable the example.
If you're going to fairly evaluate lock free algorithms, you should
understand the prototypical cases where they fail very well, if for no other
reason than for you to make the thoroughly informed decision that those
weaknesses aren't too important in your particular case.
DS
Well, the OP did ask for some open-source.
> (with a IMHO very ugly coding-style).
Sorry about that.
;)
> He suggessts to use lock-free-processing although the cases where this
> is really necessary because of a notable performance-advantages are
> rarer than the neelde in a haystack.
The libraries basic philosophy is to blend "lock-free with lock-based
programming". I also post a clear link to a discussion of some of the most
important "caveats" that come along with lock-free programming. I clearly
suggest that programmers "test" AppCore against "traditional lock-based"
solutions, that's all. Why not let application programmers try to
"experimentally improve" some of their existing designs? When the multi-core
stuff becomes available to "regular customers", it may require some
applications to "scale-up". In the next couple of years, you "might be"
receiving questions from customers like:
I bought one of those new dual-core systems with 4 processors, and your
application suite seems to execute critical tasks a little bit slower than
usual! Why is that? Do I have to set some special configuration options?
Please help...
> And in the current case, he sug-
> gests his nearly undocumented and hardly readable source as an origin
> for lock-free-processing knowledge.
It's only an example. I didn't really have time to document it, because I
was working hard on something else. AppCore is pretty much dead right now.
Anyway, I don't claim that AppCore is an "origin for lock-free-processing
knowledge"; after all, its experimental. If I am coming across that way, I'm
sorry.
Yes. However, they also execute an atomic operation and a nasty StoreLoad
style membar for the lock and unlock, contention or not. IMO, these types of
operations are very expensive for modern processors to execute. Especially
if they have to lock the bus...
> In this case, the lock is held an extremely short inter-
> val and the processing of the data taken from the queue lasts at least
> some thousand times longer than the dequeue-process. So the likehood
> of a collision is that short:
Actually, to further enhance your point, there is an efficient lock-based
queue out there that is simply guaranteed to get absolutely no "collisions"
between a producer and consumer (can show you pseudo-code if you want).
> that there's no need for lock-free processing?
Well, the only thing that a lock-free queue would buy you here, is reduced
overhead wrt atomic operations and membars.
>> Especially, when a core on one cpu has to send a message to a
>> core on another cpu; atomic operations and membars would greatly
>> reduce performance.
>
> In nearly every case this does not reduce the performance because
> enqueuing an dequeuing are only a very tiny part of the whole pro-
> cessing in almost any case.
Granted. However, imho, I still think that the ability to remove atomic
operations and membars from producer/consumer systems and database searches
is going to gradually become more and more important...
Reader threads getting a vast increase in the number of reads to a shared
database-like structure they can perform per-second. IHMO, this can be very
important.
> If you have a large chunk of read-mostly data, how can the overhead of
> bouncing around one cache line prior to these large accesses possibly
> matter?
Lock-free, or not; atomic operations and membars reduce shared reads
per-second, in virtually all cases:
http://www.rdrop.com/users/paulmck/RCU/lockperf.2004.01.17a.pdf
In some cases, it can be a difference between tens-of-thousands of reads
per-second, or million+ reads-per-second...
No kidding.
> If you're talking about lots of small accesses, well, then your lock scope
> is badly chosen.
Probably.
> However, I do agree that you picked one of the best examples. Dealing
> with large collections of read-mostly data is probably one of the best
> cases for a lock free algorithm.
Very true.
> As for hiding non-portable code behind a portable API, that argument
> misses the point. If you use an API to do something, you do not care
> whether there's a lock free or lock based algorithm behind the API.
I have a feeling that users are going to start to care about how good an API
can "scale-up" to the new systems that are going to come out. Though I do
agree that the user won't really care about what a vender uses, lock-free or
lock-based, to achieve that.
> The issue is about *writing* code, not using code.
Ahh. Yes, actually writing lock-free code can be very tedious, indeed:
http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6
;)
[...]
>
>
>>This is the best example you could come up
>>with?
>
>
> Obviously I picked the example that best illustrated my point. The more
> your situation varies from my example, the less applicable the example.
>
> If you're going to fairly evaluate lock free algorithms, you should
> understand the prototypical cases where they fail very well, if for no other
> reason than for you to make the thoroughly informed decision that those
> weaknesses aren't too important in your particular case.
>
Well, it was a fairly bad example. We're talking about scalability here.
All the literature and research that I'm aware of show lock-free algorithms
scale better than lock based ones. You're perfectly welcome to put out
your own paper showing lock based ones scale better than lock-free ones.