Depending on the testcase parms I can get up to 32000 reads/sec/thread on
Linux. So it's clear it's fast enough that all I'm seeing on a single
processor system is scheduler artifacts.
Now all I have to do (after the rewrite of course) is find a killer
app to write for it.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
I can think of lot's of non threaded programs to write.
Interesting. Reference counting wouldn't be an option here as it
would be for normal hazard pointers. Plus arbitrarily requiring
reference counting turns reference counting into a problem rather
than a solution. I think I have something that will work though.
If C++ goes with deferred reference counting, they're going to run into
the same issues RCU+SMR would have with reference counting so it wouldn't
even be very feasible.
> Joe Seigh wrote:
>> Joe Seigh wrote:
>>> Joe Seigh wrote:
>>>> Joe Seigh wrote:
>>>>> Joe Seigh wrote:
>>>>>> Joe Seigh wrote:
>>>>>>> Joe Seigh wrote:
>>>>>>>> Joe Seigh wrote:
>>>>>>>>> Joe Seigh wrote:
>>>>>>>>>> Joe Seigh wrote:
Is this a blog site, or a newsgroup?
There's some lock-free synchronization stuff here.
http://atomic-ptr-plus.sourceforge.net/
The testcase isn't meant to be an example. It uses a doubly
linked circular queue for some reason. I've changed that to
a singly linked list but that's not out there yet. The
RCU+SMR package that's out there doesn't work for linked
lists, just simple objects. I have a version that does
work for linked lists but it's not released yet. I'm
think of using it in some suitable application if I can
find one. Plus it's still in the middle of what we quaintly
call refactoring, plus documentation, plus some form of
quasi-formal proofs. And I'm currently having fun trying
to make sense of the scheduler artifacts I'm seeing on
a single processor system.
And as you can see by this topic practically being a monologue,
there's no huge demand for this kind of stuff, so I'm taking
advantage of the more leisurely schedule that allows.
What would be really useful is a lock-free *multiple-writers*,
single-reader FIFO message queue. An implementation of a message-
passing concurrent language uses this kind of queue for all
communication between processes, and would benefit from an
implementation with very low overhead on multiprocessors. The
queue entries have to be traceable by a GC.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
What's wrong with Maged Michael's lock-free FIFO and variations
on that? The only problem I'm aware of is that if you use the
GC to avoid the ABA problem you can trash the GC if you have
a high rate of usage.
Way back, I came up with a scheme where you make the queue a lock-free
LIFO stack using compare and swap. The single reader just grabbed the
entire queue using compare and swap, reversed order making it FIFO
and worked off of that util it was empty and then repeated the whole
process by grabbing the shared LIFO stack again.
Haven't done that kind of stuff in a while. Usually, I'm doing single
writer, multiple reader lock-free queues for multi-casting.
I wonder if lack of portability is partly to blame?
What would be the minimal number of extensions to C (or
whatever) that would allow this to be implemented portably by a
compiler supporting, for example 'C99+atomic'?
If you can quantify that, you might actually be able to make a
case for language extensions so that lock-free algorithms can be
used in cross-platform production projects.
I would be very interested in anything along those lines, but as
long as it is either OS-centric, CPU-centric, or both, I don't
see much value in it for the type of software I am involved
with.
--
Randy Howard (2reply remove FOOBAR)
That topic pops up from time to time. It doesn't go anywhere
due to lack of interest. Very few people use that kind of stuff.
There are some api's out there like atomic_ops here
http://www.hpl.hp.com/research/linux/qprof/
but it's fairly limited. If you have to add your own
you might as well do the whole api on your own as it's not
that much more work. If I used atomic_ops, I'd still have
to provide my own atomic ops package.
>
> If you can quantify that, you might actually be able to make a
> case for language extensions so that lock-free algorithms can be
> used in cross-platform production projects.
It's mostly a chicken and egg question. Present usage doesn't
justify putting in the features.
>
> I would be very interested in anything along those lines, but as
> long as it is either OS-centric, CPU-centric, or both, I don't
> see much value in it for the type of software I am involved
> with.
>
You have to start somewhere and I can't exactly port to hardware
I don't have. So probably ia32-64 on Linux for the time being.
Apple is getting off of ppc and I'm not going to buy propietary
hardware just for the privelege of porting to it. Ditto with
Sun. Also, their list of Solaris qualified hardware is unusable.
You'd likely end up getting hardware you really didn't want,
only to find it doesn't work with Solaris. I can't afford
to do that kind of experimentation.
You get what you pay for. Free software gets little or no
support.
The other thing here is api's are a really hard sell. I did
a low level api once, a program that translated boolean
queries into decision trees. Almost no interest apart
from my own use. Tacked it onto a program to scan files,
a sort of flat file database and it became hugely popular.
More than 10 years after I stopped supporting it and after I
left IBM, someone tracked me down to get it enhanced (couldn't,
I didn't have the source anymore.) So that's why I'm looking
for a moderate sized project to do instead.
> Randy Howard wrote:
>> On Thu, 30 Jun 2005 20:06:52 -0500, Joe Seigh wrote
>> (in article <3cqdnbE9cbI...@comcast.com>):
>>
>>> And as you can see by this topic practically being a monologue,
>>> there's no huge demand for this kind of stuff, so I'm taking
>>> advantage of the more leisurely schedule that allows.
>>
>>
>> I wonder if lack of portability is partly to blame?
>>
>> What would be the minimal number of extensions to C (or
>> whatever) that would allow this to be implemented portably by a
>> compiler supporting, for example 'C99+atomic'?
>
> That topic pops up from time to time. It doesn't go anywhere
> due to lack of interest. Very few people use that kind of stuff.
> There are some api's out there like atomic_ops here
> http://www.hpl.hp.com/research/linux/qprof/
> but it's fairly limited.
I'll take a look at it, thanks.
>> If you can quantify that, you might actually be able to make a
>> case for language extensions so that lock-free algorithms can be
>> used in cross-platform production projects.
>
> It's mostly a chicken and egg question. Present usage doesn't
> justify putting in the features.
Fair enough.
>> I would be very interested in anything along those lines, but as
>> long as it is either OS-centric, CPU-centric, or both, I don't
>> see much value in it for the type of software I am involved
>> with.
>
> You have to start somewhere and I can't exactly port to hardware
> I don't have.
Understood, for the record, I was not shooting darts at your
approach, just stating why I personally can't use it as is.
Hopefully you didn't misread that as an attack or something
along those lines.
> Apple is getting off of ppc and I'm not going to buy propietary
> hardware just for the privelege of porting to it.
Well, I can't argue with that approach. I just bought a G5 dual
powermac (about a month before the switch). First Apple
computer in my life, having avoided them for decades, then boom,
I get shot between the eyes. :-)
Oh well, at least its a decent box and gives me another testbed
to play with, despite its long-term boat anchor status.
> You get what you pay for. Free software gets little or no
> support.
It seems to depend upon the software's usage (in volume). A
very few OSS projects get massive support because they appeal to
a very broad audience, but they are the exception.
> The other thing here is api's are a really hard sell.
True. That's why I was wondering about standardization. With
all the pushing for SMP, dual-core, multi-core, (and the
abomination known as "hyperthreading", you would think some of
the big names (like Intel and AMD) would be pushing for ways to
get the most out of new processor features and parallelism
without everyone having to roll their own.
Oh well, I guess it will be a long time before this sort of
thing is widely used enough to become more generally applicable
to portable software.
Intel and AMD are basically hardware companies and are run by
hardware not software types. They know HPC (high performance
computing) and parallel programming. Anything outside of that
is pretty much alien to them. They also have a view on ROI
(return on investment) that's hardware rather than software
oriented. We're still running Cobol programs written in the 60's
because it cost too much to rewrite them. Intel and AMD will
stick an extension on just long enough to sell a few million
processors and then drop it once they got their money back.
So while you could do some really neat things with hyperthreading,
no one will because it will be gone by the time the OSes could
provide an api to exploit it.
A good example of what you can do with hyperthreading is
producer/consumer. One of the problems with consumer/producer
is either the consumers or producers will get ahead of the other.
In normal threading, you use condition variables to suspend one
and resume it when the other catches up. That suspend and resume
adds overhead. One of the things people try to avoid this is use
lock-free but if you spin wait you're wasting cycles. With hyperthreading,
the faster thread can just spin wait using the PAUSE instruction and
the cpu resources are given to the other thread to make it run faster.
As soon as it catches up, the waiting thread resumes again with no
overhead. There is no way to do the equivalent in normal threading
except to say that condition variables act that way sometimes except
with more overhead.
It's not just Intel and AMD. Sun Research has a whole group that
does synchronization, mostly lock-free. But I have yet to see any
evidence that Sun exploits any of this. And they're coming out with
Niagara which will have up to 8 cores, 4 threads per core. You
would think.
>
> Oh well, I guess it will be a long time before this sort of
> thing is widely used enough to become more generally applicable
> to portable software.
>
Well, a killer app will speed things up. But people expect bigger
and more feature rich apps. It's not like the old days when you
could write something from scratch in a few days or weeks.
I was going to suggest to Intel that they list and sponsor projects
that would benefit from being "parallelized" much like IBM and
Google did with some projects. But I'm not sure they know what
projects those would be.
> Randy Howard wrote:
>>
>> That's why I was wondering about standardization. With
>> all the pushing for SMP, dual-core, multi-core, (and the
>> abomination known as "hyperthreading", you would think some of
>> the big names (like Intel and AMD) would be pushing for ways to
>> get the most out of new processor features and parallelism
>> without everyone having to roll their own.
>
> Intel and AMD are basically hardware companies and are run by
> hardware not software types.
True, but I think they (especially Intel) realize that the
profit is in systems, not isolating hardware from software.
IOW, recognizing that they go together. Intel has had a lot of
effort in software over the years, including compilers, IDF,
software development forums on their website, reams of
documentation, white papers, example code, etc. They have been
harping on threading and the need to develop parallel solutions
for a long time.
> They know HPC (high performance computing) and parallel programming.
Exactly. If they have the energy to push something like OpenMP
and their own compilers, then working on pushing lock-free
adoption doesn't seem /that/ unlikely.
> So while you could do some really neat things with hyperthreading,
> no one will because it will be gone by the time the OSes could
> provide an api to exploit it.
I hope so. Hyperthreading is beneficial in a very few special
cases, but in general it sucks compared to SMP or multi-core
solutions. I suppose keeping it around inside of multi-core
processors isn't horrible, as long as you can turn it off. In
some cases, it makes things much worse enabled than disabled.
> It's not just Intel and AMD.
True, but they represent the bulk of the market. If they go in
a direction, then it can happen. If they don't, then the idea
will die on the vine, more often than not.
> Sun Research has a whole group that does synchronization, mostly
> lock-free. But I have yet to see any evidence that Sun exploits any of this.
If sun doesn't use it themselves, you can't expect any of their
customers or developer community to do so. They *could* push
it, but they've got bigger fish to fry right now trying to keep
Java alive in the onslaught of alternatives coming along now.
> And they're coming out with Niagara which will have up to 8 cores,
> 4 threads per core. You would think.
I hope those folks in India and China are up on threaded
programming. :-)
>> Oh well, I guess it will be a long time before this sort of
>> thing is widely used enough to become more generally applicable
>> to portable software.
>>
> Well, a killer app will speed things up.
True. I've yet to hear of any commercial or popular OSS
application bragging about the massive performance benefits
achieved through the use of lock-free techniques. Real-world
examples convince people more than research papers. Perhaps it
is being widely used but not talked about, or I just missed the
claims?
...
>
>>>Oh well, I guess it will be a long time before this sort of
>>>thing is widely used enough to become more generally applicable
>>>to portable software.
>>>
>>
>>Well, a killer app will speed things up.
>
>
> True. I've yet to hear of any commercial or popular OSS
> application bragging about the massive performance benefits
> achieved through the use of lock-free techniques. Real-world
> examples convince people more than research papers. Perhaps it
> is being widely used but not talked about, or I just missed the
> claims?
>
>
Well, Linux used RCU to speed it up. SCO initially claimed IBM
stole it from them but I think SCO had no idea what RCU actually
was.
ANTs Software has a "lock-free" database. They specialize in putting
out a press release every month about their "lock-free" database.
They have two patents,
6,760,726 System and method of managing concurrent operations on linked lists
6,763,447 Lock-free list for use with computer system utilizing FIFO queue for tracking order of various sublists
They look like a LIFO stack using double wide cas with a counter field and a circular FIFO
queue of some sort to me but what do I know. I guess databases aren't the kiler app. :)
A major open source file server is hitting enough contention on some session tables
that they need a lock-free reader/writer solution.
There's a debugger for multi-threaded programs that's experiencing serious
contention on some AVL trees they use so verify valid storage accesses and
they need a lock-free reader/writer solution.
These last two aren't desperate enough yet to actually try any lock-free
solutions yet but if they're having problems it means others are as well.
Do you mean <http://www.cs.rochester.edu/u/michael/PODC96.html>, or
something more recent? Possibly there's nothing wrong with it. However,
the correctness argument seems very informal, and I don't see any
discussion of what memory consistency model is being assumed (apart
from an assumption that the ABA problem does not occur).
It also looks expensive: two CAS operations for enqueue in the best
case, and one CAS to dequeue each item (rather than being able to
dequeue multiple items at once, which should be possible since we
have a single reader). Although a message queue must in general
handle multiple writers, most such queues will in fact only be written
by a single writer; the problem is that the language implementation
doesn't know that statically. So it should be possible to do much
better than Michael's algorithms for this application.
> The only problem I'm aware of is that if you use the
> GC to avoid the ABA problem you can trash the GC if you have
> a high rate of usage.
>
> Way back, I came up with a scheme where you make the queue a lock-free
> LIFO stack using compare and swap. The single reader just grabbed the
> entire queue using compare and swap, reversed order making it FIFO
> and worked off of that until it was empty and then repeated the whole
> process by grabbing the shared LIFO stack again.
These comments prompted me to think of another way of implementing message
passing that greatly reduces the need for (and hence the overhead of)
synchronization. I'll post it here with the subject "Implementing message
passing on multiprocessors".
--
David Hopwood <david.nosp...@blueyonder.co.uk>
>>So while you could do some really neat things with hyperthreading,
>>no one will because it will be gone by the time the OSes could
>>provide an api to exploit it.
>
> I hope so. Hyperthreading is beneficial in a very few special
> cases, but in general it sucks compared to SMP or multi-core
> solutions. I suppose keeping it around inside of multi-core
> processors isn't horrible, as long as you can turn it off. In
> some cases, it makes things much worse enabled than disabled.
I'm not a fan of hyperthreading either, but putting it in processors
and then turning it off is the worst of all worlds: the processor
designers have to make compromises to support HT and spend time
debugging it, that could have been spent more profitably with other
optimizations and with verification.
(On the subject of excessive complexity in processors, watch this
talk by Bob Colwell, formerly of Intel:
<http://stanford-online.stanford.edu/courses/ee380/040218-ee380-100.asx>.)
> True. I've yet to hear of any commercial or popular OSS
> application bragging about the massive performance benefits
> achieved through the use of lock-free techniques.
I very much doubt that there is enough benefit from using lock-free
techniques in *applications* to justify the complexity cost, and the
effort required to learn how to use them correctly. They belong in
operating systems and language implementations, where these costs are
amortized over many apps.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
> I very much doubt that there is enough benefit from using lock-free
> techniques in *applications* to justify the complexity cost, and the
> effort required to learn how to use them correctly. They belong in
> operating systems and language implementations, where these costs are
> amortized over many apps.
And where portability is less important since you already know you're
going to have to reimplement and tweak.
DS
Yep. That works very well. I have used the simple scheme for a while as
well. I used an atomic exchange for the consumers:
typedef struct n_
{
struct n_ *next; /* = 0 */
void *state;
} n_t;
typedef struct q_
{
n_t *front; /* = 0 */
} q_t;
static q_t q;
// producer
n_t *o, *c, *n = malloc( sizeof( *n ) );
o = q.front;
do
{
n->next = o;
c = o;
o = InterlockedCompareExchangePointer
( &q.front,
n,
c );
}
while ( c != o );
// consumer
n_t *f = InterlockedExchangePointer( &q.front, 0 );
if ( ! f ) { return EAGAIN; }
// reverse order of f, and process.
LIFO producers are inherently not affected by aba. On the other hand, LIFO
consumers are affected, but only if there is a "compare" type method being
used for synchronization. Using an atomic exchange to set the stack anchor
to zero can get around aba wrt LIFO lock-free stacks. Was that similar to
your initial solution?
--
http://appcore.home.comcast.net/
(portable lock-free data-structures)
Joes idea of converting an IBM freelist into a queue does reduce the number
instructions normally needed for a lock-free queue. Consumers only need an
atomic exchange, and producers need a normal cas loop. Look on my site for
my implementation of a zero-overhead queue that does not need any
synchronization instructions, just a dependant-load for the consumer and a
store barrier for the producer. However, it is a single-producer/consumer
queue. But you can use it in various distributed message passing schemes to
help reduce their synchronization overhead.
> I'll post it here with the subject "Implementing message
> passing on multiprocessors".
There was one scheme I used that avoided synchronization instructions for a
certain type of thread-to-thread message passing. The basic scheme was that
each producer thread had its own zero-overhead single-producer/consumer
queue that would be sampled by dedicated consumer threads. This allowed me
to create a fairly efficient "many-producer/few-consumer" distributed
messaging passing system.
I don't think it does, really. Note that I was talking about using lock-free
techniques directly in applications; not using "infrastructure" libraries
that happen to be implemented using lock-free stuff, but that do not expose
it to their clients. My argument is that lock-free primitives are inherently
too complex and error-prone to be exposed directly for general-purpose
application programming, no matter how well-designed the API.
> Some people believe threading is too difficult to use and
> should not be used by applications. Some have gone a far to say that
> they would fire any of their employees that used multi-threading. I've
> even seen that in job ads, that the candidate should know how to not
> use threading.
Concurrency is essential; threading isn't.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
My experience is that people who believe their low level api are too complex
for normal programmers to use and and only allow useage through a high level
api, generally suck at designing api's. If the low level api was available,
at least someone else who isn't so bad at api design could try to design a
better one. And they certainly couldn't design a new or different api if the
low level api wasn't available. If the only api available is a lock-free
linked list and someone wants to design a lock-free tree, they won't be able
to do one.
Java is a good example here. There are a lot of classes that are final and
not extendable simply because the class author doesn't trust users to extend
their class even though that is the whole *point* of OOP. There are design
patterns whose main purpose in being seems to be to get around such restrictions
which is certainly an irony.
My experience is that people generally suck at designing APIs. This does not
imply that all APIs suck.
> If the low level api was available, at least someone else who isn't so bad
> at api design could try to design a better one. And they certainly couldn't
> design a new or different api if the low level api wasn't available.
That's not the case: they can always drop down to assembler. Or they can reuse
the low-level guts of an open-source implementation of some other high-level
API.
> If the only api available is a lock-free linked list and someone wants to
> design a lock-free tree, they won't be able to do one.
Yes they will, by modifying the lock-free linked list code to implement a tree.
"Black box" reuse doesn't work, but reuse based on modification of source
code has been quite successful.
If the lock-free linked list implementation doesn't come with source, throw
it away; you can't rely on it.
> Java is a good example here. There are a lot of classes that are final and
> not extendable simply because the class author doesn't trust users to
> extend their class even though that is the whole *point* of OOP.
'final' was included in Java for security reasons, because some of the
system classes would have been exploitable by subclasses (it can be argued
that this was addressing the symptom and not the cause of other security
design problems).
If an API designer is making basic mistakes in the use of 'final', they're
probably making other basic mistakes, and their code would not have been
reusable anyway.
--
David Hopwood <david.nosp...@blueyonder.co.uk>
If documentation is available for it somehow like for Linux futex. It's
an unstable api so you can't depend on it as if it were an official api.
I've done an async safe eventcount using it. If I wanted to port it I
would have to figure out low level api's for other platforms though.
>
>> If the only api available is a lock-free linked list and someone wants to
>> design a lock-free tree, they won't be able to do one.
>
>
> Yes they will, by modifying the lock-free linked list code to implement
> a tree.
> "Black box" reuse doesn't work, but reuse based on modification of source
> code has been quite successful.
>
Assuming the linked list hasn't been hard coded into the low level api.
Unfortunately, I have seen low level api's screwed up that way.
Has there been any studies that show that lack of documentation for a low level
api has never inhibited anyone from using the low level api? How would they
measure that? Source isn't really a substitute. Otherwise I wouldn't have
to document RCU+SMR which would save a lot of trouble for me.