Double-checked locking and memory barriers

Filbert

unread,

May 13, 2003, 9:13:25 PM5/13/03

to

< This posting is meant for self-educational purposes. I am not trying
to write production code, but I want to _understand_ the necessary
mechanisms>

I would be most grateful if some people would shed some light on some
questions I have about barriers.

http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html

From the above URL we see:

// C++ implementation with explicit memory barriers
// Should work on any platform, including DEC Alphas
// From "Patterns for Concurrent and Distributed Objects",
// by Doug Schmidt
template <class TYPE, class LOCK> TYPE *
Singleton<TYPE, LOCK>::instance (void) {
// First check
TYPE* tmp = instance_;
// Insert the CPU-specific memory barrier instruction
// to synchronize the cache lines on multi-processor.
asm ("memoryBarrier");
if (tmp == 0) {
// Ensure serialization (guard
// constructor acquires lock_).
Guard<LOCK> guard (lock_);
// Double check.
tmp = instance_;
if (tmp == 0) {
tmp = new TYPE;
// Insert the CPU-specific memory barrier instruction
// to synchronize the cache lines on multi-processor.
asm ("memoryBarrier");
instance_ = tmp;
}
return tmp;
}

From the POSA 2 book, for Alpha, they write asm("mb");

O.K. I know that the comment about the cache lines is not accurate -
it's just that the whole code snippet has been copied from the
above-mentioned URL.

Question 1: Would some people explain why a memory barrier is needed
between the two lines of code (taken from above and shown below)
(I am having trouble picturing this in terms of the supposed
requirement (hence the memory barrier) to restrict the re-ordering of
reads and writes that result from the two lines of source code) (Could
there be speculative loads and stores and if so, what would they be?)

The two lines of source code are:

TYPE* tmp = instance_;
if (tmp == 0) {

By the way, the line from the example at the top:

Guard<LOCK> guard (lock_);

results in a pthread_mutex_lock() call, and that has a CPU-specific
memory barrier instruction folded into it, and when the Guard goes out
of scope, its destructor calls a pthread_mutex_unlock, which also has
a CPU-specific memory instruction folded into it (Of course, this is
assuming that the platform requires memory barriers for the relevant
pthread APIs.)

Question 2: Let's say I want to implement on a multiprocessor IBM
Power4 machine.
Assuming, from Question 1, that a memory barrier is needed, what is
the minimum I can get away with, and why?

http://www-1.ibm.com/servers/esdd/articles/power4_mem.html

Will a "lwsync" suffice? "lwsync" prevents Load/Load, Store/Store, and
Load/Store re-orderings, but not Store/Load re-orderings.

Or do I need something stronger, like "isync", or heaven forbid
"sync"? Could you provide an explanation for a neophyte like me?

Informational note from
http://www.cs.rochester.edu/u/scott/458/notes/03-coherence_etc
--------------------------------------------------------------------------------------------

- Neither sync nor isync actually subsumes the other, but the stuff
that isync does that sync doesn't matters only in kernel mode.
Isync is cheaper, but doesn't force all previously issued
instructions to complete. It just makes sure that subsequent
instructions haven't started. (For kernel-level code, isync
enforces some dependences that the processor doesn't normally
enforce, even with sync. In particular, if you change processor
status registers [e.g. the page table pointer], isync makes sure
that subsequent instructions see the change.)

Example:
If processor A does a bunch of data stores and then sets a flag,
it needs to put a sync or lwsync (lwsync is cheaper) right before
the flag store to make sure the data stores have globally performed.
If processor B spins on the flag and then tries to read the data, it
needs to do a sync or isync (isync is cheaper) right after testing
the flag to make sure subsequent loads haven't been performed
speculatively.
-------------------------------------------------------------------------------------------------------

Any helpful explanations would be greatly appreciated.

Hoping you can shed some light.

-- Filbert

Alexander Terekhov

unread,

May 14, 2003, 4:01:15 AM5/14/03

to

Filbert wrote:
[...]

> Question 1: Would some people explain why a memory barrier is needed

Read this thread (and follow the embedded links, please):

http://www.google.com/groups?selm=3EBB7807.7E084553%40web.de
(Subject: The Inventor of Portable DCI-aka-DCL (using TSD) is... ;-) )

> Question 2: Let's say I want to implement on a multiprocessor IBM
> Power4 machine.

You're quite welcome. ;-)

> Assuming, from Question 1, that a memory barrier is needed, what is
> the minimum I can get away with, and why?
>
> http://www-1.ibm.com/servers/esdd/articles/power4_mem.html
>
> Will a "lwsync" suffice? "lwsync" prevents Load/Load, Store/Store, and
> Load/Store re-orderings, but not Store/Load re-orderings.
>
> Or do I need something stronger, like "isync", or heaven forbid
> "sync"? Could you provide an explanation for a neophyte like me?

Read this message (twice, please):

http://www.google.com/groups?selm=3EBBE66B.98885A41%40web.de
(Subject: The Inventor of Portable DCI-aka-DCL (using TSD) is... ;-) )

regards,
alexander.

P.S. If you'll have questions, please post a followup to the
"inventor" thread -- I just love it. ;-)

P.P.S. http://www.google.com/groups?selm=3EBF5B97.A501D252%40web.de

Here's one more "illustration":

atomic<const stuff*> instance_ptr = ATOMIC_INITIALIZER(0); // static
pthread_mutex_t instance_mtx = PTHREAD_MUTEX_INITIALIZER; // static

const stuff & instance() {
stuff * ptr;
if (0 == (ptr = instance_ptr.load_ddhlb())) {
mutex_guard(instance_mtx);
if (0 == (ptr = instance_ptr.load())) {
ptr = new stuff(/*...*/);
instance_ptr.store_ssb(ptr);
static deleter<stuff> cleanup(ptr);
}
}
return *ptr;
}

P.P.P.S. Give up! You'll NEVER "beat" a static local synchronized
by a threads-aware C++ implementation (similar stuff to what a "C"
POSIX threads implementation does for pthread_once(), but in much
more 'convenient' way designed specifically for static lazy stuff
in C++):

http://www.google.com/groups?selm=3EB430BD.D039CC%40web.de
http://www.google.com/groups?selm=3EBB8DDC.C81DF4AE%40web.de
(Subject: Re: scoped static singleton question)

Joseph Seigh

unread,

May 14, 2003, 7:46:32 AM5/14/03

to

Alexander Terekhov wrote:
>
> P.P.P.S. Give up! You'll NEVER "beat" a static local synchronized
> by a threads-aware C++ implementation (similar stuff to what a "C"
> POSIX threads implementation does for pthread_once(), but in much
> more 'convenient' way designed specifically for static lazy stuff
> in C++):
>

It specifically states that (being unbeatable) about pthread_once()
in the Posix standard? I doubt it.

The only reason you can argue that about Posix mutexes, for instance,
is that their usage is so common and that the consequences would be so
devastating to any vendor who produced a badly performing implementation.
You can't make that same arguement about pthread_once.

If Posix refuses to make performance guarantees, then Posix adherents
have no basis for complaining about attempts to implement solutions with
*known* efficiency.

Joe Seigh

David Butenhof

unread,

May 15, 2003, 8:41:21 AM5/15/03

to

Joseph Seigh wrote:

> Alexander Terekhov wrote:
>>
>> P.P.P.S. Give up! You'll NEVER "beat" a static local synchronized
>> by a threads-aware C++ implementation (similar stuff to what a "C"
>> POSIX threads implementation does for pthread_once(), but in much
>> more 'convenient' way designed specifically for static lazy stuff
>> in C++):
>>
> It specifically states that (being unbeatable) about pthread_once()
> in the Posix standard? I doubt it.

Of course not. POSIX is an "Application Programming Interface" (i.e., SOURCE
LEVEL) standard, not an IMPLEMENTATION standard. There are many possible
ways to correctly implement the pthread_once() interface. Some are also
fast, and others aren't. This is a "quality of implementation" issue, and
of no direct concern to the standard.

However, Alexander wasn't talking about pthread_once(), but rather a
carefully tuned C++ "DCI" class that's analogous to what pthread_once()
does. (I still disagree with him in the most general sense, simply because
you're dealing with a GENERAL implementation, which can nearly always be
beaten by a strictly customized implementation that fits a particular
need... presuming the developers are willing to pay for that customization
in engineering, testing, and porting time.)

> The only reason you can argue that about Posix mutexes, for instance,
> is that their usage is so common and that the consequences would be so
> devastating to any vendor who produced a badly performing implementation.
> You can't make that same arguement about pthread_once.

That, of course, depends on how widely pthread_once() is used. If it's
rarely used, then you're absolutely right. But if it's commonly used then
the exact same "quality of implementation" rules apply. Of course, as a
user of the implementation, it's part of your responsibility to test and
give feedback to the developer so they KNOW that the quality is a concern.
Everyone needs to make engineering decisions and tradeoffs... and if the
implementors make tradeoffs that aren't appropriate for their users,
they'll know only if you tell them.

> If Posix refuses to make performance guarantees, then Posix adherents
> have no basis for complaining about attempts to implement solutions with
> *known* efficiency.

And, in many cases, known weaknesses, such as the "classical" DCI ("DCL")
which many people used for ages before finally admitting it wasn't
thread-safe.

Often, too, the "known efficiency" is largely illusory. That's not to say
that nobody can ever do anything faster on their own. In fact, a "highly
skilled engineer" can almost always do a particular thing faster, because
you can trade off customized efficiency against generality. However, all
too many people start "optimizing" with no idea where the real problems
are. Making your locking faster won't help if the real problem is
CONTENTION rather than the cost of the locking operations. In fact you're
likely to make the problem worse (sometimes by orders of magnitude) by
letting your faster threads generate more contention and context switches.

Still, the reason I generally advocate sticking with the POSIX operations is
simply that it's relatively easy to get CORRECT and PORTABLE code that way;
and that's a big advantage.

If you find that the performance isn't adequate for your needs, by all means
{custo|opti}mize. IF you're absolutely sure you understand the
architecture, and the problem, well enough to do it RIGHT as well as
FASTER, and are willing and able to predict and control the increased
development costs of a custom solution. I see far too many people turn down
the inviting country path with no idea at all where they're really going or
what the toll will be, and that's disturbing. When you explain the joy and
benefits of rolling your own, it's not a bad idea once in a while to put in
a reminder that not everyone should try this stunt at home...

--
/--------------------[ David.B...@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/

Alexander Terekhov

unread,

May 15, 2003, 10:54:49 AM5/15/03

to

David Butenhof wrote:
[...]

> However, Alexander wasn't talking about pthread_once(), but rather a
> carefully tuned C++ "DCI" class that's analogous to what pthread_once()

Well, just to clarify. At various places and different times I was
talking about several things:

- Plusifying pthread_once() which can and shall be done by WG21.

The idea here is that C++ should adopt major portions of
POSIX threading and the existing "C" interfaces and provide
a standard header <cthread> that would declare

extern "C" pthread_...
extern "C++" pthread_...

stuff for all functions that need function pointer agruments so
that people could use C++ functions (e.g. static class methods)
instead of rather anoying extern "C" wrappers. That would solve
the problem of boost.once that attempts to hide the lack of
extern "C++" pthread_once() via "extending" pthread_once() with
an additional parameter via TSD (and another "internal"
pthread_once() to init the key -- terribly ugly solution).

- Extend pthread_once() to allow non-static usage. This would
require init/destroy calls plus parameter (I'd also keep track
of once_routine return value [void*] in pthread_once_t object).

- Provide a C++ once_call<> template that would "wrap" an extended
pthread_once(). Well, here we can also have it the other way
around: extended pthread_once() can be implemented using the
once_call<void* (*)(void*)>. I'd do it that way.

- For the *static* lazy stuff in C++, define a keyword, attribute,
whatever to hide all "pthread_once() details" under the umbrella
of the explicitly synchronized static locals.

> does. (I still disagree with him in the most general sense, simply because
> you're dealing with a GENERAL implementation, which can nearly always be
> beaten by a strictly customized implementation that fits a particular
> need... presuming the developers are willing to pay for that customization
> in engineering, testing, and porting time.)

Implementation provided DCI can be done with membars or TSD.
Compiler barriers aside for a moment, hardware membars aren't
needed on UP (or when all threads are bound to one processor).
TSD COULD be faster (for MP-safe DCI). Even if one have fully
portable atomic<> with barriers, there's just no portable
interfaces that would allow to create a PORTABLE "customized
implementation" that would take into account all these
considerations and optimize "dynamically", for example. This
is what I meant with "you'll NEVER beat it".

regards,
alexander.

Joseph Seigh

unread,

May 15, 2003, 3:55:26 PM5/15/03

to

David Butenhof wrote:
>
> Joseph Seigh wrote:
>
> > Alexander Terekhov wrote:
> >>
> >> P.P.P.S. Give up! You'll NEVER "beat" a static local synchronized
> >> by a threads-aware C++ implementation (similar stuff to what a "C"
> >> POSIX threads implementation does for pthread_once(), but in much
> >> more 'convenient' way designed specifically for static lazy stuff
> >> in C++):
> >>
> > It specifically states that (being unbeatable) about pthread_once()
> > in the Posix standard? I doubt it.
>
> Of course not. POSIX is an "Application Programming Interface" (i.e., SOURCE
> LEVEL) standard, not an IMPLEMENTATION standard. There are many possible
> ways to correctly implement the pthread_once() interface. Some are also
> fast, and others aren't. This is a "quality of implementation" issue, and
> of no direct concern to the standard.

But you can give strong performance hint without specifying or requiring
an implementation. Take pthread_once for example. You could have specified
that it be as least as fast as DCL if DCL is possible to implement
correctly on that platform. Performance *is* important. It doesn't really
mean anything to be able to port a Posix program to a particular platform
if its performance really sucks wind on that platform.

...

>
> > If Posix refuses to make performance guarantees, then Posix adherents
> > have no basis for complaining about attempts to implement solutions with
> > *known* efficiency.
>
> And, in many cases, known weaknesses, such as the "classical" DCI ("DCL")
> which many people used for ages before finally admitting it wasn't
> thread-safe.

Actually, it did work at one time and still does on some architectures,
but hardware developers broke it. To argue that DCL was bad is an
a posteriori fallacy. We only knew it was bad after the fact. You can't
really future proof your code. The only way to guarantee your code will
be thread-safe forever is only write single threaded code, but even I wouldn't
be willing to bet on that. And I can't even begin to predict what programming
idiom will be the next to break and be considered a Bad Thing to Do.

>
> Often, too, the "known efficiency" is largely illusory. That's not to say
> that nobody can ever do anything faster on their own. In fact, a "highly
> skilled engineer" can almost always do a particular thing faster, because
> you can trade off customized efficiency against generality. However, all
> too many people start "optimizing" with no idea where the real problems
> are. Making your locking faster won't help if the real problem is
> CONTENTION rather than the cost of the locking operations. In fact you're
> likely to make the problem worse (sometimes by orders of magnitude) by
> letting your faster threads generate more contention and context switches.

Exactly! That's why DCL is important. It's an example of a lock-free
algorithm (nearly). Lock-free algorithms have no contention by definition
and that's of hugh performance import. Good point. I'm glad you brought
it up.

>
> Still, the reason I generally advocate sticking with the POSIX operations is
> simply that it's relatively easy to get CORRECT and PORTABLE code that way;
> and that's a big advantage.

That is generally true. 99.9% of my or anybody's code is not performance
sensitive enough to warrant the trouble of resorting to exotic techniques.
I use pthread_once a lot. Usually in library code that doing things like
syscalls and printf where even if pthread_once used mutexes you wouldn't
notice. But in some performance sensitive code, I've used a portably
correct version of DCL, one similar to the correct DCL using TSD.

>
> If you find that the performance isn't adequate for your needs, by all means
> {custo|opti}mize. IF you're absolutely sure you understand the
> architecture, and the problem, well enough to do it RIGHT as well as
> FASTER, and are willing and able to predict and control the increased
> development costs of a custom solution. I see far too many people turn down
> the inviting country path with no idea at all where they're really going or
> what the toll will be, and that's disturbing. When you explain the joy and
> benefits of rolling your own, it's not a bad idea once in a while to put in
> a reminder that not everyone should try this stunt at home...

You might see a lot more people trying that other path when JSR 133 & 166
hit general release. It wasn't much of a problem when Java and Posix threads
were functionally the same. There wasn't much that would work in one and not
the other. But that's about to change.

There's no law that says that the Posix threads standard has to be the one
adopt new function to make DCL work. It can just as easily be done as another standard.

Joe Seigh

SenderX

unread,

May 15, 2003, 4:19:06 PM5/15/03

to

> There's no law that says that the Posix threads standard has to be the one
> adopt new function to make DCL work. It can just as easily be done as
another standard.

I should do a win32 pthreads thats uses as many lock-free operations as it
can.

The install program will put the current win32 pthreads on the users hard
drive if the cmpxchg8b ( 32-bit lock-free ) or the cmp8bxchg16b (
lock-free ) isin't available.

If the lock-free opcodes are usable, the installer installs the super fu&kin
fast lock-free win32 pthreads.

;)

lol.

--
The designer of the SMP and HyperThread friendly, AppCore library.

http://AppCore.home.attbi.com

John Hickin

unread,

May 16, 2003, 9:08:37 AM5/16/03

to

"Joseph Seigh" <jsei...@xemaps.com> wrote in message
news:3EC3F1C1...@xemaps.com...
>

>
> Exactly! That's why DCL is important. It's an example of a lock-free
> algorithm (nearly). Lock-free algorithms have no contention by definition
> and that's of hugh performance import. Good point. I'm glad you brought
> it up.
>

The singleton is a global object that will be accessed by concurrently
running threads. I think that there is no point in accessing the singleton
without locking it unless it is created prior to all other threads and
thereafter it never changes. So to use the singleton we do this:

acquire lock
use singleton
release lock

It seems that DCL isn't really needed in this scenario. In any other
scenario:

int foo = dclSingletonPointer() -> get();

the get(), is wrong, no matter how well-programmed dclSingletonPointer() is,
unless it returns a temporary whose destructor releases a lock.

Regards, John.

David Butenhof

unread,

May 16, 2003, 9:37:51 AM5/16/03

to

Joseph Seigh wrote:

> David Butenhof wrote:
>>
>> Joseph Seigh wrote:
>>
>> > Alexander Terekhov wrote:
>> >>
>> >> P.P.P.S. Give up! You'll NEVER "beat" a static local synchronized
>> >> by a threads-aware C++ implementation (similar stuff to what a "C"
>> >> POSIX threads implementation does for pthread_once(), but in much
>> >> more 'convenient' way designed specifically for static lazy stuff
>> >> in C++):
>> >>
>> > It specifically states that (being unbeatable) about pthread_once()
>> > in the Posix standard? I doubt it.
>>
>> Of course not. POSIX is an "Application Programming Interface" (i.e.,
>> SOURCE LEVEL) standard, not an IMPLEMENTATION standard. There are many
>> possible ways to correctly implement the pthread_once() interface. Some
>> are also fast, and others aren't. This is a "quality of implementation"
>> issue, and of no direct concern to the standard.
>
> But you can give strong performance hint without specifying or requiring
> an implementation. Take pthread_once for example. You could have
> specified that it be as least as fast as DCL if DCL is possible to
> implement correctly on that platform. Performance *is* important. It
> doesn't really mean anything to be able to port a Posix program to a
> particular platform if its performance really sucks wind on that platform.

No, POSIX couldn't say that -- nor would it do any good if it had. POSIX
doesn't work that way.

> ...
>>
>> > If Posix refuses to make performance guarantees, then Posix adherents
>> > have no basis for complaining about attempts to implement solutions
>> > with *known* efficiency.
>>
>> And, in many cases, known weaknesses, such as the "classical" DCI ("DCL")
>> which many people used for ages before finally admitting it wasn't
>> thread-safe.
>
> Actually, it did work at one time and still does on some architectures,
> but hardware developers broke it. To argue that DCL was bad is an
> a posteriori fallacy.

No, not at all. Someone invented a cute hack that worked in restricted
circumstances without considering that it wasn't extensible or general; and
others used it with similar lack of analysis. Then "they" took it into new
environments (modern hardware platforms) that were beyond the safe little
harbor where it grew up in sheltered bliss... and again failed to perform
that analysis. That just means they may have been simply naive instead of
evil or stupid. The point is that the universe doesn't care whether you
knew that what you did was foolish.

Or, as the physicist said to the man who fell off a wall: ignorance of the
law (of gravity) is no excuse.

> We only knew it was bad after the fact. You can't
> really future proof your code.

No, there's not. But when you bring your code along to new systems and
hardware, you need to evaluate whether it makes sense there. The fact that
DCI was brought along where it didn't belong wasn't the fault of the
designers of the new and faster systems -- it was the fault of the people
who brought it along.

> The only way to guarantee your code will be thread-safe forever is only
> write single threaded code, but even I wouldn't
> be willing to bet on that. And I can't even begin to predict what
> programming idiom will be the next to break and be considered a Bad Thing
> to Do.

Not now, of course. The point is that before you deploy existing code on a
new platform that changes rules, however subtly, it's YOUR responsibility
to be sure that code makes sense.

When Intel releases a new Pentium that breaks existing X86 conventions, and
claims it's fully software compatible, THAT'S their fault. But when you
move code "born and bred" on X86 limitations to an Alpha, or Itanium, and
expect it to work the same as on an X86, that's YOUR fault... and THAT'S
what happened with DCI.

>> Often, too, the "known efficiency" is largely illusory. That's not to say
>> that nobody can ever do anything faster on their own. In fact, a "highly
>> skilled engineer" can almost always do a particular thing faster, because
>> you can trade off customized efficiency against generality. However, all
>> too many people start "optimizing" with no idea where the real problems
>> are. Making your locking faster won't help if the real problem is
>> CONTENTION rather than the cost of the locking operations. In fact you're
>> likely to make the problem worse (sometimes by orders of magnitude) by
>> letting your faster threads generate more contention and context
>> switches.
>
> Exactly! That's why DCL is important. It's an example of a lock-free
> algorithm (nearly). Lock-free algorithms have no contention by definition
> and that's of hugh performance import. Good point. I'm glad you brought
> it up.

I've been told it's impolite to tell people they're wrong. Usually, I
forget, though the alternative is often more humorous: Sorry, but that
turns out not to be the case. "Lock-free" doesn't mean "contention-free".
You're still touching shared memory, or there's no point, and when multiple
processors contend for the same memory, that is, exactly, precisely, and by
definition, "contention". And it costs. Contention represents most of the
cost of well-designed explicit synchronization. The advantage in lock-free
is truly more in the fact that the contention is limited to what's
essential in the current transaction: it's nothing more than customized
synchronization. It may not BLOCK, but if you think it won't STALL in the
execution, cache, and memory pipelines, you're severely deluding yourself.

Remember, what you're doing is fiddling with the low-level instruction
sequences on which a mutex is built. You're not waving your wand and doing
magic. You're not doing anything radically different. You're buying a live
chicken from the farm instead of packaged chicken parts in a store. You're
OPTIMIZING. Sometimes, optimization is a wonderous and grand thing. Other
times it's a silly waste of time and resources that cost far more in
maintenance and complication than they're conceivably worth. Just be sure
you know what you're doing and WHY you're doing it.

TANSTAAFL. Period. (That's "there ain't no such thing as a free lunch", by
the way.)

> There's no law that says that the Posix threads standard has to be the one
> adopt new function to make DCL work. It can just as easily be done as
> another standard.

You know, I can't help grinning every time I see someone write "DCL". First
off, that's the standard command "shell" on VMS. Second, the name's got
nothing to do with the optimization it labels... the "locking" is NOT being
"doubly checked", nor would there be any reason or purpose in doing so. It
comes across every time as a dumb joke. It's like using "umm" all the time
when one talks -- no matter what one has to say, one comes off sounding
ignorant and unprepared.

I've proposed "DCI" (double-checked initialization) as a substitution, which
makes a whole lot more sense, though it still doesn't really describe the
intent. (But then, if we wanted to be explicit and thorough, we'd hardly be
so strongly attracted to TLAs in the first place.) Maybe "RLI" would be
better: "reduced locking initialization". ;-)

Joseph Seigh

unread,

May 16, 2003, 11:07:59 AM5/16/03

to

I think you are digressing here. The point is that lock-free algorithms, of
which DCL is an example, are a good way of solving contention problems that can
arise when using mutexes. The fact that DCL can somehow be misused is not
relevant. Any technique can be misused.

But to address your point if I understand it. References to singletons are
not tracked. This is what you appear to be worried about and why you want
to use locks to track these. Well, as with any object that is not explicitly
tracked, it should be deleted when the scope it exists in exits. So, if this is
global scope, the singleton should be deleted then. If the scope is local
or some object, then it gets deleted when that local scope exits or the
object is destroyed.

I don't think the temp trick is applicable here, though I have used it with
refcounted stuff (atomic_ptr) to guarantee the object is not deleted during
the evaluation of the expression.

Joe Seigh

SenderX

unread,

May 17, 2003, 11:33:58 PM5/17/03

to

> "Lock-free" doesn't mean "contention-free".
> You're still touching shared memory, or there's no point, and when
multiple
> processors contend for the same memory, that is, exactly, precisely, and
by
> definition, "contention". And it costs.

Not nearly as much as using a single critical section! You simply cannot
compare lock-free to a " novice " locking sheme, that uses single critical
sections to protect shared data.

When people make a SMP system run through a single critical section, well
that is a shame! ;)

Here are some performance differences on " lock-free " vs. critical section:

http://eca.cx/lad/2000/Jul/0319.html

Lock-free will increase your application throughput, for sure.

> advantage in lock-free
> is truly more in the fact that the contention is limited to what's
> essential in the current transaction: it's nothing more than customized
> synchronization.

Beautiful synchronization. =)

Concurrent threads inside a single CAS critical section will always try to
push or pop nodes, they will never get into a performance destroying
conveyor-belt-syndrome.

You can actually use critical sections if you are clever, and get OK
performance.

Just look at my Queue code:

http://appcore.home.attbi.com/src/Queue.c

This code makes any other " normal " queue code that uses critical sections
look like sh%t!

David Butenhof

unread,

May 19, 2003, 9:39:49 AM5/19/03

to

SenderX wrote:

>> "Lock-free" doesn't mean "contention-free".
>> You're still touching shared memory, or there's no point, and when
> multiple
>> processors contend for the same memory, that is, exactly, precisely, and
> by
>> definition, "contention". And it costs.
>
> Not nearly as much as using a single critical section! You simply cannot
> compare lock-free to a " novice " locking sheme, that uses single critical
> sections to protect shared data.

(Sigh.) Yeah, WELL DESIGNED and WELL IMPLEMENTED and PROPERLY APPLIED custom
synchronization will perform better than POORLY DESIGNED or IMPROPERLY
APPLIED synchronization. But it's a lot easier to poorly design or
improperly apply custom low-level synchronization, and that represents a
true risk that many choose to ignore; and for many applications. When
you're designing your own systems for your own use, that's your
prerogative. I'm far more concerned with general education.

And putting all your threads through a common CONTENTION POINT is of course
bad. It doesn't really matter, though, whether that contention point is a
simple "critical section" or a "lock-free" transaction (which is as bad a
term as "double-checked locking"). It's still contention, and it still
costs. IN THEORY, it's easier to distribute the contention when your
synchronization is fine grained vs coarse grained. That applies no matter
what TYPE of synchronization you use.

Unless you're doing it just for fun, or as an academic exercise,
optimization should be the result of detailed trace data from real life
system loads. Optimize where it makes a difference, and verify that it
really does make a difference. Otherwise you're just wasting your time.

> When people make a SMP system run through a single critical section, well
> that is a shame! ;)

Sure, but so what? ;-)

I'm not disputing the THEORY, but the PRACTICE of your hype. In theory,
there's no difference between theory and practice; in practice, there's no
similarity.

> Here are some performance differences on " lock-free " vs. critical
> section:
>
> http://eca.cx/lad/2000/Jul/0319.html
>
> Lock-free will increase your application throughput, for sure.

A bold but foolish claim. Bad design can triumph over any "beautiful
synchronization". Presuming your "lock-free" library has no bugs (an issue
I won't dispute, since I have no time or interest in reviewing it, much
less in experimenting with it), and presuming that it was used exactly as
you intended in a well-designed application, I have no doubt that such
techniques could in some cases increase throughput. However, most uses of
such a library are likely to gain no real improvement because the problems
in the application were elsewhere to begin with, and would remain
elsewhere.

And I've seen many (way too many) major commercial applications that SLOWED
DOWN (often substantially) when library improvements reduced straightline
computing costs or contention in synchronization operations. Why? Because
they were so badly designed that the only thing keeping threads from
tripping over each other was the cost of synchronization. Applying correct
and efficient lock-free techniques to such an application would KILL its
performance. And applications like that don't get redesigned from scratch;
they're too big, too complicated, and have too much history.

SenderX

unread,

May 19, 2003, 7:36:07 PM5/19/03

to

> And putting all your threads through a common CONTENTION POINT is of
course
> bad. It doesn't really matter, though, whether that contention point is a
> simple "critical section" or a "lock-free" transaction (which is as bad a
> term as "double-checked locking"). It's still contention, and it still
> costs.

You simply cannot compare lock-free, to a normal lock. Apples and oranges.

http://216.239.53.100/search?q=cache:Lkcf9uxfxlsC:perso.wanadoo.fr/gmem/even
ements/jim2002/articles/L17_Fober.pdf+lock+free+fifo+queues&hl=en&ie=UTF-8

This only explains a couple of reasons why you can't do the comparison that
you seem to want to do.

> > When people make a SMP system run through a single critical section,
well
> > that is a shame! ;)
>
> Sure, but so what? ;-)

=)

> However, most uses of
> such a library are likely to gain no real improvement because the problems
> in the application were elsewhere to begin with, and would remain
> elsewhere.

Ahh yes, the bad designer meets beautiful synchronization. You have a point
there. ;)

> And I've seen many (way too many) major commercial applications that
SLOWED
> DOWN (often substantially) when library improvements reduced straightline
> computing costs or contention in synchronization operations. Why? Because
> they were so badly designed that the only thing keeping threads from
> tripping over each other was the cost of synchronization.

Crappy programmers trying to improve a badly designed app... lol.

> Applying correct
> and efficient lock-free techniques to such an application would KILL its
> performance. And applications like that don't get redesigned from scratch;
> they're too big, too complicated, and have too much history.

Let's say the app replaced all it's single lock stacks, with all lock-free
stacks to decrease calls into the kernel for contention ( that is what a
critical section does you know, it waits on a kernel object ) . Yuck!

Explain how a lock-free lifo stack would KILL the fine performance that a
single-lock lifo stack was giving the app?

Joseph Seigh

unread,

May 20, 2003, 9:01:14 AM5/20/03

to

David Butenhof wrote:

>
> SenderX wrote:
>
> >> And putting all your threads through a common CONTENTION POINT is of
> > course
> >> bad. It doesn't really matter, though, whether that contention point is a
> >> simple "critical section" or a "lock-free" transaction (which is as bad a
> >> term as "double-checked locking"). It's still contention, and it still
> >> costs.
> >
> > You simply cannot compare lock-free, to a normal lock. Apples and oranges.
>

> Pardon me, but that's simply silly.
>
> The machine provides a CAS instruction. I build a lock, that's used to
> protect a shared data cell, by looping on a CAS until the lock flag is
> clear. You loop on a CAS until the shared data cell has a particular value.

That's not an entirely correct characterization. In a lock-free algorithm a
thread can make forward progress regardless of the progress of other
threads. So a thread looping until another thread sets a particular value would
by definition not be lock-free.

...

> And most forms of lock-free synchronization need to spin; a CAS insert or
> remove doesn't usually FAIL when the 'compare' fails... it reloads and
> TRIES AGAIN. Often "forever", as in the examples in the paper you linked.
> Though some implementations might try to be more robust by placing a limit
> on the spin count, there's really no way to set a "guaranteed safe" limit;
> you have to just guess. Even when the current thread is sufficiently
> important to justify that "selfishness", it would have still be better for
> the application if the thread had blocked, when the spin exceeds the time
> that would have been required to block and unblock.
>
> Of course, "in theory", the CAS will rarely fail, and will "almost never"
> fail more than once or twice. Because contention, "of course", is always
> assumed to be low...

Well, I know for a fact that on at least one architecture by a company that
takes their SMP architecture definition seriously, simple CAS loops are required
to be "fair", that no processor can have a permanent advantage over others.

The other problem with locks is that, unless they are spin locks, threads
that are waiting for a lock can be suspended and that can make the lock
throughput slow down dramatically. Locks that allow queue jumping, non-FIFO,
can ameliorate this somewhat, but the problem is still there. You don't get
this effect with lock-free.

Joe Seigh

John Hickin

unread,

May 20, 2003, 10:10:51 AM5/20/03

to

"Joseph Seigh" <jsei...@xemaps.com> wrote in message

news:3ECA2835...@xemaps.com...

>
>
> David Butenhof wrote:
> >
> > Of course, "in theory", the CAS will rarely fail, and will "almost
never"
> > fail more than once or twice. Because contention, "of course", is always
> > assumed to be low...
>
> Well, I know for a fact that on at least one architecture by a company
that
> takes their SMP architecture definition seriously, simple CAS loops are
required
> to be "fair", that no processor can have a permanent advantage over
others.
>

Joe,

It has been a while since I read papers on lock-free algorithms but what
Dave says rings a bell. I remember that these types of loops could fail from
time to time. If I remember correctly, when used to manage allocator free
lists, a lock free algorithm could actually leak memory.

Is this in fact true or is it just a figment of my imagination?

Regards, John.

Alexander Terekhov

unread,

May 20, 2003, 10:22:31 AM5/20/03

to

David Butenhof wrote:
[...]

> I've proposed "DCI" (double-checked initialization) as a substitution, which
> makes a whole lot more sense, though it still doesn't really describe the
> intent. (But then, if we wanted to be explicit and thorough, we'd hardly be
> so strongly attracted to TLAs in the first place.) Maybe "RLI" would be
> better: "reduced locking initialization". ;-)

How about "DCSI-MBR", "DCSI-TLS"... and "DCCI" (I'd love AC/DC ;-) ):

DCSI (double-checked serialized initialization)
-----------------------------------------------

A) DCSI-MBR (DCSI with memory barriers)

class stuff { /* ... */ };

class thing {
public:
thing(/* ... */) : /* ... */ stuff_ptr(0) /* ... */ { /*...*/ }
~thing() { /* ... */ delete stuff_ptr.load(); /* ... */ }
/* ... */
const stuff & stuff_instance();
/* ... */
private:
/* ... */
mutex stuff_mtx;
atomic<stuff *> stuff_ptr;
/* ... */
};

const stuff & thing::stuff_instance() { // "lazy" one
stuff * ptr;
// hoist load barrier (with data dependency "hint")
if (0 == (ptr = stuff_ptr.load_ddhlb())) {
mutex::guard guard(stuff_mtx);
if (0 == (ptr = stuff_ptr.load())) {
ptr = new stuff(/*...*/);
// sink store barrier
stuff_ptr.store_ssb(ptr);
}
}
return *ptr;
}

B) DCSI-TSD (DCSI with thread-specific data)

class stuff { /* ... */ };

class thing {
public:
thing(/* ... */) : /* ... */ stuff_shared_ptr(0) /* ... */ { /*...*/ }
~thing() { /* ... */ delete stuff_shared_ptr; /* ... */ }
/* ... */
const stuff & stuff_instance();
/* ... */
private:
/* ... */
mutex stuff_mtx;
stuff * stuff_shared_ptr;
thread_specific_ptr<stuff *, no_cleanup> stuff_thread_ptr;
/* ... */
};

const stuff & thing::stuff_instance() { // "lazy" one
stuff * ptr;
if (0 == (ptr = stuff_thread_ptr.get())) {
{ mutex::guard guard(stuff_mtx);
if (0 == (ptr = stuff_shared_ptr))
ptr = stuff_shared_ptr = new stuff(/*...*/);
}
stuff_thread_ptr.set(ptr);
}
return *ptr;
}

DCCI (double-checked concurrent initialization)
-----------------------------------------------

(only one variation -- with memory barriers and CAS; lock-free)

class stuff { /* ... */ };

class thing {
public:
thing(/* ... */) : stuff_ptr(0) /* ... */ { /*...*/ }
~thing() { delete stuff_ptr.load(); /* ... */ }
/* ... */
const stuff & stuff_instance();
/* ... */
private:
/* ... */
atomic<stuff *> stuff_ptr;
/* ... */
};

const stuff & thing::stuff_instance() { // "lazy" one
stuff * ptr;
// hoist load barrier (with data dependency "hint")
if (0 == (ptr = stuff_ptr.load_ddhlb())) {
ptr = new stuff(/*...*/);
// sink store barrier
if (!stuff_ptr.attempt_update_ssb(ptr, 0)) {
delete ptr;
// hoist load barrier (with data dependency "hint")
if (0 == (ptr = stuff_ptr.load_ddhlb()))
abort();
}
}
return *ptr;
}

regards,
alexander.

Joseph Seigh

unread,

May 20, 2003, 11:15:09 AM5/20/03

to

John Hickin wrote:
>
> Joe,
>
> It has been a while since I read papers on lock-free algorithms but what
> Dave says rings a bell. I remember that these types of loops could fail from
> time to time. If I remember correctly, when used to manage allocator free
> lists, a lock free algorithm could actually leak memory.
>
> Is this in fact true or is it just a figment of my imagination?
>

There is nothing inherent in lock-free algorithms that they leak memory. There
are lock-free algorithms that turn out not to work correctly after all. There was
also one algorithm that temporarily swiped the entire queue, dequeued from it, and
queued the remainder back, all in order to avoid the ABA problem. But that is
not lock-free since other threads would have to wait for the queue to show back
up. It also had a rather catastrophic failure mode, since a thread failure could
lose the entire queue, not just the threads resources.

If you want a good overview of what's wrong with some of the lock-free algorithms
out there, read M. Michael's more recent papers on lock-free stuff. He pretty much
rips into everyone else's algorithms in the overview part of his paper.

Joe Seigh

SenderX

unread,

May 20, 2003, 4:57:12 PM5/20/03

to

> The machine provides a CAS instruction. I build a lock, that's used to
> protect a shared data cell, by looping on a CAS until the lock flag is
> clear. You loop on a CAS until the shared data cell has a particular
value.

You don't know the basics about lock-free algo's, everybody gets their tries
at the CAS.

Your solution is to FORCE all contending threads to WAIT in a LINE for the
DATA.

I let ALL the threads IN, and they ALWAYS get a try at the DATA.

> though it does so at a fairly simplistic level.
> (And I don't say that to belittle the article, because it was clearly
> intended to be a fairly basic treatment of the subject.)

Normal locks get priority inversion, conveyor belt syndrome, waits on a
kernel object, spins for nothing (spinlock), ect, ect. Lock-free has NONE of
those. You can't compare them, its unfair to the normal lock. =)

> In fact, I've several times been beaten up for making optimizations in our
> thread library that favor well-written applications while penalizing (even
> slightly) poorly written applications.

Damn shame. You get condemned for your intelligence? Not good. ;)

> And most forms of lock-free synchronization need to spin;

Lock-free tries to access the data on EACH spin. A spin-lock lets each
contending thread spin for NOTHING? What a waste of processor usage. ;(

> Often "forever", as in the examples in the paper you linked.

Lock-free CAS critical sections do NOT spin forever. Threads get through in
a finite number of trys.

> You haven't built, or even defined, the app; so how can anyone possibly
know
> the performance characteristics, much less how it would be affected by a
> single massive redesign of some component?

Trust me, if any app's producers or consumers went through several
high-contend shared LIFO " normal lock " stacks, and you changed them to
lock-free... The app's throughput would increase.

=)

SenderX

unread,

May 20, 2003, 5:19:28 PM5/20/03

to

> If you want a good overview of what's wrong with some of the lock-free
algorithms
> out there, read M. Michael's more recent papers on lock-free stuff. He
pretty much
> rips into everyone else's algorithms in the overview part of his paper.

There are only 2 FIFO lock-free algo's and 1 LIFO algo's that I know of.
Does he solve lock-free FIFO & LIFO in a radically different way?

We could write a paper on the " new " way we solved the ABA problem, and rip
apart the " old, unstable " ABA solution. We could say our solution is
100%, everybody's else's could crash at any time ( which is true! )... ;)

Really, our solution beats the pants off of every other ABA solution I've
ever seen. It think it is rock-solid.

Have you looked at the atomic_ptr< t >::reserve function I added yet? It
makes atomic_ptr really cool!

Sean Burke

unread,

May 20, 2003, 7:22:45 PM5/20/03

to

"SenderX" <x...@xxx.com> writes:

> > The machine provides a CAS instruction. I build a lock, that's used to
> > protect a shared data cell, by looping on a CAS until the lock flag is
> > clear. You loop on a CAS until the shared data cell has a particular
> value.
>
> You don't know the basics about lock-free algo's, everybody gets their tries
> at the CAS.
>
> Your solution is to FORCE all contending threads to WAIT in a LINE for the
> DATA.
>
> I let ALL the threads IN, and they ALWAYS get a try at the DATA.

I worry about livelock. The CPU's cannot modify the data concurrently,
they have to pass it back and forth between caches. I can believe that
backoff works well for a datum that two or three CPUs are contending
for, but how does this scale when 10 or 20 CPUs are in contention?
(Shades of ethernet vs. token-ring! :-)

In other words, how far do you have to back off to give one CPU a
reasonable chance of holding the cache line long enough to complete
an operation, and is there a point (number of contending CPUs) at
which the traditional lock outperforms the lock-free algorithm?

I would note that your LIFO solution may be worse that the counter-
protected lock-free LIFO, on related grounds. Your solution requires
the thread to acquire exclusive access to the cache line twice in
order to complete an operation: once to set read ID, and again when
performing the CAS.

In contrast, the counter-protected LIFO can fetch the queue head
and counter under shared access, and only needs exclusive access
when doing the CAS. It seems possible that this would allow the
algorithm to scale better.

A nice theory, anyway. All I need now is access to a 20+CPU
machine to test it out on! :-}

-SEan

--
Remove the pi to email me.

David Schwartz

unread,

May 20, 2003, 8:32:08 PM5/20/03

to

"Sean Burke" <burke_...@pacbell.net> wrote in message
news:x7d6idf...@bolo.xenadyne.com...

> I worry about livelock. The CPU's cannot modify the data concurrently,
> they have to pass it back and forth between caches. I can believe that
> backoff works well for a datum that two or three CPUs are contending
> for, but how does this scale when 10 or 20 CPUs are in contention?
> (Shades of ethernet vs. token-ring! :-)

Bingo! Give the man a cigar.

In the corresponding locked implementation, either there's a lot of
contention or there's very little. If there's very little contention,
there's very little benefit to a lock-free algorithm. If there's a lot of
contention, a lock-free algorithm will mean a lot of cache ping-ponging that
a lock-based algorithm would largely avoid.

Lock-free algorithms are fascinating, fun to work on, and of serious
academic interest. But they're more like sports cars than trucks. Most
real-world programming problems are about how much trucks can hold and how
fast they can go, not how fast sports cars can go.

Bottom line: Don't take risks in serious applications. Benchmark
competing implementations inside an application as much like your target
application as possible, then benchmark inside your application.

DS

SenderX

unread,

May 20, 2003, 9:29:00 PM5/20/03

to

> I worry about livelock. The CPU's cannot modify the data concurrently,
> they have to pass it back and forth between caches. I can believe that
> backoff works well for a datum that two or three CPUs are contending
> for, but how does this scale when 10 or 20 CPUs are in contention?
> (Shades of ethernet vs. token-ring! :-)

Good thoughts...

I think if any algo is prone to live-lock it will suffer from it on a 2
processor box as well...

If two processors kept changing the comprands faster than the exchange, then
live-lock would happen.

The Mike & Scott FIFO lock-free Queue algo has been around for a long time
now, which I use. It does not suffer from livelock.

Here is the paper on the algo. It heavily tests the queue using a
12-processor SGI challenge multi-processor:

http://www.cs.rochester.edu/u/scott/papers/1996_PODC_queues.pdf

No livelock reported, nor have I seen any live lock under the SMP Xeon box I
test on.

> In other words, how far do you have to back off to give one CPU a
> reasonable chance of holding the cache line long enough to complete
> an operation, and is there a point (number of contending CPUs) at
> which the traditional lock outperforms the lock-free algorithm?

Na, the 12-processor test proves that.

There is a http link in the paper to the code they used in the article, it
uses an exponential backoff.

> In contrast, the counter-protected LIFO can fetch the queue head
> and counter under shared access, and only needs exclusive access
> when doing the CAS. It seems possible that this would allow the
> algorithm to scale better.

Maybe. ;)

But the counter protected version has to use big version counts ( even
32-bit is small now ) and can STILL overflow and result in an ABA, which
will crash the system.

My solution will never crash due to an ABA problem. You can even use a
16-bit version counter, using my solution:

http://appcore.home.attbi.com/zips/AtomicPtrABATest.zip

The AtomicPtr class uses a 16-bit version stamp.

> A nice theory, anyway. All I need now is access to a 20+CPU
> machine to test it out on! :-}

=)

I believe I have a way that can simulate a live-lock test for my current ABA
solution...

Let me think here.

SenderX

unread,

May 20, 2003, 9:58:31 PM5/20/03

to

Thanx Sean for making me do my needed livelock test on my ABA solution now,
instead of later!

This is exactly what this newsgroup is for, the sharing of great ideas.

My solution DOES suffer from livelock under " certain " test situations I
put it under.

The way I can kill my ABA solution with a livelock is as follows:

do
{

/* Set the ABA id */
SharedStack.lAbaId = lMyAbaId;

/* Un-comment to get a livelock */
/*Sleep( 1 );*/

...

}

while( ! CAS( ..., ABA id ) );

The livelock doesn't seem to happen unless I put the Sleep(), but that
doesn't mean the problem isn't still there. ;)

This livelock does NOT happen when using the version-counters, instead of
thread id's.

This is because other threads don't cause contention by setting the AbaID,
like mine does.

I will change all the source code on my site to use version-counts, as that
is a trivial task, until I look at the livelock problem with my ABA solution
some more.

But, I still can't get real livelock... Unless I put a sleep inside the algo
itself. Humm...

Anyway, thank you!

Alexander Terekhov

unread,

May 21, 2003, 6:57:46 AM5/21/03

to

Interesting...

This "DCSI-TSD"-thing won't seem to work on True64... unless DRB brought
it into compliance in spite of ala-"Making pthread_getspecific slower may
be acceptable to you, but not to me" [http://tinyurl.com/cabk] remarks.
Hmm, I must conceded that Ulrich was probably {partially} right defending
the standard. Well, ideally, the TSD operations *overloads* for the
"no_cleanup" keys [more expensive due to versioning/whatever] would solve
it without impacting the performance of "with-cleanup" TSD. This can de
done easily in POSIX.C++... Oder? ;-)

Joseph Seigh

unread,

May 21, 2003, 7:17:24 AM5/21/03

to

Sean Burke wrote:
>
> I worry about livelock. The CPU's cannot modify the data concurrently,
> they have to pass it back and forth between caches. I can believe that
> backoff works well for a datum that two or three CPUs are contending
> for, but how does this scale when 10 or 20 CPUs are in contention?
> (Shades of ethernet vs. token-ring! :-)
>
> In other words, how far do you have to back off to give one CPU a
> reasonable chance of holding the cache line long enough to complete
> an operation, and is there a point (number of contending CPUs) at
> which the traditional lock outperforms the lock-free algorithm?
>
> I would note that your LIFO solution may be worse that the counter-
> protected lock-free LIFO, on related grounds. Your solution requires
> the thread to acquire exclusive access to the cache line twice in
> order to complete an operation: once to set read ID, and again when
> performing the CAS.
>
> In contrast, the counter-protected LIFO can fetch the queue head
> and counter under shared access, and only needs exclusive access
> when doing the CAS. It seems possible that this would allow the
> algorithm to scale better.

Livelock is possible at least in theory. Remember though that this
is in a CAS loop which in theory can suffer from starvation anyways.
There's supposed to be some hardware implementation magic so that
does not happen. But it's possible that the unique id reservation
scheme would work better since it pre-fetches the cache line in
exclusive mode which you need to do anyway for the CAS update to
succeed.

This is a general problem with almost all the hardware synchronization
primatives in that they don't guarantee fairness or forward progress.
So in a sense, at that level, all algorithms are flawed.

I also have a feeling that when you get to a very large number of processors
or on NUMA systems, that you may not be able to depend on CAS scaling
very well or at all. This is why I prefer loop free algorithms based
on fetch-and-op and loop-free non-fair CAS. If fair CAS breaks, you
are likely to see these if you see any arbiter mechanisms.

The other approach you might see being taken with large number of processors
is distributed algorithms such as Peterson algorithm. However distributed
algorithms have a large number of problems of their own.

Joe Seigh

Joseph Seigh

unread,

May 21, 2003, 7:41:20 AM5/21/03

to

David Schwartz wrote:
>
> "Sean Burke" <burke_...@pacbell.net> wrote in message
> news:x7d6idf...@bolo.xenadyne.com...
>
> > I worry about livelock. The CPU's cannot modify the data concurrently,
> > they have to pass it back and forth between caches. I can believe that
> > backoff works well for a datum that two or three CPUs are contending
> > for, but how does this scale when 10 or 20 CPUs are in contention?
> > (Shades of ethernet vs. token-ring! :-)
>
> Bingo! Give the man a cigar.
>
> In the corresponding locked implementation, either there's a lot of
> contention or there's very little. If there's very little contention,
> there's very little benefit to a lock-free algorithm. If there's a lot of
> contention, a lock-free algorithm will mean a lot of cache ping-ponging that
> a lock-based algorithm would largely avoid.

I don't think that necessarily follows. There's nothing inherent in lock-free
that requires cache ping ponging. You can come up with examples both ways
which doesn't prove anything.

>
> Lock-free algorithms are fascinating, fun to work on, and of serious
> academic interest. But they're more like sports cars than trucks. Most
> real-world programming problems are about how much trucks can hold and how
> fast they can go, not how fast sports cars can go.

So Linux is not a real world solution? That's interesting since RCU (Read, Copy,
Update) was put into Linux as part of the Linux scalability work, and RCU
is a lock-free solution, and scalability is definitely considered a read world
problem.

I don't think it's wise to arbitrarily dismiss out of hand a whole category
of solutions when problem solving but that's your choice. It just means that
there are problems that you will not be able to solve that others will be
able to solve. Do you really want to put yourself into that situation?

Joe Seigh

David Butenhof

unread,

May 21, 2003, 11:02:04 AM5/21/03

to

Alexander Terekhov wrote:

> Interesting...
>
> Alexander Terekhov wrote:
>>
>> David Butenhof wrote:
>> [...]
>> > I've proposed "DCI" (double-checked initialization) as a substitution,
>> > which makes a whole lot more sense, though it still doesn't really
>> > describe the intent. (But then, if we wanted to be explicit and
>> > thorough, we'd hardly be so strongly attracted to TLAs in the first
>> > place.) Maybe "RLI" would be better: "reduced locking initialization".
>> > ;-)
>>
>> How about "DCSI-MBR", "DCSI-TLS"... and "DCCI" (I'd love AC/DC ;-) ):

Um, let's stay away from "AC/DC", OK? But DCSI and DCCI are much improved
over DCL or DCI. I like it. I also like the distinction between them, since
anyone considering DCCI as an alternative needs to be very careful that the
initialization is "idempotent" -- it can't be something that really needs
to be done only once, since multiple threads may end up running it
simultaneously and only one of the constructed values will (at random) be
retained.

From the user perspective, there's no functional difference between DCSI-MBR
and DCSI-TLS, but there are certainly contexts where it's handy to know.

>
> This "DCSI-TSD"-thing won't seem to work on True64... unless DRB brought
> it into compliance in spite of ala-"Making pthread_getspecific slower may
> be acceptable to you, but not to me" [http://tinyurl.com/cabk] remarks.
> Hmm, I must conceded that Ulrich was probably {partially} right defending
> the standard. Well, ideally, the TSD operations *overloads* for the
> "no_cleanup" keys [more expensive due to versioning/whatever] would solve
> it without impacting the performance of "with-cleanup" TSD. This can de
> done easily in POSIX.C++... Oder? ;-)

I'm not entirely sure what the connection is, perhaps because you haven't
shown much of your DCSI-TLS example infrastructure. Are you dynamically
creating and destroying TSD keys without cleaning up the mess you leave in
various threads? If so, then yes, that could be a problem in any efficient
implementation of TSD where pthread_key_delete() actually makes a key
available for reuse.

I really don't want to repeat the whole discussion, but the simple fact is
that the late and incompletely considered addition of pthread_key_delete
broke the standard. It IS possible to implement the implications of the
current standard, but it's messy and inefficient, and violates some
principle tenets of the working group -- anyone who had proposed such a
solution would have gone down in flames, fast.

Basically, if you use pthread_key_delete(), you're responsible for
synchronizing TSD access across all threads. You at least need to be sure
all threads are "done" with the current key and will never read it again,
because pthread_key_delete() cannot invalidate the key. Individual threads
should confirm that synchronization by clearing their value for the key.

The best solution is to stay away from pthread_key_delete(). ;-)

David Butenhof

unread,

May 21, 2003, 11:19:56 AM5/21/03

to

Joseph Seigh wrote:

> David Butenhof wrote:
>>
>> The machine provides a CAS instruction. I build a lock, that's used to
>> protect a shared data cell, by looping on a CAS until the lock flag is
>> clear. You loop on a CAS until the shared data cell has a particular
>> value.
>
> That's not an entirely correct characterization. In a lock-free algorithm
> a thread can make forward progress regardless of the progress of other
> threads. So a thread looping until another thread sets a particular value
> would by definition not be lock-free.

That's not at all true. When you do a lock-free queue insert, you fetch the
current first entry, decide what you WANT the first entry to be (your
insert), and you try to make the exchange. Now, yes, if that attempt fails
you retry after fetching the NEW first value, rather than continuing to
test against the original expectation -- but all I was saying is that you
still need to be prepared to loop when your expected value isn't there,
whether that's a queue header link or a mutex lock bit. The higher level of
code is different, but that's not a difference that's visible at the
machine instruction or memory level. Either way, you load, cas, and loop
when the cas fails. The impact on instruction scheduling and memory
contention, locally, is the same.

In the broader picture, the behavior is different; that's not relevant at
this point, because I'm simply addressing the contention that a lock and a
'lock-free' sequence are so absolutely fundamentally different that "you
simply cannot compare" them.

>> Of course, "in theory", the CAS will rarely fail, and will "almost never"
>> fail more than once or twice. Because contention, "of course", is always
>> assumed to be low...
>
> Well, I know for a fact that on at least one architecture by a company
> that takes their SMP architecture definition seriously, simple CAS loops
> are required to be "fair", that no processor can have a permanent
> advantage over others.

And that's nice, as far as it goes, because it benefits locks... which are,
after all, merely CAS loops. ;-)

Still, "permanent" is a big and slippery word. What's "permanent" in a
processor that executes a billion instructions a second? What may seem
fleeting in our perception is an awfully long time to the computer.

In any case, access to a CAS-based lock will be as "fair" as a CAS-based
lock-free transaction. Higher level algorithms will affect the
application's view of the situation, but again that's not relevant here.

> The other problem with locks is that, unless they are spin locks, threads
> that are waiting for a lock can be suspended and that can make the lock
> throughput slow down dramatically. Locks that allow queue jumping,
> non-FIFO,
> can ameliorate this somewhat, but the problem is still there. You don't
> get this effect with lock-free.

And, once more, heavy contention is bad for any concurrent application.
Although the effects on some types of mutex, for some application
algorithms, may be measurably worse than with some lock-free behaviors, to
claim that lock-free implementations are immune to contention effects is
naive at best.

I'm dealing with the extremist viewpoint here, not the moderate and
pragmatic. We started out here with "lock-free WILL increase your
application throughput" and "you simply cannot compare". Once we've
dispensed with the fantasy world, we can start worrying about reality. One
step at a time...

David Butenhof

unread,

May 21, 2003, 11:29:44 AM5/21/03

to

SenderX wrote:

>> I worry about livelock. The CPU's cannot modify the data concurrently,
>> they have to pass it back and forth between caches. I can believe that
>> backoff works well for a datum that two or three CPUs are contending
>> for, but how does this scale when 10 or 20 CPUs are in contention?
>> (Shades of ethernet vs. token-ring! :-)
>
> Good thoughts...
>
> I think if any algo is prone to live-lock it will suffer from it on a 2
> processor box as well...

In theory, probably; however in practice you might have problems that could
be so small on a 2-way system that you won't be able to measure them.
Whereas on a 4-way they might increase exponentially. It all depends on the
cache and memory architecture of the systems you're using.

And then you try running on a 16-way or 64-way NUMA box, and suddenly you
see an entirely different world. ;-)

Joseph Seigh

unread,

May 21, 2003, 12:40:24 PM5/21/03

to

David Butenhof wrote:
>
>
> I'm dealing with the extremist viewpoint here, not the moderate and
> pragmatic. We started out here with "lock-free WILL increase your
> application throughput" and "you simply cannot compare". Once we've
> dispensed with the fantasy world, we can start worrying about reality. One
> step at a time...
>

Rather than looping forever on some contention points here :)
what exactly is your point? That there is no reason or
need for lock-free given that contention is the problem and
should never be allowed to become that significant? Or that
there is a use for lock-free, and what do you think it is then?

Joe Seigh

Alexander Terekhov

unread,

May 21, 2003, 1:49:04 PM5/21/03

to

David Butenhof wrote:
[...]

> > This "DCSI-TSD"-thing won't seem to work on True64... unless DRB brought
> > it into compliance in spite of ala-"Making pthread_getspecific slower may
> > be acceptable to you, but not to me" [http://tinyurl.com/cabk] remarks.
> > Hmm, I must conceded that Ulrich was probably {partially} right defending
> > the standard. Well, ideally, the TSD operations *overloads* for the
> > "no_cleanup" keys [more expensive due to versioning/whatever] would solve
> > it without impacting the performance of "with-cleanup" TSD. This can de
> > done easily in POSIX.C++... Oder? ;-)
>
> I'm not entirely sure what the connection is, perhaps because you haven't
> shown much of your DCSI-TLS example infrastructure.

Imagine that we've got a <cthread> header that would provide:

extern "C++" int pthread_key_create(pthread_key_t *, void (*)(void *));

(and etc.) to solve the {ugly-ugly-ugly} linkage problem. Now, here's how
the implementation could look (no_cleanup overloading aside for a moment,
and modulo bugs, of course):

#include <new> // for std::bad_alloc, std::try_again aside for a moment
#include <cthread>
#include <cassert>

struct no_cleanup { void operator()(void*) { } };

template<typename T, typename cleanup>
class thread_specific_ptr /* noncopyable */ {

static void TSD_dtor(void * p) {
cleanup()(static_cast<T *>(p));
}

pthread_key_t m_key;

public:

thread_specific_ptr() throw(std::bad_alloc/*, std::try_again*/) {
int status = pthread_key_create(&m_key, &TSD_dtor);
// throw std::bad_alloc if status == ENOMEM...
// throw std::try_again if status == EAGAIN...
assert(!status);
}

~thread_specific_ptr() throw() {
int status = pthread_key_delete(m_key);
assert(!status);
}

T * get() const throw() {
return static_cast<T *>(pthread_getspecific(m_key));
}

void set(T * p) const throw(std::bad_alloc) {
int status = pthread_setspecific(m_key, p);
// throw std::bad_alloc if status == ENOMEM...
assert(!status);
}

T * operator->() const throw() {
return get();
}

T & operator*() const throw() {
return *get();
}

T * release() throw() {
T * p = get();
if (p) set(0); // only an idiot will throw std::bad_alloc here
return p;
}

void reset(T * p = 0) throw(std::bad_alloc) {
T * old_p = get();
if (old_p != p) {
set(p);
cleanup()(old_p);
}
}

};

> Are you dynamically
> creating and destroying TSD keys without cleaning up the mess you leave in
> various threads?

Yes. The "problem" here is that TSD is an implementation detail and I don't
want to expose it to the clients...

> If so, then yes, that could be a problem in any efficient
> implementation of TSD where pthread_key_delete() actually makes a key
> available for reuse.
>
> I really don't want to repeat the whole discussion, but the simple fact is
> that the late and incompletely considered addition of pthread_key_delete
> broke the standard. It IS possible to implement the implications of the
> current standard, but it's messy and inefficient, and violates some
> principle tenets of the working group -- anyone who had proposed such a
> solution would have gone down in flames, fast.
>
> Basically, if you use pthread_key_delete(), you're responsible for
> synchronizing TSD access across all threads. You at least need to be sure
> all threads are "done" with the current key and will never read it again,
> because pthread_key_delete() cannot invalidate the key. Individual threads
> should confirm that synchronization by clearing their value for the key.

Yes, but I really don't want to expose thread_specific_ptr::release() or
thread_specific_ptr::reset() here -- the client shall not know about TSD
used as "just an implementation detail"!

class stuff { /* ... */ };

class thing {
public:
thing(/* ... */) : /* ... */ stuff_shared_ptr(0) /* ... */ { /*...*/ }
~thing() { /* ... */ delete stuff_shared_ptr; /* ... */ }
/* ... */
const stuff & stuff_instance();
/* ... */
private:
/* ... */
mutex stuff_mtx;
stuff * stuff_shared_ptr;

thread_specific_ptr<stuff, no_cleanup> stuff_thread_ptr;
/* ... */
};

const stuff & thing::stuff_instance() { // "lazy" one
stuff * ptr;
if (0 == (ptr = stuff_thread_ptr.get())) {
{ mutex::guard guard(stuff_mtx);
if (0 == (ptr = stuff_shared_ptr))
ptr = stuff_shared_ptr = new stuff(/*...*/);
}
stuff_thread_ptr.set(ptr);
}
return *ptr;
}

regards,
alexander.

Mike Mowbray

unread,

May 21, 2003, 11:18:45 PM5/21/03

to

David Butenhof wrote:

> [...] benefits locks... which are, after all,
> merely CAS loops. ;-)
> [...]

Are they "merely" CAS loops? I was under the impression
that for operations reliant on mutex-based locks we have
something like:

acquire:
read lockbit
if lockbit == 0, use cas to set it
else (or if the cas failed) goto acquire.

/* At this point we've set the lockbit. */

membar #StoreLoad | #StoreStore
...adjust list pointer(s)...
membar #LoadStore | #StoreStore
reset lock bit

In contrast, the same operation based solely on CAS is
something like:

start:
read list pointer(s)
use (d)cas to adjust the pointer(s)
if cas failed, goto start.

I.e: the CAS-based version relies on hardware-provided
atomicity of CAS and does not need explicit membars.

The reason I ask about the above is that I've seen
performance reports for mutex-based refcount incrementing
compared to cas-based refcount incrementing which
reported:

For 1 thread: cas version was 3 times
faster than mutex-based version.

For N>1 contending threads:
cas version became exponentially
preferable with increasing N.

This was on a multiprocessor using ultrasparc-II's, I believe.

(BTW, I do understand that this only matters if the
performance of one's app is genuinely limited by these
particular operations.)

But my basic question remains: since mutex-based locks involve
at least 2 membars, isn't the lock-based version intrinsically
quite a bit slower than a purely CAS-based solution?

- MikeM.

David Holmes

unread,

May 22, 2003, 2:34:51 AM5/22/03

to

"Mike Mowbray" <mi...@despammed.com> wrote in message
news:3ECC4180...@despammed.com...

> In contrast, the same operation based solely on CAS is
> something like:
>
> start:
> read list pointer(s)
> use (d)cas to adjust the pointer(s)
> if cas failed, goto start.
>
> I.e: the CAS-based version relies on hardware-provided
> atomicity of CAS and does not need explicit membars.

That's true as far as it goes, but usually some thread will want to follow
the pointers that have been adjusted and look at the data structures being
referred to. That means you still need to ensure the appropriate memory
synchronization occurs (either with explicit membars or because it's a
consequence/side-effect of the CAS).

David Holmes

Alexander Terekhov

unread,

May 22, 2003, 5:42:04 AM5/22/03

to

Mike Mowbray wrote:
>
> David Butenhof wrote:
>
> > [...] benefits locks... which are, after all,
> > merely CAS loops. ;-)
> > [...]
>
> Are they "merely" CAS loops? I was under the impression
> that for operations reliant on mutex-based locks we have
> something like:
>
> acquire:
> read lockbit
> if lockbit == 0, use cas to set it
> else (or if the cas failed) goto acquire.
>
> /* At this point we've set the lockbit. */
>
> membar #StoreLoad | #StoreStore
> ...adjust list pointer(s)...
> membar #LoadStore | #StoreStore
> reset lock bit
>
> In contrast, the same operation based solely on CAS is
> something like:
>
> start:
> read list pointer(s)
> use (d)cas to adjust the pointer(s)
> if cas failed, goto start.

No. As David Holmes noted, that's "incorrect". You'll need:

start:
read list pointer(s)
use (d)cas_WITH_STORE_SINK_BARRIER to adjust the pointer(s)

if cas failed, goto start.

and something ala atomic<>::load_ddhlb() (hoist load barrier
with data dependency "hint") on the reader/consumer side. You
might want to take look at DCCI example and/or implementation
of pthread_refcount_t using atomic<>.

>
> I.e: the CAS-based version relies on hardware-provided
> atomicity of CAS and does not need explicit membars.

It still does need membars (but "less" than general locking
stuff).

>
> The reason I ask about the above is that I've seen
> performance reports for mutex-based refcount incrementing
> compared to cas-based refcount incrementing which
> reported:
>
> For 1 thread: cas version was 3 times
> faster than mutex-based version.
>
> For N>1 contending threads:
> cas version became exponentially
> preferable with increasing N.

AFAICS, that's because 'lock-free' reference counting (and other
'lock-free' stuff) helps to contract contention (it can completely
eliminate some contention points; well, for example, see reference
counted string stuff that was posted here [I mean the stuff using
pthread_refcount_t]).

>
> This was on a multiprocessor using ultrasparc-II's, I believe.
>
> (BTW, I do understand that this only matters if the
> performance of one's app is genuinely limited by these
> particular operations.)
>
> But my basic question remains: since mutex-based locks involve
> at least 2 membars, isn't the lock-based version intrinsically
> quite a bit slower than a purely CAS-based solution?

Yup. lock-based version (using 'general' locks) is more expensive
in terms of the "amount" of reordering constraints, to begin with.

regards,
alexander.

Joseph Seigh

unread,

May 22, 2003, 5:48:01 AM5/22/03

to

It's probably C++ conventional refcounting pointers which are not externally
thread-safe, just internally. No membars for object visibility are needed
since you already have proper object visibility to begin with. However,
a membar would be required before refcount decrement to protect the heap
from late stores.

Joe Seigh

Joseph Seigh

unread,

May 22, 2003, 6:01:28 AM5/22/03

to

Mike Mowbray wrote:
> I.e: the CAS-based version relies on hardware-provided
> atomicity of CAS and does not need explicit membars.

Membar requirement has nothing to do with the atomicity.
You may or may not require membars depending on circumstances.
Conventional C++ refcounted pointers do not require membars
for object visibility since that is already fait acompli.
However it may require membars for other reasons.

>
> The reason I ask about the above is that I've seen
> performance reports for mutex-based refcount incrementing
> compared to cas-based refcount incrementing which
> reported:
>
> For 1 thread: cas version was 3 times
> faster than mutex-based version.
>
> For N>1 contending threads:
> cas version became exponentially
> preferable with increasing N.
>
> This was on a multiprocessor using ultrasparc-II's, I believe.
>
> (BTW, I do understand that this only matters if the
> performance of one's app is genuinely limited by these
> particular operations.)
>
> But my basic question remains: since mutex-based locks involve
> at least 2 membars, isn't the lock-based version intrinsically
> quite a bit slower than a purely CAS-based solution?
>

I don't think membars contribute that much of an effect here.
It's more likely that a single global mutex is being used to
protect the refcount. I'd be more interested in measurements
using one mutex per refcount or used the Microsoft hashed
mutexes scheme.

Joe Seigh

Alexander Terekhov

unread,

May 22, 2003, 6:14:30 AM5/22/03

to

Joseph Seigh wrote:
>
> David Holmes wrote:
> >
> > "Mike Mowbray" <mi...@despammed.com> wrote in message
> > news:3ECC4180...@despammed.com...
> > > In contrast, the same operation based solely on CAS is
> > > something like:
> > >
> > > start:
> > > read list pointer(s)
> > > use (d)cas to adjust the pointer(s)
> > > if cas failed, goto start.
> > >
> > > I.e: the CAS-based version relies on hardware-provided
> > > atomicity of CAS and does not need explicit membars.
> >
> > That's true as far as it goes, but usually some thread will want to follow
> > the pointers that have been adjusted and look at the data structures being
> > referred to. That means you still need to ensure the appropriate memory
> > synchronization occurs (either with explicit membars or because it's a
> > consequence/side-effect of the CAS).
> >
>
> It's probably C++ conventional refcounting pointers which are not externally
> thread-safe, just internally. No membars for object visibility are needed
> since you already have proper object visibility to begin with.

Not quite. It's kinda "problematical case for finalizers" which
works in C++ quite well. ;-) ;-)

> However,
> a membar would be required before refcount decrement to protect the heap
> from late stores.

One would need a consistent memory view for CLEANUP/DTORS (immutable
stuff aside for a moment), to begin with.

regards,
alexander.

Alexander Terekhov

unread,

May 22, 2003, 6:18:18 AM5/22/03

to

Joseph Seigh wrote:
>
> Mike Mowbray wrote:
> > I.e: the CAS-based version relies on hardware-provided
> > atomicity of CAS and does not need explicit membars.
>
> Membar requirement has nothing to do with the atomicity.
> You may or may not require membars depending on circumstances.
> Conventional C++ refcounted pointers do not require membars
> for object visibility since that is already fait acompli.

Joe, that's true for immutable stuff ONLY. As for mutable... well,
please see <http://terekhov.de/pthread_refcount_t/draft-edits.txt>.

regards,
alexander.

Joseph Seigh

unread,

May 22, 2003, 6:34:23 AM5/22/03

to

My understanding of C++ practices and conventions is that refcounts
are only incremented for objects that are "owned" in order to avoid the
refcount going to zero race condition. Refcounting is only being used
for garbage collection. If you are sharing mutable objects then you
should be using an appropriate mechanism for this such as conventional
locks. If you are doing some kind of copy on write scheme then
you are using a specialized application of refcounting and depending
on how you do it, additional membars may be called for. But that's
not true for refcounting in general.

Joe Seigh

SenderX

unread,

May 22, 2003, 6:45:44 AM5/22/03

to

> Joe, that's true for immutable stuff ONLY. As for mutable... well,
> please see <http://terekhov.de/pthread_refcount_t/draft-edits.txt>.

When it comes to reference counted pointers...

Joes atomic_ptr + The ABA scheme I added to it

= the best pointer C API ever!

http://appcore.home.attbi.com/src/AtomicPtr.c

;)

Are you developing a new reference counting scheme?

If so, here are some questions for ya:

1. Can we use your reference counted pointers in lock-free algos. like we
can with atomic_ptr?

2. Can we CAS your reference counted pointers, will they avoid the ABA
problem?

3. Can you #ifdef between a lock-free and a non-lock-free version?

4. Can a thread reference a shared object even though it didn't own a
marshaled reference already?

SenderX

unread,

May 22, 2003, 6:48:19 AM5/22/03

to

> I'd be more interested in measurements
> using one mutex per refcount or used the Microsoft hashed
> mutexes scheme.

Your in luck!

I have already implemented atomic_ptr using concurrent sections ( msdn mag
article I showed you ):

http://AppCore.home.attbi.com/src/AtomicPtr.c

They rock dude.

No need for a lock per-atomic_ptr at all, and they perform great!

Alexander Terekhov

unread,

May 22, 2003, 7:35:45 AM5/22/03

to

SenderX wrote:
>
> > Joe, that's true for immutable stuff ONLY. As for mutable... well,
> > please see <http://terekhov.de/pthread_refcount_t/draft-edits.txt>.
>
> When it comes to reference counted pointers...
>
> Joes atomic_ptr + The ABA scheme I added to it
>
> = the best pointer C API ever!
>
> http://appcore.home.attbi.com/src/AtomicPtr.c

Show me .txt, first.

>
> ;)
>
> Are you developing a new reference counting scheme?

I'm pretty much done with it. The scheme provide thread-safe
reference counting (thread-safe as an 'int'). It's meant to
be used for things like COW std::string and the upcomming
std::shared_ptr<>.

>
> If so, here are some questions for ya:
>
> 1. Can we use your reference counted pointers in lock-free algos. like we
> can with atomic_ptr?

To tell the truth, I have no what atomic_ptr meant for. Show
me the spec. My guess is that it's supposed to emulate Java's
volatile refernces with garbage collection aiming at:

http://www.hpl.hp.com/personal/Hans_Boehm/gc/example.html
("Expensive Explicit Deallocation: An Example". Note that even with GC
one would still need memory barriers to insure proper visibility
[Java's revised volatiles would work really good here], but GC makes
it possible to get rid of rather expensive locking/ref.counting.)

or similar problem(s) using DCAS/whatever (avoiding blocking, making
it "lock-free"/"non-blocking"[1]).

But, really, show me the spec., please.

>
> 2. Can we CAS your reference counted pointers, will they avoid the ABA
> problem?

You can't "CAS" pthread_refcount_t. It's a synchronization object
(either blocking or non-blocking).

>
> 3. Can you #ifdef between a lock-free and a non-lock-free version?

Yes. I idea is that there will be a feature test macro
PTHREAD_REFCOUNT_NONBLOCKING that you can use for #ifdef.

>
> 4. Can a thread reference a shared object even though it didn't own a
> marshaled reference already?

What?

regards,
alexander.

Alexander Terekhov

unread,

May 22, 2003, 7:38:19 AM5/22/03

to

Joseph Seigh wrote:
>
> Alexander Terekhov wrote:
> >
> > Joseph Seigh wrote:
> > >
> > > Mike Mowbray wrote:
> > > > I.e: the CAS-based version relies on hardware-provided
> > > > atomicity of CAS and does not need explicit membars.
> > >
> > > Membar requirement has nothing to do with the atomicity.
> > > You may or may not require membars depending on circumstances.
> > > Conventional C++ refcounted pointers do not require membars
> > > for object visibility since that is already fait acompli.
> >
> > Joe, that's true for immutable stuff ONLY. As for mutable... well,
> > please see <http://terekhov.de/pthread_refcount_t/draft-edits.txt>.
> >
>
> My understanding of C++ practices and conventions is that refcounts
> are only incremented for objects that are "owned" in order to avoid the
> refcount going to zero race condition. Refcounting is only being used
> for garbage collection. If you are sharing mutable objects then you
> should be using an appropriate mechanism for this such as conventional
> locks.

Joe, really...

http://google.com/groups?q=unref+inside+synchronized+regions&scoring=d

regards,
alexander.

Joseph Seigh

unread,

May 22, 2003, 8:06:47 AM5/22/03

to

Alexander Terekhov wrote:

>
> SenderX wrote:
> >
> >
> > Are you developing a new reference counting scheme?
>
> I'm pretty much done with it. The scheme provide thread-safe
> reference counting (thread-safe as an 'int'). It's meant to
> be used for things like COW std::string and the upcomming
> std::shared_ptr<>.
>
> >
> > If so, here are some questions for ya:
> >
> > 1. Can we use your reference counted pointers in lock-free algos. like we
> > can with atomic_ptr?
>
> To tell the truth, I have no what atomic_ptr meant for. Show
> me the spec. My guess is that it's supposed to emulate Java's
> volatile refernces with garbage collection aiming at:
>
> http://www.hpl.hp.com/personal/Hans_Boehm/gc/example.html
> ("Expensive Explicit Deallocation: An Example". Note that even with GC
> one would still need memory barriers to insure proper visibility
> [Java's revised volatiles would work really good here], but GC makes
> it possible to get rid of rather expensive locking/ref.counting.)
>
> or similar problem(s) using DCAS/whatever (avoiding blocking, making
> it "lock-free"/"non-blocking"[1]).
>

Yes, it's basically equivalent to java references. If I bother to
add membars to atomic_ptr then it would be more or less equivalent to
java references under the revised java memory model. But it's refcounted
GC so the usual caveats apply.

As to whether it's more expensive than Boehm style GC, it all depends.
I'd be real careful about making broad generalizations here.

As to what they're good for. Well, they're good for pretty much what
atomic java references are good for.

Joe Seigh

SenderX

unread,

May 22, 2003, 8:13:51 AM5/22/03

to

> I'm pretty much done with it. The scheme provide thread-safe
> reference counting (thread-safe as an 'int'). It's meant to
> be used for things like COW std::string and the upcomming
> std::shared_ptr<>.

Humm, sounds like a " normal " counter then. It will suffer from
race-conditions and crash if you try to do atomic_ptr stuff with your
reference counting scheme for sure.

I'll explain later...

> To tell the truth, I have no what atomic_ptr meant for. Show
> me the spec.

Here is the patent # 5,295,262, for Joes atomic_ptr. The patent description
describes the great algo somewhat. ;)

The patent does NOT contain my ABA code, atomic_ptr was just a " real "
atomic pointer before I added the ability for it to be used in
state-of-the-are lock-free algo's by making it avoid the deadly ABA problem.

The algo is basically a " Differential Reference Count ". One count is
negative and one is positive, there both tied to a single reference count.
The count can never get the famous reference counting race-condition that I
fear yours suffers from.

Joe and I should do a formal write up of atomic_ptr /w my ABA addition...
Even though the 16-bit version counter is small, it gets CAS'd with the
atomic_ptr ephemeral count. So a thread would have to push and pop the same
node well over 65,535 times and keep a constant ephemeral count inside of
another threads CAS critical section, which is virtually impossible. So
atomic_ptr is virtually ABA proof.

> My guess is that it's supposed to emulate Java's
> volatile refernces with garbage collection aiming at:

It is not a garbage collected pointer, it is a " real " atomic reference
counted pointer that is more efficient than a garbage counting stop the
world or mark&sweep scheme.

In fact, my library is going to show how atomic_ptr can be used for a
lock-free reference counted pointer scheme, that ties into lock-free IBM
FreeLists to create a 100% lock-free garbage collection in C! I am coding
this up very soon. ;)

Here is the FreeList API that will create the lock-free garbage bins:

http://appcore.home.attbi.com/src/FreeList.c

Here is the FreeQueue API that will pass lock-free garbage bins from thread
to thread.

http://appcore.home.attbi.com/src/FreeQueue.c

> You can't "CAS" pthread_refcount_t. It's a synchronization object
> (either blocking or non-blocking).

Damn. ;)

> > 3. Can you #ifdef between a lock-free and a non-lock-free version?
>
> Yes. I idea is that there will be a feature test macro
> PTHREAD_REFCOUNT_NONBLOCKING that you can use for #ifdef.

Thats good.

> > 4. Can a thread reference a shared object even though it didn't own a
> > marshaled reference already?
>
> What?

This is where atomic_ptr and your reference counting schemes part ways...

You see, a thread doesn't need to own a copy of a atomic_ptr in order to use
it. Yours probably does cause atomic_ptr is the only code I have ever seen
in my life that can do that!

In other words, there is absolutely no need to marshal atomic_ptrs off to
threads.

Joseph Seigh

unread,

May 22, 2003, 8:51:45 AM5/22/03

to

SenderX wrote:
>
> Here is the patent # 5,295,262, for Joes atomic_ptr. The patent description
> describes the great algo somewhat. ;)
>

as written by a patent attorney. Probably not the best way describe anything
technical. Also, the patent only covers "emphemeral" pointers held locally
by a thread or process. It doesn't cover tracking of explict pointers. atomic_ptr
is an application of classical explicit refcounting in conjunction with emphemeral
pointer counting to take care of the race condition that one usually runs into.

AFAIK the patent is expired due to not paying patent maintenace fees, so it should
be in the public domain. So is the orginal RCU patent, so if you implement RCU
via the orginal patent, you should't need IBM's permission or have to pay royalties.

I've explained how both work before, but the c.p.t. attention span seems to be too
short to explain anything non-trivial. :)

Joe Seigh

Alexander Terekhov

unread,

May 22, 2003, 9:02:32 AM5/22/03

to

SenderX wrote:
>
> > I'm pretty much done with it. The scheme provide thread-safe
> > reference counting (thread-safe as an 'int'). It's meant to
> > be used for things like COW std::string and the upcomming
> > std::shared_ptr<>.
>
> Humm, sounds like a " normal " counter then. It will suffer from
> race-conditions and crash if you try to do atomic_ptr stuff with your
> reference counting scheme for sure.
>
> I'll explain later...
>
> > To tell the truth, I have no what atomic_ptr meant for. Show
> > me the spec.
>
> Here is the patent # 5,295,262, for Joes atomic_ptr. The patent description
> describes the great algo somewhat. ;)

Unfortunately, patents are written by laywers** for laywers**. E.g.:

http://terekhov.de/OO-Persistence/DE9-2000-0074-ep-9-2000.pdf
http://terekhov.de/OO-Persistence/ep-d-9-2000.pdf

**) most often, *clueless lawyers who are also not in the loop*.

[...]

> This is where atomic_ptr and your reference counting schemes part ways...
>

> You see, ...

Yeah. I see. I see that what you're aiming at... is nothing but a
<http://www.boost.org/libs/smart_ptr/shared_ptr.htm> pointer with
"a bit stronger" thread-safety guarantee -- clients don't need to
synchronize pointer mutations (see non-const methods). Fine. The
pthread_refcount_t beast does NOT provide such machinery. It's
meant to be an implementation detail for the CURRENT shared_ptr<>
(and <http://www.boost.org/libs/smart_ptr/weak_ptr.htm>) with the
'conventional' thread safety guarantee -- pointer mutations shall
always be synchronized *by the clients*. That's it. Kapis?

regards,
alexander.

Alexander Terekhov

unread,

May 22, 2003, 9:32:20 AM5/22/03

to

Joseph Seigh wrote:
[...]

> > > 1. Can we use your reference counted pointers in lock-free algos. like we
> > > can with atomic_ptr?
> >
> > To tell the truth, I have no what atomic_ptr meant for. Show
> > me the spec. My guess is that it's supposed to emulate Java's
> > volatile refernces with garbage collection aiming at:
> >
> > http://www.hpl.hp.com/personal/Hans_Boehm/gc/example.html
> > ("Expensive Explicit Deallocation: An Example". Note that even with GC
> > one would still need memory barriers to insure proper visibility
> > [Java's revised volatiles would work really good here], but GC makes
> > it possible to get rid of rather expensive locking/ref.counting.)
> >
> > or similar problem(s) using DCAS/whatever (avoiding blocking, making
> > it "lock-free"/"non-blocking"[1]).
> >
>
> Yes, it's basically equivalent to java references. If I bother to
> add membars to atomic_ptr then it would be more or less equivalent to
> java references under the revised java memory model. But it's refcounted
> GC so the usual caveats apply.
>
> As to whether it's more expensive than Boehm style GC, it all depends.
> I'd be real careful about making broad generalizations here.
>
> As to what they're good for. Well, they're good for pretty much what
> atomic java references are good for.

Okay. How about 'super_thread_safe_shared_ptr<>'? ;-)

Seriously, "atomic_ptr" doesn't really convey what it is. You might
also want to consider making it either blocking [maximum portability;
but pretty much useless ;-)] or non-blocking (with "emphemeral" stuff
and/or whatever tricks). Oder?

regards,
alexander.

Joseph Seigh

unread,

May 22, 2003, 12:46:22 PM5/22/03

to

Alexander Terekhov wrote:
>
> Okay. How about 'super_thread_safe_shared_ptr<>'? ;-)
>
> Seriously, "atomic_ptr" doesn't really convey what it is. You might
> also want to consider making it either blocking [maximum portability;
> but pretty much useless ;-)] or non-blocking (with "emphemeral" stuff
> and/or whatever tricks). Oder?
>

Well, Detlefs calls it lock-free reference counting. But Boost
would probably call what they're doing with thread-safe reference
counting lock-free also, and they are not the same. The meaning
of atomic w.r.t java pointers is or should be well understood, so
atomic something would be indicated. Also, it wouldn't preclude
a non lock-free implementation, though you would lose the bemefits
of lock-free.

Maybe atomic_shared_ptr.

Joe Seigh

Alexander Terekhov

unread,

May 22, 2003, 1:37:13 PM5/22/03

to

Joseph Seigh wrote:
>
> Alexander Terekhov wrote:
> >
> > Okay. How about 'super_thread_safe_shared_ptr<>'? ;-)
> >
> > Seriously, "atomic_ptr" doesn't really convey what it is. You might
> > also want to consider making it either blocking [maximum portability;
> > but pretty much useless ;-)] or non-blocking (with "emphemeral" stuff
> > and/or whatever tricks). Oder?
> >
>
> Well, Detlefs calls it lock-free reference counting. But Boost
> would probably call what they're doing with thread-safe reference
> counting lock-free also, and they are not the same.

They don't call it "lock-free". Here's what they say:

http://std.dkuug.dk/JTC1/SC22/WG21/docs/papers/2003/n1450.html
(A Proposal to Add General Purpose Smart Pointers to...)

"....
The count structure C has the following members:

Virtual table pointer (C is polymorphic);
Use count;
Weak count;
Mutex (for multithreaded builds);
Original pointer passed to the constructor;
Deleter.
....
The mutex presents another target for platform-specific
optimizations. On Windows, we have been able to use a simple
one-word spinlock instead of a 6 word CRITICAL_SECTION. The
portable architecture of the implementation that abstracts
the mutex concept in a separate class prevents further
optimizations, like reusing the use count word as the
spinlock, or using the Windows API InterlockedIncrement,
InterlockedDecrement, and InterlockedCompareExchange
primitives to attempt to eliminate the mutex altogether.
Nevertheless, we believe that such optimizations are
possible.
...."

> The meaning
> of atomic w.r.t java pointers is or should be well understood, so
> atomic something would be indicated. Also, it wouldn't preclude
> a non lock-free implementation, though you would lose the bemefits
> of lock-free.
>
> Maybe atomic_shared_ptr.

Well, I'd love to have a policy-based framework that would
allow me to specify the *thread-safety policy*:

thread_safety::unsafe

"naked" count(s)

thread_safety::basic

pthread_refcount_t stuff (<http://tinyurl.com/cewa>)

thread_safety::strong

your "atomic" stuff

or something like that. Note that for the blocking stuff, one
would REALLY want to have some support for priority protocols
(priority protection or priority inheritance) to fight the
"unbounded priority inversion" problem in the "realtime" apps.

regards,
alexander.

--
http://terekhov.de/pthread_refcount_t/draft-edits.txt

Joseph Seigh

unread,

May 22, 2003, 2:36:58 PM5/22/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:

The fact that they consider and have done simplistic spin locks
speaks for itself. Also, they make a whole lot of distinctions
that don't translate very well into a threaded environment.
Their style of semantic definition with that pre and post internal
data state stuff doesn't work in multi-threading. The one thing
you usually cannot reliably observe in multi-threading is internal data
state. You have to use other mechanisms to define semantics for
multi-threading. To say nothing of their multiple attempts to
deal with data "ownership". It's why they have so many pointer
types, all compromised in some degree or other for some small
incremental benefit.

>
> > The meaning
> > of atomic w.r.t java pointers is or should be well understood, so
> > atomic something would be indicated. Also, it wouldn't preclude
> > a non lock-free implementation, though you would lose the bemefits
> > of lock-free.
> >
> > Maybe atomic_shared_ptr.
>
> Well, I'd love to have a policy-based framework that would
> allow me to specify the *thread-safety policy*:
>
> thread_safety::unsafe
>
> "naked" count(s)
>
> thread_safety::basic
>
> pthread_refcount_t stuff (<http://tinyurl.com/cewa>)
>
> thread_safety::strong
>
> your "atomic" stuff
>
> or something like that. Note that for the blocking stuff, one
> would REALLY want to have some support for priority protocols
> (priority protection or priority inheritance) to fight the
> "unbounded priority inversion" problem in the "realtime" apps.
>

I'm never quite sure what "thread-safe" really means. It seems
to get used in different ways sometimes. Atomic is easy to define.
It just means that you will only ever read some previously stored
value.

Joe Seigh

Alexander Terekhov

unread,

May 22, 2003, 3:56:45 PM5/22/03

to

Yeah. http://lists.boost.org/MailArchives/boost/msg44460.php

> Also, they make a whole lot of distinctions
> that don't translate very well into a threaded environment.
> Their style of semantic definition with that pre and post internal
> data state stuff doesn't work in multi-threading.

I think it's fine. Please see the definitions of shared_ptr and
weak_ptr thread "safeties" (below).

> The one thing
> you usually cannot reliably observe in multi-threading is internal data
> state. You have to use other mechanisms to define semantics for
> multi-threading.

I don't think so.

> To say nothing of their multiple attempts to
> deal with data "ownership". It's why they have so many pointer
> types, all compromised in some degree or other for some small
> incremental benefit.

;-)

>
> >
> > > The meaning
> > > of atomic w.r.t java pointers is or should be well understood, so
> > > atomic something would be indicated. Also, it wouldn't preclude
> > > a non lock-free implementation, though you would lose the bemefits
> > > of lock-free.
> > >
> > > Maybe atomic_shared_ptr.
> >
> > Well, I'd love to have a policy-based framework that would
> > allow me to specify the *thread-safety policy*:
> >
> > thread_safety::unsafe
> >
> > "naked" count(s)
> >
> > thread_safety::basic
> >
> > pthread_refcount_t stuff (<http://tinyurl.com/cewa>)
> >
> > thread_safety::strong
> >
> > your "atomic" stuff
> >
> > or something like that. Note that for the blocking stuff, one
> > would REALLY want to have some support for priority protocols
> > (priority protection or priority inheritance) to fight the
> > "unbounded priority inversion" problem in the "realtime" apps.
> >
>
> I'm never quite sure what "thread-safe" really means. It seems
> to get used in different ways sometimes.

thread_safety::unsafe means that *copying* AND all mutation
operations on shared_ptr/weak_ptr copies (all pointing to the
same object; "typed-null" including) shall be *synchronized*;
otherwise, the behavior is undefined. Read-only stuff (copying
aside) can be done concurrently.

thread_safety::basic means that all "const" operations
(including copying) can be done concurrently, but mutations
shall be synchronized; otherwise, the behavior is undefined.

thread_safety::strong means "do-whatever-you-want-and-don't-
care-about-synchronization" (i.e. you can also mutate shared_ptr
and weak_ptr objects *concurrently* without any synchronization
because it's fully synchronized "internally").

> Atomic is easy to define.
> It just means that you will only ever read some previously stored
> value.

The C and C++ standards already mandate that! Here's what C99
says, for example:

"The lifetime of an object is the portion of program execution
during which storage is guaranteed to be reserved for it. An
object exists, has a constant address, and retains its last-
stored value throughout its lifetime."

regards,
alexander.

Joseph Seigh

unread,

May 22, 2003, 5:47:05 PM5/22/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
> >
> > Also, they make a whole lot of distinctions
> > that don't translate very well into a threaded environment.
> > Their style of semantic definition with that pre and post internal
> > data state stuff doesn't work in multi-threading.
>
> I think it's fine. Please see the definitions of shared_ptr and
> weak_ptr thread "safeties" (below).
>
> > The one thing
> > you usually cannot reliably observe in multi-threading is internal data
> > state. You have to use other mechanisms to define semantics for
> > multi-threading.
>
> I don't think so.

I mean things like a method to get the reference count. Not only
does it expose an implementation dependent feature, but any post
condition assertion based on that data is essentialy meaningless.

I think you have some COW (copy on write) semantics leaking out
there. COW is internal implementation. It should have no
observable affect on external behavior which can be "thread-safe" or
"thread-unsafe". I.e., the behavior should be the same as a
non COW object implementation.

>
> > Atomic is easy to define.
> > It just means that you will only ever read some previously stored
> > value.
>
> The C and C++ standards already mandate that! Here's what C99
> says, for example:
>
> "The lifetime of an object is the portion of program execution
> during which storage is guaranteed to be reserved for it. An
> object exists, has a constant address, and retains its last-
> stored value throughout its lifetime."
>

No, more like this. If x is atomic and the only values stored to
x are A and B, then a read can only return A or B. If x is not
atomic and the only values stored to x are A and B, then a read can
return A or B or anything, or in short, anything.

Joe Seigh

Beman Dawes

unread,

May 22, 2003, 8:58:00 PM5/22/03

to

Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3ECD19F3...@xemaps.com>...
>... Also, they make a whole lot of distinctions

> that don't translate very well into a threaded environment.
> Their style of semantic definition with that pre and post internal
> data state stuff doesn't work in multi-threading. The one thing
> you usually cannot reliably observe in multi-threading is internal data
> state. You have to use other mechanisms to define semantics for
> multi-threading.

The C++ standard doesn't explicitly deal with multi-threading, however
it seems to implicitly assume that a postcondition is evaluated
instantly. Thus postconditions are useful for specifying semantics
even though it isn't possible to write code which can reliably observe
post-conditions in a multi-threaded program.

> To say nothing of their multiple attempts to
> deal with data "ownership". It's why they have so many pointer
> types, all compromised in some degree or other for some small
> incremental benefit.

Could you explain that in a bit more detail? Are the two types
shared_ptr and weak_ptr what you mean by "so many pointer types"?
Which aspects do you feel compomise them? Or are you also including
auto_ptr? I can understand the comment much more clearly if it applies
to auto_ptr.

Thanks,

--Beman Dawes

Mike Mowbray

unread,

May 22, 2003, 11:57:38 PM5/22/03

to

Joseph Seigh wrote:

> Mike Mowbray wrote:

>> I've seen performance reports for mutex-based refcount
>> incrementing compared to cas-based refcount
>> incrementing which reported:
>>
>> For 1 thread: cas version was 3 times
>> faster than mutex-based version.
>>
>> For N>1 contending threads:
>> cas version became exponentially
>> preferable with increasing N.
>>

>> [...] my basic question remains: since mutex-based

>> involve at least 2 membars, isn't the lock-based
>> version intrinsically quite a bit slower than a purely
>> CAS-based solution?

> I don't think membars contribute that much of an effect
> here. It's more likely that a single global mutex is being
> used to protect the refcount. I'd be more interested in
> measurements using one mutex per refcount or used the
> Microsoft hashed mutexes scheme.

But single-vs-many mutexes wouldn't change the 1-thread result,
right? So if membars do not "contribute that much of an effect",
I'm still wondering why the cas-based version is 3 times faster?

- MikeM.

SenderX

unread,

May 23, 2003, 1:25:18 AM5/23/03

to

> But single-vs-many mutexes wouldn't change the 1-thread result,
> right? So if membars do not "contribute that much of an effect",
> I'm still wondering why the cas-based version is 3 times faster?

One is lock-free, and one isn't. You can't really compare them, unfair to
the lock. ;)

Disassemble the EnterCriticalSection and LeaveCriticalSection code...

I count around 11 opcodes on a non-contenting critical section for enter,
and about 9 on the leave.

A single call to InterlockedIncrement is about 6 opcodes, and a call to
InterlockedDecrement is the same.

So, to increment a long using a non-contending critical section takes about
20 opcodes. To decrement takes the same.

Contrast with only using 6 to increment and 6 to decrement ( lock-free ).

Finally, it takes around 40 opcodes to increment and decrement a long,
protected by a non-contending critical section. It only takes about 12
opcodes to do the same thing using atomic ops.

There is around three times more overhead (opcodes) using " non-contending "
locks, than using atomic ops.

OS locks impact performance. A good programmer should use atomic ops
directly, and bypass OS locks all together when possible.

;)

SenderX

unread,

May 23, 2003, 1:51:01 AM5/23/03

to

> That's it. Kapis?

For atomic_ptr, yes. ;)

It seems that your reference counters provide portable barriers ( store,
load )?

If so, your code is has to keep up with the hardware like lock-free algo's
right?

If you can get portable membar & atomic ops #ifdef's for a lot of different
processors, that would be cool... ;)

Alexander Terekhov

unread,

May 23, 2003, 4:18:11 AM5/23/03

to

SenderX wrote:
[...]

> It seems that your reference counters provide portable barriers ( store,
> load )?

Yes. It provides the mutex-like acquire/release operations. This means
that BOTH software (compiler) AND hardware (MP) shall follow the rules
(constrain reordering of loads and stores). It's all in the spec. Why
don't you simply read it and, perhaps, provide some USEFUL comments?

>
> If so, your code is has to keep up with the hardware like lock-free algo's
> right?

Sort of. A "poor-man" (mutex-based; blocking) implementation can be
found here:

http://www.terekhov.de/pthread_refcount_t/poor-man/beta2/prefcnt.h
http://www.terekhov.de/pthread_refcount_t/poor-man/beta2/prefcnt.c

>
> If you can get portable membar & atomic ops #ifdef's for a lot of different
> processors, that would be cool... ;)

Yeah, that's for sure... http://tinyurl.com/cgyc

regards,
alexander.

Alexander Terekhov

unread,

May 23, 2003, 4:40:44 AM5/23/03

to

Joseph Seigh wrote:
[...]

> > > The one thing
> > > you usually cannot reliably observe in multi-threading is internal data
> > > state. You have to use other mechanisms to define semantics for
> > > multi-threading.
> >
> > I don't think so.
>
> I mean things like a method to get the reference count. Not only
> does it expose an implementation dependent feature, but any post
> condition assertion based on that data is essentialy meaningless.

That's not true. You clearly need synchronization for any "post
condition assertion", but ``that's it.''. And as for "data is
essentially meaningless"... that's not true either; well, an
illustration can be found here:

http://groups.google.com/groups?selm=3DE3A142.7DAACB3B%40web.de
(Subject: Re: threadsafe reference counting)

[...]

> > thread_safety::unsafe means that *copying* AND all mutation
> > operations on shared_ptr/weak_ptr copies (all pointing to the
> > same object; "typed-null" including) shall be *synchronized*;
> > otherwise, the behavior is undefined. Read-only stuff (copying
> > aside) can be done concurrently.
> >
> > thread_safety::basic means that all "const" operations
> > (including copying) can be done concurrently, but mutations
> > shall be synchronized; otherwise, the behavior is undefined.
> >
> > thread_safety::strong means "do-whatever-you-want-and-don't-
> > care-about-synchronization" (i.e. you can also mutate shared_ptr
> > and weak_ptr objects *concurrently* without any synchronization
> > because it's fully synchronized "internally").
>
> I think you have some COW (copy on write) semantics leaking out
> there. COW is internal implementation. It should have no
> observable affect on external behavior which can be "thread-safe" or
> "thread-unsafe". I.e., the behavior should be the same as a
> non COW object implementation.

There's no COW semantics here. It's rather simple, really. Any
operation that "updates" the use-count needs to be synchronized
with respect to other readers/writers. The basic thread safety
is pretty much the same stuff as POSIX's memory synchronization
rules stated in 4.10/Base Definitions volume. Note that things
like semas and pthread_refcount_t provide STRONG thread-safety
(for themselves; from "OO" point of view, if you look at their
interface); well, with some "exceptions"... please take a look
at pthread_refcount_setvalue().

regards,
alexander.

Joseph Seigh

unread,

May 23, 2003, 6:30:15 AM5/23/03

to

Mike Mowbray wrote:
>
> But single-vs-many mutexes wouldn't change the 1-thread result,
> right? So if membars do not "contribute that much of an effect",
> I'm still wondering why the cas-based version is 3 times faster?
>

If the mutex is experiencing contention, i.e. multiple threads trying
to acquire the lock at the same time, threads are likely to be getting
suspended. Context switching adds considerable overhead. If you are
using multiple mutexes, this is less likely to occur and you will just
be using the mutex's fast path which should compare more favorably to
CAS. Assuming that the mutex does have a fast path.

Joe Seigh

Joseph Seigh

unread,

May 23, 2003, 7:29:37 AM5/23/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
> [...]
> > > > The one thing
> > > > you usually cannot reliably observe in multi-threading is internal data
> > > > state. You have to use other mechanisms to define semantics for
> > > > multi-threading.
> > >
> > > I don't think so.
> >
> > I mean things like a method to get the reference count. Not only
> > does it expose an implementation dependent feature, but any post
> > condition assertion based on that data is essentialy meaningless.
>
> That's not true. You clearly need synchronization for any "post
> condition assertion", but ``that's it.''. And as for "data is
> essentially meaningless"... that's not true either; well, an
> illustration can be found here:
>
> http://groups.google.com/groups?selm=3DE3A142.7DAACB3B%40web.de
> (Subject: Re: threadsafe reference counting)

I meant if you are exposing interals, that does nothing to define
semantics besides the problem of locking you into a particular
implementation. It also does very little for proving correctness
since the problem is likely to be a race condition or something and
synchronization to allow post condition assertion prevents the
race condition from ever occurring and thus being detected by
post condtion assertion checking.

I don't want to go into specifics just now, but I think there is a better
way to express semantics for multi-threading.

> > I think you have some COW (copy on write) semantics leaking out
> > there. COW is internal implementation. It should have no
> > observable affect on external behavior which can be "thread-safe" or
> > "thread-unsafe". I.e., the behavior should be the same as a
> > non COW object implementation.
>
> There's no COW semantics here. It's rather simple, really. Any
> operation that "updates" the use-count needs to be synchronized
> with respect to other readers/writers. The basic thread safety
> is pretty much the same stuff as POSIX's memory synchronization
> rules stated in 4.10/Base Definitions volume. Note that things
> like semas and pthread_refcount_t provide STRONG thread-safety
> (for themselves; from "OO" point of view, if you look at their
> interface); well, with some "exceptions"... please take a look
> at pthread_refcount_setvalue().
>

There's too many different techniques out there besides pthread_refcount_t
kind of stuff. It's not clear to me yet that we can come up with
some definitive list of categorizations that can be used to
define the various types of thread-safety. What you listed is what
you can think of. I can think of others.

So, if you want to come up with some kind of scheme now, that's fine.
Just be aware that as of yet unknown techniques will show up and your
scheme will have to be able to handle them.

Joe Seigh

Joseph Seigh

unread,

May 23, 2003, 7:47:18 AM5/23/03

to

Beman Dawes wrote:
>
> Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3ECD19F3...@xemaps.com>...
>

> > To say nothing of their multiple attempts to
> > deal with data "ownership". It's why they have so many pointer
> > types, all compromised in some degree or other for some small
> > incremental benefit.
>
> Could you explain that in a bit more detail? Are the two types
> shared_ptr and weak_ptr what you mean by "so many pointer types"?
> Which aspects do you feel compomise them? Or are you also including
> auto_ptr? I can understand the comment much more clearly if it applies
> to auto_ptr.
>

It's more the whole ownership thing, not actual pointer types. So it's
more auto_ptr vs. scoped_ptr vs. shared_ptr vs. ... The distinctions
may seem to be important if you are deliberately ignoring threading but
less so if you are not. It's a matter of perspective. If you are in
the former category, don't expect the latter to take it as serious as
you do. And vice versa.

What I mean by "compromised" is compromised w.r.t threading. A technique
that may confer some advantage in a non-threaded environment may not
translate well into a threaded environment.

Joe Seigh

Beman Dawes

unread,

May 23, 2003, 2:50:55 PM5/23/03

to

Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3ECE0B74...@xemaps.com>...

> Beman Dawes wrote:
> >
> > Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3ECD19F3...@xemaps.com>...
> >
> > > To say nothing of their multiple attempts to
> > > deal with data "ownership". It's why they have so many pointer
> > > types, all compromised in some degree or other for some small
> > > incremental benefit.
> >
> > Could you explain that in a bit more detail? Are the two types
> > shared_ptr and weak_ptr what you mean by "so many pointer types"?
> > Which aspects do you feel compomise them? Or are you also including
> > auto_ptr? I can understand the comment much more clearly if it applies
> > to auto_ptr.
> >
> It's more the whole ownership thing, not actual pointer types. So it's
> more auto_ptr vs. scoped_ptr vs. shared_ptr vs. ... The distinctions
> may seem to be important if you are deliberately ignoring threading but
> less so if you are not. It's a matter of perspective. If you are in
> the former category, don't expect the latter to take it as serious as
> you do. And vice versa.

Well, that's an interesting perspective. I would have guessed a
non-shared-ownership smart pointer like scoped_ptr would have been
particularly interesting in a multi-threading environment because it
is non-blocking.

But regardless of the perspective, too many smart pointer types (or
smart pointer policies) can be confusing to users. That's why we
removed scoped_ptr, etc, from the standardization proposal to get
people focusing on the essential two, shared_ptr and weak_ptr.

> What I mean by "compromised" is compromised w.r.t threading. A technique
> that may confer some advantage in a non-threaded environment may not
> translate well into a threaded environment.

Yes, that's particularly true of some of the implementation
techniques. For example, a shared_ptr implemenation using the linked
list approach rather than the reference counted approach can have some
performance advantages in a non-multithreaded environment, but
reference counting is easier to lock internally for multi-threaded
use.

--Beman Dawes

Alexander Terekhov

unread,

May 26, 2003, 3:36:55 AM5/26/03

to

Here's is a rather interesting POV. <Forward Quoted>

Peter Dimov wrote:
>
> William E. Kempf wrote:
> > Peter Dimov said:
> >> Alexander Terekhov wrote:
> >>> Try to convince Dimov that policy-based designs/interfaces aren't
> >>> that bad; and that, in general, the stuff managed by smart pointers
> >>> isn't necessarily process-wide/thread-shared stuff "by default"...
> >>> like global heap, current working directory, etc.
> >>
> >> Policy based designs aren't bad. It's just that when I look at
> >> boost::shared_ptr I see std::shared_ptr. The existing practice is
> >> that things in the standard library are as thread safe as an int "by
> >> default". Nobody complains that std::string should have a threading
> >> model parameter since it can be (much) more efficient if it's only
> >> accessed from a single thread.
> >
> > Things in the standard library aren't thread safe,
>
> Yes they are. :-)
>
> > unless an implementation chooses to make it so.
>
> They do. When they forget (cough MSVC6 STL), the real world kindly reminds
> them by submitting several quintillion bug reports.
>
> > The standard doesn't specify
> > *anything* about threads, including the thread safety of any
> > operations.
>
> Indeed. I've been careful to use "existing practice" and not to refer to any
> legalese.
>
> > Obviously, everyone knows that. I'm stating it now to make a point.
> > When the standard finally does have something to say about threads,
> > I'm not sure that it's a given that we'll be able to say "things in
> > the standard library are as thread safe as an int" (or what many call
> > thread-neutral). If some standard "thing" *must* rely on shared state
> > (not a given for string, like it is for shared_ptr), I can certainly
> > have sympathy with those who want control over whether or not a
> > particular instance of this thing needs any synchronization.
>
> It's not as simple as you make it seem.
>
> There are three separate problems that I see with shared_ptr WRT threading.
>
> 1. A thread safe shared_ptr cannot interoperate with its thread unsafe
> counterpart. This is technically an ODR violation, but it is common enough
> that it deserves support. To provide a concrete example, when a Boost
> library that uses shared_ptr is built on a Windows compiler using the single
> threaded runtime, but the application that uses that library is using the
> multithreaded runtime (or vice versa), some shared_ptr operations hang
> waiting forever for a spinlock that isn't there.
>
> 2. The pthread_mutex based version is suboptimal, both space- and
> performance wise. The problem here is that it is difficult to fix portably.
> On most platforms with a suboptimal (for our purposes, not in general)
> pthread_mutex, it is possible to implement a platform-specific shared_ptr
> that is no more than ~1.2 - 2.0 times slower than the unsynchronized version
> and doesn't use much extra space. In 95%+ of the cases (statistic made up on
> the spot) this performance difference doesn't even register on the profiler.
> Maybe one day we'll have a pthread_refcount_t. ;-)
>
> 3. The third problem is not technical but psychological/PR. Many shared_ptr
> users try hard to avoid the synchronization overhead without researching its
> actual impact on their code. Often shared_ptr operations are a relatively
> small amount of the code and don't affect its overall performance. Also,
> depending on the creation:copy ratio, it is often the allocator that is the
> bottleneck and not the count synchronization.
>
> This problem is fairly delicate since you can't just tell people to profile
> their code and get back to you. There is always the possibility that they
> have, indeed, profiled their code beforehand and found a real problem, and
> the "profile" answer is offensive and condescending.
>
> All that said.
>
> My preferred solution to (1) is to just make it work wherever possible, and
> to (2) is to make the synchronized version fast/tight enough. This will
> hopefully take care of (3).
>
> The good thing about providing a threading model policy takes care of (3)
> instantly, but it doesn't solve (1) in general, and it doesn't solve (2). In
> fact, by allowing people to sweep (2) under the carpet, it takes the
> pressure off the implementors to ever fix (2).
>
> So I believe that long term, it is in the best interest of the C++ community
> to not introduce a threading model policy. We'll continue to receive user
> complaints until (2) is fixed, though. :-(
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Alexander Terekhov

unread,

May 31, 2003, 4:04:52 PM5/31/03

to

Alexander Terekhov wrote: [well, "forwarded"]

>
> Here's is a rather interesting POV. <Forward Quoted>
>
> Peter Dimov wrote:

[...]

> > So I believe that long term, it is in the best interest of the C++ community
> > to not introduce a threading model policy.

I don't think so.

<Forward-Inline>

-------- Original Message --------
Date: Sat, 31 May 2003 22:00:06 +0200
Newsgroups: gmane.comp.lib.boost.devel
Subject: Re: shared_ptr/weak_ptr and thread-safety
References: ... <bb1vmb$mfe$1...@main.gmane.org> <3ED4C0AE...@web.de>

Alexander Terekhov wrote:
>
> Russell Hind wrote:
> >
> > Trevor Taylor wrote:
> > >
> > > Who? Me?
> > >
> >
> > I think Peter meant Alexander
>
> I got the message. I'll post refcount<thread_safety,
> typename integer_t> once I'll have some time for it. It won't include
> atomic<> implementation(s), however. ;-)

http://terekhov.de/pthread_refcount_t/experimental/refcount.cpp

regards,
alexander.

--
http://groups.google.com/groups?selm=3EC4F194.2DA8701C%40web.de