std::string, reference counting and multithreading

Alexander Terekhov

unread,

Nov 16, 2002, 1:10:19 PM11/16/02

to

< the referenced article(s) can be found on c.l.c++.mod >

Joseph Seigh wrote:
[...]
> I posted an early version to c.p.t. The basic logic works. I'm
> just messing with c++ design issues. When I get a little more resolution
> on those, I'll post it here.

C++ design issues aside, you are aiming at solving this:

http://www.hpl.hp.com/personal/Hans_Boehm/gc/example.html
("Expensive Explicit Deallocation: An Example". Note that even with GC
one would still need memory barriers to insure proper visibility
[Java's revised volatiles would work really good here], but GC makes
it possible to get rid of rather expensive locking/ref.counting.)

or similar problem(s) using DCAS/whatever (avoiding blocking, making
it "lock-free"/"non-blocking"[1]), I believe.

AFAICT, this isn't the problem that is commonly solved by "reference
counting". ``shared_ptr is supposed to be as thread-safe as an int.''

Well, OTOH, it would be really great if POSIX would provide something
along the lines of: <SKETCHv2>

Opaque "mutex-like" pthread_refcount_t [and pthread_refcount_attr_t]
type for reference counting. The only reference counting operations
that should "synchronize memory with respect to other threads" [4.10,
but "execution" aside -- <http://tinyurl.com/2jb9>, except destruction
of course] would be pthread_refcount_decrement_msync() and
pthread_refcount_sub_msync(). Note that pthread_refcount_increment()
and pthread_refcount_add() need *NOT* "synchronize memory with respect
to other threads". For immutable stuff (like COW), and for "double"
counting ala strong/weak counts, one could also use "optimized"
memory-sync-less pthread_refcount_decrement_nomsync() and
pthread_refcount_sub_nomsync() operations. The rest is as follows:

PTHREAD_REFCOUNT_MAX // upper bound
PTHREAD_REFCOUNT_INITIALIZER(N) // macro for statics INIT
pthread_refcount_init() // mutex like but with initial value
pthread_refcount_destroy() // mutex like apart from EBUSY
pthread_refcount_getvalue() // just in case someone will need it
pthread_refcount_setvalue() // without any sync whatsoever
pthread_refcount_attr_*() // PROCESS_SHARED and etc. attr-stuff
std::size_t as "value type" // for get/set/add/sub "value" param
PTHREAD_REFCOUNT_DROPPED_TO_ZERO // for pthread_refcount_decrement*()
// and pthread_refcount_sub() to
// indicate "dropped to zero" condition
// that MAY be used to safely destoy
// associated ref.counted object or
// simply continue -- increment/setval

regards,
alexander.

[1] http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Detlefs:David.html
(Even Better DCAS-Based Concurrent Deques. DISC 2000: 59-73)

"A non-blocking implementation is one for which any history that has
invocations of some set O of higher-level operations but no associated
responses may contain any number of responses for high-level operations
concurrent with those in O. That is, even if some higher-level operations
(each of which may be continuously taking steps, or not) never complete,
other invoked operations may nevertheless continually complete. Thus the
system as a whole can make progress; individual processors cannot be
blocked, only delayed, by other processors continuously taking steps or
failing to take steps. Using locks would violate the above condition,
hence the alternate name lock-free."

Joseph Seigh

unread,

Nov 17, 2002, 2:23:53 PM11/17/02

to

In his paper "Lock-Free Reference Counting", Detlefs et al breifly
go over why you might want to use reference counting instead of gc.
Paper can be found here http://citeseer.nj.nec.com/481457.html

As far as proposing a pthread interface for reference counting
the problem is that reference counting doesn't really abstract
very well in C. It's going to be implemetation dependent, so
a reference counting scheme base on DCAS could be different than
one based on some other mechanism. You don't want to hard code
an implementation into POSIX.

The reason for doing smart pointers in C++ is that, while far
from perfect (extremely so), C++'s abstraction mechanism lets
you hide the implementation details. So in theory you could have
completely different impelementations have the same external
"api" as it were.

Also C++ has features that can be exploited to create a more
robust thread-safe solution. Though it would be nice if the
C/C++ designers didn't have such antipathy towards threading
issues.

Joe Seigh

Joseph Seigh

unread,

Nov 19, 2002, 5:16:27 PM11/19/02

to

Well, it's mostly done. It's as thread-safe or atomic as
Java references are.

It still has the compare and swap logic simulated with
mutexes for the time being. If I have to learn assembler
programming on 2 or 3 platforms that I don't want to, it's
going to get really low priority.

Joe Seigh

Alexander Terekhov

unread,

Nov 19, 2002, 5:28:38 PM11/19/02

to

Joseph Seigh wrote:
>
> Well, it's mostly done. It's as thread-safe or atomic as
> Java references are.

Ha! I knew it! ;-) I just wonder: do you use the "ptr& operator=
( ptr*********(ptr::*********null)() )" trick? ;-) ;-) Seriously,
how do you achieve ala-Java-volatile-refs ptr-"load-acquire" and
ptr-"store-release" semantics?

< Forward Inline >

-------- Original Message --------
Message-ID: <3DDA27BD...@web.de>
Date: Tue, 19 Nov 2002 12:59:57 +0100
Newsgroups: comp.lang.c++.moderated
Subject: Re: std::string, reference counting and multithreading

Joseph Seigh wrote:
[...]
> Smart pointers are or should be like Java pointers. They just
> guarantee that you point to a valid object. How you manage and
> access the object is up to you much like it is in Java.

IOW, you want to emulate Java's *volatile* references (REVISED Java
volatile semantics and "magical" [GC] object destruction; things
like <http://tinyurl.com/2tmf> ["soft"/"weak"/"phantom" Java stuff]
aside for a moment). Correct?

regards,
alexander.

Joseph Seigh

unread,

Nov 19, 2002, 6:29:08 PM11/19/02

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
> >
> > Well, it's mostly done. It's as thread-safe or atomic as
> > Java references are.
>
> Ha! I knew it! ;-) I just wonder: do you use the "ptr& operator=
> ( ptr*********(ptr::*********null)() )" trick? ;-) ;-) Seriously,
> how do you achieve ala-Java-volatile-refs ptr-"load-acquire" and
> ptr-"store-release" semantics?

It doesn't compile and I'm not sure what it does except that
it has 9 levels of indirect referencing.

The java semantics are in the form of a java virtual machine
meta implementation. They're only useful for implementing
other jvm's the only way Sun thinks it can be done, and for
satisfying masochistic tendencies when used to do program
proofs. I don't know why Sun didn't do proper semantics.
Presumably, they knew how. They have semantics that can't
be verified by black box testing. That means you could
do a non-compliant jvm implementation and if you did not
release the source, they could not prove your jvm was not
compliant.

The smart pointers I wrote just have effective java semantics.
They're atomic w.r.t any operations on them. Basically you
can't break them.

> < Forward Inline >
>
> -------- Original Message --------
> Message-ID: <3DDA27BD...@web.de>
> Date: Tue, 19 Nov 2002 12:59:57 +0100
> Newsgroups: comp.lang.c++.moderated
> Subject: Re: std::string, reference counting and multithreading
>
> Joseph Seigh wrote:
> [...]
> > Smart pointers are or should be like Java pointers. They just
> > guarantee that you point to a valid object. How you manage and
> > access the object is up to you much like it is in Java.
>
> IOW, you want to emulate Java's *volatile* references (REVISED Java
> volatile semantics and "magical" [GC] object destruction; things
> like <http://tinyurl.com/2tmf> ["soft"/"weak"/"phantom" Java stuff]
> aside for a moment). Correct?
>

Well, the smart pointer doesn't need memory barriers for it's own
internal integrity. But since almost everything interesting would
need them, it would be a good idea to have them.

No weak references for the time being.

Where are these revised java volatile semantics anyway? They
been taking about this for years.

Joe Seigh

Alexander Terekhov

unread,

Nov 19, 2002, 6:58:28 PM11/19/02

to

Joseph Seigh wrote:
>
> Alexander Terekhov wrote:
> >
> > Joseph Seigh wrote:
> > >
> > > Well, it's mostly done. It's as thread-safe or atomic as
> > > Java references are.
> >
> > Ha! I knew it! ;-) I just wonder: do you use the "ptr& operator=
> > ( ptr*********(ptr::*********null)() )" trick? ;-) ;-) Seriously,
> > how do you achieve ala-Java-volatile-refs ptr-"load-acquire" and
> > ptr-"store-release" semantics?
>
> It doesn't compile

What?! <http://tinyurl.com/2ud5

xxx> g++ -v
Reading specs from /usr/lib/gcc-lib/i386-unknown-openbsd3.1/2.95.3/specs
gcc version 2.95.3 20010125 (prerelease)
xxx> g++ -o joe joe.cpp
xxx> ./joe
That's for Joe. Let it be null!
= something (might carry null).
That's for Joe. Let it be null!
xxx> cat joe.cpp
#include <iostream>
using namespace std;

struct ptr {

ptr& operator=( ptr temp ) {
/* this->swap( temp ); */
cout << "= something (might carry null).\n";
return *this;
}

ptr& operator=( ptr*********(ptr::*********null)() ) {
cout << "That's for Joe. Let it be null!\n";
return *this;
}

};

int main() {
ptr p1;
p1 = 0;
ptr p2;
p2 = p1;
p1 = 777-666-111;
}
xxx>

[...]

> Where are these revised java volatile semantics anyway?

http://www.cs.umd.edu/~pugh/java/memoryModel/semantics.pdf

regards,
alexander.

Joseph Seigh

unread,

Nov 19, 2002, 8:57:09 PM11/19/02

to

My gcc didn't like iostream. That's ok. I don't like it either.
Ok, I think I see the trick. You've got a lot of indirection
there and I'm not just talking about the *'s.

>
> [...]
> > Where are these revised java volatile semantics anyway?
>
> http://www.cs.umd.edu/~pugh/java/memoryModel/semantics.pdf
>

Oh, that stuff. I think the problem with Java semantics is
that they're in the form of a meta implementation. This
is another meta implementation. Has *anybody* ever used
java semantics to prove correctness of a non trivial program
other than as an exercise in masochism or as academic busy
work? I see the volatile stuff -- memory barriers get added.

I though Sun was working on this. They've had a bug reported
against AWTEventMulticaster since forever, before I even
realized it had the problem. Unless you count as a fix the
fact that Swing uses a copy on write (sort of) vector for its
event multicaster.

Joe Seigh

Silicon Graphics

unread,

Nov 19, 2002, 9:30:50 PM11/19/02

to

Joseph Seigh <jsei...@xemaps.com> wrote:
: Well, it's mostly done. It's as thread-safe or atomic as
: Java references are.

Cool, look forward to viewing the source.

: It still has the compare and swap logic simulated with

: mutexes for the time being. If I have to learn assembler
: programming on 2 or 3 platforms that I don't want to, it's
: going to get really low priority.

I can send you inline assembly for gcc/x86, Microsoft/x86,
and SunPro CC/SPARC 32- and 64-bit compare and exchange, and
read/write memory barriers. The only interesting tweak
I have found so far is eliminating LOCK prefix at run-time
if the machine is a uniprocessor x86.

Adam
ze...@best.com

: Joe Seigh

Alexander Terekhov

unread,

Nov 20, 2002, 5:36:05 AM11/20/02

to

Silicon Graphics wrote:
[...]

> The only interesting tweak I have found so far is eliminating
> LOCK prefix at run-time if the machine is a uniprocessor x86.

WOW. That's almost Java, then. ;-)

regards,
alexander.

Alexander Terekhov

unread,

Nov 20, 2002, 5:39:00 AM11/20/02

to

Joseph Seigh wrote:
[...]

> > #include <iostream>
> > using namespace std;
> >
> > struct ptr {
> >
> > ptr& operator=( ptr temp ) {
> > /* this->swap( temp ); */
> > cout << "= something (might carry null).\n";
> > return *this;
> > }
> >
> > ptr& operator=( ptr*********(ptr::*********null)() ) {
> > cout << "That's for Joe. Let it be null!\n";
> > return *this;
> > }
> >
> > };
> >
> > int main() {
> > ptr p1;
> > p1 = 0;
> > ptr p2;
> > p2 = p1;
> > p1 = 777-666-111;
> > }
> > xxx>
>
> My gcc didn't like iostream. That's ok. I don't like it either.
> Ok, I think I see the trick. You've got a lot of indirection

That's irrelevant. It's just a pointer type that can NOT be used in the
client code. As long as you don't have another =op(ptr-type), [and you
really shouldn't have it], the "= 0" (= ``null pointer constant'' thing)
will call that op=. Well, it's simply sort of "obfuscated" reset(). ;-)

regards,
alexander.

Joseph Seigh

unread,

Nov 20, 2002, 2:45:38 PM11/20/02

to

That would help very much if you did.

Joe Seigh

Joseph Seigh

unread,

Nov 20, 2002, 3:57:59 PM11/20/02

to

Here's what I have so far. No compare and swap in place of locking yet.
Ptr's are considered atomic for comparison purposes which can't be
strictly relied upon. I'll need an atomic_read for that part.

local_ptr is roughly equivalent to shared_ptr which is to say entirely
non thread-safe. These should be local only and not shared. I may
make the entire class private if this seems to be too much of an
exposure.

The only part that isn't entirely crowbar proof is setting atomic_ptr from
an object reference. If you do it twice rather than copying atomic_ptr you'll
get a double free sooner or later. Most likely sooner.

Also, this solution is only one atomic operation more expensive than using
DCAS. Plus I suspect DCAS would be problematic to implement on some NUMA
architectures.

Joe Seigh

------------------------------------------
#include <cassert> // temporary
#include <pthread.h> // temporary
#include <errno.h> // temporary

// temporary mutex
class zmutex {
public:
zmutex() { assert (pthread_mutex_init(&xmutex, NULL) == 0); }
~zmutex() { assert (pthread_mutex_destroy(&xmutex) == 0); }
void lock() { assert (pthread_mutex_lock(&xmutex) == 0); }
void unlock() { assert (pthread_mutex_unlock(&xmutex) == 0); }
private:
pthread_mutex_t xmutex;

};

template<typename T> class atomic_ptr;
template<typename T> class local_ptr;

//=============================================================================
// atomic_ptr_ref
//
//
//=============================================================================
template<typename T> class atomic_ptr_ref {
friend class atomic_ptr<T>;
friend class local_ptr<T>;

private:

atomic_ptr_ref(T * p = 0) {
ephemeralCount = 0;
referenceCount = 1;
ptr = p;
};

int adjust(long xephemeralCount, long xreferenceCount) {
long ecount;
long rcount;

mutex.lock();
ecount = (ephemeralCount += xephemeralCount);
rcount = (referenceCount += xreferenceCount);
mutex.unlock();
return (ecount == 0 && rcount == 0) ? 0 : 1;
}

long ephemeralCount; // ephemeral reference count
long referenceCount; // reference count
T * ptr; // ptr to actual object

zmutex mutex; // temporary

}; // class atomic_ptr_ref

//=============================================================================
// local_ptr
//
//
//=============================================================================
template<typename T> class local_ptr {
public:
local_ptr() { refptr = 0; }

local_ptr(const local_ptr<T> & src) {
if ((refptr = src.refptr) != 0)
refptr->adjust(+1, 0);
}

local_ptr(atomic_ptr<T> & src) {
refptr = src.getrefptr();
}

~local_ptr() {
if (refptr != 0 && refptr->adjust(-1, 0) == 0) {
delete refptr->ptr;
delete refptr;
}
}

local_ptr<T> & operator = (const float ** (atomic_ptr<T>:: **)()) {
local_ptr<T> temp;
//assert(z == 0);
swap(temp); // non-atomic
}

local_ptr<T> & operator = (local_ptr<T> & src) {
local_ptr<T> temp(src);
swap(temp); // non-atomic
}

local_ptr<T> & operator = (atomic_ptr<T> & src) {
local_ptr<T> temp(src);
swap(temp); // non-atomic
}

T * get() { return (refptr != 0) ? refptr->ptr : 0; }

T * operator -> () { return get(); }
T & operator * () { return get(); }

bool operator == (T * rhd) { return (rhd == get() ); }
bool operator != (T * rhd) { return (rhd != get() ); }

// refptr == rhd.refptr iff refptr->ptr == rhd.refptr->ptr
bool operator == (local_ptr<T> & rhd) { return (refptr == rhd.refptr);}
bool operator != (local_ptr<T> & rhd) { return (refptr != rhd.refptr);}
bool operator == (atomic_ptr<T> & rhd) { return (refptr == rhd.refptr);}
bool operator != (atomic_ptr<T> & rhd) { return (refptr != rhd.refptr);}

private:
void * operator new (size_t) {} // auto only

atomic_ptr_ref<T> * refptr;

inline void swap(local_ptr & other) { // non-atomic swap
atomic_ptr_ref<T> * temp;
temp = refptr;
refptr = other.refptr;
other.refptr = temp;
}

}; // class local_ptr

template<typename T> inline bool operator == (T * lhd, local_ptr<T> & rhd)
{ return (rhd == lhd); }

template<typename T> inline bool operator != (T * lhd, local_ptr<T> & rhd)
{ return (rhd != lhd); }

//=============================================================================
// atomic_ptr
//
//
//=============================================================================
template<typename T> class atomic_ptr {
friend class local_ptr<T>;

public:

atomic_ptr(T * obj = 0) {
ephemeralCount = 0;
if (obj != 0)
refptr = new atomic_ptr_ref<T>(obj);
else
refptr = 0;
// membar
}

atomic_ptr(atomic_ptr & src) { // copy constructor

refptr = src.getrefptr(); // atomic
ephemeralCount = 0;

// adjust link count
if (refptr != 0)
refptr->adjust(-1, +1); // atomic
// membar
}

~atomic_ptr() { // destructor
if (refptr != 0 && refptr->adjust(ephemeralCount, -1) == 0) {
delete refptr->ptr;
delete refptr;
}
}

atomic_ptr & operator = (atomic_ptr & src) {
atomic_ptr temp(src); // copy
swap(temp); // atomic
return *this;
}

atomic_ptr & operator = (T * obj) {
atomic_ptr temp(obj);
swap(temp);
return *this;
}

// generate local temp ptr to guarantee validity of ptr
// during lifetime of expression. temp is dtor'd after
// expression has finished execution.
local_ptr<T> operator -> () { return local_ptr<T>(*this); }
local_ptr<T> operator * () { return local_ptr<T>(*this); }

bool operator == (T * rhd) {
if (rhd == 0)
return (refptr == 0);
else
return (local_ptr<T>(*this) == rhd);
}

bool operator != (T * rhd) {
if (rhd == 0)
return (refptr != 0);
else
return (local_ptr<T>(*this) != rhd);
}

// refptr == rhd.refptr iff refptr->ptr == rhd.refptr->ptr
bool operator == (local_ptr<T> & rhd) { return (refptr == rhd.refptr);}
bool operator != (local_ptr<T> & rhd) { return (refptr != rhd.refptr);}
bool operator == (atomic_ptr<T> & rhd) { return (refptr == rhd.refptr);}
bool operator != (atomic_ptr<T> & rhd) { return (refptr != rhd.refptr);}

private:

long ephemeralCount; // ephemeral reference count
atomic_ptr_ref<T> * refptr; // reference count object
zmutex mutex; // temporary

void swap(atomic_ptr & obj) {
long tempcount;
atomic_ptr_ref<T> *tempptr;

mutex.lock();
tempcount = ephemeralCount; tempptr = refptr;
ephemeralCount = obj.ephemeralCount; refptr = obj.refptr;
obj.ephemeralCount = tempcount; obj.refptr = tempptr;
mutex.unlock();
}

atomic_ptr_ref<T> * getrefptr() {
atomic_ptr_ref<T> *p;

mutex.lock();
ephemeralCount++;
p = refptr;
mutex.unlock();

return p;
}

}; // class atomic_ptr

template<typename T> inline bool operator == (T * lhd, atomic_ptr<T> & rhd)
{ return (rhd == lhd); }

template<typename T> inline bool operator != (T * lhd, atomic_ptr<T> & rhd)
{ return (rhd != lhd); }

/*-*/

Mike Mowbray

unread,

Nov 20, 2002, 7:40:22 PM11/20/02

to

Joseph Seigh wrote:

> > Here's what I have so far. [...]

I arrived late to this discussion so I'm trying to catch up.
Have you already posted some code that shows how this stuff
is intended to be used? - I'm trying to figure out whether
it's the same as the classic old handle-and-rep idea, or
something different, but there's no comments in your code
to give me any hint about this. :-(

If you haven't already posted usage examples, do you have
some that you could show us?

- MikeM.

Joseph Seigh

unread,

Nov 20, 2002, 8:43:51 PM11/20/02

to

Here's a testcase. Not really a usage example. The main
thread pushes onto a queue and removes one item from the
middle. It's the only one doing it so it can be considered
to have exclusive access. The two other threads repeatedly
iterate thru the queue while it's being concurrently
modified. There's some checking in the dtor for memory
leaks. It's really hard to get random interleaving of
the threads though for testing purposes.

Joe Seigh

#include <stdio.h>
#include <sched.h>
#include <pthread.h>
#include <stdio.h>
#include <time.h>

#include "atomic_ptr.h"

int count = 0;
int ncount = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
struct timespec delay;

#define DELAY sched_yield();
//#define DELAY nanosleep(&delay, NULL);
#define JCOUNT 200000

void adjCount(int amount) {
pthread_mutex_lock(&lock);
count += amount;
ncount++;
pthread_mutex_unlock(&lock);
}

class qItem {

public:
qItem(atomic_ptr<qItem> & x) {
next = x;
adjCount(+1);
flag = 1;
}

~qItem() {
if (flag == -1)
printf("dtor error\n");
flag = -1;
adjCount(-1);
}

atomic_ptr<qItem> next;
int flag;

};

atomic_ptr<qItem> q;

void *thrash(void *x) {

int id = *(int *)x;
int j;
int xcount = 0;
local_ptr<qItem> p;

for (j = 0; j<JCOUNT; j++) {

p = q;
while (p != 0) {
DELAY;
assert(p->flag != -1);
if (p->flag == 0)
xcount++;
sched_yield();
DELAY;
p = p->next;
}
}
printf("id = %d xcount = %d count = %d\n", id, xcount, JCOUNT);

return NULL;
}

int main(int argc, char *argv[]) {
pthread_t tid1, tid2;
int id1, id2;
void *rval;
int j;
local_ptr<qItem> p2;
local_ptr<qItem> p3;

delay.tv_sec = 0;
delay.tv_nsec = 2000;

// create queue with 4 items
q = new qItem(q);
q = new qItem(q);
q = new qItem(q);
q = new qItem(q);

id1 = 1;
pthread_create(&tid1, NULL, &thrash, &id1);
id2 = 2;
pthread_create(&tid2, NULL, &thrash, &id2);

for (j = 0; j<JCOUNT; j++) {
q = new qItem(q); // push new item on queue
DELAY;
p2 = q->next; // 2nd item
p3 = p2->next; // 3rd item
p3->flag = 0;
p2->next = p3->next; // dequeue 3rd item
DELAY;
}

pthread_join(tid1, &rval);
pthread_join(tid2, &rval);

q = 0;
p2 = 0;
p3 = 0;

printf("jcount = %d\n", JCOUNT);
printf("count = %d\n", count);
printf("ncount = %d\n", ncount);

return 0;
}

/*-*/

Hillel Y. Sims

unread,

Nov 20, 2002, 10:49:05 PM11/20/02

to

"Joseph Seigh" <jsei...@xemaps.com> wrote in message
news:3DDBF8B8...@xemaps.com...

>
> // temporary mutex
> class zmutex {
> public:
> zmutex() { assert (pthread_mutex_init(&xmutex, NULL) == 0); }
> ~zmutex() { assert (pthread_mutex_destroy(&xmutex) == 0); }
> void lock() { assert (pthread_mutex_lock(&xmutex) == 0); }
> void unlock() { assert (pthread_mutex_unlock(&xmutex) == 0); }
> private:
> pthread_mutex_t xmutex;
>
> };

I know you said the mutex is just temporary, but as given this will
generally only achieve the desired effects in debug mode (well really:
#ifndef NDEBUG)

hys

--
(c) 2002 Hillel Y. Sims
FactSet Research Systems
hsims AT factset.com

Mike Mowbray

unread,

Nov 20, 2002, 10:56:06 PM11/20/02

to

Joseph Seigh wrote:

> > Mike Mowbray wrote:
>
> >> If you haven't already posted usage examples,
> >> do you have some that you could show us?
>
> > Here's a testcase. Not really a usage example.

Not sure what this was supposed to show. Only the main
thread creates and destroys qItems apparently, while the
other threads just check consistency of items they think
are on the queue. The qItem flag they check is set to -1
in qItem's d'tor, but nothing stops the space of a deleted
qItem from being re-allocated almost immediately and a
valid new qItem being constructed therein.

But let's back up a bit... I scanned the Detlefs paper
on "Lock-Free Reference Counting" mentioned earlier in
this newsthread. It talks about a refcounting mechanism
that doesn't require having ownership of the reference
first. You alluded to this problem in another article:

> First you had naive reference counting in a single thread
> which worked fine. When that was used in a multi-threaded
> environment the increment/decrement of a shared counter
> problem quickly became apparent. That was fixed with atomic
> increment/decrement. But it turns out there was still the
> problem of count going to zero before increment. The
> solution to that appeared to be intractable, so a
> compromise was made and the restriction of having to own a
> reference to copy it was widely adopted.

I'm someone who has gone through exactly these stages you
describe, and believes the "intractible" part in the above
(though I remain open-minded). Is this what you're trying
to solve? Being able to copy references between threads without
having to own a reference (and hence not needing to lock a mutex
to pass the reference safely between threads)?

- MikeM.

Joseph Seigh

unread,

Nov 21, 2002, 6:20:43 AM11/21/02

to

oops. one bug already.

The comparison operators for atomic_ptr aren't safe. For
the express "a == b", the object pointed to by a could get
deallocated and a new object get allocated with the same
address and assigned to b resulting in a false result. I
got a little confused I guess trying to figure out the
semantics of comparing unsynchronized pointers. Fix is
simple, just create a local_ptr temp via a ctor for
operants of atomic_ptr type.

Joe Seigh

Joseph Seigh

unread,

Nov 21, 2002, 6:35:57 AM11/21/02

to

Mike Mowbray wrote:
>
> Joseph Seigh wrote:
>
> > > Mike Mowbray wrote:
> >
> > >> If you haven't already posted usage examples,
> > >> do you have some that you could show us?
> >
> > > Here's a testcase. Not really a usage example.
>
> Not sure what this was supposed to show. Only the main
> thread creates and destroys qItems apparently, while the
> other threads just check consistency of items they think
> are on the queue. The qItem flag they check is set to -1
> in qItem's d'tor, but nothing stops the space of a deleted
> qItem from being re-allocated almost immediately and a
> valid new qItem being constructed therein.

Correct but there are no references to the dtor'd object
at that point so it is ok.

>
> But let's back up a bit... I scanned the Detlefs paper
> on "Lock-Free Reference Counting" mentioned earlier in
> this newsthread. It talks about a refcounting mechanism
> that doesn't require having ownership of the reference
> first. You alluded to this problem in another article:
>
> > First you had naive reference counting in a single thread
> > which worked fine. When that was used in a multi-threaded
> > environment the increment/decrement of a shared counter
> > problem quickly became apparent. That was fixed with atomic
> > increment/decrement. But it turns out there was still the
> > problem of count going to zero before increment. The
> > solution to that appeared to be intractable, so a
> > compromise was made and the restriction of having to own a
> > reference to copy it was widely adopted.
>
> I'm someone who has gone through exactly these stages you
> describe, and believes the "intractible" part in the above
> (though I remain open-minded). Is this what you're trying
> to solve? Being able to copy references between threads without
> having to own a reference (and hence not needing to lock a mutex
> to pass the reference safely between threads)?
>

I mentioned how it was done somewhere back there. Differential
reference counting. Instead of a refence count, you keep track
of references started and references ended. The "reference
count" is the difference between the two. The reference started
count is kept in the smart pointer and atomically incremented
when the pointer value is fetched and thus no race condition.
And yes, you can copy these smart pointers between threads without
having to own them, just like Java references.

Joe Seigh

Joseph Seigh

unread,

Nov 21, 2002, 6:52:35 AM11/21/02

to

"Hillel Y. Sims" wrote:
>
> "Joseph Seigh" <jsei...@xemaps.com> wrote in message
> news:3DDBF8B8...@xemaps.com...
> >
> > // temporary mutex
> > class zmutex {
> > public:
> > zmutex() { assert (pthread_mutex_init(&xmutex, NULL) == 0); }
> > ~zmutex() { assert (pthread_mutex_destroy(&xmutex) == 0); }
> > void lock() { assert (pthread_mutex_lock(&xmutex) == 0); }
> > void unlock() { assert (pthread_mutex_unlock(&xmutex) == 0); }
> > private:
> > pthread_mutex_t xmutex;
> >
> > };
>
> I know you said the mutex is just temporary, but as given this will
> generally only achieve the desired effects in debug mode (well really:
> #ifndef NDEBUG)
>

Well, prototype mode actually. Production mode would use compare and swap
which isn't in there yet. But it might be a good idea to keep them
as an alternate implementation for debugging, and as part of a
reference implementation.

Also as an implementation for those platforms without compare and swap. I
like the idea of standards allowing inefficient implementations. It is
just so important to be able to port onto platforms with suboptimal
implementations.

Joe Seigh

Joseph Seigh

unread,

Nov 22, 2002, 6:12:26 PM11/22/02

to

Some timings using compare and swap vs. simulated compare and swap using a single
global mutex.

On a Sun Blade 100 running Solaris 8, compare and swap is about 30% faster.

On a 866mhz P3 running rh 8.0 linux (2.4 kernel) compare and swap is about
20X faster. Against the orginal atomic_ptr with individual mutexes, it only
about 30% faster.

It would appear that linux mutexes use fifo queues for service which really slows
things down when there is any lock contention. Solaris threads can jump the
queue so a queue is less likely to form.

Joe Seigh

Alexander Terekhov

unread,

Nov 23, 2002, 10:53:59 AM11/23/02

to

Joseph Seigh wrote:
[...]

> It would appear that linux mutexes use fifo queues for service which really slows
> things down when there is any lock contention.

Unless you use PTHREAD_MUTEX_ADAPTIVE_NP ones, IIRC.

regards,
alexander.

Joseph Seigh

unread,

Nov 26, 2002, 10:24:48 AM11/26/02

to

Ok, updated with compare and swap. Memory barriers in there yet, just comments where
they would go. They're not where you would think they would go. The atomic/compare
and swap operations don't need membars. Basically the store of a smart pointer will
act as if a store memory barrier occurred before the store, and a fetch of a smart
pointer will act as if a fetch memory barrier occurred after the fetch. This should
suffice to support DCL, COW (copy on write), immutable objects, and other wait-free
techniques in as much as I understand them.

This is still prototyped and things are still a little kludgey. The cas64 will be an
external probably with its own header file. Atomic_ptr will probably end up as a
trivial subclass so I can decouple the implentation. That will allow flexibility in
things like the atomic op implementations to use other techniques besides compare and
swap.

Also will go open source with it at some point.

Joe Seigh

atomic_ptr.h
-----------------------------
extern "C" {
extern int cas64(void *target, void *oldval, void *newval);
}

template<typename T> class atomic_ptr;
template<typename T> class local_ptr;

template<typename T> class atomic_ptr_ref;

struct refcount {
long ecount; // ephemeral count
long rcount; // reference count
};

template<typename T> struct ref {
long ecount; // ephemeral count
atomic_ptr_ref<T> *ptr;
};

//=============================================================================
// atomic_ptr_ref
//
//
//=============================================================================
template<typename T> class atomic_ptr_ref {
friend class atomic_ptr<T>;
friend class local_ptr<T>;

private:

atomic_ptr_ref(T * p = 0) {

count.ecount = 0;
count.rcount = 1;
ptr = p;
};

// atomic

int adjust(long xephemeralCount, long xreferenceCount) {

refcount oldval, newval;

oldval.ecount = count.ecount;
oldval.rcount = count.rcount;
do {
newval.ecount = oldval.ecount + xephemeralCount;
newval.rcount = oldval.rcount + xreferenceCount;

}
while (cas64(&count, &oldval, &newval) == 0);

return (newval.ecount == 0 && newval.rcount == 0) ? 0 : 1;
}

refcount count; // reference counts

T * ptr; // ptr to actual object

}; // class atomic_ptr_ref

//=============================================================================
// local_ptr
//
//
//=============================================================================
template<typename T> class local_ptr {
public:
local_ptr() { refptr = 0; }

local_ptr(const local_ptr<T> & src) {
if ((refptr = src.refptr) != 0)
refptr->adjust(+1, 0);
}

local_ptr(atomic_ptr<T> & src) {
refptr = src.getrefptr();

// fetch membar
}

~local_ptr() {
if (refptr != 0 && refptr->adjust(-1, 0) == 0) {
delete refptr->ptr;
delete refptr;
}
}

// only valid argument is 0 (null). All other values silently ignored.

local_ptr<T> & operator = (const float ** (atomic_ptr<T>:: **)()) {
local_ptr<T> temp;

atomic_ptr_ref<T> * refptr;

}; // class local_ptr

public:

xxx.ecount = 0;
if (obj != 0)
xxx.ptr = new atomic_ptr_ref<T>(obj);
else
xxx.ptr = 0;
// store membar
}

atomic_ptr(atomic_ptr & src) { // copy constructor

xxx.ecount = 0;
xxx.ptr = src.getrefptr(); // atomic

// adjust link count
if (xxx.ptr != 0)
xxx.ptr->adjust(-1, +1); // atomic
// store membar
}

~atomic_ptr() { // destructor
if (xxx.ptr != 0 && xxx.ptr->adjust(xxx.ecount, -1) == 0) {
delete xxx.ptr->ptr;
delete xxx.ptr;
}
}

atomic_ptr & operator = (atomic_ptr & src) {
atomic_ptr temp(src); // copy
swap(temp); // atomic
return *this;
}

atomic_ptr & operator = (T * obj) {
atomic_ptr temp(obj);
swap(temp);
return *this;
}

// generate local temp ptr to guarantee validity of ptr
// during lifetime of expression. temp is dtor'd after
// expression has finished execution.
local_ptr<T> operator -> () { return local_ptr<T>(*this); }
local_ptr<T> operator * () { return local_ptr<T>(*this); }

bool operator == (T * rhd) {
if (rhd == 0)

return (xxx.ptr == 0);

else
return (local_ptr<T>(*this) == rhd);
}

bool operator != (T * rhd) {
if (rhd == 0)

return (xxx.ptr != 0);

else
return (local_ptr<T>(*this) != rhd);
}

bool operator == (local_ptr<T> & rhd) {return (local_ptr<T>(*this) == rhd); }
bool operator != (local_ptr<T> & rhd) {return (local_ptr<T>(*this) != rhd); }
bool operator == (atomic_ptr<T> & rhd) {return (local_ptr<T>(*this) == local_ptr<T>(rhd)); }
bool operator != (atomic_ptr<T> & rhd) {return (local_ptr<T>(*this) != local_ptr<T>(rhd)); }

private:

//long ephemeralCount; // ephemeral reference count
//atomic_ptr_ref<T> * refptr; // reference count object

ref<T> xxx;

// atomic
void swap(atomic_ptr & obj) { // obj is local & non-shared
ref<T> temp;

temp.ecount = xxx.ecount;
temp.ptr = xxx.ptr;

while(cas64(&xxx, &temp, &obj.xxx) == 0);

obj.xxx.ecount = temp.ecount;
obj.xxx.ptr = temp.ptr;
}

// atomic
atomic_ptr_ref<T> * getrefptr() {
ref<T> oldval, newval;

oldval.ecount = xxx.ecount;
oldval.ptr = xxx.ptr;

do {
newval.ecount = oldval.ecount + 1;
newval.ptr = oldval.ptr;
}
while (cas64(&xxx, &oldval, &newval) == 0);

return oldval.ptr;

Joseph Seigh

unread,

Nov 26, 2002, 10:28:54 AM11/26/02

to

quick and dirty intel pentium compare and swap, uniprocessor
version w/o lock prefix.

Joe Seigh

--------
typedef unsigned long long32_t;
typedef struct {
long32_t high;
long32_t low;
} long64_t;

inline int cmpxchg64(long64_t *target, long64_t cmp, long32_t val1, long32_t val2) {
unsigned char ret;

__asm__
(
"cmpxchg8b %4\n\t"
"sete %0\n\t"
:
"=q" (ret) /* rc, 0 or 1 */
:
"A" (cmp),
"c" (val2),
"b" (val1),
"m" (*target)
:
"memory", "cc"
);

return ret;
}

int cas64(long64_t *target, long64_t *cmp, long64_t *value) {
int rc;
if ((rc = cmpxchg64(target, *cmp, value->high, value->low)) == 0) {
*cmp = *target;
}
return rc;
}

/*-*/

Joseph Seigh

unread,

Nov 26, 2002, 6:25:46 PM11/26/02

to

Joseph Seigh wrote:
>
> Ok, updated with compare and swap. Memory barriers in there yet, just comments where
> they would go. They're not where you would think they would go. The atomic/compare
> and swap operations don't need membars. Basically the store of a smart pointer will
> act as if a store memory barrier occurred before the store, and a fetch of a smart
> pointer will act as if a fetch memory barrier occurred after the fetch. This should
> suffice to support DCL, COW (copy on write), immutable objects, and other wait-free
> techniques in as much as I understand them.
>

I'll have to take that back about no membars required for the compare and swap atomic ops.
They will require membars. Not because they have to be ordered with respect to other memory
operations but they do have to be ordered with respect to other atomic operations. At least
for sparc this is the case. The sparc architecture manual says nothing about ordering of multiple
compare and swaps.

Joe Seigh

Mike Mowbray

unread,

Nov 26, 2002, 7:00:13 PM11/26/02

to

Joseph Seigh wrote:

> > Ok, updated with compare and swap. [...]

I'm trying to understand this in conjunction with the
"by-the-numbers" explanation you posted separately...

It all seems critically dependent on being able to
do a cas64 on a 64-bit word which contains both a 32-bit
refcount and a 32-bit pointer. That would mean it can't
work on real 64-bit platforms (e.g: alpha, sparcv9,
itanium?, ...) which use 64-bit pointers.

Does itanium have a cmpxchg128 instruction? I'm pretty
sure sparcv9 has only casx which has merely the semantics
of cas64. I doubt whether alpha has such a thing either,
though I don't know whether a load-locked/store-conditional
sequence could do it.(Perhaps Dave Butenhof can tell us?)

- MikeM.

Joseph Seigh

unread,

Nov 26, 2002, 8:04:10 PM11/26/02

to

Well, you could simulate it using locks at the compare and swap
level or atomic operation level. Wouldn't be lock-free technically.

IBM z390 or z900 or whatever they're calling it has 128 bit
compare and swap. I would think that if 32 bit architectures
thought a doubleword compare and swap was useful then the
same might hold for 64 bit architectures. Of course it
could have been part of a 64 bit migration thing.

Joe Seigh

Mike Mowbray

unread,

Nov 26, 2002, 10:02:44 PM11/26/02

to

Joseph Seigh wrote:

> > Mike Mowbray wrote:
>
> >> It all seems critically dependent on being able to
> >> do a cas64 on a 64-bit word which contains both a 32-bit
> >> refcount and a 32-bit pointer. That would mean it can't
> >> work on real 64-bit platforms (e.g: alpha, sparcv9,
> >> itanium?, ...) which use 64-bit pointers.
>

> > Well, you could simulate it using locks at the compare and swap
> > level or atomic operation level. Wouldn't be lock-free technically.

In which case I don't see the point of the exercise.

> > IBM z390 or z900 or whatever they're calling it has 128 bit
> > compare and swap. I would think that if 32 bit architectures
> > thought a doubleword compare and swap was useful then the
> > same might hold for 64 bit architectures. Of course it
> > could have been part of a 64 bit migration thing.

But it remains true that if most of the commonly available
64-bit processors don't support some form of cas128, the
whole exercise is academic, or limited to running on some
small set of processors.

- MikeM.

Alexander Terekhov

unread,

Nov 27, 2002, 3:01:41 AM11/27/02

to

Joseph Seigh wrote:
[...]
> IBM z390 or z900 or whatever they're calling it ...

whatever == "z/Architecture" ;-)

http://www.ibm.com/servers/s390/os390/bkserv/latest/z_arch.html
("Latest publications -- z/Architecture (64-bit architecture).
Links to the latest editions of publications are available
here when an edition of a publication becomes available that
supersedes the edition available on a current CD collection
kit bookshelf.")

regards,
alexander.

David Butenhof

unread,

Nov 27, 2002, 10:22:43 AM11/27/02

to

Mike Mowbray wrote:

Heh heh. Yeah, I've been watching the philosophical ramblings on this
subject with minor interest, while trying to avoid distracting myself too
much from "real work" by actually thinking about it.

That is indeed the catch. Neither Alpha nor IPF has 128 bit compare-and-swap
"as such". On Alpha the rules of LDQ_L/STQ_C allow (with many caveats and
restrictions) touching a second "quadword" (64 bits) in the same cache
line, (which allows VMS to do atomic double-linked queue operations that
extend the equivalent VAX instructions to 64-bit pointers). But that's a
"hack" that depends on the low-level load-lock and store-conditional
primitives, and done in the interrupt-protected environment of "PAL code".
(More or less equivalent to a microcode-like machine environment.)

You could probably get away with it, though I've never had the courage to
try to write anything that depended on that leeway. (You'd never find
problems in normal testing, and the symptoms in a complicated application
when it finally broke would be an absolute nightmare to backtrack.)

In any case, though, you can't construct something like that with the
higher-level IPF compare-and-swap. Perhaps you could start working on
convincing Intel to put a cmpxchg16 (the 'sz' field is bytes, not bits)
into the next rev of the architecture. They're supposedly applying the
Alpha engineers they've "inherited" to develop substantial and possibly
even radical improvements for a future generation of IPF.

--
/--------------------[ David.B...@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/

Joseph Seigh

unread,

Nov 27, 2002, 11:55:55 AM11/27/02

to

And I presume that the PowerPC 970 will continue to have it in the
form ldarx/stdcx (load doubleword reserved/store doubleword conditional.
If so then maybe Macs will have it also.

Joe Seigh