Strong thread safety and lock free?

Marcel Müller

unread,

Jul 25, 2009, 3:12:45 PM7/25/09

to

Hi,

I am seeking for a lock free solution for strongly thread safe reference
counted pointers. Is this possible at all?

The problem is that I have to dereference the pointer and increment the
reference counter atomically. And I have absolutely no idea how to avoid
to write memory that is no longer used by the reference counter.

Any ideas?

Marcel

Dmitriy V'jukov

unread,

Jul 26, 2009, 5:12:19 PM7/26/09

to

This is exactly what you looking for (download atomic-ptr-plus
package):
http://sourceforge.net/projects/atomic-ptr-plus/files/

--
Dmitriy V'yukov

Chris M. Thomasson

unread,

Jul 26, 2009, 9:17:16 PM7/26/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:0c692331-5bcc-4639...@c14g2000yqm.googlegroups.com...

Right. FWIW, here is an alternative "mostly lock-free" implementation:

http://webpages.charter.net/appcore/vzoom/refcount

;^)

Chris M. Thomasson

unread,

Jul 26, 2009, 10:47:45 PM7/26/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:0c692331-5bcc-4639...@c14g2000yqm.googlegroups.com...

I feel the need to point out that some yahoos from Sun Microsystems are
attempting to patent Joe Seighs atomic-ptr invention. I cannot seem to find
the relevant post on Goolge Groups right now. Humm, it seems as if Google
trimmed the number of stores posts down rather significantly. I am having a
hard time finding posts from the past. Oh well. ;^(...

Chris M. Thomasson

unread,

Jul 26, 2009, 10:51:41 PM7/26/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:h4j4fs$2ohn$1...@news.ett.com.ua...

I found the patent application:

http://www.google.com/patents/about?id=cs6aAAAAEBAJ&dq=Lightweight+reference+counting+using+single-target+synchronization

This is NOT an invention from Sun. It was invented by Joe Seigh; here is
prior art in which the basis of the very neat atomic-ptr algorithm was
formed:

http://www.google.com/patents/about?id=DSIeAAAAEBAJ&dq=5,295,262

Marcel Müller

unread,

Jul 27, 2009, 3:51:44 AM7/27/09

to

Hallo,

Dmitriy V'jukov wrote:
> This is exactly what you looking for (download atomic-ptr-plus
> package):
> http://sourceforge.net/projects/atomic-ptr-plus/files/

interesting concept. (I had only a rough look at it so far.)

Unfortunately it relies on cmpxch8b on x86 which expects 64 bit
alignment, isn't it? It could get rather hard to ensure that alignment
on my platform (eCS). At least the C runtime won't support it.

I will see if I can take over the idea to intrusive pointers
(requirement) and maybe compact the memory footprint of the reference
counts to 32 bits by cutting a some bits from the counter for the
ecount. But I did not catch the idea behind ecount and rcount so far. It
looks like ecount is used like a temporary spinlock to synchronize read
access.

Marcel

Dmitriy V'jukov

unread,

Jul 27, 2009, 4:31:38 AM7/27/09

to

On Jul 27, 11:51 am, Marcel Müller <news.5.ma...@spamgourmet.com>
wrote:

Yes, you definitely can pack pointer and counter into single word.
No, there is no emulation of locks, just 2 reference counters
(ephemeral and persistent, or outer and inner), so the solution is
lock-free.

Take a look at my wait-free proxy algorithm:
http://groups.google.com/group/lock-free/browse_frm/thread/f20a631b3b4c3813
The main idea is the same as in Joe Seigh's atomic ptr plus. However
that algorithm is totally wait-free (only InterlockedExchangeAdd and
InterlockedExchnage are used, no loops). I do pointer packing -
allocate object with some alignment and then steal low bits for
counter.
You may start reading from the usage example (in the bottom of the
post), reader threads do strong acquire of reference counter while
writer threads do simultaneous modification of the pointer.
The main trick is that dereferencing of a pointer and increment of the
counter is single atomic action.

--
Dmitriy V'jukov

Dmitriy V'jukov

unread,

Jul 27, 2009, 4:51:04 AM7/27/09

to

On Jul 27, 6:51 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > I feel the need to point out that some yahoos from Sun Microsystems are
> > attempting to patent Joe Seighs atomic-ptr invention. I cannot seem to
> > find the relevant post on Goolge Groups right now. Humm, it seems as if
> > Google trimmed the number of stores posts down rather significantly. I am
> > having a hard time finding posts from the past. Oh well. ;^(...
>
> I found the patent application:
>

> http://www.google.com/patents/about?id=cs6aAAAAEBAJ&dq=Lightweight+re...

>
> This is NOT an invention from Sun. It was invented by Joe Seigh; here is
> prior art in which the basis of the very neat atomic-ptr algorithm was
> formed:
>
> http://www.google.com/patents/about?id=DSIeAAAAEBAJ&dq=5,295,262

Btw, Chris, do they patent only Joe Seigh's lock-free implementation,
or the patent also covers improved wait-free implementation:
http://groups.google.com/group/lock-free/browse_frm/thread/f20a631b3b4c3813
?
I see the patent mentions only "lock-free" and is totally based on
CASes, there is no single mention of "wait-free" and FAA.
It seems that they did not come to the idea of wait-free
implementation. If the patent does not cover wait-free implementation,
then... well, why does anybody want to use lock-free CAS-loop based
solution when there is improved wait-free loop-less algorithm.
What do you think?

I've posted in to c.p.t back in 2007 (can't find the post now),
however I have to mention the name of my former colleague Alexander
Shuvaev because it's he who invented the wait-free version (you might
see him here on c.p.t).

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Jul 27, 2009, 7:00:45 AM7/27/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:b906e640-a9c7-4af8...@h11g2000yqb.googlegroups.com...

Well, I think the patent application is going to get rejected based on
prior-art. However, they don't explicitly mention CAS in the actuall claims.
They only refer to so-called "single-target synchronization". The counting
algorithm itself is identical to that of atomic-ptr. So, lets say that the
patent gets accepted, and somebody creates atomic-ptr via alignment trick,
well, I think that would still be patent infringement. Humm...

BTW, I think one might want to use the lock-free version because the
reference count can be a full 32-bits wide instead of the rather limited
size of the stolen bits.

> I've posted in to c.p.t back in 2007 (can't find the post now),

I am having trouble with Google groups search as well as it seems they
dumped older posts.

> however I have to mention the name of my former colleague Alexander
> Shuvaev because it's he who invented the wait-free version (you might
> see him here on c.p.t).

Yeah. I think I remember reading some posts from him. However, I cannot find
any of them via Google Groups search; damn!

Chris M. Thomasson

unread,

Jul 27, 2009, 7:03:13 AM7/27/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:48fd7fa2-f978-492f...@j21g2000yqe.googlegroups.com...
On Jul 27, 11:51 am, Marcel M�ller <news.5.ma...@spamgourmet.com>

wrote:
> > Hallo,
> >
> > Dmitriy V'jukov wrote:
> > > This is exactly what you looking for (download atomic-ptr-plus
> > > package):
> > >http://sourceforge.net/projects/atomic-ptr-plus/files/
> >
> > interesting concept. (I had only a rough look at it so far.)
> >
> > Unfortunately it relies on cmpxch8b on x86 which expects 64 bit
> > alignment, isn't it? It could get rather hard to ensure that alignment
> > on my platform (eCS). At least the C runtime won't support it.
> >
> > I will see if I can take over the idea to intrusive pointers
> > (requirement) and maybe compact the memory footprint of the reference
> > counts to 32 bits by cutting a some bits from the counter for the
> > ecount. But I did not catch the idea behind ecount and rcount so far. It
> > looks like ecount is used like a temporary spinlock to synchronize read
> > access.

> Yes, you definitely can pack pointer and counter into single word.

Indeed. The OP just has to make sure to steal enough bits to accommodate the
total amount of references to an object that can exist at any one time.

[...]

Dmitriy V'jukov

unread,

Jul 27, 2009, 7:20:58 AM7/27/09

to

On Jul 27, 3:00 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Well, I think the patent application is going to get rejected based on
> prior-art. However, they don't explicitly mention CAS in the actuall claims.
> They only refer to so-called "single-target synchronization". The counting
> algorithm itself is identical to that of atomic-ptr. So, lets say that the
> patent gets accepted, and somebody creates atomic-ptr via alignment trick,
> well, I think that would still be patent infringement. Humm...
>
> BTW, I think one might want to use the lock-free version because the
> reference count can be a full 32-bits wide instead of the rather limited
> size of the stolen bits.

The point is not in alignment trick, the point is in wait-free
implementation that does not use CAS (only FAA and XCHG). The
algorithm is similar to that of Joe Seigh, but not identical.
The mention "lock-free" in the claims, but not "wait-free". Wait-free
is a subset of lock-free, though.

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Jul 27, 2009, 7:24:22 AM7/27/09

to

"Marcel Müller" <news.5...@spamgourmet.com> wrote in message
news:4a6d5c90$0$31863$9b4e...@newsspool3.arcor-online.net...

> Hallo,
>
> Dmitriy V'jukov wrote:
>> This is exactly what you looking for (download atomic-ptr-plus
>> package):
>> http://sourceforge.net/projects/atomic-ptr-plus/files/
>
> interesting concept. (I had only a rough look at it so far.)
>
> Unfortunately it relies on cmpxch8b on x86 which expects 64 bit alignment,
> isn't it?

No. However, you will get much better performance if the data is properly
aligned on a 64-bit boundary:

http://www.intel.com/Assets/PDF/manual/253668.pdf
(section 8.1.2.2)

> It could get rather hard to ensure that alignment on my platform (eCS). At
> least the C runtime won't support it.
>
> I will see if I can take over the idea to intrusive pointers (requirement)
> and maybe compact the memory footprint of the reference counts to 32 bits
> by cutting a some bits from the counter for the ecount. But I did not
> catch the idea behind ecount and rcount so far. It looks like ecount is
> used like a temporary spinlock to synchronize read access.

No spinlock logic exists in atomic-ptr. You can read through the following
patent applications teachings and drawings to learn the counting logic
works:

http://www.google.com/patents/about?id=cs6aAAAAEBAJ

Chris M. Thomasson

unread,

Jul 27, 2009, 7:30:56 AM7/27/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:a23af500-d472-43aa...@l31g2000vbp.googlegroups.com...

On Jul 27, 3:00 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > Well, I think the patent application is going to get rejected based on
> > prior-art. However, they don't explicitly mention CAS in the actuall
> > claims.
> > They only refer to so-called "single-target synchronization". The
> > counting
> > algorithm itself is identical to that of atomic-ptr. So, lets say that
> > the
> > patent gets accepted, and somebody creates atomic-ptr via alignment
> > trick,
> > well, I think that would still be patent infringement. Humm...
> >
> > BTW, I think one might want to use the lock-free version because the
> > reference count can be a full 32-bits wide instead of the rather limited
> > size of the stolen bits.

> The point is not in alignment trick, the point is in wait-free
> implementation that does not use CAS (only FAA and XCHG). The
> algorithm is similar to that of Joe Seigh, but not identical.

I need to clarify a point. When I wrote:

"The counting algorithm itself is identical to that of atomic-ptr."

I was referring to the counting algorithm in the patent application, not the
proxy collector algorithm you posted.

> The mention "lock-free" in the claims, but not "wait-free". Wait-free
> is a subset of lock-free, though.

Good point. It would be funny if they say, well, wait-free is lock-free by
default therefore you infringe on our stuff, and BTW, our lawyers are bigger
than yours!

;^/

Chris M. Thomasson

unread,

Jul 27, 2009, 7:50:14 AM7/27/09

to

"Marcel Müller" <news.5...@spamgourmet.com> wrote in message
news:4a6d5c90$0$31863$9b4e...@newsspool3.arcor-online.net...

> Hallo,
>
> Dmitriy V'jukov wrote:
>> This is exactly what you looking for (download atomic-ptr-plus
>> package):
>> http://sourceforge.net/projects/atomic-ptr-plus/files/
>
> interesting concept. (I had only a rough look at it so far.)
>
> Unfortunately it relies on cmpxch8b on x86 which expects 64 bit alignment,
> isn't it? It could get rather hard to ensure that alignment on my platform
> (eCS). At least the C runtime won't support it.

You can make the reference counts 16-bits wide and use 32-bit atomic
operations to mutate them.

> I will see if I can take over the idea to intrusive pointers (requirement)
> and maybe compact the memory footprint of the reference counts to 32 bits
> by cutting a some bits from the counter for the ecount. But I did not
> catch the idea behind ecount and rcount so far. It looks like ecount is
> used like a temporary spinlock to synchronize read access.

FWIW, you can implement reference counting with strong thread safety level
using just about any deferred reclamation algorithm. The pattern goes like,
using hazard pointers:
_______________________________________________________________
struct foo {
int refs;
};

struct foo*
foo_strong_acquire(
struct foo** pself
) {
struct foo* self = smr_load(pself);
if (self) {
int refs;
do {
refs = self->refs;
if (refs < 1) {
smr_unload(self);
return NULL;
}
} while (! ATOMIC_CAS(&self->refs, refs, refs + 1));
smr_unload(self);
}
return self;
}

void
foo_release(
struct foo* self
) {
if (! ATOMIC_DEC(&self->refs)) {
smr_defer(self);
}
}
_______________________________________________________________

For a proxy collector, the pattern could go like:
_______________________________________________________________
struct foo {
int refs;
};

struct foo*
foo_strong_acquire(
struct foo** pself
) {
struct proxy* p = proxy_acquire(...);
struct foo* self = *pself;
if (self) {
int refs;
do {
refs = self->refs;
if (refs < 1) {
self = NULL;
break;
}
} while (! ATOMIC_CAS(&self->refs, refs, refs + 1));
}
proxy_release(p);
return self;
}

void
foo_release(
struct foo* self
) {
if (! ATOMIC_DEC(&self->refs)) {
proxy_defer(self);
}
}
_______________________________________________________________

Chris M. Thomasson

unread,

Jul 27, 2009, 8:16:46 AM7/27/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:h4k490$323$1...@news.ett.com.ua...

> "Marcel Müller" <news.5...@spamgourmet.com> wrote in message
> news:4a6d5c90$0$31863$9b4e...@newsspool3.arcor-online.net...
>> Hallo,
>>
>> Dmitriy V'jukov wrote:
>>> This is exactly what you looking for (download atomic-ptr-plus
>>> package):
>>> http://sourceforge.net/projects/atomic-ptr-plus/files/
>>
>> interesting concept. (I had only a rough look at it so far.)
>>
>> Unfortunately it relies on cmpxch8b on x86 which expects 64 bit
>> alignment, isn't it? It could get rather hard to ensure that alignment on
>> my platform (eCS). At least the C runtime won't support it.

One point. The alignment issue aside, the algorithm as-is requires DWCAS
which is not supported on most 64-bit systems. So, if you plan on porting to
64-bit system, your going to have to take this issue into account...

> You can make the reference counts 16-bits wide and use 32-bit atomic
> operations to mutate them.

Something like:
_____________________________________________________________________
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>

struct xcount {
__int16 ecount;
__int16 rcount;
};

union count {
struct xcount count;
__int32 target;
};

LONG
adjust(
union count* self,
__int16 ecount,
__int16 rcount
) {
LONG addend = ecount + (rcount << 16U);
return InterlockedExchangeAdd(&self->target, addend) + addend;
}

int main(void) {
union count x = { 0 };
adjust(&x, 3, 2);
adjust(&x, -2, -1);
adjust(&x, -1, -1);
return 0;
}
_____________________________________________________________________

However, this does not address the need for DWCAS. Your going to need to use
an alignment or offset trick. Alignment trick steals bits while offset trick
uses an offset as a pointer (e.g., index into array).

Marcel Müller

unread,

Jul 28, 2009, 11:53:39 AM7/28/09

to

Hi,

Chris M. Thomasson wrote:
> I found the patent application:
>
> http://www.google.com/patents/about?id=cs6aAAAAEBAJ&dq=Lightweight+reference+counting+using+single-target+synchronization

thanks for the links. It was a long document with many stories about
'persons of ordinary skill' and the only really important fact is in
section 72 telling that there is an additional counter at any shared
reference. However, the idea is very useful.

Marcel

Chris M. Thomasson

unread,

Jul 29, 2009, 1:26:31 AM7/29/09

to

"Marcel Müller" <news.5...@spamgourmet.com> wrote in message

news:4a6f1f05$0$30227$9b4e...@newsspool1.arcor-online.net...

> Hi,
>
> Chris M. Thomasson wrote:
>> I found the patent application:
>>
>> http://www.google.com/patents/about?id=cs6aAAAAEBAJ&dq=Lightweight+reference+counting+using+single-target+synchronization
>
> thanks for the links. It was a long document with many stories about
> 'persons of ordinary skill'

;^)

> and the only really important fact is in section 72 telling that there is
> an additional counter at any shared reference. However, the idea is very
> useful.

The drawings basically explain everything as they show how Joes atomic-ptr
works.

Marcel Müller

unread,

Aug 13, 2009, 4:36:24 PM8/13/09

to

Hi!

Dmitriy V'jukov wrote:
> On Jul 27, 11:51 am, Marcel M�ller <news.5.ma...@spamgourmet.com>

>
>> I will see if I can take over the idea to intrusive pointers
>> (requirement) and maybe compact the memory footprint of the reference
>> counts to 32 bits by cutting a some bits from the counter for the
>> ecount. But I did not catch the idea behind ecount and rcount so far. It
>> looks like ecount is used like a temporary spinlock to synchronize read
>> access.
>
>
> Yes, you definitely can pack pointer and counter into single word.
> No, there is no emulation of locks, just 2 reference counters
> (ephemeral and persistent, or outer and inner), so the solution is
> lock-free.
>
> Take a look at my wait-free proxy algorithm:
> http://groups.google.com/group/lock-free/browse_frm/thread/f20a631b3b4c3813
> The main idea is the same as in Joe Seigh's atomic ptr plus. However
> that algorithm is totally wait-free (only InterlockedExchangeAdd and
> InterlockedExchnage are used, no loops). I do pointer packing -
> allocate object with some alignment and then steal low bits for
> counter.
>
> You may start reading from the usage example (in the bottom of the
> post), reader threads do strong acquire of reference counter while
> writer threads do simultaneous modification of the pointer.
> The main trick is that dereferencing of a pointer and increment of the
> counter is single atomic action.

It took me a while to get time to get further into this.
It turned out soon that I cannot reasonably take enough bits to serve
the number of concurrent references. This would blow up the memory
footprint of the application too much. In my case it happens that many
different shared pointers refer to the same object and this can cause
significantly more than one active reference per thread.

But I think I finally have a solution.
I put Joe Seigh's idea, your implementation with packed structures,
intrusive reference counting and a smarter handling of the outer
counters to reduce the number of additional bits together. What came out
is an algorithm with the following properties:

- Size of a shared smart pointer instance: like a C pointer
- Size of a local smart pointer instance: like a C pointer
- Overhead of the intrusive reference count: like long
- Supported number of reader threads: in practice unlimited
- Atomic primitives:
- InterlockedExchangeAdd
- InterlockedExchange
- InterlockedCompareExchange
The other used primitives could be replaced by InterlockedExchangeAdd.
- Wait-free
- Intrusive reference counting based on a public interface
- Automatic alignment but typically no explicit alignment required (*1)
- Local smart pointer instances are binary compatible to
C style pointers. Useful for fast access.
- The right access strategy is chosen automatically based on the
volatile keyword.

The drawback is:
- 3 interlocked instructions for getting a local reference (amortized)

(*1) The trick is that the temporary reference count in the shared
pointers is transferred to the main reference count as soon as possible.
This reduces the number of required bits nearly to one. In fact I was
not able to raise the overflow assertion even with only 1 stolen bit
unless I put (slow) debugging output between the critical instructions.
So 2 or 3 bits should be sufficient for most purposes. Most of the C
runtime libs provide at least 32 bit alignment anyway. So 2 bits are
free to use without additional alignment.

Marcel

Btw: it is amazing that this modern techniques do not require anything
that has not been available on an almost 20 year old 486 CPU.

--------------------------------------------------------------
#define INCL_DOS
#include <os2.h>

#include <process.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <time.h>
#include <malloc.h>
#include <assert.h>

#define Sleep DosSleep

//#define DEBUGLOG(x)
#define DEBUGLOG(x) debuglog x

#define WORD_WIDTH 32
#define STOLEN_BITS 2U
#define CLIB_ALIGNMENT 4U // Intrinsic alignment of C runtime
#define ALIGNMENT (1U << STOLEN_BITS)
#define POINTER_MASK (~ALIGNMENT+1U)
#define COUNTER_MASK (ALIGNMENT-1U)
#define XCOUNT_SHIFT (WORD_WIDTH-STOLEN_BITS)
#define TYPE my_data

// log to stderr
static void debuglog( const char* fmt, ... )
{ va_list va;
PTIB ptib;
char buffer[14+1024+1];
ULONG dummy;

// Get TID
DosGetInfoBlocks(&ptib, NULL);

va_start(va, fmt);
// 8+ 1+4+ 1 = 14
sprintf(buffer, "%08ld %04ld ", clock(), ptib->tib_ptib2->tib2_ultid);
vsnprintf(buffer+14, 1024, fmt, va);
va_end( va );
DosWrite(2, buffer, 14 + strlen(buffer+14), &dummy);
//fputs( buffer, stderr );

//Sleep(0);
}

// Intrinsics for gcc/x86
static __inline__ void _InterlockedIncrement(__volatile__ long *pu)
{ __asm__ __volatile__("lock; incl %0"
: "+m" (*pu) : : "cc");
}
static __inline__ void _InterlockedDecrement(__volatile__ long *pu)
{ __asm__ __volatile__("lock; decl %0"
: "+m" (*pu) : : "cc");
}
static __inline__ long _InterlockedExchange(__volatile__ long *pu, long u)
{ __asm__ __volatile__("xchgl %1, %0"
: "+m" (*pu), "+r" (u));
return u;
}
static __inline__ void _InterlockedAdd(__volatile__ long *pu, const long
uAdd)
{ __asm__ __volatile__("lock; addl %1, %0"
: "+m" (*pu) : "nr" (uAdd) : "cc");
}
static __inline__ void _InterlockedSub(__volatile__ long *pu, const long
uSub)
{ __asm__ __volatile__("lock; subl %1, %0"
: "+m" (*pu) : "nr" (uSub) : "cc");
}
static __inline__ long _InterlockedExchangeAdd(__volatile__ long *pu,
long uAdd)
{ __asm__ __volatile__("lock; xaddl %1, %0"
: "+m" (*pu), "+r" (uAdd) : : "cc");
return uAdd;
}
static __inline__ long _InterlockedCompareExchange(__volatile__ long
*pu, long uOld, const long uNew)
{ __asm__ __volatile__("lock; cmpxchgl %2, %0"
: "+m" (*pu), "+a" (uOld) : "r" (uNew) : "cc");
return uOld;
}

// base class for reference counted objects
struct intrusive_count
{
friend class intrusive_ptr;
typedef unsigned char offset; /* must be able to hold ALIGNMENT */
long count;

protected:
intrusive_count() : count(0) {}
intrusive_count(const intrusive_count&) : count(0) {}
~intrusive_count() {}

#if ALIGNMENT > CLIB_ALIGNMENT
public:
// alignment
void* operator new( unsigned int len )
{ char* p = (char*)::operator new(len + sizeof(offset)
+ ALIGNMENT - CLIB_ALIGNMENT);
offset off = ((-(int)p-sizeof(offset)) & COUNTER_MASK)
+ sizeof(offset);
p += off;
((offset*)p)[-1] = off;
return p;
}
void operator delete( void* ptr )
{ char* p = (char*)ptr;
offset off = ((offset*)p)[-1];
::operator delete(p - off);
}
#endif
};

// test tracer - used as user data
long inst_counter = 0;
long id_counter = 0;

struct my_data : public intrusive_count
{
const int i;
const int j;

my_data(int i) : i(i), j(_InterlockedExchangeAdd(&id_counter, 1))
{ _InterlockedIncrement(&inst_counter);
DEBUGLOG(("ctor %p %d %d %d\r\n", this, i, j, inst_counter));
}

~my_data()
{ DEBUGLOG(("dtor %p %d %d %d\r\n", this, i, j, inst_counter));
_InterlockedDecrement(&inst_counter);
}

};

class intrusive_ptr
{private:
mutable long data;

// Strongly thread safe read
long acquire() volatile const
{ if (!data)
return 0; // fast path
long old_outer = _InterlockedExchangeAdd(&data, 1);
++old_outer;
assert((old_outer & COUNTER_MASK) != 0); // overflow condition
long new_outer = old_outer & POINTER_MASK;
if (new_outer)
// Transfer counter to obj->count.
_InterlockedAdd(&((TYPE*)new_outer)->count,
1 - (((old_outer & COUNTER_MASK) - 1) << XCOUNT_SHIFT));
// And reset it in *this.
if (_InterlockedCompareExchange(&data, old_outer, new_outer)
!= old_outer && new_outer)
// Someone else does the job already => outer count.
_InterlockedAdd(&((TYPE*)new_outer)->count,
(old_outer & COUNTER_MASK) << XCOUNT_SHIFT);
// The global count cannot return to zero here,
// since we have an active reference.
return new_outer;
}

public:
intrusive_ptr()
: data(0) {}
intrusive_ptr(TYPE* obj)
: data((long)obj)
{ if (obj)
_InterlockedIncrement(&obj->count);
}
intrusive_ptr(const intrusive_ptr& r)
: data(r.data & POINTER_MASK)
{ if (data)
_InterlockedIncrement(&((TYPE*)data)->count);
}
intrusive_ptr(volatile const intrusive_ptr& r)
: data(r.acquire())
{}

TYPE* get() const
{ // non-volatile instances may not have a non-zero counter.
return (TYPE*)data;
}

~intrusive_ptr()
{ long adjust = -((data & COUNTER_MASK) << XCOUNT_SHIFT) -1;
TYPE* obj = (TYPE*)data;
adjust += _InterlockedExchangeAdd(&obj->count, adjust);
if (obj && adjust == 0)
delete obj;
}

void swap(intrusive_ptr& r)
{ long new_data = r.data;
r.data = data;
data = new_data;
}
void swap(intrusive_ptr& r) volatile
{ long temp = r.data;
r.data = _InterlockedExchange(&data, r.data);
// transfer hold count
temp = r.data & COUNTER_MASK;
if (temp)
{ TYPE* obj = (TYPE*)(r.data & POINTER_MASK);
_InterlockedSub(&obj->count, temp << XCOUNT_SHIFT);
r.data = (long)obj;
}
}
void swap(volatile intrusive_ptr& r)
{ r.swap(*this);
}

intrusive_ptr& operator=(TYPE* r)
{ intrusive_ptr(r).swap(*this);
return *this;
}
intrusive_ptr& operator=(const intrusive_ptr& r)
{ intrusive_ptr(r).swap(*this);
return *this;
}
intrusive_ptr& operator=(volatile const intrusive_ptr& r)
{ intrusive_ptr(r).swap(*this);
return *this;
}
void operator=(TYPE* r) volatile
{ intrusive_ptr(r).swap(*this);
}
void operator=(const intrusive_ptr& r) volatile
{ intrusive_ptr(r).swap(*this);
}
void operator=(volatile const intrusive_ptr& r) volatile
{ intrusive_ptr(r).swap(*this);
}
};

// And here is the test:

long thread_counter = 0;

static void reader_thread(void* p)
{ volatile intrusive_ptr& s = *(volatile intrusive_ptr*)p;
for (int i = 0; i <= 100000; ++i)
{ intrusive_ptr h(s);
// here is our object
my_data* data = h.get();
// work with data
DEBUGLOG(("read %p %d %d\r\n", data, data->i, data->j));
//Sleep(0);
// get one more ref but now with normal thread safety
intrusive_ptr h2(h);
// ...
}
_InterlockedDecrement(&thread_counter);
}

static void writer_thread(void* p)
{ volatile intrusive_ptr& s = *(volatile intrusive_ptr*)p;
for (int i = 1; i <= 100000; ++i)
{ my_data* data = new my_data(i);
DEBUGLOG(("repl %p %d %d\r\n", data, i, data->j));
s = data;
//Sleep(0);
}
_InterlockedDecrement(&thread_counter);
}

static void starter(void (*fn)(void*), volatile void* p)
{ _InterlockedIncrement(&thread_counter);
_beginthread(fn, NULL, 0, (void*)p);
}

int main()
{ { volatile intrusive_ptr s(new my_data(0));

starter(reader_thread, &s);
starter(writer_thread, &s);
starter(reader_thread, &s);
starter(writer_thread, &s);
starter(reader_thread, &s);
starter(reader_thread, &s);
starter(writer_thread, &s);
starter(reader_thread, &s);
starter(reader_thread, &s);
starter(reader_thread, &s);
starter(writer_thread, &s);
starter(reader_thread, &s);
starter(reader_thread, &s);
starter(reader_thread, &s);
// join threads
do {Sleep(100);} while (thread_counter != 0);
}

debuglog("done - %li instances\r\n", inst_counter);
assert(inst_counter == 0);
putchar('\7'); // bing!
}

Dmitriy V'jukov

unread,

Aug 17, 2009, 1:10:54 PM8/17/09

to

On Aug 14, 12:36 am, Marcel Müller <news.5.ma...@spamgourmet.org>
wrote:

Hi Marcel,

> // Strongly thread safe read
> long acquire() volatile const
> { if (!data)
> return 0; // fast path
> long old_outer = _InterlockedExchangeAdd(&data, 1);
> ++old_outer;
> assert((old_outer & COUNTER_MASK) != 0); // overflow condition

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

I can't get how you protect against that overflow... other than
assert.

> long new_outer = old_outer & POINTER_MASK;
> if (new_outer)
> // Transfer counter to obj->count.
> _InterlockedAdd(&((TYPE*)new_outer)->count,
> 1 - (((old_outer & COUNTER_MASK) - 1) << XCOUNT_SHIFT));
> // And reset it in *this.
> if (_InterlockedCompareExchange(&data, old_outer, new_outer)
> != old_outer && new_outer)
> // Someone else does the job already => outer count.
> _InterlockedAdd(&((TYPE*)new_outer)->count,
> (old_outer & COUNTER_MASK) << XCOUNT_SHIFT);
> // The global count cannot return to zero here,
> // since we have an active reference.
> return new_outer;
> }

> ~intrusive_ptr()

> { long adjust = -((data & COUNTER_MASK) << XCOUNT_SHIFT) -1;
> TYPE* obj = (TYPE*)data;
> adjust += _InterlockedExchangeAdd(&obj->count, adjust);
> if (obj && adjust == 0)
> delete obj;
> }

I guess it must be something like:

~intrusive_ptr()
{ long adjust = -((data & COUNTER_MASK) << XCOUNT_SHIFT) -1;

TYPE* obj = (TYPE*)(data & POINTER_MASK);
if (obj) {

adjust += _InterlockedExchangeAdd(&obj->count, adjust);

if (adjust == 0)
delete obj;
}
}
?

--
Dmitriy V'jukov

Marcel Müller

unread,

Aug 18, 2009, 11:56:17 AM8/18/09

to

Dmitriy V'jukov wrote:
> On Aug 14, 12:36 am, Marcel M�ller <news.5.ma...@spamgourmet.org>

> wrote:
>
> Hi Marcel,
>
>
>> // Strongly thread safe read
>> long acquire() volatile const
>> { if (!data)
>> return 0; // fast path
>> long old_outer = _InterlockedExchangeAdd(&data, 1);
>> ++old_outer;
>> assert((old_outer & COUNTER_MASK) != 0); // overflow condition
>
>
> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
>
> I can't get how you protect against that overflow... other than
> assert.

Exactly. But the assertion does not trigger unless there are I/O calls
at this point. Normally I used STOLEN_BITS = 4. It is nearly impossible
get more than 15 threads into the same few instructions for the same
object - even with logging in between.
In fact I was not able to trigger the assertion with only 2 stolen bits
without the logging. Even with only 1 bit, you have to test very long to
get two threads exactly in the right place at the right time.

As far as I have seen your code did not have any protection against
overflow too. Or did I miss something.
And furthermore you have to be careful with aliasing, because when a
thread has more than one local pointer taken from the same shared
pointer (e.g. if you manage strings) you need bits for more than the
number of threads.

>> ~intrusive_ptr()
>> { long adjust = -((data & COUNTER_MASK) << XCOUNT_SHIFT) -1;
>> TYPE* obj = (TYPE*)data;
>> adjust += _InterlockedExchangeAdd(&obj->count, adjust);
>> if (obj && adjust == 0)
>> delete obj;
>> }
>
>
> I guess it must be something like:
>
> ~intrusive_ptr()
> { long adjust = -((data & COUNTER_MASK) << XCOUNT_SHIFT) -1;
> TYPE* obj = (TYPE*)(data & POINTER_MASK);
> if (obj) {
> adjust += _InterlockedExchangeAdd(&obj->count, adjust);
> if (adjust == 0)
> delete obj;
> }
> }
> ?

You are right. I already fixed this.
The same bug was in swap.

Additionally I swapped the bits in the intrusive reference counter. Now
the stolen bits are taken from the lower end. This removes all of the
bit shift operations, because 'normal' reference counts always increment
by one, which turns into the constant ALIGNMENT. The generated assembly
code is really pretty small. The whole binary take about 2kB.

Furthermore I had to add some additional methods to the class to
disambiguate the overloaded function lookup for non-constant objects.

I may post the whole code if you are interested. But I should do some
further testing at first. Currently I replace the smart pointer class of
a larger application by this one.

Marcel

Dmitriy V'jukov

unread,

Aug 19, 2009, 2:58:35 AM8/19/09

to

On 18 авг, 19:56, Marcel Müller <news.5.ma...@spamgourmet.org> wrote:
> Dmitriy V'jukov wrote:

> > On Aug 14, 12:36 am, Marcel Müller <news.5.ma...@spamgourmet.org>

> > wrote:
>
> > Hi Marcel,
>
> >> // Strongly thread safe read
> >> long acquire() volatile const
> >> { if (!data)
> >> return 0; // fast path
> >> long old_outer = _InterlockedExchangeAdd(&data, 1);
> >> ++old_outer;
> >> assert((old_outer & COUNTER_MASK) != 0); // overflow condition
>
> > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
>
> > I can't get how you protect against that overflow... other than
> > assert.
>
> Exactly. But the assertion does not trigger unless there are I/O calls
> at this point. Normally I used STOLEN_BITS = 4. It is nearly impossible
> get more than 15 threads into the same few instructions for the same
> object - even with logging in between.
> In fact I was not able to trigger the assertion with only 2 stolen bits
> without the logging. Even with only 1 bit, you have to test very long to
> get two threads exactly in the right place at the right time.

Ok, I see.

> As far as I have seen your code did not have any protection against
> overflow too. Or did I miss something.

This code:
http://groups.google.com/group/lock-free/browse_frm/thread/f20a631b3b4c3813
supports up to 2^stolen_bits-1 references per object. In other
respects it must handle overflows gracefully, i.e. it's ok if outer or
inner counters overflow.

Please post the updates if you will fix something in the algorithm
itself.
Btw, for testing you may consider using Relacy Race Detector:
http://groups.google.ru/group/relacy
It's especially designed for verification of such stuff.

--
Dmitriy V'jukov

Marcel Müller

unread,

Aug 19, 2009, 4:11:30 AM8/19/09

to

Dmitriy V'jukov wrote:
>> As far as I have seen your code did not have any protection against
>> overflow too. Or did I miss something.
>
> This code:
> http://groups.google.com/group/lock-free/browse_frm/thread/f20a631b3b4c3813
> supports up to 2^stolen_bits-1 references per object. In other
> respects it must handle overflows gracefully, i.e. it's ok if outer or
> inner counters overflow.

You say it's OK if the outer counter overflows?
But how do you avoid that
_InterlockedExchangeAdd(&outer.whole,
addend.whole);
increments outer.ptr resulting in UB?

>> I may post the whole code if you are interested. But I should do some
>> further testing at first. Currently I replace the smart pointer class of
>> a larger application by this one.
>
> Please post the updates if you will fix something in the algorithm
> itself.

Except for the two bugs mentioned I only did optimizations so far.

> Btw, for testing you may consider using Relacy Race Detector:
> http://groups.google.ru/group/relacy
> It's especially designed for verification of such stuff.

Huh, this is much stuff. As far as I see you use proxies for any kind of
variable or synchronization object. Is there some class overview
available? It is quite hard to get the whole thing from analyzing source
and examples. And what is the meaning of $ in C++ code?

I did not catch how you finally identify a potential race condition by
tracking access to storage and resources. How could relacy deal with the
undefined behavior in my acquire method (The result of CompareExchange
is undefined). It only turns into defined behavior when the
contributions of the other threads are done.

Futhermore, porting it to my platform may be plenty much of work. I only
have a restricted set of the standard library available. And I only can
do these things in my spare time.

Marcel

Dmitriy V'jukov

unread,

Aug 20, 2009, 11:26:36 AM8/20/09

to

On Aug 19, 1:11 am, Marcel Müller <news.5.ma...@spamgourmet.com>
wrote:

> Dmitriy V'jukov wrote:
> >> As far as I have seen your code did not have any protection against
> >> overflow too. Or did I miss something.
>
> > This code:

> >http://groups.google.com/group/lock-free/browse_frm/thread/f20a631b3b...

> > supports up to 2^stolen_bits-1 references per object. In other
> > respects it must handle overflows gracefully, i.e. it's ok if outer or
> > inner counters overflow.
>
> You say it's OK if the outer counter overflows?
> But how do you avoid that
> _InterlockedExchangeAdd(&outer.whole,
> addend.whole);
> increments outer.ptr resulting in UB?

Well, yes, C++ declares overflows on signed types as UB. But C++ does
not have threads as well. So it's not C++.
It's MSVC/x86. And MSVC/x86 has defined semantics for signed
overflows. It's modulo N arithmetic.
If/when you will rewrite the algorithm for C++0x, you will use
std::atomic<uint32_t> or std::atomic<uint64_t>. And unsigned types in C
++ have also modulo N arithmetic. So it must work too.
I.e. practically the algorithm requires modulo N arithmetic, which is
IMHO quite feasible requirement. I do not see any practical problems
here. Double-word CAS represents much more urgent problem.

--
Dmitriy V'jukov

Dmitriy V'jukov

unread,

Aug 20, 2009, 11:47:48 AM8/20/09

to

On Aug 19, 1:11 am, Marcel Müller <news.5.ma...@spamgourmet.com>
wrote:

> > Btw, for testing you may consider using Relacy Race Detector:

> >http://groups.google.ru/group/relacy
> > It's especially designed for verification of such stuff.
>
> Huh, this is much stuff.

It's not as scary as it may look at first sight :)

> As far as I see you use proxies for any kind of
> variable or synchronization object. Is there some class overview
> available? It is quite hard to get the whole thing from analyzing source
> and examples.

Well, it's a weak side of Relacy now. No extensive documentation now.
First read this short tutorial:
http://groups.google.com/group/relacy/web/tutorial
Use RL_NEW/RL_DELETE/RL_MALLOC/RL_FREE for dynamic memory management.
It must be enough to express your algorithm.

Also you may use POSIX/Win32 API basically as-is, only add ($).

> And what is the meaning of $ in C++ code?

$ is a macro which is used for debug info collection (source file,
line, function name)

> I did not catch how you finally identify a potential race condition by
> tracking access to storage and resources.

I track 'happens-before' relation for memory accesses. If 2 memory
accesses (at least 1 is write) are not ordered by 'happens-before'
then it's a race.

> How could relacy deal with the
> undefined behavior in my acquire method (The result of CompareExchange
> is undefined).

What exactly do you mean here? Undefined result of signed overflow in C
++?

> It only turns into defined behavior when the
> contributions of the other threads are done.

Not sure I get you here.

> Futhermore, porting it to my platform may be plenty much of work. I only
> have a restricted set of the standard library available. And I only can
> do these things in my spare time.

No need for porting. It's a nice side of Relacy, you may use Windows/
x86 for verification of an algorithm which is intended to work on
Solaris/SPARC. Relacy verifies against formal semantics, not against
underlying OS/hardware.

--
Dmitriy V'jukov