On
http://softwareforums.intel.com/ids/board/message?board.id=42&message.id=143#M143
a mr lockfree writes ``It is possible to create dynamic lock-free lists
that allow for concurrent pushes, pops, and iterations using a "special"
reference counted garbage-collector. You implement this using the
wonderful cmpxchg8b instruction... ( why didn't intel do a cmpxchg16b
for 64-bit? )''
I was wondering how this is different from what I wrote in my container
library (and whether it might be interesting for that library, if the
code is free/libre/open) so I wanted to take a look at the code, but the
links seem to be broken.
Would mr. lockfree be so kind as to post up-to-date links?
Thanks,
rlc
NB: I'm pretty sure mr lockfree == mr Seigh, which is why I'm posting
here, but if I'm wrong, please forgive the confusion :)
lockfree is senderx in the intel forums. I'm jseigh there. That's an old
post probably refereing to Itanium. The Intel ia32-64 has a cmpxchg16b instruction.
It's AMD that's missing it. I asked that question in AMS's developer forums
but I don't really expect an answer. Hardware architects generally don't
publish their rationale for their archtectures. It would bring into question
their godlike infallible status.
Joe Seigh
That Intel link pointed to really old collector code. I did post my
"lock-free vs. lock-based" test on here, and I believe some people
downloaded it. If you did, could you post the assembler part of it here? I
have all of the old stuff on DVD, but I can't get to it right now. If nobody
has the assembler code I posted lying around, I will try to repost it before
Friday.
The old code is close to what the K42 project is doing, you can read about
how they use proxy-like gc to help detect quiescent periods:
http://groups.google.com/groups?selm=3F2E4165.75ED3D33%40xemaps.com
The K42 might be using spin-locks for their RCU-subsystem, my stuff used cas
loops to detect quiescent periods for specific collections that the gc was
protecting... Unfortunately I can't post the new stuff.
> NB: I'm pretty sure mr lockfree == mr Seigh,
Nope. :)
P.S.
I am currently putting the finishing touches on a POSIX compliant library
that shows how lock-free and lock-based can work together. It will even
provide a high-speed pthread interface for win32 versions. The rw-lock is
running 4-6 times faster than the current pthread-win32 library, under heavy
contention mixed with random cancellations.
Here is an strictly i586-i686 win32 ad hoc library of mine that will most
likely show you how to increase the speed of your inter-thread
communication. You can do lock-free FIFO and LIFO with it:
http://appcore.home.comcast.net
( You should only use the code for experimental and educational purposes
only. )
The code uses a experimental pointer delay algorithm that heavily reduces
the chance of ABA:
http://groups.google.com/groups?selm=plCAa.475147%24Si4.412103%40rwcrnsc51.ops.asp.att.net&rnum=1
The Queue API shows off a lock-based communication ( doesn't guarantee fifo
all the time ) that performs damn near as good as the lock-free stuff.
There are concurrent lists implemented with hashed-locks. There is also an
advanced thread-pool and job-scheduler in there.
Have fun... ;)
Doesn't ia64 have a cmp8xchg16? I don't think it has a double-wide cas...
> I asked that question in AMS's developer forums
> but I don't really expect an answer. Hardware architects generally don't
> publish their rationale for their archtectures. It would bring into
question
> their godlike infallible status.
lol.
SenderX wrote:
>
> > Intel ia32-64 has a cmpxchg16b instruction.
> > It's AMD that's missing it.
>
> Doesn't ia64 have a cmp8xchg16? I don't think it has a double-wide cas...
>
Itanium has cmp8xchg16. The ia32-64, if that's the right designation, is
Intel's 64 bit extention to its 32 bit architecture and is mostly equivalent
to AMD's 64 bit extensions except Intel has cmpxchg16b while AMD does not.
Joe Seigh
Ahh yes...
http://www.intel.com/technology/64bitextensions/faq.htm
( looks like their scrambling to compete with amd64 )
> except Intel has cmpxchg16b while AMD does not.
I wonder why AMD would do such a thing, their total idiots for making such
an "obvious and massive mistake"!
Come to think about it, the Itaniums cmp8xchg16 instruction is fu$%$cking
weird. Why not a straight forward double-wide cas, do all hardware guys like
to play tricks?
SenderX wrote:
>
> > The ia32-64, if that's the right designation, is
> > Intel's 64 bit extention to its 32 bit architecture and is mostly
> equivalent
> > to AMD's 64 bit extensions
>
> Ahh yes...
>
> http://www.intel.com/technology/64bitextensions/faq.htm
> ( looks like their scrambling to compete with amd64 )
>
> > except Intel has cmpxchg16b while AMD does not.
>
> I wonder why AMD would do such a thing, their total idiots for making such
> an "obvious and massive mistake"!
I wouldn't say that they actually are idiots but it sort of looks that way.
I don't think that kernel and thread architects should reward AMD by
dumbing down their code. They should use cmpxchg16b when available and
slower work arounds when not, rather than slower work arounds all the
time.
>
> Come to think about it, the Itaniums cmp8xchg16 instruction is fu$%$cking
> weird. Why not a straight forward double-wide cas, do all hardware guys like
> to play tricks?
I think they thought they could save a few clock cycles doing a 64 bit compare
rather than a 128 bit one and assumed (incorrectly) that the only use of double
wide CAS was for versioning. Of course there's the possiblity that there's a
brilliant reason for doing that way but they don't publish rationales so we'll
never know and in our ignorance keep on thinking that all hw architects are
idiots. :)
Joe Seigh
> I am currently putting the finishing touches on a POSIX compliant library
> that shows how lock-free and lock-based can work together. It will even
> provide a high-speed pthread interface for win32 versions. The rw-lock is
> running 4-6 times faster than the current pthread-win32 library, under heavy
> contention mixed with random cancellations.
>
> Here is an strictly i586-i686 win32 ad hoc library of mine that will most
> likely show you how to increase the speed of your inter-thread
> communication. You can do lock-free FIFO and LIFO with it:
>
> http://appcore.home.comcast.net
> ( You should only use the code for experimental and educational purposes
> only. )
>
> The code uses a experimental pointer delay algorithm that heavily reduces
> the chance of ABA:
> http://groups.google.com/groups?selm=plCAa.475147%24Si4.412103%40rwcrnsc51.ops.asp.att.net&rnum=1
>
> The Queue API shows off a lock-based communication ( doesn't guarantee fifo
> all the time ) that performs damn near as good as the lock-free stuff.
>
> There are concurrent lists implemented with hashed-locks. There is also an
> advanced thread-pool and job-scheduler in there.
>
> Have fun... ;)
>
>
As far as the ABA problem is concerned, I kinda like M.M. Micheal's
"Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic
Reads and Writes" (PODC 2002 July 21-24, 2002), which I've implemented
in libmemory (http://jail-ust.sf.net/index.php?section=3&page=2). It
isn't very well-tested yet (mostly due to lack of time) but it should work..
That is basically also the reason why I don't need a DCAS in libcontain
(for the moment) but the problem is that I have a race condition in my
vector implementation - which I was hoping you wouldn't have in yours..
I.e., you don't have the same problem because you use a realloc to
resize your array, which you apparently assume to be atomic.
I haven't had the time to look at your code in detail yet - mostly
because it doesn't compile /chez moi/ so I can't run the tests on it,
but I'd be interested to know why using realloc on the array is not a
problem, in your opinion/implementation.
In mine (as in: in my array/vector implementation - not in my opinion),
it is: I have no way of knowing no-one else is referencing the pointer
while I'm copying, so if another thread performs a write operation on
the buffer before my realloc is done, I can't define the behaviour of
the array..
I've been pondering making a readers/writers lock and only counting
resize() as a writer, but that would kinda defeat the idea of being
lock-free..
rlc
PS: libcontain also contains a lock-free concurrent list implementation
- also based on an algorithm by M.M. Micheal. The latest version of the
code of the entire library is here:
http://jail-ust.sourceforge.net/index.php?section=3&page=1
and the code if the list implementation here:
http://cvs.sourceforge.net/viewcvs.py/jail-ust/libcontain/list.c?view=markup
and
http://cvs.sourceforge.net/viewcvs.py/jail-ust/libcontain/list.h?view=markup
It should work, and it would avoid aba 100% of the time.
However, you have to do a lot of bookkeeping IMHO, per-thread lists, bin
sorting them, iterating them and comparing against per-thread
hazard-pointers, tracking of the participating threads, and ( i think )the
amount of hazards per-thread is static...
Could a thread reference more nodes than there are hazards per-thread?
> That is basically also the reason why I don't need a DCAS in libcontain
> (for the moment) but the problem is that I have a race condition in my
> vector implementation - which I was hoping you wouldn't have in yours..
> I.e., you don't have the same problem because you use a realloc to
> resize your array, which you apparently assume to be atomic.
Yes, the Array API is NOT thread-safe in anyway, shape or form.
> but I'd be interested to know why using realloc on the array is not a
> problem, in your opinion/implementation.
I just never got around to putting the proxy gc code in there..
> I haven't had the time to look at your code in detail yet - mostly
> because it doesn't compile /chez moi/ so I can't run the tests on it,
You need platform sdk and vc++ 6.0. What errors/warnings do you get?
> I've been pondering making a readers/writers lock and only counting
> resize() as a writer, but that would kinda defeat the idea of being
> lock-free..
You can do a fast rw-lock with double-wide cas.
P.S.
Here is some pseudo-code for my old collector:
/* Lock-Free Proxy Garbage Collector
/**************************************
struct node
{
struct node *pNext;
struct node *pGcNext;
const void *pState;
void ( *fpDtor ) ( void* );
};
/* Must fit into double-wide cas for your specific cpu! */
struct gc
{
struct node *pTrash;
unsigned __int16 Proxy;
unsigned __int16 Aba;
};
void AddRef( struct gc *pThis )
{
do
{
read old and local copys
increment local.Proxy
}
while ( ! dwide_cas( pThis, old, local ) ); // Acquire membar
}
void Collect( struct gc *pThis, struct node *pNode )
{
do
{
read old and local copys
push pNode onto old.pTrash
}
while ( ! dwide_cas( pThis, old, local ) ); // Release membar
}
void Release( struct gc *pThis )
{
do
{
read old and local copys
decrement local.Proxy
increment local.Aba
if ( local.Proxy == 0 )
{
local.pTrash = 0;
}
}
while ( ! dwide_cas( pThis, old, local ) ); // Release membar
if ( local.proxy == 0 )
{
we have hit a quiesent period! Flush the callbacks.
iterate through the nodes and call node.fpDtor( node.pState ) for
each.
}
}
Then you would use it like:
/* Lock-Free Stack
/**************************************
typedef struct node lfstack;
typedef struct gc lfstack_gc;
void LockFreeStackNodeDtor( void *pNode )
{
free( pNode );
}
void LockFreeStackPush( lfstack *pThis, const void *pState, )
{
pNode = malloc( sizeof( *pNode ) );
pNode->pNext = pNode->pGcNext = 0;
pNode->pState = pState;
pNode->fpDtor = LockFreeStackNodeDtor;
do
{
read old and local copys
push pNode onto old
}
while ( ! normal_cas( pThis, old, local ) ); // Release membar
}
void* LockFreeStackPop( lfstack *pThis, lfstack_gc *pGc )
{
AddRef( pGc );
do
{
read old and local copys
if empty { return 0; }
pop front off old
}
while ( ! normal_cas( pThis, old, local ) ); // Release membar
Collect( pGc, popped_node );
Release( pGc );
}
That's basically it, and I bet you can get it working from this pseudo-code.
Also, notice how you don't have to use double-wide cas for the stack?
I will try to post the assembler very soon.
I will have to study your implementation further.
^^^^^^^^^^
DOH, that should be an Acquire membar!
> void Release( struct gc *pThis )
> {
> do
> {
> read old and local copys
>
> decrement local.Proxy
> increment local.Aba
>
> if ( local.Proxy == 0 )
> {
> local.pTrash = 0;
> }
> }
>
> while ( ! dwide_cas( pThis, old, local ) ); // Release membar
>
> if ( local.proxy == 0 )
> {
> we have hit a quiesent period! Flush the callbacks.
>
> iterate through the nodes and call node.fpDtor( node.pState ) for
^^^^^^^^^^^^^^^^^^^
Should be:
iterate through the nodes and call node.fpDtor( node ) for
( *don't pass user object to dtor )
> each.
> }
> }
* - This is not a requirement. You can easily create a collection that "uses
the gc to delete user state, along with the nodes"...
SenderX wrote:
>
> > As far as the ABA problem is concerned, I kinda like M.M. Micheal's
> > "Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic
> > Reads and Writes" (PODC 2002 July 21-24, 2002), which I've implemented
> > in libmemory (http://jail-ust.sf.net/index.php?section=3&page=2). It
> > isn't very well-tested yet (mostly due to lack of time) but it should
> work..
>
> It should work, and it would avoid aba 100% of the time.
>
> However, you have to do a lot of bookkeeping IMHO, per-thread lists, bin
> sorting them, iterating them and comparing against per-thread
> hazard-pointers, tracking of the participating threads, and ( i think )the
> amount of hazards per-thread is static...
>
> Could a thread reference more nodes than there are hazards per-thread?
Not concurrently but you can reuse the hazard pointers. You have to
remember them so in practice it's pthread_get_specific unless you
are reusing it a lot in local scope. Similar to RCU in that regard.
The store/load membar after storing into the hazard pointer is a little
expensive but still cheaper than the sigsetjmp I'm using in RCU. The
sigsetjmp is an expensive way to simulate a efficient C exception throwing
mechanism w/ stack unwinding. Otherwise RCU would beat SMR hands down.
RCU is maybe a little better than SMR in GC overhead but I haven't really
looked at what SMR does for scalability. RCU doesn't have any overhead
increase with the number of pointers that you use.
SMR is a lot more portable. RCU for preemptive user threads depends on
/proc functinality and/or kernel extensions.
Ugh. Brain hurts. I've been writing one bug ugly testcse with mutexes,
rwlocks, RCU, and atomic_ptr. I don't think I want to throw SMR in
there. :)
Joe Seigh
Joe Seigh wrote:
> Not concurrently but you can reuse the hazard pointers. You have to
> remember them so in practice it's pthread_get_specific unless you
> are reusing it a lot in local scope. Similar to RCU in that regard.
> The store/load membar after storing into the hazard pointer is a little
> expensive but still cheaper than the sigsetjmp I'm using in RCU. The
> sigsetjmp is an expensive way to simulate a efficient C exception throwing
> mechanism w/ stack unwinding. Otherwise RCU would beat SMR hands down.
I'll have to take that back somewhat. If you use SMR to traverse a linked
list you have to set the hazard pointer for every node you visit. For
RCU the overhead is just once at the start and it's only dependent load
membars after each load which aren't needed on most architectures and
just a load/load on the ones (alpa) that do. So RCU beats SMR in that
case.
Joe Seigh
You also have to reset the hazard pointers on each cas failure...
>>As far as the ABA problem is concerned, I kinda like M.M. Micheal's
>>"Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic
>>Reads and Writes" (PODC 2002 July 21-24, 2002), which I've implemented
>>in libmemory (http://jail-ust.sf.net/index.php?section=3&page=2). It
>>isn't very well-tested yet (mostly due to lack of time) but it should
>
> work..
>
> It should work, and it would avoid aba 100% of the time.
>
> However, you have to do a lot of bookkeeping IMHO, per-thread lists, bin
> sorting them, iterating them and comparing against per-thread
> hazard-pointers, tracking of the participating threads, and ( i think )the
> amount of hazards per-thread is static...
>
> Could a thread reference more nodes than there are hazards per-thread?
Yes, the number of hazard pointers per thread is static and no, a thread
should not reference more nodes than there are hazards per thread.
However, the number of hazards per thread is defined by the algorithm
and hence known to the developer, so it doesn't need to be dynamic. The
*total* number of hazard pointers - which is also static in M.M.
Micheal's description - is not static in my implementation: i.e. I've
tried to reduce the number of compile-time constants to an absolute
minimum when implementing..
There is book-keeping involved. I've been working on a way to keep only
a single dlist in stead of one per thread, but the overhead of that
solution is (currently) more than it's worth - still working on it,
though (nothing in CVS yet).
>>That is basically also the reason why I don't need a DCAS in libcontain
>>(for the moment) but the problem is that I have a race condition in my
>>vector implementation - which I was hoping you wouldn't have in yours..
>>I.e., you don't have the same problem because you use a realloc to
>>resize your array, which you apparently assume to be atomic.
> Yes, the Array API is NOT thread-safe in anyway, shape or form.
That's what I was afraid of.. :(
>>but I'd be interested to know why using realloc on the array is not a
>>problem, in your opinion/implementation.
> I just never got around to putting the proxy gc code in there..
I'm going to take a close look at your proxy gc code: see if it really
beats SMR. I guess it's the code in FreePool.{c,h} that you call "proxy gc"?
>>I haven't had the time to look at your code in detail yet - mostly
>>because it doesn't compile /chez moi/ so I can't run the tests on it,
> You need platform sdk and vc++ 6.0. What errors/warnings do you get?
I guess I just don't have a recent enough version of the Platform SDK
installed: I have an OOTB MSDN Visual Studio 6.0 Eneterprise edition
installation - no updates applied at all..
>>I've been pondering making a readers/writers lock and only counting
>>resize() as a writer, but that would kinda defeat the idea of being
>>lock-free..
> You can do a fast rw-lock with double-wide cas.
It would still defeat the purpose of being lock-free, though :(
The race condition in question is really a corner case: it only occurs
when array_resize is called and, between the copy of the array to the
new buffer and the time the buffer is put in place in the descriptor,
some write operation occurs on the old buffer (which is still hooked
into the descriptor). I really don't like adding a lock to the container
just for a corner case like that - but I guess I'll have to :(
[snipped gc code]
Interesting - I'll see if I can make a C implementation out of it :)
[snipped stack code]
Did you take a look at my stack code?
http://cvs.sourceforge.net/viewcvs.py/jail-ust/libcontain/stack.c?view=markup
No DCAS - just CAS and SMR. If you scrap SMR, you'd have to use a DCAS
or some other memory management (other than SMR) to prevent the ABA
problem, of course.
The thing I really like about SMR is the absence of reference counters:
despite the overhead of the book keeping that SMR implies, it doesn't
have the overhead of reference counting, which is (apparently) more
contention-prone..
> That's basically it, and I bet you can get it working from this pseudo-code.
> Also, notice how you don't have to use double-wide cas for the stack?
Yeah, but that's not really surprising if you're using a garbage
collector ;)
Thanks,
rlc
> RCU is maybe a little better than SMR in GC overhead but I haven't really
> looked at what SMR does for scalability. RCU doesn't have any overhead
> increase with the number of pointers that you use.
Increasing the number of hazards per thread (which in libcontain doesn't
get above three for the moment) increases the number of hazard pointers
per thread - and therefore the total number of hazard pointers. Going
through a dlist is linear wrt the total number of hazard pointers (i.e.
the dlist's smallest possible size is the total number of hazard
pointers + 1).
> SMR is a lot more portable. RCU for preemptive user threads depends on
> /proc functinality and/or kernel extensions.
That's what I thought - makes RCU a lot less interesting from my point
of view, because my container library should port anywhere where GCC
ports.. (my arch-dependant code is GCC extended inline asm, so not
having gcc won't get you anywhere, but not having Linux is not a problem).
It remains interesting, though, if the cost of SMR is really superior to
RCU and RCU is a possibility (e.g. on Linux)..
rlc
Ronald Landheer-Cieslak wrote:
>
> Joe Seigh wrote:
>
> > RCU is maybe a little better than SMR in GC overhead but I haven't really
> > looked at what SMR does for scalability. RCU doesn't have any overhead
> > increase with the number of pointers that you use.
> Increasing the number of hazards per thread (which in libcontain doesn't
> get above three for the moment) increases the number of hazard pointers
> per thread - and therefore the total number of hazard pointers. Going
> through a dlist is linear wrt the total number of hazard pointers (i.e.
> the dlist's smallest possible size is the total number of hazard
> pointers + 1).
Yes, but you have to set a hazard pointer for every node you visit. That
can get to be expensive.
Joe Seigh
Most GC schemes require some kind of action before a node can safely be
accessed - SMR isn't unique in this respect and, because it uses "local"
pointers which are only written by one thread at a time, the SMR-way of
doing it avoids contention.
Like I said, I don't know how RCU does it, but if it doesn't require
special action on node access, it may in deed have a lot less overhead
than SMR..
rlc
Ronald Landheer-Cieslak wrote:
> Most GC schemes require some kind of action before a node can safely be
> accessed - SMR isn't unique in this respect and, because it uses "local"
> pointers which are only written by one thread at a time, the SMR-way of
> doing it avoids contention.
>
> Like I said, I don't know how RCU does it, but if it doesn't require
> special action on node access, it may in deed have a lot less overhead
> than SMR..
>
AFAIK, SRM does a
loop:
load from shared pointer
store in thread local hazard pointer
store/load membar
reload from shared pointer
if new value != hazard pointer goto loop:
You have to repeat this for every shared pointer. In a linked list
this is every node visited. The store/load is expensive as far as
membars go.
RCU without the restart stuff is just
load from shared pointer
dependent load membar
On most architectures (alpha being the main exception) dependent load
membars are not needed. So it's just
load from shared pointer
Joe Seigh
Very true. A bit inflexible, but it will work.
> There is book-keeping involved. I've been working on a way to keep only
> a single dlist in stead of one per thread, but the overhead of that
> solution is (currently) more than it's worth - still working on it,
> though (nothing in CVS yet).
Its would be hard for a global list to compete with a list per-thread, that
could be kept in its own cache line.
> > Yes, the Array API is NOT thread-safe in anyway, shape or form.
> That's what I was afraid of.. :(
Yup... To lazy to make the array thread-safe. Its safe in the POSIX version,
but I am still polishing it.
;(...
> I'm going to take a close look at your proxy gc code: see if it really
> beats SMR. I guess it's the code in FreePool.{c,h} that you call "proxy
gc"?
Nope. The freepool just creates a pool of reference counted objects. It not
meant to be a collector for lock-free algos like SMR or RCU.
> I guess I just don't have a recent enough version of the Platform SDK
> installed: I have an OOTB MSDN Visual Studio 6.0 Eneterprise edition
> installation - no updates applied at all..
The code is ad hoc and seems to only like vc++ 6.0 sp5. What types of errors
are you getting?
> It would still defeat the purpose of being lock-free, though :(
Yes, under write contention it would, however when reads come in parallel,
the fast rw lock just runs through a cas loop, and no slow path is hit...
> I really don't like adding a lock to the container
> just for a corner case like that - but I guess I'll have to :(
Locks can be used to help lock-free, you just have to get tricky...
:O
> [snipped gc code]
> Interesting - I'll see if I can make a C implementation out of it :)
I already have a full blown test in C that I posted to this group... I can't
seem to find the test, but I did find the assembler code for the AddRef,
Collect, and *Release functions. Its at the end of this post.
> [snipped stack code]
> Did you take a look at my stack code?
>
http://cvs.sourceforge.net/viewcvs.py/jail-ust/libcontain/stack.c?view=markup
> No DCAS - just CAS and SMR. If you scrap SMR, you'd have to use a DCAS
> or some other memory management (other than SMR) to prevent the ABA
> problem, of course.
I took a brief look, and I noticed that you seem to malloc a new node on
every push... Is you malloc special, does it pool memory? If not, malloc
could blow you away with heap contention. Alls you would have to do it pool
up the nodes ( if you don't already ) and you will experience a performance
boost.
> The thing I really like about SMR is the absence of reference counters:
> despite the overhead of the book keeping that SMR implies, it doesn't
> have the overhead of reference counting, which is (apparently) more
> contention-prone..
Yeah, SMR is nice in that respect. Its a little to inflexible IMHO, but it
gets the job done.
> > That's basically it, and I bet you can get it working from this
pseudo-code.
> > Also, notice how you don't have to use double-wide cas for the stack?
> Yeah, but that's not really surprising if you're using a garbage
> collector ;)
:)
P.S.
* All of the code here is really old, I simply can't post any of the new
stuff to public domain. You should really tweak the Release function to
increment the ac_LfGarbage::Aba every time. The old code, as you can see,
does not increment ABA. It really should.
AddRef - Grabs a proxy reference count
Collect - Dump nodes in the trash
Release - Dumps our count, and returns with any nodes that were in the
trash.
/** SenderX's old proxy collector
******************************/
typedef struct ac_LfNode_
{
struct ac_LfNode_ *pNext;
struct ac_LfNode_ *pGcNext;
const void *pState;
unsigned __int32 Flags;
} *ac_LfNode;
typedef struct ac_LfGarbage_
{
ac_LfNode pNode;
unsigned __int16 Proxy;
unsigned __int16 Aba;
} *ac_LfGarbage;
; Lock-Free Garbage Collect 64
; extern void pcpu_acLfGarbageCollect64( ac_LfGarbage pGarbage, ac_LfNode
pNode );
pcpu_acLfGarbageCollect64 PROC
; Preserve
push esi
push ebx
mov esi, [esp + 12]
; Load the garbage
mov edx, [esi + 4]
mov eax, [esi]
; Load the node
mov ebx, [esp + 16]
pcpu_acLfGarbageCollect64_Retry:
; Prepare the Cas
mov ecx, edx
mov [ebx + 4], eax
; Try to link the new node
lock cmpxchg8b qword ptr [esi] ; Release
jne pcpu_acLfGarbageCollect64_Retry
; Restore
pop ebx
pop esi
ret
pcpu_acLfGarbageCollect64 ENDP
; References the garbage 64
; extern void pcpu_acLfGarbageAddRef64( ac_LfGarbage pGarbage );
pcpu_acLfGarbageAddRef64 PROC
; Preserve
push esi
push ebx
mov esi, [esp + 12]
; Load the garbage
mov edx, [esi + 4]
mov eax, [esi]
pcpu_acLfGarbageAddRef64_Retry:
; Prepare the Cas
mov ebx, eax
lea ecx, [edx + 1]
; Try to link the new node
lock cmpxchg8b qword ptr [esi] ; Acquire
jne pcpu_acLfGarbageAddRef64_Retry
; Restore
pop ebx
pop esi
ret
pcpu_acLfGarbageAddRef64 ENDP
; Releases the garbage 64
; extern ac_LfNode pcpu_acLfGarbageRelease64( ac_LfGarbage pGarbage );
pcpu_acLfGarbageRelease64 PROC
; Preserve
push esi
push ebx
mov esi, [esp + 12]
; Load the garbage
mov edx, [esi + 4]
mov eax, [esi]
pcpu_acLfGarbageRelease64_Retry:
; Prepare the Cas
mov ebx, eax
lea ecx, [edx - 1]
; Check for zero reference count
test ecx, ecx
jne pcpu_acLfGarbageRelease64_TryRelease
; Reset the garbage
mov ebx, 0
pcpu_acLfGarbageRelease64_TryRelease:
; Try to release the garbage
lock cmpxchg8b qword ptr [esi] ; Release
jne pcpu_acLfGarbageRelease64_Retry
; Check for zero reference count
test ecx, ecx
je pcpu_acLfGarbageRelease64_Done
mov eax, 0
pcpu_acLfGarbageRelease64_Done:
; Restore
pop ebx
pop esi
ret
pcpu_acLfGarbageRelease64 ENDP
>>I guess I just don't have a recent enough version of the Platform SDK
>>installed: I have an OOTB MSDN Visual Studio 6.0 Eneterprise edition
>>installation - no updates applied at all..
> The code is ad hoc and seems to only like vc++ 6.0 sp5. What types of errors
> are you getting?
It chokes on every instance of the FORCEINLINE macro - every time you
use it, it gets a parse error:
...\appcore\include\atomic.h(144) : error C2054: expected '(' to follow
'FORCEINLINE'
>>It would still defeat the purpose of being lock-free, though :(
> Yes, under write contention it would, however when reads come in parallel,
> the fast rw lock just runs through a cas loop, and no slow path is hit...
I.e., it would be "lock-poor" in stead of "lock-free"..
;)
>>I really don't like adding a lock to the container
>>just for a corner case like that - but I guess I'll have to :(
> Locks can be used to help lock-free, you just have to get tricky...
<lol>
>>[snipped gc code]
>>Interesting - I'll see if I can make a C implementation out of it :)
> I already have a full blown test in C that I posted to this group... I can't
> seem to find the test, but I did find the assembler code for the AddRef,
> Collect, and *Release functions. Its at the end of this post.
>>[snipped stack code]
>>Did you take a look at my stack code?
> http://cvs.sourceforge.net/viewcvs.py/jail-ust/libcontain/stack.c?view=markup
>>No DCAS - just CAS and SMR. If you scrap SMR, you'd have to use a DCAS
>>or some other memory management (other than SMR) to prevent the ABA
>>problem, of course.
> I took a brief look, and I noticed that you seem to malloc a new node on
> every push... Is you malloc special, does it pool memory? If not, malloc
> could blow you away with heap contention. Alls you would have to do it pool
> up the nodes ( if you don't already ) and you will experience a performance
> boost.
A memory-pool-driven malloc() implementation is next on my TODO list
(one I've got libthread ready to be committed: I'm still working on a
lock-free semaphore that freezes the current thread if it needs to wait,
but is still deadlock-safe.)
[snip]
>
> * All of the code here is really old, I simply can't post any of the new
> stuff to public domain. You should really tweak the Release function to
> increment the ac_LfGarbage::Aba every time. The old code, as you can see,
> does not increment ABA. It really should.
[snipped actual code]
I'll have a closer look at the code when I have some time. It looks like
a more-or-less basic reference counter that happens to use CAS to
avoid locks (or rather: DCAS to avoid locks and ABA) but I'd have to at
least translate it to AT&T syntax to be able to read it more easily :)
(I reqlly cqn't get used to Intel's right-to-left syntax with various
brackets, etc.)
When I have a bit of time, I'll get back to you :)
Thanks!~
rlc
I guess you've seen this:
http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
Ross.
rlc
I find it interesting that this article doesn't reference either
Jeff Bonwick's/John Adam's paper:
http://www.usenix.org/event/usenix01/bonwick.html
Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
> Could a thread reference more nodes than there are hazards per-thread?
Only for nodes it has created itself and has not linked into a list,
array, vector, binomial tree, ..., yet
>> That is basically also the reason why I don't need a DCAS in libcontain
>> (for the moment) but the problem is that I have a race condition in my
>> vector implementation - which I was hoping you wouldn't have in yours..
>> I.e., you don't have the same problem because you use a realloc to
>> resize your array, which you apparently assume to be atomic.
> Yes, the Array API is NOT thread-safe in anyway, shape or form.
Right - that explains why I couldn't find that "atomic realloc
implementation" anywhere :)
>> but I'd be interested to know why using realloc on the array is not a
>> problem, in your opinion/implementation.
> I just never got around to putting the proxy gc code in there..
:)
>> I haven't had the time to look at your code in detail yet - mostly
>> because it doesn't compile /chez moi/ so I can't run the tests on it,
> You need platform sdk and vc++ 6.0. What errors/warnings do you get?
I have VC .NET. I don't remember the errors I got, but I'll have a look
when I have the time.
>> I've been pondering making a readers/writers lock and only counting
>> resize() as a writer, but that would kinda defeat the idea of being
>> lock-free..
> You can do a fast rw-lock with double-wide cas.
Would you care to elaborate that a bit?
Would that be a spinlock or would you use a couple of mutexes, split
binary semaphores, condvars or some other such lock somewhere?
(If you know of a lock-free implementation other than Joe Seigh's
rw-spinlock, I'd be interested :)
Thanks - I'll have a closer look at it when I have time (which I don't
really have the moment...
rlc
> SenderX wrote:
>>
>> However, you have to do a lot of bookkeeping IMHO, per-thread lists, bin
>> sorting them, iterating them and comparing against per-thread
>> hazard-pointers, tracking of the participating threads, and ( i think )the
>> amount of hazards per-thread is static...
> For the time being, it is. I have yet to find a "proper" way to reliably
> grow the number of per-thread hazard pointers without stopping the world..
It could be done but I don't think it should be done if it adds an extra level
of indirection. I don't think you need more than one hazard pointer per
thread unless you do some kind of lock-free recursive traversal of a tree or
something. You can always handle it as an error equivalent to stack overflow.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
I did an implementation that allows for any number of hazard pointers
per-thread, but it needed an extra memory barrier for the bit of extra state
that needed to be scanned by the polling thread. It used per-thread
hash-tables instead of hazard pointers. It hashed the pointer that needed to
be protected into an index of a per-thread table and incremented the
corresponding counter. This allows for a single hazard to protect multiple
objects. I guess I should call the algorithm hazard proxies instead of
hazard pointers.
> I don't think you need more than one hazard pointer per
> thread unless you do some kind of lock-free recursive traversal of a tree
> or
> something.
This is exactly why I implemented it.
> You can always handle it as an error equivalent to stack overflow.
Yes, that would make it a lot easier.
:)
You can get away with using per-thread hash-tables ( an unpublished
algorithm of mine ) to get an unlimited amount of hazards per-thread.
However it would add extra memory barrier overhead, and introduce some key
differences wrt the original SMR algorithm. One key difference is the
reduced granularity. One per-thread hash bucket ( I call them hazard
proxy's ) could protect multiple objects, so a thread failure could hold up
a variable number of objects from being collected.
>>> That is basically also the reason why I don't need a DCAS in libcontain
>>> (for the moment) but the problem is that I have a race condition in my
>>> vector implementation - which I was hoping you wouldn't have in yours..
>>> I.e., you don't have the same problem because you use a realloc to
>>> resize your array, which you apparently assume to be atomic.
>> Yes, the Array API is NOT thread-safe in anyway, shape or form.
> Right - that explains why I couldn't find that "atomic realloc
> implementation" anywhere :)
>
>>> but I'd be interested to know why using realloc on the array is not a
>>> problem, in your opinion/implementation.
>> I just never got around to putting the proxy gc code in there..
> :)
>
>>> I haven't had the time to look at your code in detail yet - mostly
>>> because it doesn't compile /chez moi/ so I can't run the tests on it,
>> You need platform sdk and vc++ 6.0. What errors/warnings do you get?
> I have VC .NET. I don't remember the errors I got, but I'll have a look
> when I have the time.
I have a brand new portable demo library posted, you should check it out;
it's really tuned up nicely for SMP and/or HyperThreaded systems:
http://appcore.home.comcast.net/
(portable lock-free data-structures)
I believe you are referencing my old stuff for windows that I had up a year
or two ago...
>>> I've been pondering making a readers/writers lock and only counting
>>> resize() as a writer, but that would kinda defeat the idea of being
>>> lock-free..
>> You can do a fast rw-lock with double-wide cas.
> Would you care to elaborate that a bit?
> Would that be a spinlock or would you use a couple of mutexes, split
> binary semaphores, condvars or some other such lock somewhere?
No, its not a spinlock. I got it to use ( unpublished algorithm ) CAS for
fast-paths and two condvars for the slow-paths. It was also cancellation
safe and allowed for timeouts.
> (If you know of a lock-free implementation other than Joe Seigh's
> rw-spinlock, I'd be interested :)
http://appcore.home.comcast.net/appcore/src/ac_rwspinlock_algo1_c.html
http://appcore.home.comcast.net/appcore/include/ac_rwspinlock_algo1_h.html
I think I am not the only person to come up this algorithm. Notice this has
no specific ordering, unlike Joes elegant algorithm...
>> I will try to post the assembler very soon.
> Thanks - I'll have a closer look at it when I have time (which I don't
> really have the moment...
The assembler code for my old proxy collector was posted here:
http://groups-beta.google.com/group/comp.programming.threads/msg/a4a796e25a157ca1?hl=en
It got a bit garbled by Outlook though.
;(