It doesn't help that most programmers, even if they do know
something about multithreading, are blissfully unaware of
the specific synchronization techniques they are using or
relying on. Techniques that either don't scale well or may
have subtle errors that will only show up in a heavily
multi-processed environment.
What might be useful is a list of the known synchronization
patterns and anti-patterns. Programmers could find their
implementations in the list and then realize there are better
patterns to use. So far I haven't found a list of these.
I don't want to write up my own list since I'd have to go
and write up an explanation of each technique. Note that
a lot of published papers compare their techniques to other
known techniques but they tend to pick and choose so the
comparisons are rather incomplete.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
How about this:
process A locks nodes
process B locks other nodes
process A locks some more
...
pattern ==> lock nest monster :-)
Too bad it's not well known to be broken :-)
Sean
exactly - I was thinking of it for the anti-pattern column - to help
let everyone know it was broken.
>
> Sean
The list of list traversal synchronization techniques
would be (very roughtly)
1) conventional mutex for list
2) conventional rwlock for list
3) mutex or rwlock per node
4) reference counting
5) lock-free reference counting
6) optimized reference counting (patent 5,295,262)
7) lock-free (never deallocate deleted nodes)
8) lock-free (deallocate deleted nodes when reader count == 0)
9) RCU
10) SMR (modified hazard pointer algorithm)
11) RCU+SMR
12) proxy refcounting
13) proxy SMR or RCU+SMR
That's all I can think of offhand. There may be others.
There are performance, scalability, and other tradeoffs.
Naturally the best methods are towards the end of the
list.
I've called that "lock chaining" in the past, which seems like a fairly
decent vague description. ;-)
But then, I do like the sound of "lock nest monster", too...
--
Dave Butenhof, David.B...@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062
I almost forgot
14) tracing GC (e.g. Java lock-free linked lists)
>
> That's all I can think of offhand. There may be others.
> There are performance, scalability, and other tradeoffs.
> Naturally the best methods are towards the end of the
> list.
Except for that last one.
Would you like to add anything to this article?
http://en.wikipedia.org/wiki/Concurrency_pattern
Regards,
Markus
Would you like to add anything to this article?
http://en.wikipedia.org/wiki/Double-checked_locking
Regards,
Markus
A servicable replacement of double checked locking (at least for singleton
access) will often simply amount to using regular locking the first time (in
each thread) that you look up the singleton. Then you can cache a reference
or pointer per thread. This will work for singletons that are created and
never destroyed until they are no longer referenced. That sounds like a
variation of observable, smart pointer, or whatever.
Regards, John.
"Markus Elfring" <Markus....@web.de> wrote in message
news:3u3fg0F...@individual.net...
I probably shouldn't have used the term pattern which is
so overused so as to be totally meaningless. Reader/writer
isn't a pattern. It's a problem as in "the readers/writers
problem", and has multiple solutions which are probably more
accurately characterized as algorithms than patterns.
You can blame the debasement of the term pattern on some
idiotic academics who have nothing better to do than to
published dozens and dozens of trivial patterns which if
they called them algorithms it would become immediately
obvious how stupid their ideas were in the first place.
Another hat trick. It's secret at this point until I decide what
do with it (assuming it works). It has the advantage of being
more portable and having less patent entanglement problems (unless
I patent it of course).
That's the route I took with VZOOM; the memory visibility requirements can
be guaranteed to work via. POSIX mutex semantics. So, I can add yet another
pattern to your list:
16) Distributed Proxy Ref-Counting (DPRC)?
Each thread has an array of reference counters that track the total amount
of proxy references there are to an object. Sum of all threads counters is
total proxy count for an object. Two or more objects are allowed to share
the same per-thread reference count. Adjustments to the count can be made
without any atomic ops and/or membars. Repeatedly iterating over a large
shared circular linked-list could be as simple as this:
int loops = 0;
int loop_sync = 8192;
item *f, *nx;
f = vz_persistent_inc( &col.front );
while ( f )
{
nz = vz_persistent_inc( &f->next );
do_work( f );
if ( ! ( (++loops) % loop_sync ) )
{
vz_sync();
}
vz_persistent_dec( f );
f = nx;
}
This simple example happens to voluntary execute a synchronization (membar),
however they can also be tracked by relying on highly non-portable operating
system information, ect... The reason I use the term "persistent" is because
references can exist across calls to vz_sync() and/or operating system
induced synchronizations, also the number of references a thread can acquire
is basically unbounded. IMHO, this is what I think sets VZOOM apart from
most of the schemes that are currently out there. I will be able to provide
some working examples that are based on various Distributed Proxy
Ref-Counting schemes in early December so you can see VZOOM in action...
> and having less patent entanglement problems (unless
> I patent it of course).
Yes. It seems that the claims for most of the relevant patents in this area
are narrowed down to a point where you can slide around "some of them",
indeed... For instance, I can claim that VZOOM requires a thread/processor
in a plurality of threads/processors to periodically, episodically or
randomly execute a "memory barrier". This simple claim is very broad and
doesn't need to be narrowed down to support the patent teachings. It also
did not infringe on existing relevant patent claims...
;)
Damned rabbit. It was a little trickier than I thought. :)
>
>
> That's the route I took with VZOOM; the memory visibility requirements can
> be guaranteed to work via. POSIX mutex semantics. So, I can add yet another
> pattern to your list:
>
>
> 16) Distributed Proxy Ref-Counting (DPRC)?
>
Too many things on the list. :) The only real beneficiary of the one
I'm working on are architectures which were too short sighted to put in
a double wide compare and swap or LL/SC. One guess which architecture that
is. But they do have KCSS but that's only obstruction free.
Have you got more use cases besides list traversal in mind?
Can any thread-safe library for well-known data structures show practical
opportunities?
How do they fit to alternative categorizations for server architectures and
models?
1. single-threaded
2. process-per-request
3. thread-per-request
4. thread-per-socket
5. thread-per-task
6. thread-per-API
7. thread-per-servant
8. thread-per-object
9. thread pool
How many server software do you know where such details can be configured as
a policy
parameter?
Examples:
- CORBA
- Apache multi-processing module
Can we hope that the growing availability of multicore processors will also
result in improved application cores?
How much will the design and corresponding strategy selection be affected by
this technical progress?
Regards,
Markus
--
urgh...
this sentence from the article is a bit off:
"Due to the semantics of most programming languages, the code generated
by the compiler is allowed to update the shared variable to point to a
partially constructed object before A has finished performing the
initialization"
I makes it sound like a compiler code generation (maybe even
optimization) problem, when really it is a memory reading/writing
problem. But I'm not sure I could come up with something that was
correct and/or comprehensible (not to mention consice) either - it is
easy to see wrong, hard to write right, or something like that. Maybe
I'll give it a try.
At least they reference the Meyers / Alexandrescu article.
Tony
List traversal is just a special case. People are more familiar with
lock-free linked lists than they are for other types of structures so
I just used those as an example. The techniques could apply to most
data structures.
> How do they fit to alternative categorizations for server architectures and
> models?
> 1. single-threaded
> 2. process-per-request
> 3. thread-per-request
> 4. thread-per-socket
> 5. thread-per-task
> 6. thread-per-API
> 7. thread-per-servant
> 8. thread-per-object
> 9. thread pool
Well, threads per actions that are independent, i.e. can be run concurrently
without too much synchronization. That's always been the goal.
> How many server software do you know where such details can be configured as
> a policy
> parameter?
> Examples:
> - CORBA
> - Apache multi-processing module
Apache is probably a bad example since they screwed up their run time
library, e.g. the win32 part at least.
> Can we hope that the growing availability of multicore processors will also
> result in improved application cores?
We can hope but I don't think it's going to happen by itself.
> How much will the design and corresponding strategy selection be affected by
> this technical progress?
The industry has a bit of a herd mentality here. Nobody will make a move until
someone else moves first.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
--
------------ And now a word from our sponsor ---------------------
For a secure high performance FTP using SSL/TLS encryption
upgrade to SurgeFTP
---- See http://netwinsite.com/sponsor/sponsor_surgeftp.htm ----
The rabbit has a name or acronymn anyway. rcpc for (yet another)
reference counted proxy collector. Turns out not to be crowbar
proof so it's not worth patenting or anything. You can break it
if you hold the read lock long enough to wrap a 31 or 63 bit
sequence counter under certain conditions. So 63 bits would be
safer.
It's out on http://atomic-ptr-plus.sourceforge.net/
No atomix.h source for sparc which is probably the only place
where this would be useful. Don't have my SB100 anymore. It
was taking up too much room.
I think with the way I have the compare handle wrap arounds, they're
effectively 30 and 62 bit sequence counters.
>
> It's out on http://atomic-ptr-plus.sourceforge.net/
> No atomix.h source for sparc which is probably the only place
> where this would be useful. Don't have my SB100 anymore. It
> was taking up too much room.
>
>
Hmm... It's possible to use the expired patent 5,295,262 trick
for 32 bit machines with double wide CAS, and the refcounting
trick for machines with LL/SC. So we could make proxy refcounting
work everywhere it matters.
Ok, a little bit kludgy but done. rcpc 0.0.2
I just very briefly skimmed over the code. Your sequence count/search scheme
seems like a pretty neat technique indeed. It seems like your are
implementing the collector by back-linking and recording the sequence
numbers into a shared linked-list. You are applying some nifty reference
counting tricks in the defer function that seem to involve the even or odd
quality of a collectors reference count in relation to a sequence count.
Humm, I can't quire figure out why exclusive access to proxy collector
objects is required... I have to study your design some more.
> Turns out not to be crowbar
> proof so it's not worth patenting or anything. You can break it
> if you hold the read lock long enough to wrap a 31 or 63 bit
> sequence counter under certain conditions.
Well, in order for it to break one thread would have to read a sequence
number and then stop inside its collected region long enough for another
thread to read the "exact same sequence number". Then both threads would
have to concurrently iterate through the "compare logic" in a "precise
order". Humm...
Still, I do see your point. Murphy's Law dictates that a couple of
application threads will execute in the exact order necessary to cause a
problem for you algorithm. Of course it will be a critical server
application and the problem will happen at peak usage hours and will
probably cause some major problems...
;)
> So 63 bits would be safer.
Ahhh yeah, 63 bits; that's a very nifty counting trick you whipped up.
:)
> It's out on http://atomic-ptr-plus.sourceforge.net/
> No atomix.h source for sparc which is probably the only place
> where this would be useful. Don't have my SB100 anymore. It
> was taking up too much room.
I finally setup a site for VZOOM at:
http://home.comcast.net/~vzoom/
Various demo applications will be available for download before 2006. Some
documentation will be posted sometime in January 2007. I am sorry for the
delay, but I am just so damn busy! Oh well. Anyway, I really do believe you
will find the techniques I used to implement VZOOM to be pretty nifty. IMHO,
its an elegant and straightforward algorithm. I just hope that you are not
the only one out there that can understand how it all works...
;)
The parity bit is just a way of getting a reserved value out of a counter
that can wrap. If I make bits 63(31)..1 the counter and the parity bit
a flag then I can atomically test for count==0 && flag==0. That's used
for the reference count since for a long lived proxy object the reference
count decrements could wrap. For the sequence counts it's used to designate
an uninitialized sequence count. After a while those could wrap.
Deleted objects have to be GC'd (deallocated) in FIFO order, i.e. the
same order they were deleted in. That's to allow safe lock-free traversal.
You don't want the situation where thread 1 deletes object A, thread
2 deletes object B, queues it for GC, and then thread 1 queues object
A for GC. You could get a situation where objects got deallocated while
they were still reachable. So both the deletion and queueing for GC must
be done while holding exlusive access to maintain FIFO order.
You don't notice this with RCU because it has larger granularity. No deallocations
are done while a reader is in the RCU critical region.
>
>
>
>
>
>>Turns out not to be crowbar
>>proof so it's not worth patenting or anything. You can break it
>>if you hold the read lock long enough to wrap a 31 or 63 bit
>>sequence counter under certain conditions.
>
>
> Well, in order for it to break one thread would have to read a sequence
> number and then stop inside its collected region long enough for another
> thread to read the "exact same sequence number". Then both threads would
> have to concurrently iterate through the "compare logic" in a "precise
> order". Humm...
>
> Still, I do see your point. Murphy's Law dictates that a couple of
> application threads will execute in the exact order necessary to cause a
> problem for you algorithm. Of course it will be a critical server
> application and the problem will happen at peak usage hours and will
> probably cause some major problems...
>
> ;)
You can break it in a fairly determistic way. One thread gets a read lock
and doesn't release it, another thread does a write so there's two proxy
objects. Enough readers can come along so the sequence on the current
proxy object wraps relative to the older proxy object. You're supposed
to see the older proxy object dequeued long before that could happen.
>
>
>
>>So 63 bits would be safer.
Particularly since a 63 bit counter isn't likely to wrap on current hardware.
>
>
> Ahhh yeah, 63 bits; that's a very nifty counting trick you whipped up.
>
> :)
>
[...]
>
> I finally setup a site for VZOOM at:
>
> http://home.comcast.net/~vzoom/
>
> Various demo applications will be available for download before 2006. Some
> documentation will be posted sometime in January 2007. I am sorry for the
> delay, but I am just so damn busy! Oh well. Anyway, I really do believe you
> will find the techniques I used to implement VZOOM to be pretty nifty. IMHO,
> its an elegant and straightforward algorithm. I just hope that you are not
> the only one out there that can understand how it all works...
>
> ;)
>
Okay, I'll try to take a look at it.
Exactly. VZOOM is similar to RCU in that respect. Strict FIFO ordering is
not required because reference counters are only checked by the polling
logic after it has determined that each registered thread has recently
executed a synchronization operation.
> You can break it in a fairly determistic way. One thread gets a read
> lock
> and doesn't release it, another thread does a write so there's two proxy
> objects. Enough readers can come along so the sequence on the current
> proxy object wraps relative to the older proxy object. You're supposed
> to see the older proxy object dequeued long before that could happen.
Ahhh, so a dangerous situation could occur if a thread that did not release
its read lock observes a fresh sequence number that is less than or equal to
the one it read... Is that basically correct?
>> I finally setup a site for VZOOM at:
>>
>> http://home.comcast.net/~vzoom/
>>
>> Various demo applications will be available for download before 2006.
>> Some documentation will be posted sometime in January 2007. I am sorry
>> for the delay, but I am just so damn busy! Oh well. Anyway, I really do
>> believe you will find the techniques I used to implement VZOOM to be
>> pretty nifty. IMHO, its an elegant and straightforward algorithm. I just
>> hope that you are not the only one out there that can understand how it
>> all works...
>>
>> ;)
> Okay, I'll try to take a look at it.
You might want to take a quick look a particular algorithm that I used to
benchmark against VZOOM. Its your basic proxy collector scheme. It doesn't
require DWCAS. The basic trick is to align collector structures on a 128 or
greater byte-boundary. The extra space is used as a master reference count.
Differential counting between the master count and a per-collector private
count along with back-linked reference counting are used to manage the
lifetime of the collectors and to ensure strict FIFO ordering. Humm, I
wonder if you have you seen anything that is similar to my design?
The following sample implementation should compile fine with MSVC 6+:
pc_sample.c
---------
#include <stdlib.h>
#include <stdio.h>
/* proxy collector structures */
typedef void ( *pc_fp_dtor_t ) ( void* );
typedef struct pc_dtor_s pc_dtor_t;
typedef struct pc_s pc_t;
typedef struct pc_deferred_s pc_deferred_t;
struct pc_dtor_s
{
pc_fp_dtor_t fp_dtor;
void *state;
};
struct pc_deferred_s
{
pc_deferred_t *next; /* back-link list */
pc_t *owner;
pc_dtor_t dtor;
int refs; /* private refs */
int back_refs; /* back-link list refs */
void *base_mem;
};
struct pc_s
{
pc_deferred_t *pc_anchor; /* active collector & master refs */
};
/* proxy collector user api */
int pc_alloc( pc_t* );
void pc_free( pc_t* );
pc_deferred_t* pc_inc( pc_t* );
void pc_dec( pc_deferred_t* );
void pc_defer( pc_t*, pc_fp_dtor_t, void* );
/* proxy collector system api */
static pc_deferred_t* pc_deferred_sys_alloc( pc_t*, int );
static void pc_deferred_sys_free( pc_deferred_t* );
static void pc_deferred_sys_dec( pc_deferred_t* );
static int atomic_casptr( void volatile *dest, void const *_cmp, void const
*_xchg );
static void* atomic_xchgptr( void volatile *dest, void const *_xchg );
static void* atomic_xaddptr( void volatile *dest, void const *_xadd );
#define PC_SYS_PTRREF_DEPTH 128
#define PC_SYS_ROUNDUP( p1, p2 ) ( ( (p1) % (p2) ) ? (p1) + ( (p2) - ( (p1)
% (p2) ) ) : (p1) )
#define PC_SYS_GETPTR( mp_p1 ) ( (pc_deferred_t*)( ((int)(mp_p1)) &
(-PC_SYS_PTRREF_DEPTH) ) )
#define PC_SYS_GETREFS( mp_p1 ) ( ( ((int)(mp_p1)) & (PC_SYS_PTRREF_DEPTH -
1) ) )
#define atomic_casword( mp_dest, mp_cmp, mp_xchg ) \
( atomic_casptr \
( (void volatile*)(mp_dest), \
(void const*)(mp_cmp), \
(void const*)(mp_xchg) ) )
#define atomic_xchgword( mp_dest, mp_xchg ) \
( (int)atomic_xchgptr \
( (void volatile*)(mp_dest), \
(void const*)(mp_xchg) ) )
#define atomic_xaddword( mp_dest, mp_xadd ) \
( (int)atomic_xaddptr \
( (void volatile*)(mp_dest), \
(void const*)(mp_xadd) ) )
/* proxy collector user api impl */
int pc_alloc( pc_t *_this )
{
_this->pc_anchor = pc_deferred_sys_alloc( _this, 1 );
return 0;
}
void pc_free( pc_t *_this )
{
pc_defer( _this, 0, 0 );
}
void pc_defer( pc_t *_this, pc_fp_dtor_t fp_dtor, void *state )
{
/* obtain a new collector and swap it with the current one */
pc_deferred_t
*pc_new = ( fp_dtor ) ? pc_deferred_sys_alloc( _this, 2 ) : 0,
*pc_raw = atomic_xchgptr( &_this->pc_anchor, pc_new ),
*pc_old = PC_SYS_GETPTR( pc_raw );
/* back-link the old collector to the new one */
pc_old->next = pc_new;
pc_old->dtor.fp_dtor = fp_dtor;
pc_old->dtor.state = state;
/* obtain the correct count by incrementing
the old collectors private ref count
by the old collectors master ref count */
if ( ! PC_SYS_GETREFS( pc_raw ) ||
atomic_xaddword
( &pc_old->refs,
PC_SYS_GETREFS( pc_raw ) ) ==
-PC_SYS_GETREFS( pc_raw ) )
{
pc_deferred_sys_dec( pc_old );
}
}
pc_deferred_t* pc_inc( pc_t *_this )
{
/* inc master count */
return PC_SYS_GETPTR( atomic_xaddword( &_this->pc_anchor, 1 ) );
}
void pc_dec( pc_deferred_t *_this )
{
pc_deferred_t *cmp = _this->owner->pc_anchor;
do
{
if ( PC_SYS_GETPTR( cmp ) != _this )
{
/* the collector was swapped, we need to dec the private refs */
if ( atomic_xaddword( &_this->refs, -1 ) == 1 )
{
pc_deferred_sys_dec( _this );
}
return;
}
/* we have the same collector so try and
dec the master count */
} while ( ! atomic_casword
( &_this->owner->pc_anchor,
cmp,
((int)cmp) - 1 ) );
}
/* proxy collector system api impl */
pc_deferred_t* pc_deferred_sys_alloc( pc_t *owner, int back_refs )
{
/* align collector on ref boundary */
void *base_mem =
malloc( PC_SYS_ROUNDUP( sizeof( pc_deferred_t ),
PC_SYS_PTRREF_DEPTH ) + (PC_SYS_PTRREF_DEPTH - 1) );
pc_deferred_t *_this =
(pc_deferred_t*)((((int)base_mem) +
(PC_SYS_PTRREF_DEPTH - 1)) & -((int)PC_SYS_PTRREF_DEPTH) );
_this->base_mem = base_mem;
_this->owner = owner;
_this->dtor.fp_dtor = 0;
_this->dtor.state = 0;
_this->next = 0;
_this->refs = 0;
_this->back_refs = back_refs;
return _this;
}
void pc_deferred_sys_free( pc_deferred_t *_this )
{
/* process dtor */
if ( _this->dtor.fp_dtor )
{
_this->dtor.fp_dtor( _this->dtor.state );
}
/* 100% safe to reclaim collector memory */
free( _this->base_mem );
}
void pc_deferred_sys_dec( pc_deferred_t *_this )
{
/* the private refs are zero, we now need to dec
the back-link refs */
if ( atomic_xaddword( &_this->back_refs, -1 ) == 1 )
{
/* this collector has dropped to zero,
so queue it locally */
pc_deferred_t *next, *cur = _this->next, *dtorf = _this, *dtorb = _this;
_this->next = 0;
while ( cur )
{
next = cur->next;
/* we now need to dec the back-link refs of
the this collector */
if ( atomic_xaddword( &cur->back_refs, -1 ) != 1 )
{
break;
}
/* this collector has dropped to zero,
so queue it locally */
cur->next = 0;
dtorb->next = cur;
dtorb = cur;
cur = next;
}
/* process the dropped collectors in FIFO order */
cur = dtorf;
while ( cur )
{
next = cur->next;
pc_deferred_sys_free( cur );
cur = next;
}
/* that's all folks! :) */
}
}
__declspec( naked ) int atomic_casptr( void volatile *dest, void const
*_cmp, void const *_xchg )
{
_asm
{
mov ecx, [esp + 4]
mov eax, [esp + 8]
mov edx, [esp + 12]
lock cmpxchg [ecx], edx
jne fail
mov eax, 1
ret
fail:
xor eax, eax
ret
}
}
__declspec( naked ) void* atomic_xchgptr( void volatile *dest, void const
*_xchg )
{
_asm
{
mov ecx, [esp + 4]
mov eax, [esp + 8]
xchg [ecx], eax
ret
}
}
__declspec( naked ) void* atomic_xaddptr( void volatile *dest, void const
*_xadd )
{
_asm
{
mov ecx, [esp + 4]
mov eax, [esp + 8]
lock xadd [ecx], eax
ret
}
}
/* a quick sample application */
static void prv_defer( void *state );
static pc_t pg_pc;
int main( void )
{
pc_deferred_t *pcd1, *pcd2;
pc_alloc( &pg_pc );
/* collected in reigon 0 */
pc_defer( &pg_pc, prv_defer, (void*)1 );
pc_defer( &pg_pc, prv_defer, (void*)2 );
pcd1 = pc_inc( &pg_pc );
/* collected in reigon 1 */
pc_defer( &pg_pc, prv_defer, (void*)3 );
pcd2 = pc_inc( &pg_pc );
/* collected in reigon 2 */
pc_defer( &pg_pc, prv_defer, (void*)4 );
pc_defer( &pg_pc, prv_defer, (void*)5 );
pc_dec( pcd2 );
/* collected in reigon 1 */
pc_defer( &pg_pc, prv_defer, (void*)6 );
pc_defer( &pg_pc, prv_defer, (void*)7 );
pc_dec( pcd1 );
/* collected in reigon 0 */
pc_defer( &pg_pc, prv_defer, (void*)8 );
pc_defer( &pg_pc, prv_defer, (void*)9 );
pc_free( &pg_pc );
return 0;
}
void prv_defer( void *state )
{
printf( "my_dtor - %i\n", (int)state );
}
Argh, cut-and-paste might have messed the code so you many not be able to
just copy-and-paste. Here is a link to the sample code here:
http://home.comcast.net/~vzoom/demos/pc_sample.c
just in case you have any problems.
The compare is change from x < y to (x - y) < 0 to handle overall wrap
but that uses up 1 bit. The parity bit uses up the other bit.
>
>
>
>>>I finally setup a site for VZOOM at:
>>>
>>>http://home.comcast.net/~vzoom/
>>>
[...]
>
>
> You might want to take a quick look a particular algorithm that I used to
> benchmark against VZOOM. Its your basic proxy collector scheme. It doesn't
> require DWCAS. The basic trick is to align collector structures on a 128 or
> greater byte-boundary. The extra space is used as a master reference count.
> Differential counting between the master count and a per-collector private
> count along with back-linked reference counting are used to manage the
> lifetime of the collectors and to ensure strict FIFO ordering. Humm, I
> wonder if you have you seen anything that is similar to my design?
>
This is stuff you worked on from way back? I don't think anyone else
has been doing proxy type collectors.
Ahhh, thanks for that. I think I now understand how your rabbit performs
some of its tricks...
:)
>> You might want to take a quick look a particular algorithm that I used to
>> benchmark against VZOOM. Its your basic proxy collector scheme. It
>> doesn't require DWCAS. The basic trick is to align collector structures
>> on a 128 or greater byte-boundary. The extra space is used as a master
>> reference count. Differential counting between the master count and a
>> per-collector private count along with back-linked reference counting are
>> used to manage the lifetime of the collectors and to ensure strict FIFO
>> ordering. Humm, I wonder if you have you seen anything that is similar to
>> my design?
>>
>
> This is stuff you worked on from way back?
I designed a preliminary version of the basic idea in mid 2004. Then finally
cleaned it up enough so that it could be used as a benchmarking tool. I made
a very brief mention of the idea in this post:
http://groups.google.com/group/comp.arch/msg/7f56cfc0541ec7f0?hl=en&
Actually, some of my oldest stuff that went public was the DWCAS based
collector that did not rely on back-linking, but instead pushing deferred
objects onto a lock-free LIFO whose anchor was embedded with a reference
count:
/* two contiguous words */
struct old_pc_anchor_s
{
int refs; /* refs */
pc_deferred_t *defer; /* lock-free LIFO */
};
http://groups.google.com/group/comp.programming.threads/msg/9d15cb75717124e1
http://groups.google.com/group/comp.programming.threads/msg/e547340c08220b6b
(posts about the old algorithm)
Deferred object list got collected when refs == 0, FIFO ordering not
required...
> I don't think anyone else has been doing proxy type collectors.
You know, I think your correct! I believe we are creating some novel
solutions here, indeed... I just wonder when the emperor will decide to wear
his new clothes...
lol
;)
I don't know. There's a lot of NIH syndrome out there. The problem with
scalability is the applications where scalability is an issue tend to be
really large, like databases and such. And you really need to have a
sucess story to get the ball rolling as everyone rushes to catch up with
the competition. API's by themselves are a hard sell. Plus, I don't know
if anyone noticed but there's no profit margin in open source software so
if anyone is waiting for me to buy an x86-64 or a Niagara processor so the
api's can be ported there, they're going to have a long wait. :)
The fun part is I think I've established enough prior art here. So
eventially someone will have to do all the work on a lock-free solution
themselves. They just won't have an exclusive competitive advantage since
all their competitors will be able to copy their methods.
DOH!
I made a mistake in the code when I was stripping some of my library
specific stuff. The cas loop in the pc_dec( ... ) function did not reload
the cmp var when a cas failed.
Here is the fixed code:
http://home.comcast.net/~vzoom/demos/pc_sample.c
Sorry!
Is more drumming and "advertisement" needed like it is demonstrated in the article
"Software and the Concurrency Revolution" by Herb Sutter and James Larus?
http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=332
Regards,
Markus
Of course I would find out about it after I no longer have
an SB100.
Humm, I wonder how my proxy collector code would do:
http://home.comcast.net/~vzoom/demos/pc_sample.c
I should really read the rules and regulations to see how they feel about
submitting any algorithms that have patent(s) pending...
> Of course I would find out about it after I no longer have
> an SB100.
Murphy's law strikes again! Humm, I have do have a version of VZOOM that
uses only pthread mutexs/condvars and only relies on dependant-load ordering
for readers. Do you know how good/complete the pthread api is on Solaris 10?
Also, do you know if Niagara processors need dependant-load membars like
Alphas do?