Rise of the multicore machines

Joe Seigh

unread,

Nov 15, 2005, 11:04:49 AM11/15/05

to

With the introduction of Sun's T1 (Niagara) processor we
can add yet another vendor to the list of vendor's fretting
about applications exploiting the multiprocessing capabilities
of these processors so they can sell more of them.

It doesn't help that most programmers, even if they do know
something about multithreading, are blissfully unaware of
the specific synchronization techniques they are using or
relying on. Techniques that either don't scale well or may
have subtle errors that will only show up in a heavily
multi-processed environment.

What might be useful is a list of the known synchronization
patterns and anti-patterns. Programmers could find their
implementations in the list and then realize there are better
patterns to use. So far I haven't found a list of these.
I don't want to write up my own list since I'd have to go
and write up an explanation of each technique. Note that
a lot of published papers compare their techniques to other
known techniques but they tend to pick and choose so the
comparisons are rather incomplete.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Doug

unread,

Nov 15, 2005, 2:14:18 PM11/15/05

to

Shame - I'd love to see such a list. Even a list of names would be
great - google could do the rest.

Joe Seigh

unread,

Nov 15, 2005, 3:20:15 PM11/15/05

to

Doug wrote:
> Shame - I'd love to see such a list. Even a list of names would be
> great - google could do the rest.
>

That's the problem. I don't think they all have standard
well known names. What's the name for locking individual
nodes as you traverse a list? I don't know but it's been
around for at least 30 years.

John Hickin

unread,

Nov 15, 2005, 5:03:27 PM11/15/05

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:TYmdnQYwnOUr3-fe...@comcast.com...

> Doug wrote:
> > Shame - I'd love to see such a list. Even a list of names would be
> > great - google could do the rest.
> >
> That's the problem. I don't think they all have standard
> well known names. What's the name for locking individual
> nodes as you traverse a list? I don't know but it's been
> around for at least 30 years.
>

How about this:

process A locks nodes
process B locks other nodes
process A locks some more
...
pattern ==> lock nest monster :-)

gottlo...@gmail.com

unread,

Nov 15, 2005, 5:18:26 PM11/15/05

to

DCLP is getting to be well known.

Sean Kelly

unread,

Nov 15, 2005, 7:07:00 PM11/15/05

to

gottlo...@gmail.com wrote:
> DCLP is getting to be well known.

Too bad it's not well known to be broken :-)

Sean

gottlo...@gmail.com

unread,

Nov 15, 2005, 11:28:12 PM11/15/05

to

exactly - I was thinking of it for the anti-pattern column - to help
let everyone know it was broken.

>
> Sean

Joe Seigh

unread,

Nov 16, 2005, 7:35:48 AM11/16/05

to

No, not even a lock mess monster :)
You acquire the locks in the order you traverse the list.
Also you release a lock after you acquire the lock for the
next node.

The list of list traversal synchronization techniques
would be (very roughtly)

1) conventional mutex for list
2) conventional rwlock for list
3) mutex or rwlock per node
4) reference counting
5) lock-free reference counting
6) optimized reference counting (patent 5,295,262)
7) lock-free (never deallocate deleted nodes)
8) lock-free (deallocate deleted nodes when reader count == 0)
9) RCU
10) SMR (modified hazard pointer algorithm)
11) RCU+SMR
12) proxy refcounting
13) proxy SMR or RCU+SMR

That's all I can think of offhand. There may be others.
There are performance, scalability, and other tradeoffs.
Naturally the best methods are towards the end of the
list.

David Butenhof

unread,

Nov 16, 2005, 9:40:03 AM11/16/05

to

Joe Seigh wrote:
> John Hickin wrote:
>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:TYmdnQYwnOUr3-fe...@comcast.com...
>>
>>> Doug wrote:
>>>
>>>> Shame - I'd love to see such a list. Even a list of names would be
>>>> great - google could do the rest.
>>>>
>>> That's the problem. I don't think they all have standard
>>> well known names. What's the name for locking individual
>>> nodes as you traverse a list? I don't know but it's been
>>> around for at least 30 years.
>>>
>> How about this:
>>
>> process A locks nodes
>> process B locks other nodes
>> process A locks some more
>> ...
>> pattern ==> lock nest monster :-)
>>
>>
> No, not even a lock mess monster :)
> You acquire the locks in the order you traverse the list.
> Also you release a lock after you acquire the lock for the
> next node.

I've called that "lock chaining" in the past, which seems like a fairly
decent vague description. ;-)

But then, I do like the sound of "lock nest monster", too...

--
Dave Butenhof, David.B...@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062

dick.dunbar

unread,

Nov 16, 2005, 11:33:32 AM11/16/05

to

WikiPedia might be a good place to collabaorate on such a list.

Joe Seigh

unread,

Nov 16, 2005, 9:08:10 PM11/16/05

to

Joe Seigh wrote:
> The list of list traversal synchronization techniques
> would be (very roughtly)
>
> 1) conventional mutex for list
> 2) conventional rwlock for list
> 3) mutex or rwlock per node
> 4) reference counting
> 5) lock-free reference counting
> 6) optimized reference counting (patent 5,295,262)
> 7) lock-free (never deallocate deleted nodes)
> 8) lock-free (deallocate deleted nodes when reader count == 0)
> 9) RCU
> 10) SMR (modified hazard pointer algorithm)
> 11) RCU+SMR
> 12) proxy refcounting
> 13) proxy SMR or RCU+SMR

I almost forgot
14) tracing GC (e.g. Java lock-free linked lists)

>
> That's all I can think of offhand. There may be others.
> There are performance, scalability, and other tradeoffs.
> Naturally the best methods are towards the end of the
> list.

Except for that last one.

Markus Elfring

unread,

Nov 17, 2005, 8:36:32 AM11/17/05

to

> [...]

> The list of list traversal synchronization techniques
> would be (very roughtly)

> [...]

Would you like to add anything to this article?
http://en.wikipedia.org/wiki/Concurrency_pattern

Regards,
Markus

Markus Elfring

unread,

Nov 17, 2005, 8:32:33 AM11/17/05

to

> exactly - I was thinking of it for the anti-pattern column - to help
> let everyone know it was broken.

Would you like to add anything to this article?
http://en.wikipedia.org/wiki/Double-checked_locking

Regards,
Markus

John Hickin

unread,

Nov 17, 2005, 9:17:03 AM11/17/05

to

Isn't the point of an anti-pattern is that it is the wrong tool for the job?

A servicable replacement of double checked locking (at least for singleton
access) will often simply amount to using regular locking the first time (in
each thread) that you look up the singleton. Then you can cache a reference
or pointer per thread. This will work for singletons that are created and
never destroyed until they are no longer referenced. That sounds like a
variation of observable, smart pointer, or whatever.

Regards, John.

"Markus Elfring" <Markus....@web.de> wrote in message
news:3u3fg0F...@individual.net...

Joe Seigh

unread,

Nov 17, 2005, 11:04:20 AM11/17/05

to

I probably shouldn't have used the term pattern which is
so overused so as to be totally meaningless. Reader/writer
isn't a pattern. It's a problem as in "the readers/writers
problem", and has multiple solutions which are probably more
accurately characterized as algorithms than patterns.

You can blame the debasement of the term pattern on some
idiotic academics who have nothing better to do than to
published dozens and dozens of trivial patterns which if
they called them algorithms it would become immediately
obvious how stupid their ideas were in the first place.

Joe Seigh

unread,

Nov 19, 2005, 12:22:27 PM11/19/05

to

Joe Seigh wrote:
> Joe Seigh wrote:
>
>> The list of list traversal synchronization techniques
>> would be (very roughtly)
>>
>> 1) conventional mutex for list
>> 2) conventional rwlock for list
>> 3) mutex or rwlock per node
>> 4) reference counting
>> 5) lock-free reference counting
>> 6) optimized reference counting (patent 5,295,262)
>> 7) lock-free (never deallocate deleted nodes)
>> 8) lock-free (deallocate deleted nodes when reader count == 0)
>> 9) RCU
>> 10) SMR (modified hazard pointer algorithm)
>> 11) RCU+SMR
>> 12) proxy refcounting
>> 13) proxy SMR or RCU+SMR
>
>
> I almost forgot
> 14) tracing GC (e.g. Java lock-free linked lists)
>

15) Yet Another Rabbit (YAR)

Another hat trick. It's secret at this point until I decide what
do with it (assuming it works). It has the advantage of being
more portable and having less patent entanglement problems (unless
I patent it of course).

Chris Thomasson

unread,

Nov 19, 2005, 7:22:26 PM11/19/05

to

>> I almost forgot
>> 14) tracing GC (e.g. Java lock-free linked lists)
>>
> 15) Yet Another Rabbit (YAR)
>
> Another hat trick. It's secret at this point until I decide what
> do with it (assuming it works). It has the advantage of being
> more portable

That's the route I took with VZOOM; the memory visibility requirements can
be guaranteed to work via. POSIX mutex semantics. So, I can add yet another
pattern to your list:

16) Distributed Proxy Ref-Counting (DPRC)?

Each thread has an array of reference counters that track the total amount
of proxy references there are to an object. Sum of all threads counters is
total proxy count for an object. Two or more objects are allowed to share
the same per-thread reference count. Adjustments to the count can be made
without any atomic ops and/or membars. Repeatedly iterating over a large
shared circular linked-list could be as simple as this:

int loops = 0;
int loop_sync = 8192;
item *f, *nx;

f = vz_persistent_inc( &col.front );

while ( f )
{
nz = vz_persistent_inc( &f->next );

do_work( f );

if ( ! ( (++loops) % loop_sync ) )
{
vz_sync();
}

vz_persistent_dec( f );
f = nx;
}

This simple example happens to voluntary execute a synchronization (membar),
however they can also be tracked by relying on highly non-portable operating
system information, ect... The reason I use the term "persistent" is because
references can exist across calls to vz_sync() and/or operating system
induced synchronizations, also the number of references a thread can acquire
is basically unbounded. IMHO, this is what I think sets VZOOM apart from
most of the schemes that are currently out there. I will be able to provide
some working examples that are based on various Distributed Proxy
Ref-Counting schemes in early December so you can see VZOOM in action...

> and having less patent entanglement problems (unless
> I patent it of course).

Yes. It seems that the claims for most of the relevant patents in this area
are narrowed down to a point where you can slide around "some of them",
indeed... For instance, I can claim that VZOOM requires a thread/processor
in a plurality of threads/processors to periodically, episodically or
randomly execute a "memory barrier". This simple claim is very broad and
doesn't need to be narrowed down to support the patent teachings. It also
did not infringe on existing relevant patent claims...

;)

Joe Seigh

unread,

Nov 21, 2005, 8:28:42 AM11/21/05

to

Chris Thomasson wrote:
>>>I almost forgot
>>>14) tracing GC (e.g. Java lock-free linked lists)
>>>
>>
>>15) Yet Another Rabbit (YAR)
>>
>>Another hat trick. It's secret at this point until I decide what
>>do with it (assuming it works). It has the advantage of being
>>more portable

Damned rabbit. It was a little trickier than I thought. :)

>
>
> That's the route I took with VZOOM; the memory visibility requirements can
> be guaranteed to work via. POSIX mutex semantics. So, I can add yet another
> pattern to your list:
>
>
> 16) Distributed Proxy Ref-Counting (DPRC)?
>

Too many things on the list. :) The only real beneficiary of the one
I'm working on are architectures which were too short sighted to put in
a double wide compare and swap or LL/SC. One guess which architecture that
is. But they do have KCSS but that's only obstruction free.

Markus Elfring

unread,

Nov 21, 2005, 3:09:33 PM11/21/05

to

> [...]

> It doesn't help that most programmers, even if they do know
> something about multithreading, are blissfully unaware of
> the specific synchronization techniques they are using or
> relying on. Techniques that either don't scale well or may
> have subtle errors that will only show up in a heavily
> multi-processed environment.
>
> What might be useful is a list of the known synchronization
> patterns and anti-patterns. Programmers could find their
> implementations in the list and then realize there are better
> patterns to use. So far I haven't found a list of these.

Have you got more use cases besides list traversal in mind?
Can any thread-safe library for well-known data structures show practical
opportunities?

How do they fit to alternative categorizations for server architectures and
models?
1. single-threaded
2. process-per-request
3. thread-per-request
4. thread-per-socket
5. thread-per-task
6. thread-per-API
7. thread-per-servant
8. thread-per-object
9. thread pool

How many server software do you know where such details can be configured as
a policy
parameter?
Examples:
- CORBA
- Apache multi-processing module

Can we hope that the growing availability of multicore processors will also
result in improved application cores?
How much will the design and corresponding strategy selection be affected by
this technical progress?

Regards,
Markus

--

gottlo...@gmail.com

unread,

Nov 21, 2005, 4:29:09 PM11/21/05

to

urgh...

this sentence from the article is a bit off:

"Due to the semantics of most programming languages, the code generated
by the compiler is allowed to update the shared variable to point to a
partially constructed object before A has finished performing the
initialization"

I makes it sound like a compiler code generation (maybe even
optimization) problem, when really it is a memory reading/writing
problem. But I'm not sure I could come up with something that was
correct and/or comprehensible (not to mention consice) either - it is
easy to see wrong, hard to write right, or something like that. Maybe
I'll give it a try.

At least they reference the Meyers / Alexandrescu article.

Tony

Joe Seigh

unread,

Nov 21, 2005, 7:28:42 PM11/21/05

to

Markus Elfring wrote:
>>What might be useful is a list of the known synchronization
>>patterns and anti-patterns. Programmers could find their
>>implementations in the list and then realize there are better
>>patterns to use. So far I haven't found a list of these.
>
>
> Have you got more use cases besides list traversal in mind?
> Can any thread-safe library for well-known data structures show practical
> opportunities?

List traversal is just a special case. People are more familiar with
lock-free linked lists than they are for other types of structures so
I just used those as an example. The techniques could apply to most
data structures.

> How do they fit to alternative categorizations for server architectures and
> models?
> 1. single-threaded
> 2. process-per-request
> 3. thread-per-request
> 4. thread-per-socket
> 5. thread-per-task
> 6. thread-per-API
> 7. thread-per-servant
> 8. thread-per-object
> 9. thread pool

Well, threads per actions that are independent, i.e. can be run concurrently
without too much synchronization. That's always been the goal.

> How many server software do you know where such details can be configured as
> a policy
> parameter?
> Examples:
> - CORBA
> - Apache multi-processing module

Apache is probably a bad example since they screwed up their run time
library, e.g. the win32 part at least.

> Can we hope that the growing availability of multicore processors will also
> result in improved application cores?

We can hope but I don't think it's going to happen by itself.

> How much will the design and corresponding strategy selection be affected by
> this technical progress?

The industry has a bit of a herd mentality here. Nobody will make a move until
someone else moves first.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

--
------------ And now a word from our sponsor ---------------------
For a secure high performance FTP using SSL/TLS encryption
upgrade to SurgeFTP
---- See http://netwinsite.com/sponsor/sponsor_surgeftp.htm ----

Joe Seigh

unread,

Nov 27, 2005, 11:19:59 AM11/27/05

to

I wrote:
> 15) Yet Another Rabbit (YAR)
>
> Another hat trick. It's secret at this point until I decide what
> do with it (assuming it works). It has the advantage of being
> more portable and having less patent entanglement problems (unless
> I patent it of course).
>

The rabbit has a name or acronymn anyway. rcpc for (yet another)
reference counted proxy collector. Turns out not to be crowbar
proof so it's not worth patenting or anything. You can break it
if you hold the read lock long enough to wrap a 31 or 63 bit
sequence counter under certain conditions. So 63 bits would be
safer.

It's out on http://atomic-ptr-plus.sourceforge.net/
No atomix.h source for sparc which is probably the only place
where this would be useful. Don't have my SB100 anymore. It
was taking up too much room.

Joe Seigh

unread,

Nov 28, 2005, 7:19:45 AM11/28/05

to

Joe Seigh wrote:
> I wrote:
>
>> 15) Yet Another Rabbit (YAR)
>>
>> Another hat trick. It's secret at this point until I decide what
>> do with it (assuming it works). It has the advantage of being
>> more portable and having less patent entanglement problems (unless
>> I patent it of course).
>>
>
> The rabbit has a name or acronymn anyway. rcpc for (yet another)
> reference counted proxy collector. Turns out not to be crowbar
> proof so it's not worth patenting or anything. You can break it
> if you hold the read lock long enough to wrap a 31 or 63 bit
> sequence counter under certain conditions. So 63 bits would be
> safer.

I think with the way I have the compare handle wrap arounds, they're
effectively 30 and 62 bit sequence counters.

>
> It's out on http://atomic-ptr-plus.sourceforge.net/
> No atomix.h source for sparc which is probably the only place
> where this would be useful. Don't have my SB100 anymore. It
> was taking up too much room.
>
>

Hmm... It's possible to use the expired patent 5,295,262 trick
for 32 bit machines with double wide CAS, and the refcounting
trick for machines with LL/SC. So we could make proxy refcounting
work everywhere it matters.

Joe Seigh

unread,

Nov 29, 2005, 3:30:10 PM11/29/05

to

I wrote:
>> It's out on http://atomic-ptr-plus.sourceforge.net/
>> No atomix.h source for sparc which is probably the only place
>> where this would be useful. Don't have my SB100 anymore. It
>> was taking up too much room.
>>
>>
> Hmm... It's possible to use the expired patent 5,295,262 trick
> for 32 bit machines with double wide CAS, and the refcounting
> trick for machines with LL/SC. So we could make proxy refcounting
> work everywhere it matters.

Ok, a little bit kludgy but done. rcpc 0.0.2

Chris Thomasson

unread,

Dec 8, 2005, 11:07:30 AM12/8/05

to

> The rabbit has a name or acronymn anyway. rcpc for (yet another)
> reference counted proxy collector.

I just very briefly skimmed over the code. Your sequence count/search scheme
seems like a pretty neat technique indeed. It seems like your are
implementing the collector by back-linking and recording the sequence
numbers into a shared linked-list. You are applying some nifty reference
counting tricks in the defer function that seem to involve the even or odd
quality of a collectors reference count in relation to a sequence count.
Humm, I can't quire figure out why exclusive access to proxy collector
objects is required... I have to study your design some more.

> Turns out not to be crowbar
> proof so it's not worth patenting or anything. You can break it
> if you hold the read lock long enough to wrap a 31 or 63 bit
> sequence counter under certain conditions.

Well, in order for it to break one thread would have to read a sequence
number and then stop inside its collected region long enough for another
thread to read the "exact same sequence number". Then both threads would
have to concurrently iterate through the "compare logic" in a "precise
order". Humm...

Still, I do see your point. Murphy's Law dictates that a couple of
application threads will execute in the exact order necessary to cause a
problem for you algorithm. Of course it will be a critical server
application and the problem will happen at peak usage hours and will
probably cause some major problems...

;)

> So 63 bits would be safer.

Ahhh yeah, 63 bits; that's a very nifty counting trick you whipped up.

:)

> It's out on http://atomic-ptr-plus.sourceforge.net/
> No atomix.h source for sparc which is probably the only place
> where this would be useful. Don't have my SB100 anymore. It
> was taking up too much room.

I finally setup a site for VZOOM at:

http://home.comcast.net/~vzoom/

Various demo applications will be available for download before 2006. Some
documentation will be posted sometime in January 2007. I am sorry for the
delay, but I am just so damn busy! Oh well. Anyway, I really do believe you
will find the techniques I used to implement VZOOM to be pretty nifty. IMHO,
its an elegant and straightforward algorithm. I just hope that you are not
the only one out there that can understand how it all works...

;)

Joe Seigh

unread,

Dec 8, 2005, 1:33:25 PM12/8/05

to

Chris Thomasson wrote:
>>The rabbit has a name or acronymn anyway. rcpc for (yet another)
>>reference counted proxy collector.
>
>
> I just very briefly skimmed over the code. Your sequence count/search scheme
> seems like a pretty neat technique indeed. It seems like your are
> implementing the collector by back-linking and recording the sequence
> numbers into a shared linked-list. You are applying some nifty reference
> counting tricks in the defer function that seem to involve the even or odd
> quality of a collectors reference count in relation to a sequence count.
> Humm, I can't quire figure out why exclusive access to proxy collector
> objects is required... I have to study your design some more.

The parity bit is just a way of getting a reserved value out of a counter
that can wrap. If I make bits 63(31)..1 the counter and the parity bit
a flag then I can atomically test for count==0 && flag==0. That's used
for the reference count since for a long lived proxy object the reference
count decrements could wrap. For the sequence counts it's used to designate
an uninitialized sequence count. After a while those could wrap.

Deleted objects have to be GC'd (deallocated) in FIFO order, i.e. the
same order they were deleted in. That's to allow safe lock-free traversal.
You don't want the situation where thread 1 deletes object A, thread
2 deletes object B, queues it for GC, and then thread 1 queues object
A for GC. You could get a situation where objects got deallocated while
they were still reachable. So both the deletion and queueing for GC must
be done while holding exlusive access to maintain FIFO order.

You don't notice this with RCU because it has larger granularity. No deallocations
are done while a reader is in the RCU critical region.

>
>
>
>
>
>>Turns out not to be crowbar
>>proof so it's not worth patenting or anything. You can break it
>>if you hold the read lock long enough to wrap a 31 or 63 bit
>>sequence counter under certain conditions.
>
>
> Well, in order for it to break one thread would have to read a sequence
> number and then stop inside its collected region long enough for another
> thread to read the "exact same sequence number". Then both threads would
> have to concurrently iterate through the "compare logic" in a "precise
> order". Humm...
>
> Still, I do see your point. Murphy's Law dictates that a couple of
> application threads will execute in the exact order necessary to cause a
> problem for you algorithm. Of course it will be a critical server
> application and the problem will happen at peak usage hours and will
> probably cause some major problems...
>
> ;)

You can break it in a fairly determistic way. One thread gets a read lock
and doesn't release it, another thread does a write so there's two proxy
objects. Enough readers can come along so the sequence on the current
proxy object wraps relative to the older proxy object. You're supposed
to see the older proxy object dequeued long before that could happen.

>
>
>
>>So 63 bits would be safer.

Particularly since a 63 bit counter isn't likely to wrap on current hardware.

>
>
> Ahhh yeah, 63 bits; that's a very nifty counting trick you whipped up.
>
> :)
>

[...]

>
> I finally setup a site for VZOOM at:
>
> http://home.comcast.net/~vzoom/
>
> Various demo applications will be available for download before 2006. Some
> documentation will be posted sometime in January 2007. I am sorry for the
> delay, but I am just so damn busy! Oh well. Anyway, I really do believe you
> will find the techniques I used to implement VZOOM to be pretty nifty. IMHO,
> its an elegant and straightforward algorithm. I just hope that you are not
> the only one out there that can understand how it all works...
>
> ;)
>

Okay, I'll try to take a look at it.

Chris Thomasson

unread,

Dec 8, 2005, 4:48:07 PM12/8/05

to

> Deleted objects have to be GC'd (deallocated) in FIFO order, i.e. the
> same order they were deleted in. That's to allow safe lock-free
> traversal.
> You don't want the situation where thread 1 deletes object A, thread
> 2 deletes object B, queues it for GC, and then thread 1 queues object
> A for GC. You could get a situation where objects got deallocated while
> they were still reachable. So both the deletion and queueing for GC must
> be done while holding exlusive access to maintain FIFO order.
>
> You don't notice this with RCU because it has larger granularity. No
> deallocations
> are done while a reader is in the RCU critical region.

Exactly. VZOOM is similar to RCU in that respect. Strict FIFO ordering is
not required because reference counters are only checked by the polling
logic after it has determined that each registered thread has recently
executed a synchronization operation.

> You can break it in a fairly determistic way. One thread gets a read
> lock
> and doesn't release it, another thread does a write so there's two proxy
> objects. Enough readers can come along so the sequence on the current
> proxy object wraps relative to the older proxy object. You're supposed
> to see the older proxy object dequeued long before that could happen.

Ahhh, so a dangerous situation could occur if a thread that did not release
its read lock observes a fresh sequence number that is less than or equal to
the one it read... Is that basically correct?

>> I finally setup a site for VZOOM at:
>>
>> http://home.comcast.net/~vzoom/
>>
>> Various demo applications will be available for download before 2006.
>> Some documentation will be posted sometime in January 2007. I am sorry
>> for the delay, but I am just so damn busy! Oh well. Anyway, I really do
>> believe you will find the techniques I used to implement VZOOM to be
>> pretty nifty. IMHO, its an elegant and straightforward algorithm. I just
>> hope that you are not the only one out there that can understand how it
>> all works...
>>
>> ;)
> Okay, I'll try to take a look at it.

You might want to take a quick look a particular algorithm that I used to
benchmark against VZOOM. Its your basic proxy collector scheme. It doesn't
require DWCAS. The basic trick is to align collector structures on a 128 or
greater byte-boundary. The extra space is used as a master reference count.
Differential counting between the master count and a per-collector private
count along with back-linked reference counting are used to manage the
lifetime of the collectors and to ensure strict FIFO ordering. Humm, I
wonder if you have you seen anything that is similar to my design?

The following sample implementation should compile fine with MSVC 6+:

pc_sample.c
---------

#include <stdlib.h>
#include <stdio.h>

/* proxy collector structures */
typedef void ( *pc_fp_dtor_t ) ( void* );
typedef struct pc_dtor_s pc_dtor_t;
typedef struct pc_s pc_t;
typedef struct pc_deferred_s pc_deferred_t;

struct pc_dtor_s
{
pc_fp_dtor_t fp_dtor;
void *state;
};

struct pc_deferred_s
{
pc_deferred_t *next; /* back-link list */
pc_t *owner;
pc_dtor_t dtor;
int refs; /* private refs */
int back_refs; /* back-link list refs */
void *base_mem;
};

struct pc_s
{
pc_deferred_t *pc_anchor; /* active collector & master refs */
};

/* proxy collector user api */
int pc_alloc( pc_t* );
void pc_free( pc_t* );
pc_deferred_t* pc_inc( pc_t* );
void pc_dec( pc_deferred_t* );
void pc_defer( pc_t*, pc_fp_dtor_t, void* );

/* proxy collector system api */
static pc_deferred_t* pc_deferred_sys_alloc( pc_t*, int );
static void pc_deferred_sys_free( pc_deferred_t* );
static void pc_deferred_sys_dec( pc_deferred_t* );
static int atomic_casptr( void volatile *dest, void const *_cmp, void const
*_xchg );
static void* atomic_xchgptr( void volatile *dest, void const *_xchg );
static void* atomic_xaddptr( void volatile *dest, void const *_xadd );

#define PC_SYS_PTRREF_DEPTH 128
#define PC_SYS_ROUNDUP( p1, p2 ) ( ( (p1) % (p2) ) ? (p1) + ( (p2) - ( (p1)
% (p2) ) ) : (p1) )
#define PC_SYS_GETPTR( mp_p1 ) ( (pc_deferred_t*)( ((int)(mp_p1)) &
(-PC_SYS_PTRREF_DEPTH) ) )
#define PC_SYS_GETREFS( mp_p1 ) ( ( ((int)(mp_p1)) & (PC_SYS_PTRREF_DEPTH -
1) ) )

#define atomic_casword( mp_dest, mp_cmp, mp_xchg ) \
( atomic_casptr \
( (void volatile*)(mp_dest), \
(void const*)(mp_cmp), \
(void const*)(mp_xchg) ) )

#define atomic_xchgword( mp_dest, mp_xchg ) \
( (int)atomic_xchgptr \
( (void volatile*)(mp_dest), \
(void const*)(mp_xchg) ) )

#define atomic_xaddword( mp_dest, mp_xadd ) \
( (int)atomic_xaddptr \
( (void volatile*)(mp_dest), \
(void const*)(mp_xadd) ) )

/* proxy collector user api impl */
int pc_alloc( pc_t *_this )
{
_this->pc_anchor = pc_deferred_sys_alloc( _this, 1 );
return 0;
}

void pc_free( pc_t *_this )
{
pc_defer( _this, 0, 0 );
}

void pc_defer( pc_t *_this, pc_fp_dtor_t fp_dtor, void *state )
{
/* obtain a new collector and swap it with the current one */
pc_deferred_t
*pc_new = ( fp_dtor ) ? pc_deferred_sys_alloc( _this, 2 ) : 0,
*pc_raw = atomic_xchgptr( &_this->pc_anchor, pc_new ),
*pc_old = PC_SYS_GETPTR( pc_raw );

/* back-link the old collector to the new one */
pc_old->next = pc_new;
pc_old->dtor.fp_dtor = fp_dtor;
pc_old->dtor.state = state;

/* obtain the correct count by incrementing
the old collectors private ref count
by the old collectors master ref count */
if ( ! PC_SYS_GETREFS( pc_raw ) ||
atomic_xaddword
( &pc_old->refs,
PC_SYS_GETREFS( pc_raw ) ) ==
-PC_SYS_GETREFS( pc_raw ) )
{
pc_deferred_sys_dec( pc_old );
}
}

pc_deferred_t* pc_inc( pc_t *_this )
{
/* inc master count */
return PC_SYS_GETPTR( atomic_xaddword( &_this->pc_anchor, 1 ) );
}

void pc_dec( pc_deferred_t *_this )
{
pc_deferred_t *cmp = _this->owner->pc_anchor;

do
{
if ( PC_SYS_GETPTR( cmp ) != _this )
{
/* the collector was swapped, we need to dec the private refs */
if ( atomic_xaddword( &_this->refs, -1 ) == 1 )
{
pc_deferred_sys_dec( _this );
}

return;
}

/* we have the same collector so try and
dec the master count */
} while ( ! atomic_casword
( &_this->owner->pc_anchor,
cmp,
((int)cmp) - 1 ) );
}

/* proxy collector system api impl */
pc_deferred_t* pc_deferred_sys_alloc( pc_t *owner, int back_refs )
{
/* align collector on ref boundary */
void *base_mem =
malloc( PC_SYS_ROUNDUP( sizeof( pc_deferred_t ),
PC_SYS_PTRREF_DEPTH ) + (PC_SYS_PTRREF_DEPTH - 1) );

pc_deferred_t *_this =
(pc_deferred_t*)((((int)base_mem) +
(PC_SYS_PTRREF_DEPTH - 1)) & -((int)PC_SYS_PTRREF_DEPTH) );

_this->base_mem = base_mem;
_this->owner = owner;
_this->dtor.fp_dtor = 0;
_this->dtor.state = 0;
_this->next = 0;
_this->refs = 0;
_this->back_refs = back_refs;

return _this;
}

void pc_deferred_sys_free( pc_deferred_t *_this )
{
/* process dtor */
if ( _this->dtor.fp_dtor )
{
_this->dtor.fp_dtor( _this->dtor.state );
}

/* 100% safe to reclaim collector memory */
free( _this->base_mem );
}

void pc_deferred_sys_dec( pc_deferred_t *_this )
{
/* the private refs are zero, we now need to dec
the back-link refs */
if ( atomic_xaddword( &_this->back_refs, -1 ) == 1 )
{
/* this collector has dropped to zero,
so queue it locally */
pc_deferred_t *next, *cur = _this->next, *dtorf = _this, *dtorb = _this;

_this->next = 0;

while ( cur )
{
next = cur->next;

/* we now need to dec the back-link refs of
the this collector */
if ( atomic_xaddword( &cur->back_refs, -1 ) != 1 )
{
break;
}

/* this collector has dropped to zero,
so queue it locally */
cur->next = 0;
dtorb->next = cur;
dtorb = cur;

cur = next;
}

/* process the dropped collectors in FIFO order */
cur = dtorf;

while ( cur )
{
next = cur->next;
pc_deferred_sys_free( cur );
cur = next;
}

/* that's all folks! :) */
}
}

__declspec( naked ) int atomic_casptr( void volatile *dest, void const
*_cmp, void const *_xchg )
{
_asm
{
mov ecx, [esp + 4]
mov eax, [esp + 8]
mov edx, [esp + 12]
lock cmpxchg [ecx], edx
jne fail
mov eax, 1
ret

fail:
xor eax, eax
ret
}
}

__declspec( naked ) void* atomic_xchgptr( void volatile *dest, void const
*_xchg )
{
_asm
{
mov ecx, [esp + 4]
mov eax, [esp + 8]
xchg [ecx], eax
ret
}
}

__declspec( naked ) void* atomic_xaddptr( void volatile *dest, void const
*_xadd )
{
_asm
{
mov ecx, [esp + 4]
mov eax, [esp + 8]
lock xadd [ecx], eax
ret
}
}

/* a quick sample application */
static void prv_defer( void *state );
static pc_t pg_pc;

int main( void )
{
pc_deferred_t *pcd1, *pcd2;

pc_alloc( &pg_pc );

/* collected in reigon 0 */
pc_defer( &pg_pc, prv_defer, (void*)1 );
pc_defer( &pg_pc, prv_defer, (void*)2 );

pcd1 = pc_inc( &pg_pc );

/* collected in reigon 1 */
pc_defer( &pg_pc, prv_defer, (void*)3 );

pcd2 = pc_inc( &pg_pc );

/* collected in reigon 2 */
pc_defer( &pg_pc, prv_defer, (void*)4 );
pc_defer( &pg_pc, prv_defer, (void*)5 );

pc_dec( pcd2 );

/* collected in reigon 1 */
pc_defer( &pg_pc, prv_defer, (void*)6 );
pc_defer( &pg_pc, prv_defer, (void*)7 );

pc_dec( pcd1 );

/* collected in reigon 0 */
pc_defer( &pg_pc, prv_defer, (void*)8 );
pc_defer( &pg_pc, prv_defer, (void*)9 );

pc_free( &pg_pc );

return 0;
}

void prv_defer( void *state )
{
printf( "my_dtor - %i\n", (int)state );
}

Chris Thomasson

unread,

Dec 8, 2005, 5:16:30 PM12/8/05

to

> The following sample implementation should compile fine with MSVC 6+:
>
>
>
>
> pc_sample.c
> ---------

Argh, cut-and-paste might have messed the code so you many not be able to
just copy-and-paste. Here is a link to the sample code here:

http://home.comcast.net/~vzoom/demos/pc_sample.c

just in case you have any problems.

Joe Seigh

unread,

Dec 8, 2005, 6:13:07 PM12/8/05

to

In the read lock routine after a reader get the sequence number,it looks
for the proxy object corresponding to the sequence number. It doesn't
look at the sequence number of the oldest proxy object (which may be the
current one). So the difference between the sequence number on the next
oldest proxy object and the current sequence number has to be less than
half the sequence number range for the compares to work. That's 2**30
for a 32 bit int and 2**62 for a 64 bit int.

The compare is change from x < y to (x - y) < 0 to handle overall wrap
but that uses up 1 bit. The parity bit uses up the other bit.

>
>
>
>>>I finally setup a site for VZOOM at:
>>>
>>>http://home.comcast.net/~vzoom/
>>>

[...]

>
>
> You might want to take a quick look a particular algorithm that I used to
> benchmark against VZOOM. Its your basic proxy collector scheme. It doesn't
> require DWCAS. The basic trick is to align collector structures on a 128 or
> greater byte-boundary. The extra space is used as a master reference count.
> Differential counting between the master count and a per-collector private
> count along with back-linked reference counting are used to manage the
> lifetime of the collectors and to ensure strict FIFO ordering. Humm, I
> wonder if you have you seen anything that is similar to my design?
>

This is stuff you worked on from way back? I don't think anyone else
has been doing proxy type collectors.

Chris Thomasson

unread,

Dec 8, 2005, 11:21:40 PM12/8/05

to

>> Ahhh, so a dangerous situation could occur if a thread that did not
>> release its read lock observes a fresh sequence number that is less than
>> or equal to the one it read... Is that basically correct?
>>
>>
> In the read lock routine after a reader get the sequence number,it looks
> for the proxy object corresponding to the sequence number. It doesn't
> look at the sequence number of the oldest proxy object (which may be the
> current one). So the difference between the sequence number on the next
> oldest proxy object and the current sequence number has to be less than
> half the sequence number range for the compares to work. That's 2**30
> for a 32 bit int and 2**62 for a 64 bit int.
>
> The compare is change from x < y to (x - y) < 0 to handle overall wrap
> but that uses up 1 bit. The parity bit uses up the other bit.

Ahhh, thanks for that. I think I now understand how your rabbit performs
some of its tricks...

:)

>> You might want to take a quick look a particular algorithm that I used to
>> benchmark against VZOOM. Its your basic proxy collector scheme. It
>> doesn't require DWCAS. The basic trick is to align collector structures
>> on a 128 or greater byte-boundary. The extra space is used as a master
>> reference count. Differential counting between the master count and a
>> per-collector private count along with back-linked reference counting are
>> used to manage the lifetime of the collectors and to ensure strict FIFO
>> ordering. Humm, I wonder if you have you seen anything that is similar to
>> my design?
>>
>
> This is stuff you worked on from way back?

I designed a preliminary version of the basic idea in mid 2004. Then finally
cleaned it up enough so that it could be used as a benchmarking tool. I made
a very brief mention of the idea in this post:

http://groups.google.com/group/comp.arch/msg/7f56cfc0541ec7f0?hl=en&

Actually, some of my oldest stuff that went public was the DWCAS based
collector that did not rely on back-linking, but instead pushing deferred
objects onto a lock-free LIFO whose anchor was embedded with a reference
count:

/* two contiguous words */
struct old_pc_anchor_s
{
int refs; /* refs */
pc_deferred_t *defer; /* lock-free LIFO */
};

http://groups.google.com/group/comp.programming.threads/msg/9d15cb75717124e1
http://groups.google.com/group/comp.programming.threads/msg/e547340c08220b6b
(posts about the old algorithm)

Deferred object list got collected when refs == 0, FIFO ordering not
required...

> I don't think anyone else has been doing proxy type collectors.

You know, I think your correct! I believe we are creating some novel
solutions here, indeed... I just wonder when the emperor will decide to wear
his new clothes...

lol

;)

Joe Seigh

unread,

Dec 9, 2005, 9:03:55 AM12/9/05

to

Chris Thomasson wrote:
>
>
>
>
>>I don't think anyone else has been doing proxy type collectors.
>
>
> You know, I think your correct! I believe we are creating some novel
> solutions here, indeed... I just wonder when the emperor will decide to wear
> his new clothes...
>

I don't know. There's a lot of NIH syndrome out there. The problem with
scalability is the applications where scalability is an issue tend to be
really large, like databases and such. And you really need to have a
sucess story to get the ball rolling as everyone rushes to catch up with
the competition. API's by themselves are a hard sell. Plus, I don't know
if anyone noticed but there's no profit margin in open source software so
if anyone is waiting for me to buy an x86-64 or a Niagara processor so the
api's can be ported there, they're going to have a long wait. :)

The fun part is I think I've established enough prior art here. So
eventially someone will have to do all the work on a lock-free solution
themselves. They just won't have an exclusive competitive advantage since
all their competitors will be able to copy their methods.

Chris Thomasson

unread,

Dec 13, 2005, 12:48:03 PM12/13/05

to

> This is stuff you worked on from way back? I don't think anyone else
> has been doing proxy type collectors.

DOH!

I made a mistake in the code when I was stripping some of my library
specific stuff. The cas loop in the pc_dec( ... ) function did not reload
the cmp var when a cas failed.

Here is the fixed code:

http://home.comcast.net/~vzoom/demos/pc_sample.c

Sorry!

Markus Elfring

unread,

Dec 22, 2005, 3:51:02 PM12/22/05

to

> There's a lot of NIH syndrome out there. The problem with
> scalability is the applications where scalability is an issue tend to be
> really large, like databases and such. And you really need to have a
> sucess story to get the ball rolling as everyone rushes to catch up with
> the competition. API's by themselves are a hard sell. Plus, I don't know
> if anyone noticed but there's no profit margin in open source software so
> if anyone is waiting for me to buy an x86-64 or a Niagara processor so the
> api's can be ported there, they're going to have a long wait. :)

Is more drumming and "advertisement" needed like it is demonstrated in the article
"Software and the Concurrency Revolution" by Herb Sutter and James Larus?
http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=332

Regards,
Markus

Joe Seigh

unread,

Dec 22, 2005, 5:15:32 PM12/22/05

to

I was thinking something more along the lines of
https://coolthreads.dev.java.net/

Of course I would find out about it after I no longer have
an SB100.

Chris Thomasson

unread,

Dec 24, 2005, 1:06:48 AM12/24/05

to

> I was thinking something more along the lines of
> https://coolthreads.dev.java.net/

Humm, I wonder how my proxy collector code would do:

http://home.comcast.net/~vzoom/demos/pc_sample.c

I should really read the rules and regulations to see how they feel about
submitting any algorithms that have patent(s) pending...

> Of course I would find out about it after I no longer have
> an SB100.

Murphy's law strikes again! Humm, I have do have a version of VZOOM that
uses only pthread mutexs/condvars and only relies on dependant-load ordering
for readers. Do you know how good/complete the pthread api is on Solaris 10?
Also, do you know if Niagara processors need dependant-load membars like
Alphas do?