Wait free queue

Jon Harrop

unread,

Jan 11, 2009, 12:27:43 AM1/11/09

to

I'm trying to find concurrent data structures implemented efficiently in C.
Where should I look?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Chris M. Thomasson

unread,

Jan 11, 2009, 12:44:31 AM1/11/09

to

"Jon Harrop" <j...@ffconsultancy.com> wrote in message
news:QpqdnRajkKbyH_TU...@posted.plusnet...

>
> I'm trying to find concurrent data structures implemented efficiently in
> C.

Do you trust the "efficiently" level of a "stock" OS provided "atomic" op
API? Solaris and Windows come to mind. FWIW, I generally hand craft assembly
code for my projects non-blocking algorihtms; it makes me less paranoid...

> Where should I look?

For C... Humm...

http://www.rdrop.com/users/paulmck/RCU

http://www.docs.hp.com/en/5992-3373/ch10s08.html

http://atomic-ptr-plus.sourceforge.net
(Joe Seighs excellent work!)

http://webpages.charter.net/appcore
(has wait-free unbounded single producer/consumer queue with eventcount)

-- http://www.intel.com/cd/ids/developer/asmo-na/eng/151201.htm
(search page text for `Chris Thomasson')

-- http://developers.sun.com/solaris/articles/chip_multi_thread.html
(refer to section 2.2.1.1.2; contains link to old AppCore site)

--
http://72.14.203.104/search?q=cache:JRTk2SzLVvUJ:www.cs.nyu.edu/artg/internet/Spring2006/lectures/DavidBuksbaum-BuildingHighThroughputMulti-threadedServersInCSharpAndDotNet.ppt+appcore+.ppt&hl=en&gl=us&ct=clnk&cd=20
(search page text for `AppCore')

http://www.cl.cam.ac.uk/research/srg/netos/lock-free

http://groups.google.com/group/comp.programming.threads/search?hl=en&group=comp.programming.threads&q=wait-free+queue

http://www.non-blocking.com
(has hard-core license that demands a "tax" on your total sales!)

http://groups.google.com/group/comp.programming.threads/search?hl=en&group=comp.programming.threads&q=vzoom

Any thoughts? Your looking for a wait-free queue; okay, well, your going to
have a hard time finding one that can handle multiple producers... There are
workarounds. How are you planning on using the queue? What type of
algorihtm/solution are you developing?

Jon Harrop

unread,

Jan 11, 2009, 9:09:41 AM1/11/09

to

Chris M. Thomasson wrote:
> Any thoughts? Your looking for a wait-free queue; okay, well, your going
> to have a hard time finding one that can handle multiple producers...
> There are workarounds. How are you planning on using the queue? What type
> of algorihtm/solution are you developing?

I am considering the simplest possible way to implement a GC. I almost have
a really simple single-threaded GC now:

1. Mark all heap blocks as unreachable.

2. Create a queue of heap blocks to traverse from the global roots (all
global variables and the shadow stacks of all threads).

3. Take a pointer to a heap block from the queue, or deallocate unreachable
blocks and complete if the queue is empty.

4. Mark the block as "reachable".

5. Add all pointers to blocks referred to by the current block to the queue.

6. Repeat from 3.

I'm thinking about parallelizing it and the easiest way to do this is with a
single global queue shared between multiple GC threads that both read and
write to the queue concurrently.

I was hoping I could just pull a concurrent queue implementation off the
shelf and use that as an easy solution. If not, the work required to
implement one warrants further investigation of other algorithms instead.

Dmitriy V'jukov

unread,

Jan 11, 2009, 9:28:23 AM1/11/09

to

IMVHO solution with single shared queue will have negative scalability
provided small amount of local work. Showstoppers are (1) contention
on both ends of the queue and (2) poor locality of processing.

The natural scalable solution is to:
1. Create worker thread per processor.
2. Create work-stealing deque per worker thread.
3. Worker threads use own deque in LIFO mode.
4. When own deque is empty then steal from other threads in FIFO mode.

This is classical Cilk-style scheduler, which known to have nearly
perfect characteristics for such problems.

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Jan 11, 2009, 1:36:40 PM1/11/09

to

"Jon Harrop" <j...@ffconsultancy.com> wrote in message

news:Be-dnQ8vLoNYYfTU...@posted.plusnet...

> Chris M. Thomasson wrote:
>> Any thoughts? Your looking for a wait-free queue; okay, well, your going
>> to have a hard time finding one that can handle multiple producers...
>> There are workarounds. How are you planning on using the queue? What type
>> of algorihtm/solution are you developing?
>
> I am considering the simplest possible way to implement a GC. I almost
> have
> a really simple single-threaded GC now:
>
> 1. Mark all heap blocks as unreachable.
>
> 2. Create a queue of heap blocks to traverse from the global roots (all
> global variables and the shadow stacks of all threads).

How are you keeping track of global variables? Are you doing this in C?

> 3. Take a pointer to a heap block from the queue, or deallocate
> unreachable
> blocks and complete if the queue is empty.
>
> 4. Mark the block as "reachable".
>
> 5. Add all pointers to blocks referred to by the current block to the
> queue.
>
> 6. Repeat from 3.
>
> I'm thinking about parallelizing it and the easiest way to do this is with
> a
> single global queue shared between multiple GC threads that both read and
> write to the queue concurrently.
>
> I was hoping I could just pull a concurrent queue implementation off the
> shelf and use that as an easy solution. If not, the work required to
> implement one warrants further investigation of other algorithms instead.

Windows provides the Interlocked SList API. You can easily use that to
create your queue. One major caveat, the SList is LIFO. Therefore, a
producer thread would need to flush all items, reverse order and process.
One big plus with this approach: No memory management is needed for the
queue nodes. However, a single global queue does not scale well at all. I
would suggest using single producer/consumer queue per thread. You can
interconnect them is several interesting ways... A particularly good method
is to use a Kautz graph, just like the SiCortex supercomputers do:

http://sicortex.com/content/download/296/1805/file/5832%20Winter2007.pdf

This scales very well... I already provided a link to an example single
producer/consumer queue x86 implementation. It can also port to basically
any architecture you need.

Chris M. Thomasson

unread,

Jan 11, 2009, 2:09:46 PM1/11/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:U_qal.7061$tD1....@newsfe07.iad...

> "Jon Harrop" <j...@ffconsultancy.com> wrote in message
> news:Be-dnQ8vLoNYYfTU...@posted.plusnet...

>> [...]

>> I was hoping I could just pull a concurrent queue implementation off the
>> shelf and use that as an easy solution. If not, the work required to
>> implement one warrants further investigation of other algorithms instead.
>
> Windows provides the Interlocked SList API. You can easily use that to
> create your queue. One major caveat, the SList is LIFO. Therefore, a
> producer thread would need to flush all items, reverse order and process.

> [...]

You don't even need the SList API. You can do something like:

#define CAS_PTR_RELEASE(d, c, x) \
InterlockedCompareExchangePointerRelease((d), (x), (c))

#define SWAP_PTR_ACQUIRE(d, x) \
InterlockedExchangePointerAcquire((d), (x))

struct node {
struct node* next;
};

struct stack {
struct node* head; /* = NULL */
};

static struct node*
stack_push(
struct stack* const self,
struct node* const node
) {
struct node* cmptmp, cmp = self->head;

do {
node->next = cmp;

cmptmp = cmp;

cmp = CAS_PTR_RELEASE(&self->head, cmp, node);

} while (cmp != cmptmp);

return node;
}

static struct node*
stack_flush_ex(
struct stack* const self,
struct node* const node
) {
return SWAP_PTR_ACQUIRE(&self->head, node);
}

#define stack_flush(s) stack_flush_ex((s), NULL)

struct queue {
struct stack stack;
};

#define queue_push stack_push

static struct node*
queue_flush_ex(
struct queue* const self,
struct node* const node
) {
struct node* fifo = NULL;
struct node* lifo = stack_flush_ex(&self->stack, node);

while (lifo) {
struct node* const next = lifo->next;
lifo->next = fifo;
fifo = lifo;
lifo = next;
}

return fifo;
}

#define queue_flush(s) queue_flush_ex((s), NULL)

You can also use `libatomic' from HP to implement the simple non-blocking
algo above:

http://www.docs.hp.com/en/5992-3373/ch10s08.html

Jon Harrop

unread,

Jan 11, 2009, 7:14:14 PM1/11/09

to

Chris M. Thomasson wrote:
> "Jon Harrop" <j...@ffconsultancy.com> wrote in message
> news:Be-dnQ8vLoNYYfTU...@posted.plusnet...

>> 2. Create a queue of heap blocks to traverse from the global roots (all
>> global variables and the shadow stacks of all threads).
>
> How are you keeping track of global variables?

I'm writing the compiler. Whenever a new global variable (of a reference
type) is created it is automatically added to a list of global roots by my
compiler.

> Are you doing this in C?

No. I am writing the compiler in OCaml using LLVM for code generation.

I gave a description of my motives and an overview of my objectives here, if
you are interested:

http://flyingfrogblog.blogspot.com/2008/12/building-better-future-high-level.html

Interestingly, LLVM's IR is a mid-level assembler that includes memory
fences. So I thought it would be fun to implement these kinds of concurrent
algorithms directly using LLVM's IR to obtain good efficiency without
sacrificing architecture independence.

My dream is to implement a fully concurrent garbage collector but, for now,
I am just beginning to consider the simplest possible parallel GC.

>> I was hoping I could just pull a concurrent queue implementation off the
>> shelf and use that as an easy solution. If not, the work required to
>> implement one warrants further investigation of other algorithms instead.
>
> Windows provides the Interlocked SList API. You can easily use that to
> create your queue. One major caveat, the SList is LIFO. Therefore, a
> producer thread would need to flush all items, reverse order and process.
> One big plus with this approach: No memory management is needed for the
> queue nodes. However, a single global queue does not scale well at all.

Yes. I had assumed that it would be much easier to implement work sharing
but that does not appear to be the case.

> I
> would suggest using single producer/consumer queue per thread. You can
> interconnect them is several interesting ways... A particularly good
> method is to use a Kautz graph, just like the SiCortex supercomputers do:
>
> http://sicortex.com/content/download/296/1805/file/5832%20Winter2007.pdf
>
> This scales very well... I already provided a link to an example single
> producer/consumer queue x86 implementation. It can also port to basically
> any architecture you need.

I'll check that out in more detail now. Thanks.

Jon Harrop

unread,

Jan 11, 2009, 7:15:10 PM1/11/09

to

Dmitriy V'jukov wrote:
> IMVHO solution with single shared queue will have negative scalability
> provided small amount of local work. Showstoppers are (1) contention
> on both ends of the queue and (2) poor locality of processing.
>
> The natural scalable solution is to:
> 1. Create worker thread per processor.
> 2. Create work-stealing deque per worker thread.
> 3. Worker threads use own deque in LIFO mode.
> 4. When own deque is empty then steal from other threads in FIFO mode.
>
> This is classical Cilk-style scheduler, which known to have nearly
> perfect characteristics for such problems.

Yes. I see now that work sharing would be very inefficient indeed and is not
actually much simpler to implement.

Thanks.

Dmitriy V'jukov

unread,

Jan 12, 2009, 8:01:55 AM1/12/09

to

On Jan 12, 3:15 am, Jon Harrop <j...@ffconsultancy.com> wrote:
> Dmitriy V'jukov wrote:
> > IMVHO solution with single shared queue will have negative scalability
> > provided small amount of local work. Showstoppers are (1) contention
> > on both ends of the queue and (2) poor locality of processing.
>
> > The natural scalable solution is to:
> > 1. Create worker thread per processor.
> > 2. Create work-stealing deque per worker thread.
> > 3. Worker threads use own deque in LIFO mode.
> > 4. When own deque is empty then steal from other threads in FIFO mode.
>
> > This is classical Cilk-style scheduler, which known to have nearly
> > perfect characteristics for such problems.
>
> Yes. I see now that work sharing would be very inefficient indeed and is not
> actually much simpler to implement.

Interesting improvement is to take into account data affinity. If you
are using hooks to stop threads, you can extend them this way:

if (__garbage_collection_pending)
{
wait_for_gc_to_complete_and_note_my_affinity
(get_current_processor_number());
}

Then every gc worker thread will initially place all threads (root
objects on stack) that was last running on the processor to own work-
stealing deque in the LIFO order (i.e. worker thread will first
process the last running thread). This way you will get good initial
distribution of work.

The other possible improvement is to bound the size of the work-
stealing deques. For example, to number_of_processors*64. I.e. if
thread's work-stealing deque is already contains
number_of_processors*64 elements, then thread can put new work
elements to purely local container.

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Jan 12, 2009, 9:31:55 PM1/12/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:Vtral.10772$Jy....@newsfe06.iad...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ummm:

struct node *cmptmp, *cmp = self->head;

is correct; perhaps;

struct node* cmptmp;
struct node* cmp = self->head;

is better... Sorry for any confusions.

> do {
> node->next = cmp;
>
> cmptmp = cmp;
>
> cmp = CAS_PTR_RELEASE(&self->head, cmp, node);
>
> } while (cmp != cmptmp);
>
> return node;
> }

> [...]

Chris M. Thomasson

unread,

Jan 12, 2009, 9:37:54 PM1/12/09

to

"Jon Harrop" <j...@ffconsultancy.com> wrote in message

news:CradnWHV-5HoF_fU...@posted.plusnet...

> Chris M. Thomasson wrote:
>> "Jon Harrop" <j...@ffconsultancy.com> wrote in message
>> news:Be-dnQ8vLoNYYfTU...@posted.plusnet...
>>> 2. Create a queue of heap blocks to traverse from the global roots (all
>>> global variables and the shadow stacks of all threads).
>>
>> How are you keeping track of global variables?
>
> I'm writing the compiler. Whenever a new global variable (of a reference
> type) is created it is automatically added to a list of global roots by my
> compiler.
>
>> Are you doing this in C?
>
> No. I am writing the compiler in OCaml using LLVM for code generation.

Okay. LLVM seems to follow the SPARC memory model because of its fine
granularity; it makes sense. Fine. It should work out great, and you should
be able to "efficiently" abstract impl details away with LLVM. Nice. I think
LLVM is a noble project indeed.

> I gave a description of my motives and an overview of my objectives here,
> if
> you are interested:
>
> http://flyingfrogblog.blogspot.com/2008/12/building-better-future-high-level.html

> [...]

Humm, well, I am definitely interested in creating efficient GC algorithms.
Humm...

Chris M. Thomasson

unread,

Jan 17, 2009, 6:03:14 AM1/17/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:0Hfal.5329$1k1....@newsfe14.iad...

> "Jon Harrop" <j...@ffconsultancy.com> wrote in message
> news:QpqdnRajkKbyH_TU...@posted.plusnet...
>>
>> I'm trying to find concurrent data structures implemented efficiently in
>> C.
>
> Do you trust the "efficiently" level of a "stock" OS provided "atomic" op
> API? Solaris and Windows come to mind. FWIW, I generally hand craft
> assembly code for my projects non-blocking algorihtms; it makes me less
> paranoid...
>
>
>
>
>
>> Where should I look?
>
> For C... Humm...

> [...]

http://groups.google.com/group/comp.programming.threads/search?hl=en&group=comp.programming.threads&q=Dmitriy+V%27jukov+queue

Codewarp

unread,

Feb 12, 2009, 2:20:58 PM2/12/09

to

>From Dmitriy V'jukov, May 31, 2007:

(regarding his nested-list wait-free mpmc queue)

>I will post here some ideas that I already post to comp.programming.threads, just to create
>some initial content for this group. First I provide some primary points and then code sketch.

>There is no limitations on queue size and producer/consumer count. Implementation requires only CAS
>instruction. Values must be of word size (if DWCAS supported then values can be of dword size). There
>must be some reserved value that never enqueued (currently 0).

>The main point is:

>For underlying data-structure I use nested-list, i.e. values organized in continuous buffers of some
>constant size (nests), nests linked in single-linked list. Nest filled with values only once and only once
>consumed, then nest is recycled. So there is no cyclic buffers like in bounded queue on top of vector.
>I think that nested-list based lock-free data structures are very interesting and perspective. I haven't
>seen it before. Definitely some other lock-free data structures can be build on top of nested-list...

>Comments/suggestions/thoughts are welcome ;)

>Dmitriy V'jukov

I read your mpmc queue algorithm with keen interest. I like it
because the nests provide natural groupings of
multiple nodes you can work with in a variety of ways. I use a
similar arrangement to process a queue using
two CPUs, working from the same end(!). This dual dequeue operation
works like this:

(1) Each node nest has its own atomic counter as the sole means of
sychronization within the nest.
(2) Each nest has a head and tail pointer--one for each thread to
access the nest (non-atomic).
(3) Each thread processes the nest by decrementing the counter and
examining the result.
(4) If >0 then advance your pointer (head or tail) and pick up the
node there and process it.
(5) If ==0 then you are the first thread to finish the nest, go
prepare and start the next nest.
(6) if <0 then you are the 2nd thread to finish the nest, go catch
up with the next nest in progress.

This method naturally balances the work between CPUs so that they
always finish within one node of each other.
The knowledge of who finished 1st and 2nd permits different roles to
be played, at the transistion between nests.
I have tested this method on an artificial workload of mandelbrot
regions (for both load and variability) and achieved a
1.9x speed up of two processors (Opteron170), as compared against a
single-threaded computation of the same
thing. This is the highest ratio for two processors that I have been
able to attain by any method.

I would be most interested to hear comments about this method. Nested-
list methods have many interesting
characteristics and possibilities. I would concur that complexity is
a down side to nested-lists, but it appears to
give a lot in return.

Chris Cochran

Dmitriy V'jukov

unread,

Feb 12, 2009, 4:17:51 PM2/12/09

to

Recently I've implemented very similar algorithm. And it indeed
slaughters most implementations under high load. I've tested it versus
tbb::concurrent_queue on quad-core (the queue is full-fledged mpmc
queue), and the performance difference was around 20-30x.
But my algorithm differs a bit. First of all, it's mpmc, so I am
unable to determine "who is the last thread to finish the nest", so I
was using some form of Safe Memory Reclamation for nests. Then, nest
is cache-aware, i.e. nest is divided into several regions, and each
region represents a cache-line. Then, producer index is embed into
every region (i.e. several indexes per nest), and consumer index is
moved out of the nest (collocated with pointer to the nest). Such
measures allows for very cache-friendly behaviour - i.e. every
operation modifies no more than 1 cache-line (amortized).
In your design, if nest is bigger than single cache-line, producer has
to modify cache-line with index and cache-line with data item - this
can degrade performance several times.
I will post implementation in near future. If you have established
test-suite, it will be interesting to make a shootout (probably my
queue will be slower, who knows :) ).

--
Best regards,
Dmitriy V'jukov

Codewarp

unread,

Feb 14, 2009, 4:37:03 PM2/14/09

to

This is really validating news to hear. I am interested in building a
queue that is multi-producer, multi-consumer at one end (i.e. like a
stack) and single-consumer at the other end. This queue would be
employed by threads that advertise "help wanted", for a work-stealing
scheme. Threads that require fifo ordering on their tasks would not
"ask for help", so they can use the simpler and faster MPSC queue. A
nested-list approach, with cache-line awareness as you describe, looks
like the right direction to go in, for such a queue.

I have run into a bit of a snag in testing the performance of these
queues. Essentially, I get the same performance with 4 CPUs, as with
3 CPUs, as with 2 CPUs. In other words, my test is saturating the
system with cache-line loads, 4 cache-lines per transaction, to be
exact. Likely that at least one of these CL-loads is due to counters
and such, so some refinement on this is still necessary. I can
process 4.5 million transactions per second, on a system that can
access 18 million cache-lines per second. This effectively hides any
basis for comparison between two different methods that both access
the same number of cache-lines per transaction. Perhaps I could vary
the work performed by each transaction, to see how much overall effect
it has on total bandwidth. I suspect that many more CPU cycles are
available concurrently with these CL loads, than my testing reveals.

BTW I am getting 5% more out of the reversing stack method, than with
your lock-free mpmc, but I believe that is purely an artifact of this
particular test. Your queue simply does not blow away the L1, while
the reversing stack does, and that means much more to me than this
particular result.

I am not sure that a "shoot-out" would be very illuminating with cache-
line loads dominating the numbers like this. However, a function that
may be even more important to overall performance, multi-threaded heap
management, might be more revealing and productive to persue as a
shoot-out. So I think I might start a new topic, with a description
of my MT heap management (not GC), and see where it goes...

Chris Cochran

Chris M. Thomasson

unread,

Feb 14, 2009, 8:49:25 PM2/14/09

to

"Codewarp" <ch...@megabasic.com> wrote in message
news:ab49ea61-c019-4558...@p23g2000prp.googlegroups.com...

> This is really validating news to hear. I am interested in building a
> queue that is multi-producer, multi-consumer at one end (i.e. like a
> stack) and single-consumer at the other end. This queue would be
> employed by threads that advertise "help wanted", for a work-stealing
> scheme.

What do you mean by "help wanted"? It kinds of sounds like more of a
work-requesting scheme than work-stealing; humm:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/6019e377e4aa73ec
(a crude sketch of a possible work-requesting scheme)

For those who don't know, in work-stealing, when a CPU has nothing to do, it
tries to steal something from other CPUS running on the same node. In other
words, the CPU with nothing to do does all the actual work. On the other
hand, work-requesting involves a CPU making an explicit request for more
work to a busy CPU(s) on the same node, and then waits on that result. Busy
CPU notices the request and hands off some of its own work load off to the
idle CPU. In other words, the busy CPU does all the work.

> Threads that require fifo ordering on their tasks would not
> "ask for help", so they can use the simpler and faster MPSC queue. A
> nested-list approach, with cache-line awareness as you describe, looks
> like the right direction to go in, for such a queue.

You can look here for a discussion/dissection, of a fairly interesting
work-stealing deque:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8ad297f61b369a41

Also, Clik++ has open sourced their project:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/ff6bf28d6663ce7
(full-blown work-stealing scheme)

Dmitriy V'jukov has an example of a work-stealing deque includes as an
example within his excellent Relacy product. His uses a lock for the
stealing; which is perfectly fine AFAICT.

> I am not sure that a "shoot-out" would be very illuminating with cache-
> line loads dominating the numbers like this. However, a function that
> may be even more important to overall performance, multi-threaded heap
> management, might be more revealing and productive to persue as a
> shoot-out. So I think I might start a new topic, with a description
> of my MT heap management (not GC), and see where it goes...

Are you implementing PDR? Or, are you implementing a memory allocator? Or a
combination of both? Sorry for all the questions, I am just interested in
what you came up with. I have implemented all kinds of PDR algorihtms and
memory allocators and I might be able to give you some constructive ideas. I
suggest you look up some of the discussions between Dmitriy V'jukov and
myself, and also don't forget to lookup posts by Joe Seigh; here is his
project:

http://atomic-ptr-plus.sourceforge.net

Here is my older AppCore project, which is still not complete!:

http://webpages.charter.net/appcore

which has been referenced by Intel, Sun and a couple of Professors in their
lectures.

I am working on another project called vZOOM now. It does indeed contain a
patented algorihtm, but I am really thinking hard about granting free
licenses for non-commercial uses only; just to give my hard work more
exposure. The experimental pre-pre-pre-alpha version, lol, was a finalist in
the CoolThreads contest:

https://coolthreads.dev.java.net

Pretty cool... Well, at least I got a free T2000 out of it.

;^)

Chris Cochran

unread,

Feb 14, 2009, 10:53:14 PM2/14/09

to

Chris,

Thank you for your inquistive curiosity. I suspected that referring
to "help wanted" might imply work-pushing, rather than work-pulling.
But you can't pull work from a thread that requires fifo order. So it
seems reasonable for each client thread to be able to establish its
own ordering requirements, without the producer threads necessarily
having to know that information.

Am I using PDR? I don't know, am I? For some strange reason, I am
having a mental block on exactly what that acronym is for (Physician's
Desk Reference maybe?). Could you please illuminate what PDR and PDR
pointer actually are? As a matter of fact, I have been enjoying your
conversations with Dmitriy on these topics, and look forward to more.

What I have built is an all-purpose multi-threaded heap management
service: wait-free, non-atomic, per-thread local allocator, free
blocks from any other allocator by any thread, and always provides
ideal alignments. It operates at three scales: a multi-threaded
managed large virtual memory block allocator and recycler, a power of
2 allocator within virtual memory blocks, and a suballocator for all
block sizes below 4k and zero fragmentation. It maintains a separate
free list for each block size. It's performance is usually around 4x
the standard new/delete, often much better than that. It is fully
self contained and makes no system calls except for the large virtual
memory block management.

This service will successfully delete blocks allocated by other
threads, even after the originating thread has terminated. This
capability is not available from the standard new/delete, nor from any
other heap implementation I know of--they all just crash, mine does
not.

I have been writing allocators since the 8-bit days, so at some point,
we just may have ourselves a shoot-out...

Regards,

Chris Cochran

Chris M. Thomasson

unread,

Feb 15, 2009, 12:53:15 AM2/15/09

to

"Chris Cochran" <ch...@megabasic.com> wrote in message
news:7059d875-49be-40cf...@x6g2000pre.googlegroups.com...

> Chris,
>
> Thank you for your inquistive curiosity. I suspected that referring
> to "help wanted" might imply work-pushing, rather than work-pulling.
> But you can't pull work from a thread that requires fifo order. So it
> seems reasonable for each client thread to be able to establish its
> own ordering requirements, without the producer threads necessarily
> having to know that information.

Okay. Humm, need to think some more on this.

> Am I using PDR? I don't know, am I? For some strange reason, I am
> having a mental block on exactly what that acronym is for (Physician's
> Desk Reference maybe?). Could you please illuminate what PDR and PDR
> pointer actually are? As a matter of fact, I have been enjoying your
> conversations with Dmitriy on these topics, and look forward to more.

Yikes! Sorry about that. Anyway, PDR is a term coined by Joe Seigh in the
following post:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/82376b7295e6ce1a

It stands for `Partial Copy-On-Write Deferred Reclamation'. Algorithms such
as RCU/vZOOM/ROP/SMR/SMR+RCU, ect... can be categorized as PDR algorithms.
PDR algorithms can be used to manage memory in all sorts of non-blocking
algorihtms.

> What I have built is an all-purpose multi-threaded heap management
> service: wait-free, non-atomic, per-thread local allocator, free
> blocks from any other allocator by any thread, and always provides
> ideal alignments.

You don't use a common maximum alignment for each allocation? I have did
something like that here:

http://groups.google.com/group/comp.lang.c++/browse_frm/thread/f62bd4db8de1aad2

One question related to per-thread allocator instance based designs. How do
you handle the common memory leak that is basically inherent with these
types of general purpose allocators? In case your not familiar with it, goes
something like this:

thread_a allocates shi%load of objects and passes them to thread_b.
thread_a does not make any more allocations for the rest of its life.
thread_a only finishes when the application finishes.
thread_b frees objects which in turn passes them back to thread_a.
BANG! all objects are leaked.

I know of some methods that attempt to get around this, but there not all
that pretty. How do you do it? There is a post on the Intel forum by Dimitry
which raises the issue. I can't seem to find it right now. I will post a
link when I do. This memory leak effects existing allocators like the one in
TBB.

> It operates at three scales: a multi-threaded
> managed large virtual memory block allocator and recycler, a power of
> 2 allocator within virtual memory blocks, and a suballocator for all
> block sizes below 4k and zero fragmentation. It maintains a separate
> free list for each block size. It's performance is usually around 4x
> the standard new/delete, often much better than that. It is fully
> self contained and makes no system calls except for the large virtual
> memory block management.

How do you maintain wait-free properties on remote deallocations? I have
experimented with a couple of different methods, but I don't quite see their
advantages over using atomic CAS in remote free case. Well, I do see the
advantage of caching blocks in remote threads, which my vz stuff does, but
that's not really relevant to the method used to pass blocks back to their
owner thread. Perhaps you have a technique I have not seen yet; I am
definitely interested.

Here is a post which might be of interest to you:

http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/61145
(read all)

The algorithm in the initial post refers to a generic algorihtm I have in
vZOOM that can transform an existing single-threaded allocator into a
multi-threaded one. The caveats are that an allocator instance must be able
to be dynamically created and destroyed, and that it uses no static
variables. The Windows Heap API with `HEAP_NO_SERIALIZE' is an example of
such a allocator and I give example on how to transform it into a
multi-threaded one.

> This service will successfully delete blocks allocated by other
> threads, even after the originating thread has terminated.

You mean forcefully terminated? Like `TerminateThread()'? Or do you mean
joined? Please clarify.

> This
> capability is not available from the standard new/delete, nor from any
> other heap implementation I know of--they all just crash, mine does
> not.

Humm... Please correct me if I am totally misunderstanding you... But, are
you saying that the following program will crash if the allocator is
standard `malloc()/free()' or `new/delete':
______________________________________________________________
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>

void* thread_a(void* state) {
void* mem = malloc(123);
printf("allocated %p in thread_a\n", mem);
return mem;
}

int main(void) {
void* mem;
pthread_t tid;
pthread_create(&tid, NULL, thread_a, NULL);
pthread_join(tid, &mem);
puts("thread_a has been joined with");
free(mem);
printf("freed %p from main thread\n", mem);
fflush(stdout);
return 0;
}
______________________________________________________________

AFAICT, every multi-threaded allocator on a _hosted_ C platform I have ever
used allows memory allocated in thread a to be freed in thread b even if
thread a has already stopped running (e.g., has been joined) before thread b
got a chance to actually free it.

> I have been writing allocators since the 8-bit days, so at some point,
> we just may have ourselves a shoot-out...

Have you benchmarked against some of the existing state-of-the-art
allocators like StreamFlow, or TBB?

http://people.cs.vt.edu/~scschnei/streamflow

http://www.threadingbuildingblocks.org

What about Hoard? They all use distributed per-thread allocation techniques.
All but Hoard uses non-blocking algorihtm for remote free and local
reclamation; last time I checked, Hoard uses per-thread lock.

I have been privately building and experimenting with such distributed
allocators for a long time. I made vague pseudo-code post in 2005 here in
order to release the general idea into public domain:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69

which formed the basis of my closed source general purpose vZOOM allocator.
The original algorithm has many short comings. One, it was not general
purpose, and it did an unnecessary iteration of reclaimed memory for
reference counting purposes, ect... Oh well, this was old algorihtm. BTW, if
you read the thread I linked to on the Intel forum, you will come across a
post that shows how I fix one of the problems. Dimitry was right on the
money!

Here is a very crude and quick distributed region allocator I did as an
example and posted here:

http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html
(only compiled with MSVC 8 or higher...)

BTW, this is _not_ the algorihtm I use in commercial vZOOM. Also, the reason
why the name Sergey is in the link name is because I benchmarked the
allocator against his in the following post:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/4038a7a8a4f5a7cb

He creates an allocator per-thread but does not allow memory in one thread
to be freed in another. I wanted to show him how to create truly
multi-threaded allocator. It has some fairly major caveats, but they were
not really relevant to the discussion.

I am out of time and cannot elaborate anymore right now; sorry about that. I
needed more time to make a proper response.

Chris M. Thomasson

unread,

Feb 15, 2009, 1:05:44 AM2/15/09

to

"Chris Cochran" <ch...@megabasic.com> wrote in message
news:7059d875-49be-40cf...@x6g2000pre.googlegroups.com...
> Chris,
>
> [...]

>
> What I have built is an all-purpose multi-threaded heap management
> service: wait-free, non-atomic, per-thread local allocator, free
> blocks from any other allocator by any thread, and always provides
> ideal alignments.

AFAICT, in order to provide ideal alignment, one would need to know the type
of the object being allocated. Are you doing this in C++?

> [...]

Chris Cochran

unread,

Feb 15, 2009, 4:26:01 AM2/15/09

to

Chris,

You are lucky this is the weekend, I have to get back to client work
soon and may not be able to keep up with your pace in this
discussion. Here's a few answers...

Ideal alignment: Not quite C++ object alignment (sorry, I didn't
define ideal). I don't use a common maximum alignment for each
allocation. It determines alignment from the requested allocation
size, using the largest power of 2 factor in the size, subject to some
limits. Ask for 11, gets 16; ask for 300, it gives 320, ask for 999,
gets 1024, and so on. I also do that to reduce the actual number of
different allocation sizes down to a dull roar, for a little smaller
working set. Since you only need C++ object alignment for arrays of
such objects, and no one expects to access successive allocations as
computed array elements, what exactly do you need C++ object alignment
for, anyway?

Freeing memory: If a block does not belong to the current allocator,
I look up its address in a virtual memory map (one ptr per megabyte x
4k), to find its home heap. I then CAS it to its remote free stack
(no ABA required), where it stays until the owner thread notices the
nonempty stack, pulls the list with an XCHG with zero, and properly
frees all its free nodes, lock-free, wait-free.

new/delete/malloc/free: I had continual problems with rare crashes
when deleting cross-threaded blocks allocated the normal way--always a
non-trivial situation with many allocations. Each time, I solved the
problem by replacing the allocation method for the offending object
type with my allocator (and nothing else), and each time the problem
was solved. It occurred enough times for me to get the message. I
didn't try to diagnose the standard allocator any further, afterall,
why bother--it is far slower than mine and I don't need it anymore.
It is self-sufficient enough to work in any x86 virtual operating
system.

thread_a allocates shi%load of objects and passes them to thread_b.
thread_a does not make any more allocations for the rest of its life.
thread_a only finishes when the application finishes.
thread_b frees objects which in turn passes them back to thread_a.
BANG! all objects are leaked.

I have not seen this scenerio before, but it's a problem only because
the allocation never taken, is the only way to release the remote
freelist. Possible solutions include:

- Post a periodic event to each thread holding a remote freelist,
to process the list
- Build remote freelist clean-up into calls to WaitForObject( ),
Sleep( ), OpenFile( ), and other unavoidable operations.
- Prominently document the weakness. Never allocating anything
again is a pretty pathological case--and easy to avoid.
- Force C++ to use my allocator for all uses, making it even less
likely to be able to invoke this situation.

Note that checking the remote freelist is a non-atomic, wait-free
test.

wait-free properties on remote deallocations: I use CAS to push free
nodes onto a stack. Since remote deallocation is rare compared to
local deallocation, contention is not normally an issue, making lock-
free "nearly wait-free".

Generic multi-threaded allocator from single-threaded: I was aware my
method could be used in this manner, back when I dreamed it up. But
you get more speed out of concepts designed to work together, than by
tacking together a pile of separate pieces. Heap management is one of
several very special application services that have a profound effect
upon overall application performance, justifying every reasonable
effort and small advantage to improve it, and then some.

PDR: Thanks for clearing that up. The only garbage collector I have
ever liked, is the real one that shows up at 6:00 AM to hauI it all
away. I have never liked GC, it's too slow and complicated, touches
too much memory, doesn't respond well to widely dynamically changing
requirements, difficult to compare different strategies, and a lot of
wantabees like to talk about GC to sound like know-it-alls (I am sure
you know the type). It does have some advantages in assisting ABA
avoidance, i.e. to help you avoid thinking too hard about a real
solution, the IT version of wishful thinking. I am probably just too
harsh, but if that is its only real draw, it's not enough to hold my
interest.

That's enough for me right now. I'll likely have more tomorrow...

Regards,

Chris Cochran

unread,

Feb 15, 2009, 4:41:19 AM2/15/09

to

Chris,

One more thing... TerminateThread( ) is a dirty word, only needed for
applications that haven't thought out how to terminate properly. I
don't think I have any references to it. It is often a sign of a bad
design. I should have said "application exit"...

Regards,

Chris Cochran

Chris M. Thomasson

unread,

Feb 15, 2009, 9:05:02 AM2/15/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:kgOll.1759$xK6....@newsfe12.iad...

Hey why not use some expert advise here:

http://www.ddj.com/cpp/211601363?pgno=4

Is that a load of crap? ;^(...

Why do you think the experts fail to consult us poor weaklings here on
Usenet before they post prior established art as if its something new?

RESEARCH! ARGH!

Chris M. Thomasson

unread,

Feb 15, 2009, 9:11:19 AM2/15/09

to

"Chris Cochran" <ch...@megabasic.com> wrote in message

news:909c456f-b397-4985...@z28g2000prd.googlegroups.com...
> Chris,
> [...]

> wait-free properties on remote deallocations: I use CAS to push free
> nodes onto a stack. Since remote deallocation is rare compared to
> local deallocation, contention is not normally an issue, making lock-
> free "nearly wait-free".

Rare? What about producer/consumer pattern???

Chris M. Thomasson

unread,

Feb 15, 2009, 9:12:54 AM2/15/09

to

"Chris Cochran" <ch...@megabasic.com> wrote in message

news:1dc714c9-6abb-40d9...@n33g2000pri.googlegroups.com...

I only referenced the nasty term it because you explicitly wrote:

"This service will successfully delete blocks allocated by other
threads, even after the originating thread has terminated. This
capability is not available from the standard new/delete, nor from any
other heap implementation I know of--they all just crash, mine does
not."

I thought you were referencing a method to simply destroy a thread in the
middle of doing something useful.

If not, and ___please___ CORRECT me if I am misunderstanding you, but it
sure sound like your claiming that you never used a general purpose memory
allocator that could allocate in thread_a, and free in thread_b after
thread_a finished and returned to the OS for whatever reason (e.g.,
terminated).

I need your clarification and correction.

Chris M. Thomasson

unread,

Feb 15, 2009, 9:14:22 AM2/15/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:ynVll.1154$fM1...@newsfe14.iad...

Nearly wait-free is _not_ wait-free in any way, shape or form. Do you run
your algorithms on a medical device? Every darn function needs to have a
wait-free property!

Chris M. Thomasson

unread,

Feb 15, 2009, 9:15:48 AM2/15/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:oqVll.1158$fM1....@newsfe14.iad...

If this property is broken, well patient is dead or hospital backup
generator is malfunctioning and local device backup is dead...

Chris Cochran

unread,

Feb 15, 2009, 3:23:15 PM2/15/09

to

OK, Ok, ok... I have a wait-free, non-atomic local memory allocate/
dealloc, lock-free remote memory dealloc among producers, wait-free
for the consumer, with all threads serviced through lock-free queues.
Threads block when they have nothing else to do. Instrumentation in
my framework includes the count of all CAS fail/retry cycles, to see
in real time the actual (not imagined) amount of non-blocking wait
cycles that arise in any application runtime context. Other
statistics shown by thread include: latency, wake-ups, events
processed, queue depth, and others.

I would be happy to trust my life to a medical device that has been
properly designed from stochastic processes--we do it all the time.
It is a human frailty to distrust probabilistic systems that have
acceptable risk, while simultaneously accepting all sorts of high-
probability risks arising everywhere you look. That said, your right,
you're right mathematically, you're right technically, and I'm going
to have to watch my French around here (I hope you're not French...).
We do need to be precise about these things, if for no other reason
than to protect their definitions. I stand corrected.

If not, and ___please___ CORRECT me if I am misunderstanding you,

but it sure sounds like your claiming that you never used a general

purpose memory allocator that could allocate in thread_a, and free in
thread_b after thread_a finished and returned to the OS for whatever
reason (e.g., terminated).

Now you are overreaching Chris, I never said anything of the sort. I
have not tried this with very many allocators, I am not generalizing,
I didn't distilled it down to a primitive example, and my comments are
specifically about the standard new/delete provided in Microsoft
Windows, from NT4 to Win2K, and maybe WinXP (i.e. in msvcrxx.dlls).
Between its bad performance and the really obvious inability to ALWAYS
succeed at cross-threaded memory releases, my patience wore thin.
When you don't have to rely on someone else's code for such things,
you can only catch it in the act so many times before confidence
disappears and you move on. OBVIOUSLY others have figured out how to
perform cross-thread deallocation correctly too, and do, but when my
memory allocator blows the msvcrt away--why bother with it, it's a no
brainer. If I learned that Microsoft fixed this problem, I would not
go back unless I had to, and they probably have fixed it. I avoid,
because I can.

Dr.Dobbs, October 29, 2008
Writing a Generalized Concurrent Queue

What to you mean?? This looks like a fine article, for 1998, they
just got the date wrong. ;-)

And now that we have thoroughly and completely taken over Jon Harrop's
thread for this, there you have it.

thank you much,

Chris Cochran

Message has been deleted

Chris Cochran

unread,

Feb 15, 2009, 7:20:24 PM2/15/09

to

On Feb 14, 9:53 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> One question related to per-thread allocator instance based designs. How do
> you handle the common memory leak that is basically inherent with these
> types of general purpose allocators? In case your not familiar with it, goes
> something like this:
>
> thread_a allocates shi%load of objects and passes them to thread_b.
> thread_a does not make any more allocations for the rest of its life.
> thread_a only finishes when the application finishes.
> thread_b frees objects which in turn passes them back to thread_a.
> BANG! all objects are leaked.

Chris,

Because of this remote memory leak problem, I have added the
reasonable measure to my framework, to finalize( ) deallocations as
needed ahead of any entry of its thread into a blocking state, no
matter how brief. Any thread creating objects and passing them to
other threads, is very likely to also be processing events, blocking
and then calling finalize( ). It is now much more difficult for a
reasonable scenerio to have the conditions for causing this leak. My
memory heap finalize ( ) call is non-atomic when no action is
required, and always wait free for the consuming thread.

Thank you for the heads up on that...

Chris Cochran

Chris M. Thomasson

unread,

Feb 16, 2009, 3:03:46 PM2/16/09

to

"Chris Cochran" <ch...@megabasic.com> wrote in message

news:0ac30931-8681-4d75...@b40g2000pri.googlegroups.com...

> OK, Ok, ok... I have a wait-free, non-atomic local memory allocate/
> dealloc, lock-free remote memory dealloc among producers, wait-free
> for the consumer, with all threads serviced through lock-free queues.
> Threads block when they have nothing else to do. Instrumentation in
> my framework includes the count of all CAS fail/retry cycles, to see
> in real time the actual (not imagined) amount of non-blocking wait
> cycles that arise in any application runtime context. Other
> statistics shown by thread include: latency, wake-ups, events
> processed, queue depth, and others.

Sounds good to me.

> I would be happy to trust my life to a medical device that has been
> properly designed from stochastic processes--we do it all the time.
> It is a human frailty to distrust probabilistic systems that have
> acceptable risk, while simultaneously accepting all sorts of high-
> probability risks arising everywhere you look. That said, your right,
> you're right mathematically, you're right technically, and I'm going
> to have to watch my French around here (I hope you're not French...).
> We do need to be precise about these things, if for no other reason
> than to protect their definitions. I stand corrected.

Sorry for coming across as hostile.

>> "Chris M. Thomasson wrote:"
>> If not, and ___please___ CORRECT me if I am misunderstanding you,
>> but it sure sounds like your claiming that you never used a general
>> purpose memory allocator that could allocate in thread_a, and free in
>> thread_b after thread_a finished and returned to the OS for whatever
>> reason (e.g., terminated).
>
> Now you are overreaching Chris, I never said anything of the sort. I
> have not tried this with very many allocators, I am not generalizing,
> I didn't distilled it down to a primitive example, and my comments are
> specifically about the standard new/delete provided in Microsoft
> Windows, from NT4 to Win2K, and maybe WinXP (i.e. in msvcrxx.dlls).

Well, it reads as if you wrote something to that effect; how am I
misunderstanding the following quote:

> "Chris Cochran wrote:"
> "This service will successfully delete blocks allocated by other
> threads, even after the originating thread has terminated. This
> capability is not available from the standard new/delete, nor from any
> other heap implementation I know of--they all just crash, mine does
> not."

> Between its bad performance and the really obvious inability to ALWAYS
> succeed at cross-threaded memory releases, my patience wore thin.

I have never seen the standard memory allocator (e.g., `malloc()/free()'
`new/delete') in NT4 or Win2k crash on cross-thread memory frees. Can you
show me some example code? I simple cannot reproduce the behavior your
describing...

> When you don't have to rely on someone else's code for such things,
> you can only catch it in the act so many times before confidence
> disappears and you move on. OBVIOUSLY others have figured out how to
> perform cross-thread deallocation correctly too, and do, but when my
> memory allocator blows the msvcrt away--why bother with it, it's a no
> brainer. If I learned that Microsoft fixed this problem, I would not
> go back unless I had to, and they probably have fixed it. I avoid,
> because I can.

AFAICT, the standard allocator on NT4 and Win2k have no problem with
allocating memory in thread_a and freeing it in thread_b after thread_a has
finished its execution. Can you create a standalone example program that all
of us can test?

> Dr.Dobbs, October 29, 2008
> Writing a Generalized Concurrent Queue
>
> What to you mean?? This looks like a fine article, for 1998, they
> just got the date wrong. ;-)
>
> And now that we have thoroughly and completely taken over Jon Harrop's
> thread for this, there you have it.

AFAICT, Herb seems to be using nasty spinlocks. I would just use OS provided
mutexs for head and tail. Code like the following code, taken from the
article:

bool Consume( T& result ) {
while( consumerLock.exchange(true) )
{ } // acquire exclusivity

makes be cringe. There is not even a backoff mechanism. No PAUSE
instruction, nothing. Anyway, he seems to be writing that the reason he is
using spinlocks is because he can put them on there own cache-lines. You can
most certainly do that with `CRITICAL_SECTION's and a most implementations
of `pthread_mutex_t' as well. Also, he fails to properly align
data-structures on L2 boundaries. This is very important:

http://groups.google.com/group/comp.programming.threads/msg/8036dc8a380718ad

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7030f8ac1610a709

I would setup the two-lock queue data-structure on windows as follows.

________________________________________________________________
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>

#define ALIGN(mp_this, mp_type, mp_align) ((mp_type)( \
(((ptrdiff_t)(mp_this)) + (mp_align) - 1) \
& (-(ptrdiff_t)(mp_align)) \
))

#define L2_CACHE_LINE 128

union CRITICAL_SECTION_PAD {
CRITICAL_SECTION self;
char l2pad[L2_CACHE_LINE];
};

struct queue_node {
struct queue_node* next;
void* payload;
void* base_mem;
};

union queue_node_pad {
struct queue_node self;
char l2pad[L2_CACHE_LINE];
};

union queue_node_ptr_pad {
struct queue_node* ptr;
char l2pad[L2_CACHE_LINE];
};

struct queue {
union queue_node_ptr_pad head;
union CRITICAL_SECTION_PAD head_lock;
union queue_node_ptr_pad tail;
union CRITICAL_SECTION_PAD tail_lock;
};

typedef char static_assert[
sizeof(struct queue) == L2_CACHE_LINE * 4
];

struct queue_node*
queue_node_alloc(void* payload) {
void* base_mem =
malloc(sizeof(union queue_node_ptr_pad) + L2_CACHE_LINE - 1);
if (base_mem) {
struct queue_node* self =
ALIGN(base_mem, struct queue_node*, L2_CACHE_LINE);
self->payload = payload;
self->base_mem = base_mem;
self->next = NULL;
return self;
}
return NULL;
}

void
queue_node_free(struct queue_node* self) {
free(self->base_mem);
}

static unsigned char g_queue_raw_buf[
sizeof(struct queue) + L2_CACHE_LINE - 1
] = { '\0' };

static struct queue* g_queue = NULL;

int main(void) {
g_queue = ALIGN(g_queue_raw_buf, struct queue*, L2_CACHE_LINE);

InitializeCriticalSection(&g_queue->head_lock.self);
InitializeCriticalSection(&g_queue->tail_lock.self);

g_queue->head.ptr = g_queue->tail.ptr = queue_node_alloc(NULL);

if (g_queue->head.ptr) {
/* [whatever] */

assert(g_queue->head.ptr == g_queue->tail.ptr);

queue_node_free(g_queue->head.ptr);
}

DeleteCriticalSection(&g_queue->tail_lock.self);
DeleteCriticalSection(&g_queue->head_lock.self);

return 0;
}

________________________________________________________________

It should compile fine. Everything is padded to L2 and everything is aligned
on L2 boundary. No false-sharing.

What do you think?

Chris M. Thomasson

unread,

Feb 16, 2009, 3:05:52 PM2/16/09

to

"Chris Cochran" <ch...@megabasic.com> wrote in message

news:5dbe14cd-a20d-4339...@b40g2000pri.googlegroups.com...
> Chris,
>
> I sent you a whole big response, got the "Post successful", but it
> appears to have never gotten there, and I don't have another copy...

Damn; I hate what that non-sense happens!

> I'll have to come back another time and dream it all up all over
> again...

I am interested, so when you get some free time can you please retype it and
post it again? That sucks that the post got lost.

;^(...

Chris M. Thomasson

unread,

Feb 16, 2009, 3:14:59 PM2/16/09

to

"Chris Cochran" <ch...@megabasic.com> wrote in message

news:b241a0c5-751d-4d67...@u1g2000pre.googlegroups.com...

> Chris,

You could add some documentation for threads that fit within the pattern of
`thread_a' in the above sequence I laid out. Something simple like:

Threads which fit into [this category], should periodically invoke
`finalize()' in order to prevent memory leaks.

Humm, that should solve the problem. I think Dimitry suggested the same
solution over on Intel TBB forum.

> Thank you for the heads up on that...

No problem.

Chris M. Thomasson

unread,

Feb 16, 2009, 3:25:17 PM2/16/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:oqVll.1158$fM1....@newsfe14.iad...

> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
> news:ynVll.1154$fM1...@newsfe14.iad...
>> "Chris Cochran" <ch...@megabasic.com> wrote in message
>> news:909c456f-b397-4985...@z28g2000prd.googlegroups.com...
>>> Chris,
>>> [...]
>>> wait-free properties on remote deallocations: I use CAS to push free
>>> nodes onto a stack. Since remote deallocation is rare compared to
>>> local deallocation, contention is not normally an issue, making lock-
>>> free "nearly wait-free".
>>
>> Rare? What about producer/consumer pattern???
>

> Nearly wait-free is _not_ wait-free in any way, shape or form. [...]

<off-topic rant>

Have you ported your allocator over to the SPARC? One thing that bugs me is
that once you do that, your code is no longer wait-free. The SPARC forces
you to use CAS loops for everything. They even depreciated the SWAP
function! What a load of crap that is:

http://groups.google.com/group/comp.arch/browse_frm/thread/cde6fa5ab9d22067
(follow contained links as well...)

ARGH!

</off-topic rant>

Chris Cochran

unread,

Feb 16, 2009, 7:39:20 PM2/16/09

to

On Feb 16, 12:25 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Chris M. Thomasson" <n...@spam.invalid> wrote in messagenews:oqVll.1158$fM1....@newsfe14.iad...

>
> > "Chris M. Thomasson" <n...@spam.invalid> wrote in message
> >news:ynVll.1154$fM1...@newsfe14.iad...
> >> "Chris Cochran" <ch...@megabasic.com> wrote in message
> >>news:909c456f-b397-4985...@z28g2000prd.googlegroups.com...
>

> Have you ported your allocator over to the SPARC? One thing that bugs me is
> that once you do that, your code is no longer wait-free. The SPARC forces
> you to use CAS loops for everything. They even depreciated the SWAP
> function! What a load of crap that is:
>

> http://groups.google.com/group/comp.arch/browse_frm/thread/cde6fa5ab9...

> (follow contained links as well...)
>
> ARGH!

So, have you ported to machine XYZ yet? Let me tell you how it
sucks... (that's quite a sales pitch.. ;-)
It seems you will have to put up with merely lock-free afterall--you
better not need health services anytime soon...

Seriously, it sounds like all my code needs for SPARC is:

(1) Implement my safeswap( ) with versioned CAS (ugly, but a
version counter will fit in).
(2) Replace the Windows virtual memory management calls
(3) Use a C++ implementation for the internal allocsize<--->index
mapping I currently have in x86 asm

Also, the cache architecture awareness of this allocator is fairly
weak, so it may have even more performance to give, with further
attention there.

Chris M. Thomasson

unread,

Feb 16, 2009, 8:01:13 PM2/16/09

to

"Chris Cochran" <ch...@megabasic.com> wrote in message

news:d202b53b-d35b-46bf...@s1g2000prg.googlegroups.com...

On Feb 16, 12:25 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "Chris M. Thomasson" <n...@spam.invalid> wrote in
> > messagenews:oqVll.1158$fM1....@newsfe14.iad...
> >
> > > "Chris M. Thomasson" <n...@spam.invalid> wrote in message
> > >news:ynVll.1154$fM1...@newsfe14.iad...
> > >> "Chris Cochran" <ch...@megabasic.com> wrote in message
> > >>news:909c456f-b397-4985...@z28g2000prd.googlegroups.com...
> >
> > Have you ported your allocator over to the SPARC? One thing that bugs me
> > is
> > that once you do that, your code is no longer wait-free. The SPARC
> > forces
> > you to use CAS loops for everything. They even depreciated the SWAP
> > function! What a load of crap that is:
> >
> > http://groups.google.com/group/comp.arch/browse_frm/thread/cde6fa5ab9...
> > (follow contained links as well...)
> >
> > ARGH!

> So, have you ported to machine XYZ yet? Let me tell you how it
> sucks... (that's quite a sales pitch.. ;-)

lol. :^D

> It seems you will have to put up with merely lock-free afterall--you
> better not need health services anytime soon...

Yikes! ;^D

> Seriously, it sounds like all my code needs for SPARC is:

> (1) Implement my safeswap( ) with versioned CAS (ugly, but a
> version counter will fit in).

32-bit SPARC has the `CASX' instruction, which is double width. However, on
64-bit CASX is only 64-bit wide. This does not make any sense to me. Oh
well. You would need to use an alignment or offset trick in order to cram a
version counter in there.

You use double width compare-and-swap (e.g., DWCAS) in your allocator? I
know that its not really required for the logic in remote deallocations. So
you must be using it in for something else; perhaps in the large block
allocator?

Chris Cochran

unread,

Feb 16, 2009, 9:09:32 PM2/16/09

to

Chris,

I do not use DWCAS in my allocator (no ABA issues). But I do use an
xchg reg,mem, which it sounds like I would have to emulate using a
DWCAS.

Chris M. Thomasson

unread,

Feb 16, 2009, 11:55:38 PM2/16/09

to

"Chris Cochran" <ch...@megabasic.com> wrote in message

news:a976cf41-7ed1-4e3a...@k36g2000pri.googlegroups.com...

you can use `CASX', which is not DWCAS, on SPARC64 to emulate a normal SWAP
that can operate on two pointers. It most certainly sucks, but its what you
have to do if you indeed decide to port to this particular arch; what a
shame. It turns your nice elegant non-blocking wait-free code into lock-free
CAS-loops. ARGH!