Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Memory manager for a lock-free queue

35 views

Skip to first unread message

Andy Venikov

unread,

Dec 8, 2009, 6:10:51 PM12/8/09

Hi,

there's recently been a proposal for adding a lock-free set of
algorithms to boost's free C++ libraries.

http://thread.gmane.org/gmane.comp.lib.boost.devel/196862
http://thread.gmane.org/gmane.comp.lib.boost.devel/197400

In light of that, I wanted to ask all you knowledgeable dwellers of
c.p.t few questions.

The algorithm of choice for fifo queue was M.Michal-M.Scott's queue. Do
you think it was a good choice?

Also, in order to be portable to as many platforms as possible, we need
to support systems that don't have DWCAS (just a plain CAS) and where
you can't steal bits from the malloc'd pointer to use as an ABA tag.
One way to circumvent this problem is to use pre-allocated arrays of
nodes and use node indexes instead of node pointers. Assuming that the
size of the index is sufficiently smaller than a size of a pointer, we
could use the remaining bits for the ABA tag. On the other hand, this
approach might mean that the queue can't grow in size.
I've been thinking of a way to circumvent that, and here's what I've
come up with and wanted to hear your opinion on:

Instead of having a single array of nodes (a single node will contain a
full user object plus an index to a next node), we should have an array
of arrays. Meaning that there will be a pre-allocated ArrayTable, which
is an array of pointers to arrays of nodes. Only the first NodeArray
will be pre-allocated. If the first NodeArray is exhausted, the "enque"
function would allocate another NodeArray and so on. The index of a node
will have to be split in two parts where first part will be an index
into ArrayTable and the second part an index into NodeArray. Every
NodeArray is initialized as a free-list so that the same lock-free
algorithm could be used to allocate/free nodes from this free list. To
prevent false sharing, NodeArray initialization will setup "next"
pointers (or rather indices) to point not to a next node in the array,
but to a node that is <cache_size> from the current node with wrapping
at the end of the NodeArray. For example, instead of element 0 pointing
to element 1, element 1->2, 2->3 and so forth, the free list should be
initialized from the NodeArray such that element 0 points to element 0
plus cache size and so forth. The first element of the last cache line
of the NodeArray will point to the second element of the first cache line.
This way we can address as many nodes as can be addressed with index's
bits, but we can grow the amount of nodes dynamically.
There's still a problem with this approach and the problem is that the
amount of preallocated memory is high for small queues. For example, on
32 bit systems, if we chose to use 8 bits for the ABA tag and the rest
for the index, we'll have 24 bits to address nodes. Furthermore, if we
decide to split 24 bits into 10 bits to address nodes in the NodeArrays
(effectively making the size of NodeArray 1024 elements) and 14 bits to
address the pointers to NodeArrays in the ArrayTable, every queue would
have to preallocate: 1)An array table of <pointer_size> * 16384 ; 2) at
least one NodeArray with 1024 elements.

What if the split between bits that address nodes in the NodeArray and
bits that address NodeArray in the ArrayTable is not static?

We could say that if the highest bit in the index is 0, then we can use
all remaining 23 bits as an index into a NodeArray. If, on the other
hand, the highest two bits are "10", then we'll use remaining 22 bits as
an index into a NodeArray. If the highest 3 bits are "110", then 21 bits
are used and so on until some minimum value for the amount of bits.

For example, if we say that NodeArray will contain at minimum 16
elements, then an index into the first NodeArray will be encoded
"11111111111111111110xxxx", where xxxx is the actual index. Index into
the second NodeArray will be encoded "1111111111111111110xxxxx". The
size of the second NodeArray will be twice as big as that of the first.
The size of the ArrayTable in this case will only be 20 elements
(pointers), with first element representing a pointer to a NodeArray
containing Nodes addressable with "11111111111111111110xxxx" and the
last entry in the ArrayTable representing a pointer to a NodeArray
containing Nodes addressable with "10xxxxxxxxxxxxxxxxxxxxxx".
The only drawback of this solution is that the total number of
addressable nodes will be less. Whereas before we could address 2^24
nodes, with this approach we can only address 2^23 + 2^22 + 2^21 ....

I hope I wasn't too messy with my description.
Anyway, I wanted to know if something like this would be a good fit and
also if there are any better solutions. I also hope that this doesn't
infringe on anybody's patent as it seems that the area of lock-free
algorithms is riddled with patents and you can never know if somebody
has already come up with a given idea.

Thanks a lot,
Andy.

Chris M. Thomasson

unread,

Dec 8, 2009, 9:33:54 PM12/8/09

"Andy Venikov" <swojch...@gmail.com> wrote in message
news:hfmmdt$odc$1...@news.eternal-september.org...

> Hi,
>
> there's recently been a proposal for adding a lock-free set of algorithms
> to boost's free C++ libraries.
>
> http://thread.gmane.org/gmane.comp.lib.boost.devel/196862
> http://thread.gmane.org/gmane.comp.lib.boost.devel/197400
>
> In light of that, I wanted to ask all you knowledgeable dwellers of c.p.t
> few questions.

I don't have time right now to give a complete answer, so I will make this
very brief.

> The algorithm of choice for fifo queue was M.Michal-M.Scott's queue. Do
> you think it was a good choice?

IMHO, that queue does not really perform all that well. If you really want a
good multi-producer/consumer (MPMC) queue I would go with:

http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300

This queue is actually blocking. However, it's a prime example of a blocking
algorithm which can outperform a lock-free algorithm.

http://groups.google.com/group/lock-free/browse_frm/thread/3173c1a68a49f890

http://lock-free.googlegroups.com/web/mpmc_block_queue.zip?gda=UfzyikYAAABqzD8afv_uZdoI8aPWkA5RCgFT0yWczmtUwROFEeSbmxhWmFmbiAW_oTQFowHpDftJ7bz9tedoI9cRbqyUDqyTE-Ea7GxYMt0t6nY0uV5FIQ

BTW, you really need to offer more than a MPMC queue. There are many
combinations to choose from:

http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/70339/reply/103496

Also, I believe that you need to read the anchors of the Michael and Scott
queue atomically. I don't think the implementation by Tim Blechmann does
that. You can use the `MOVQ' instruction or a dummy `LOCK CMPXCHG8B' to get
the job done.

> Also, in order to be portable to as many platforms as possible, we need to
> support systems that don't have DWCAS (just a plain CAS) and where you
> can't steal bits from the malloc'd pointer to use as an ABA tag.

You need to be sure to steal enough bits (e.g., 24 or 32) because version
counter solution to the ABA problem is basically race-condition management
that is NOT guaranteed to work. The ABA counter can wrap, and screw things
up in certian seneraios:

http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A / Multiprogramming ans Multiprocessing Examples / Free-Pool
Manipulation)

At the time of this writing, the paper mentions that it would take on the
order of days to overflow a 32-bit counter. Things have changed since
then...

;^)

> One way to circumvent this problem is to use pre-allocated arrays of nodes
> and use node indexes instead of node pointers. Assuming that the size of
> the index is sufficiently smaller than a size of a pointer, we could use
> the remaining bits for the ABA tag. On the other hand, this approach might
> mean that the queue can't grow in size.
> I've been thinking of a way to circumvent that, and here's what I've come
> up with and wanted to hear your opinion on:
>

[...]

You are not going to be able to deallocate any memory unless you are using a
PDR algorithm or some SEH or catch sef-faults. Anyway, why not just allocate
virtual storage up to 4 GB and use the 32-bit offset as a pointer? Then you
can add this to the base to get the 64-bit pointer? This technique works
fine and has been discussed on this group before.

A Partial Copy-On-Write Deferred Reclamation (PDR) algorithm will allow you
to get around ABA, use normal word-sized atomic instruction and still be
able to allocate and free memory to/from the OS:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/82376b7295e6ce1a

There have been several clever proxy garbage collector algorithms posted to
this group. This is an interesting one:

http://groups.google.com/group/relacy/browse_frm/thread/b6b2e35b993705be/a435b979d834e749

http://groups.google.com/group/lock-free/browse_frm/thread/f20a631b3b4c3813/34854cb674932f00

http://groups.google.com/group/lock-free/browse_frm/thread/18a9f43b93a6b3f0/d7f7778b160501f4

Dmitriy V'jukov used my orginal proxy collector:

http://webpages.charter.net/appcore/misc/pc_sample_c.html

http://webpages.charter.net/appcore/misc/pc_sample_h.html

as a base for the excellent algorithms he developed. There is a simple
amortization technqiue that can reduce the number of atomic operations used
in application threads:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7cd7151f469336a9/4f9cc6b0a0e54aa9

There are other PDR methods that incur virtually zero-overheads. I have one
called Virtually Zero-Overhead Object Management (vZOOM), but it has a
patent. I am still thinking about freely licensing it for non-commercial use
only.

Joe Seigh also has an RCU+SMR algorithm here:

http://atomic-ptr-plus.sourceforge.net

that has excellent performance properties.

But RCU and SMR are both patented by IBM.

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1ff48ea953a3ec0e/4c379dacde88a487

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/438bebe31c902eeb/8a948d273f3b24b3

> I hope I wasn't too messy with my description.
> Anyway, I wanted to know if something like this would be a good fit and
> also if there are any better solutions. I also hope that this doesn't
> infringe on anybody's patent as it seems that the area of lock-free
> algorithms is riddled with patents and you can never know if somebody has
> already come up with a given idea.

Do a patent search/feasibility check. When I was working with my Lawyer for
the vZOOM patent I made sure to give him reference to every piece of
prior-art and patents I could possibly find.

I will try to give a more detailed answer when I get time.

Thanks.

frege

unread,

Dec 8, 2009, 11:58:17 PM12/8/09

On Dec 8, 6:10 pm, Andy Venikov <swojchelo...@gmail.com> wrote:
>
> Instead of having a single array of nodes (a single node will contain a
> full user object plus an index to a next node), we should have an array
> of arrays.>

> What if the split between bits that address nodes in the NodeArray and
> bits that address NodeArray in the ArrayTable is not static?
>

> The only drawback of this solution is that the total number of
> addressable nodes will be less. Whereas before we could address 2^24
> nodes, with this approach we can only address 2^23 + 2^22 + 2^21 ....
>

isn't 2^23 + 2^22 + ... 2^21 + ... == 2^24 - 1 ?

I've been thinking about something similar for a while now. If you
want to both grow AND shrink, then you need a usage count on each
NodeArray in the ArrayTable, and (I think) the count has to be atomic
with the pointer to the NodeArray - ie 64bits (or 96/128 on 64bit
systems), which of course is completely what you are trying to avoid.
If you only want to grow, then no problem. For mine, I'm thinking of
going with { count+pointer }. I suppose it could fall back to not
shrinking on systems that don't have DCAS.

As for the sizeof the ArrayTable vs the NodeArray(s), you could also
make them template parameters - let the client decide.

Tony

Andy Venikov

unread,

Dec 9, 2009, 12:31:28 AM12/9/09

frege wrote:
> On Dec 8, 6:10 pm, Andy Venikov <swojchelo...@gmail.com> wrote:
>> Instead of having a single array of nodes (a single node will contain a
>> full user object plus an index to a next node), we should have an array
>> of arrays.>
>> What if the split between bits that address nodes in the NodeArray and
>> bits that address NodeArray in the ArrayTable is not static?
>>
>> The only drawback of this solution is that the total number of
>> addressable nodes will be less. Whereas before we could address 2^24
>> nodes, with this approach we can only address 2^23 + 2^22 + 2^21 ....
>>
>
> isn't 2^23 + 2^22 + ... 2^21 + ... == 2^24 - 1 ?

Of course it is :-)
I've been type too fast - had no time to think :-)

>
>
> I've been thinking about something similar for a while now. If you
> want to both grow AND shrink, then you need a usage count on each
> NodeArray in the ArrayTable, and (I think) the count has to be atomic
> with the pointer to the NodeArray - ie 64bits (or 96/128 on 64bit
> systems), which of course is completely what you are trying to avoid.
> If you only want to grow, then no problem. For mine, I'm thinking of
> going with { count+pointer }. I suppose it could fall back to not
> shrinking on systems that don't have DCAS.

Yes, I've tried no to use DCAS.

>
> As for the sizeof the ArrayTable vs the NodeArray(s), you could also

> make them template parameters - let the client decide.\

True. But then the type of the queue will be different. I don't know if
that's a problem though. I mean - look at shared_ptr, they could've used
a lot of template parameters, just like Alexandrescu's smart_ptr, but
decided not to.

>
> Tony

Andy Venikov

unread,

Dec 9, 2009, 12:41:25 AM12/9/09

Chris M. Thomasson wrote:

>
>> The algorithm of choice for fifo queue was M.Michal-M.Scott's queue.
>> Do you think it was a good choice?
>
> IMHO, that queue does not really perform all that well. If you really
> want a good multi-producer/consumer (MPMC) queue I would go with:
>
> http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300
>
> This queue is actually blocking. However, it's a prime example of a
> blocking algorithm which can outperform a lock-free algorithm.
>
>
> http://groups.google.com/group/lock-free/browse_frm/thread/3173c1a68a49f890
>
> http://lock-free.googlegroups.com/web/mpmc_block_queue.zip?gda=UfzyikYAAABqzD8afv_uZdoI8aPWkA5RCgFT0yWczmtUwROFEeSbmxhWmFmbiAW_oTQFowHpDftJ7bz9tedoI9cRbqyUDqyTE-Ea7GxYMt0t6nY0uV5FIQ
>
>
>
> BTW, you really need to offer more than a MPMC queue. There are many
> combinations to choose from:
>
> http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/70339/reply/103496
>
>

<snip>

> http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
> (Appendix A / Multiprogramming ans Multiprocessing Examples / Free-Pool
> Manipulation)
>

<snip>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/82376b7295e6ce1a
>
<snip>

> http://groups.google.com/group/relacy/browse_frm/thread/b6b2e35b993705be/a435b979d834e749
> http://groups.google.com/group/lock-free/browse_frm/thread/f20a631b3b4c3813/34854cb674932f00
> http://groups.google.com/group/lock-free/browse_frm/thread/18a9f43b93a6b3f0/d7f7778b160501f4

<snip>
> http://webpages.charter.net/appcore/misc/pc_sample_c.html
> http://webpages.charter.net/appcore/misc/pc_sample_h.html

> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7cd7151f469336a9/4f9cc6b0a0e54aa9
>
> http://atomic-ptr-plus.sourceforge.net
>

>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1ff48ea953a3ec0e/4c379dacde88a487
>
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/438bebe31c902eeb/8a948d273f3b24b3

Wow, there sure is a lot of information in the links that you gave and
I'm grateful for each and everyone of them. It's gonna take some time to
read it all though... ;-)

>
>
>
>
>
>> I hope I wasn't too messy with my description.
>> Anyway, I wanted to know if something like this would be a good fit
>> and also if there are any better solutions. I also hope that this
>> doesn't infringe on anybody's patent as it seems that the area of
>> lock-free algorithms is riddled with patents and you can never know if
>> somebody has already come up with a given idea.
>
> Do a patent search/feasibility check. When I was working with my Lawyer
> for the vZOOM patent I made sure to give him reference to every piece of
> prior-art and patents I could possibly find.

Yes, doing that is always a part of algorithm development. Unfortunately
boost isn't a commercial enterprise, so there isn't (as far as I know) a
dedicated "legal" department that could sort these issues out and the
task of doing that lies on the shoulders of the poor developers...

> I will try to give a more detailed answer when I get time.

Thank you.

>
> Thanks.

Andy.

Andy Venikov

unread,

Dec 10, 2009, 4:58:58 PM12/10/09

Well, as promised, I did some reading on these links.

Chris M. Thomasson wrote:
> "Andy Venikov" <swojch...@gmail.com> wrote in message

> IMHO, that queue does not really perform all that well. If you really
> want a good multi-producer/consumer (MPMC) queue I would go with:
>
> http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300
>
> This queue is actually blocking. However, it's a prime example of a
> blocking algorithm which can outperform a lock-free algorithm.

Interesting and simple design.
Unfortunately, as I understand it, if one producer gets terminated
between swinging the head pointer and updating "previous" to point to
the newly inserted element, then the whole queue gets stuck, even if
other producers are still active?

I'm not sure if it's a big problem or not, since terminating threads
should NOT be an expected behaviour....
Still, with M-S's queue, even if one producer/consumer is terminated in
an opportune time, the algorithm on a whole is guaranteed to continue to
work.

>
>
> http://groups.google.com/group/lock-free/browse_frm/thread/3173c1a68a49f890

I'm still going through the source code. The source code is rather
big/complex. Is there a simplified boiled-down version of this
algorithm? It seems to be using a lot of extra stuff, like event counts
and proxy threads and also contains some debug code that takes time to
sort out.

>
> BTW, you really need to offer more than a MPMC queue. There are many
> combinations to choose from:
>
> http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/70339/reply/103496

I agree. I think the final version of the library will have to offer
different kinds of queues.

> Also, I believe that you need to read the anchors of the Michael and
> Scott queue atomically. I don't think the implementation by Tim
> Blechmann does that. You can use the `MOVQ' instruction or a dummy `LOCK
> CMPXCHG8B' to get the job done.

Hmm, I'm not sure what you mean by Anchors? If you mean head and tail,
then both are word-sized and assuming they are aligned, the read should
happen atomically. No?

> You need to be sure to steal enough bits (e.g., 24 or 32) because
> version counter solution to the ABA problem is basically race-condition
> management that is NOT guaranteed to work. The ABA counter can wrap, and
> screw things up in certian seneraios:
>
> http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
> (Appendix A / Multiprogramming ans Multiprocessing Examples / Free-Pool
> Manipulation)

I read this document. It seems to be just showing a simple example of an
ABA problem. They chose to use 32 bit tag explaining that it would
take on the order of days for the tag to roll-over. The jury is still
out as to the safety of smaller tags. What's the probability of 16bit
tag rolling over? 8 bit? Yes, it's definitely greater than 32 bit, but
maybe it's still withing acceptable limits? Someone needs to do a
complete statistical analysis, and it would have to be done for every
different system/architecture. Maybe using an ABA tag isn't a good
approach in the first place? I mean, even with 32 bit tag, it's possible
to get ABA'd.

> [...]
>
> You are not going to be able to deallocate any memory unless you are
> using a PDR algorithm or some SEH or catch sef-faults.

Yeah, I wasn't going to give the memory back to the system, that's why I
don't think we really need any SMR. The "usable" memory would
exponentially grow as needed. Much like std::vector does.

Anyway, why not
> just allocate virtual storage up to 4 GB and use the 32-bit offset as a
> pointer? Then you can add this to the base to get the 64-bit pointer?
> This technique works fine and has been discussed on this group before.

Well, my primary concern was about 32 bit systems. But even on a 64 bit
system, how can we be sure that 32 bits will always be available? Even
in the future, when new versions of libc might potentially use the bits
for their own uses.

> A Partial Copy-On-Write Deferred Reclamation (PDR) algorithm will allow
> you to get around ABA, use normal word-sized atomic instruction and
> still be able to allocate and free memory to/from the OS:
>
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/82376b7295e6ce1a

This is another one I haven't read yet. But it sounds very promising!!!

Ok, so back to reading....

Thanks,
Andy.

Chris M. Thomasson

unread,

Dec 10, 2009, 8:40:51 PM12/10/09

"Andy Venikov" <swojch...@gmail.com> wrote in message

news:hfrqv4$rh4$1...@news.eternal-september.org...

> Well, as promised, I did some reading on these links.
>
> Chris M. Thomasson wrote:
>> "Andy Venikov" <swojch...@gmail.com> wrote in message
>
>
>> IMHO, that queue does not really perform all that well. If you really
>> want a good multi-producer/consumer (MPMC) queue I would go with:
>>
>> http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300
>>
>> This queue is actually blocking. However, it's a prime example of a
>> blocking algorithm which can outperform a lock-free algorithm.
>
> Interesting and simple design.

Indeed.

> Unfortunately, as I understand it, if one producer gets terminated between
> swinging the head pointer and updating "previous" to point to the newly
> inserted element, then the whole queue gets stuck, even if other producers
> are still active?

You are correct because the algorithm's behavior as a whole is not lock-free
by definition. Here is some more information on that particular aspect:

http://groups.google.com/group/comp.programming.threads/msg/2556874a844a706e

http://groups.google.com/group/comp.programming.threads/msg/ff46fd1c76412a51

> I'm not sure if it's a big problem or not, since terminating threads
> should NOT be an expected behaviour....

Agreed!

> Still, with M-S's queue, even if one producer/consumer is terminated in an
> opportune time, the algorithm on a whole is guaranteed to continue to
> work.

Granted. However, the M-S queue has a fairly expensive push operation while
Dmitriy's algorithm has excellent performance overall in the sense that it
only has a single atomic SWAP and a release barrier for push. It's a
definite tradeoff and if you simply require lock-free semantics then
Dmitriy's queue is not going to work out.

>> http://groups.google.com/group/lock-free/browse_frm/thread/3173c1a68a49f890
>
> I'm still going through the source code. The source code is rather
> big/complex. Is there a simplified boiled-down version of this algorithm?
> It seems to be using a lot of extra stuff, like event counts and proxy
> threads and also contains some debug code that takes time to sort out.

I don't think Dmitriy has posted simplified pseudo-code for this algorithm.

>> BTW, you really need to offer more than a MPMC queue. There are many
>> combinations to choose from:
>>
>> http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/70339/reply/103496
>
> I agree. I think the final version of the library will have to offer
> different kinds of queues.

Right. There can be many combinations, but it is definitely worth the extra
work.

>> Also, I believe that you need to read the anchors of the Michael and
>> Scott queue atomically. I don't think the implementation by Tim Blechmann
>> does that. You can use the `MOVQ' instruction or a dummy `LOCK CMPXCHG8B'
>> to get the job done.
>
> Hmm, I'm not sure what you mean by Anchors? If you mean head and tail,
> then both are word-sized and assuming they are aligned, the read should
> happen atomically. No?

AFAICT, in the code by Tim Blechmann, the anchor is typedef'ed as:
____________________________________________________________
typedef tagged_ptr<node> atomic_node_ptr;
____________________________________________________________

and `tagged_ptr<node>' is double-wide when `BOOST_LOCKFREE_PTR_COMPRESSION'
is not defined. Tim loads the head/tail anchor in his implementation of the
MS algorithm like:
____________________________________________________________
atomic_node_ptr tail (tail_);
____________________________________________________________

and the copy constructor of the double-wide version of `tagged_ptr' is:
____________________________________________________________
/** copy constructor */
tagged_ptr(tagged_ptr const & p)//: ptr(0), tag(0)
{
set(p);
}
____________________________________________________________

and the `set()' procedure is:
____________________________________________________________
void set(tagged_ptr const & p)
{
ptr = p.ptr;
tag = p.tag;
}
____________________________________________________________

Which will not be atomic on a 32-bit system. AFAICT, this breaks the
algorithm according to the MS queue pseudo-code. So, Tim should probably
change this behavior and call `atomic_set()' instead. Of course this will
only make the performance much worse! You could use the `MOVQ' instruction
on the x86, but that would use MMX. This is why I pointed you to the
blocking queue which does not need to do this. Yes, it's blocking but it's a
MUCH simpler algorithm and has far better performance characteristics.

>> You need to be sure to steal enough bits (e.g., 24 or 32) because version
>> counter solution to the ABA problem is basically race-condition
>> management that is NOT guaranteed to work. The ABA counter can wrap, and
>> screw things up in certian seneraios:
>>
>> http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
>> (Appendix A / Multiprogramming ans Multiprocessing Examples / Free-Pool
>> Manipulation)
>
> I read this document. It seems to be just showing a simple example of an
> ABA problem. They chose to use 32 bit tag explaining that it would take on
> the order of days for the tag to roll-over. The jury is still out as to
> the safety of smaller tags. What's the probability of 16bit tag rolling
> over?

There is more than one way to implement this stack algorithm. The usual way
is to increment the version counter on both push and pop. So, if you apply
load on this then the 16-bit counter will overflow once you hit 32768 pushes
and 32768 pops. That's not that many.

If you study the algorithm, you will realize that the push operation is
immune to ABA and does not need to increment the counter, or even use DWCAS
for that matter. Therefore a 16-bit counter will overflow once you hit 65536
pops, still that's not very much.

I should point out a caveat for the latter implementation technique. If you
don't increment ABA and don't use DWCAS for the push operation then the pop
operation MUST read the version counter BEFORE the head pointer. Here is
why:

http://groups.google.com/group/comp.programming.threads/msg/c5004dea42e42aaa
(brief description of the race-condition)

http://groups.google.com/group/comp.programming.threads/msg/95706ec49f3cbc0c
(detailed description of the race-condition and presents several solutions)

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1e166dc04c3df118/d5c074ce23435ae8
(read all, because it's very interesting thread, well, IMHO of course!)

;^)

> 8 bit? Yes, it's definitely greater than 32 bit, but maybe it's
> still withing acceptable limits? Someone needs to do a complete
> statistical analysis, and it would have to be done for every different
> system/architecture. Maybe using an ABA tag isn't a good approach in the
> first place? I mean, even with 32 bit tag, it's possible to get ABA'd.

Well, the shi% can hit the fan, and the version counter does not prevent
that. It only reduces the chances. So, I guess it's up to the user if that
is okay or not. Some people can live with that possibility, and some
cannot...

:^o

>> You are not going to be able to deallocate any memory unless you are
>> using a PDR algorithm or some SEH or catch sef-faults.
>
> Yeah, I wasn't going to give the memory back to the system, that's why I
> don't think we really need any SMR. The "usable" memory would
> exponentially grow as needed. Much like std::vector does.

That might not be acceptable for every usage case. Especially in robust
systems that handle cases in which the system is overloaded by reducing
cached data, forcefully dropping connections, freeing anything that is not
needed at the time, ect... The system might get pissed off if can't do
something because the stack is hogging to much memory. Of course this is
contrived, but it is something to think about.

;^)

>> Anyway, why not
>> just allocate virtual storage up to 4 GB and use the 32-bit offset as a
>> pointer? Then you can add this to the base to get the 64-bit pointer?
>> This technique works fine and has been discussed on this group before.
>
> Well, my primary concern was about 32 bit systems. But even on a 64 bit
> system, how can we be sure that 32 bits will always be available? Even in
> the future, when new versions of libc might potentially use the bits for
> their own uses.

Need to think about this. Humm, well if the call to `VirtualAlloc()' or
`mmap()' succeeds in allocating a pool such that a 32-bit value is
sufficient to map into it, then you are set. However, if `VirtualAlloc()' or
`mmap()' fails then you are screwed and have to fall back to an index
solution that you described.

>> A Partial Copy-On-Write Deferred Reclamation (PDR) algorithm will allow
>> you to get around ABA, use normal word-sized atomic instruction and still
>> be able to allocate and free memory to/from the OS:
>>
>>
>> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/82376b7295e6ce1a
>
> This is another one I haven't read yet. But it sounds very promising!!!

Yes. PDR algorithms are a decent solution to the ABA problem. However, they
can be overkill of you use it just for that problem alone. PDR is mainly
used to solve the reader/writer problem such that reader thread can be
iterating over a shared data-structure while writer threads are concurrently
adding/remove and freeing nodes. Think along the lines of RCU.

Another solution for ABA is to simply catch the error, think SEH or handling
SIGSEV. This is what Windows does in there SList implementation. Also, some
STM algorithm use this technique.

> Ok, so back to reading....

:^D

Andy Venikov

unread,

Dec 11, 2009, 10:02:54 AM12/11/09

Yup, indeed.
Good catch!

Thanks,
Andy.

Chris M. Thomasson

unread,

Dec 12, 2009, 6:22:30 AM12/12/09

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:EphUm.18007$_b5....@newsfe22.iad...

> "Andy Venikov" <swojch...@gmail.com> wrote in message
> news:hfrqv4$rh4$1...@news.eternal-september.org...

[...]

>>> http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
>>> (Appendix A / Multiprogramming ans Multiprocessing Examples / Free-Pool
>>> Manipulation)
>>
>> I read this document. It seems to be just showing a simple example of an
>> ABA problem. They chose to use 32 bit tag explaining that it would take
>> on
>> the order of days for the tag to roll-over. The jury is still out as to
>> the safety of smaller tags. What's the probability of 16bit tag rolling
>> over?
>
> There is more than one way to implement this stack algorithm. The usual
> way is to increment the version counter on both push and pop. So, if you
> apply load on this then the 16-bit counter will overflow once you hit
> 32768 pushes and 32768 pops. That's not that many.
>
> If you study the algorithm, you will realize that the push operation is
> immune to ABA and does not need to increment the counter, or even use
> DWCAS for that matter. Therefore a 16-bit counter will overflow once you
> hit 65536 pops, still that's not very much.

[...]

I forgot to mention that if you keep the version counter on a per-node basis
you would have to push the _same_ object 65536 times in order to roll a
counter. That would give you some "better odds" to work with. Also, if you
use PDR to manage memory you can use the version counter to determine when
to defer a node. For instance, you could reuse the node in the lock-free
algorithm until its version counter is just about to roll. For a 16-bit
number if the counter is equal to 0xFFFF then I would queue it in PDR in
order to ensure that the ABA race condition will never happen. This will
take some pressure off the PDR mechanism because you are not queuing nodes
for deferment on every pop operation.

Chris M. Thomasson

unread,

Dec 12, 2009, 7:47:50 AM12/12/09

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:21LUm.56684$ky1....@newsfe14.iad...

[...]

> I forgot to mention that if you keep the version counter on a per-node
> basis you would have to push the _same_ object 65536 times in order to
> roll a counter. That would give you some "better odds" to work with.

[...]

Something like:

<pseudo-code in news reader>
______________________________________________________________
struct node
{
struct node* next;
uint32_t version;
};

struct stack
{
uint32_t version;
struct node* head; /* = NULL */
};

void
stack_push(struct stack* const self,
struct node* node)
{
struct stack cmp, xchg;

xchg.version = ++node->version;
xchg.head = node;

cmp.version = self->version;
cmp.head = self->head;

do
{
node->next = cmp.head;
}

while (! DWCAS_RELEASE(self, &cmp, &xchg));
}

struct node*
stack_pop(struct stack* const self)
{
struct stack cmp, xchg;

cmp.version = self->version;
cmp.head = LOAD_ACQUIRE_DEPENDS(&self->head);

do
{
if (! cmp.head) return NULL;

/* memory hazard 1 */
xchg.head = LOAD_ACQUIRE_DEPENDS(&cmp.head->next);

if (xchg.head)
{
/* memory hazard 2 */
xchg.version = xchg.head->version;
}

else
{
xchg.version = 0;
}
}

while (! DWCAS_ACQUIRE(self, &cmp, &xchg));

return cmp.head;
}
______________________________________________________________

There is a caveat, the version count must remain stable for the nodes
"lifetime". More precisely, you can only reset the version count when a node
is in a persistent quiescent state wrt its participation in any lock-free
algorithm that depends on it. For instance, you can set the version counter
to anything you want after it comes out on the other side of the PDR
deferment process.

Chris M. Thomasson

unread,

Dec 12, 2009, 10:41:54 PM12/12/09

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:EphUm.18007$_b5....@newsfe22.iad...
[...]

> Another solution for ABA is to simply catch the error, think SEH or
> handling SIGSEV. This is what Windows does in there SList implementation.
> Also, some STM algorithm use this technique.

I made a mistake. The first sentence should read as:

Another solution for _lock-free memory management_ is to simply catch the

error, think SEH or handling SIGSEV.

There is a caveat, the access to the node which causes the seg-fault should
be read-only.

Sorry Andy!

;^(...

0 new messages