AtomicPtr ABA Test /w a Lock-Free FIFO Queue here...

SenderX

unread,

May 14, 2003, 3:33:15 AM5/14/03

to

Whoops, when I cut-and-pasted this test lib together I put the wrong
AtomicPer version in there.

Reload the zip file if you already downloaded it:

http://appcore.home.attbi.com/zips/AtomicPtrABATest.zip

The old code suffered from the following problem:

http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=3E94
A3B5.39B8C20A%40xemaps.com&prev=/groups%3Fdq%3D%26num%3D25%26hl%3Den%26lr%3D
%26ie%3DUTF-8%26oe%3DUTF-8%26group%3Dcomp.programming.threads%26start%3D75

The problem doesn't come up under the test even thought it posts 123,456
test messages and cycles them for ten-minutes, I thought it needed to be
updated with the fix anyway =)

--
The designer of the SMP and HyperThread friendly, AppCore library.

http://AppCore.home.attbi.com

Alexander Terekhov

unread,

May 14, 2003, 4:05:34 AM5/14/03

to

SenderX wrote:
>
> Here is my test code for Joe Seighs AtomicPtr, and the ABA code I added to
> it.

Know it's time to write a docu (usage and impl spec.)... ;-)

Please.

regards,
alexander.

SenderX

unread,

May 14, 2003, 3:26:43 PM5/14/03

to

> Know it's time to write a docu (usage and impl spec.)... ;-)

=)

So far... The code will works on Intel's " General Purpose " & " System "
instruction set. Which means all IA-32 processors can run it. You can also
use LL/CS with instruction barriers ( sync opcode ) to the atomics on a
PowerPC:

http://cvs.grame.fr/cgi-bin/midishare-cvs/src/common/Headers/lflifo.h?rev=1.
5&co

You can also use my ABA method in " Lock-Free " Algos that don't use Joe's
great AtomicPtr at all.

AtomicPtr by it self is great, but just NOT good for " Lock-Free " algos
that need a version stamp per CAS. It simply won't work for it. That is why
I added the ability for AtomicPtr to avoid the ABA problem. =)

AtomicPtr is the best and " only " atomic reference counter out there ( it
doesn't need DCAS ). Now, coupled with my ABA solution. It is a work of art!
;)

I have been running 24-hour tests on my ABA solution /w Joes code. They
haven't failed yet.

The next test will be using my ABA solution in the IBM Freelist Algo coupled
with some other " stuff " of mine. I will also post the IBM Freelist test
code when it is ready.

Now, my ABA stuff alone doesn't need instruction barriers. I believe
AtomicPtr does need a couple of instruction syncs. A few well placed mfence
opcodes should do the trick for the IA-32 port. PowerPC would use a couple
of sync instructions to do it.

If my ABA stuff works out ( it looks like it will ) I will do a write up on
it.

It basically says that:

1. Each thread WILL create a unique id under 32-bits.

2. A thread WILL include its unique id in ANY compare-and-swap operations it
does on ANY shared pointer.

If you do this, each thread will know if ANY OTHER thread has touched the
pointer concurrently, and will drop out of CAS to try again or do other
stuff. This avoids the ABA problem when updating shared pointers.

I should patent this, if it works out!

;)

Joseph Seigh

unread,

May 15, 2003, 9:16:25 AM5/15/03

to

SenderX wrote:
>
>
> If my ABA stuff works out ( it looks like it will ) I will do a write up on
> it.
>
> It basically says that:
>
> 1. Each thread WILL create a unique id under 32-bits.
>
> 2. A thread WILL include its unique id in ANY compare-and-swap operations it
> does on ANY shared pointer.
>
> If you do this, each thread will know if ANY OTHER thread has touched the
> pointer concurrently, and will drop out of CAS to try again or do other
> stuff. This avoids the ABA problem when updating shared pointers.
>

I took a look at the code and there might be some problems. I think an examination
of the original IBM solution to the ABA problem and how it relates to LL/SC solutions
will clarify things somewhat.

The original solution is basically 3 steps.

1) load pointer and change count
2) do something
3) update w/ new pointer value and change count if no change has occurred since step 1.
i.e. old change count equal to change count value copied in step 1.

The LL/SC solution looks like (and I'll use IBM mnemonics rather than Alpha, load locked
is somewhat misleading)

1) load pointer reserved
2) do something
3) update w/ new pointer value if reserve still holds, and release reserve.

You can see that they're basically equivalent. Reserve is basically a way of detecting
whether a change has occurred.

The way this would have worked in atomic_ptr would be to have a change count as part of
the pointer which would be incremented every time that pointer was stored into. You'd have
a special reserved pointer class that would be used as the comparand in the CAS method.
When the reserved pointer got stored/copied into, the change count would be copied from
the source, not incremented as in a a normal pointer. So the logic is again basically
the same.

1) load reserved pointer object with pointer and change count
2) do something
3) update w/ new pointer value and change count if no change has occurred since step 1.
i.e. old change count equal to change count value copied in step 1.

The only problem with this (beside the fact that there are a heck of a lot of atomic ops
going on under the covers) is the change count would be too small to reliably detect
changes in some implementations.

The thread id solution would work in the following manner. atomic_ptr would have a reserve
id. field. This field would be set to 0 on every store into it, 0 being an invalid thread id
(thread id being a unique fixed number, not the posix opaque thread id). The reserved pointer
class used as comparand in the CAS method would the same local scope restriction that local_ptr
has, but when you copied into reserved pointer, the source pointer reserve id would be set with
the thread id. So the logic would go like

1) load reserved pointer with source pointer and set reserved id in source pointer with current
thread id.
2) do something
3) update w/ new pointer and set reserve id to 0 if reserve id equals current thread id.
(if not equal, some other thread has or is attempting to update also).

I suspect that this is how some hardware may handle it, so I don't think we came up with
anything new here and patents are really expensive to execute, plus big companies can
ignore your patent because their lawyers are bigger then your lawyers.

It would be interesting to look at ways of dealing with reservation conflicts. Perhaps some
really cool maitre-d' backoff function. If we can have bakery algorithms then we can have
restaurant algorithms, too.

The problem with your current implementation of setting the thread id to that of the last
thread storing into it, as far as I can tell, is that the same thread could have pushed the
same element back onto the stack between the time you first looked at the stack pointer
and the time when you attempt to update it, so you could still have the ABA problem
I think. But I'd have to study it a little more.

Joe Seigh

SenderX

unread,

May 15, 2003, 12:02:45 PM5/15/03

to

> The problem with your current implementation of setting the thread id to
that of the last
> thread storing into it, as far as I can tell, is that the same thread
could have pushed the
> same element back onto the stack between the time you first looked at the
stack pointer
> and the time when you attempt to update it, so you could still have the
ABA problem
> I think. But I'd have to study it a little more.

Ahhhh yes... The reason it works is because of the unique thread id mixed
with the ephemeral count change. It simulates a version count. I have yet to
break the FIFO queue though... Please, try to break it. =)

> It would be interesting to look at ways of dealing with reservation
conflicts.

Indeed!

In fact...

I was really digging around in some IBM research docs. Turns out that there
is no " rock-solid " solution to the ABA problem anyway. The version count
increment per-cas is meant to make the possibility of ABA extremely remote!
So, if we can minimize the time an ABA can happen between a load and store
( cas ), that would be good.

My brain whipped up another very good idea that will simulate LL/SC with
CAS32 and CAS64! =)

Really think about this pseudo-code which I think will make ABA 100%
impossible:

/* A node */
typedef struct S_Node
{
struct S_Node *pNode;
LPVOID *pObj;

} NODE, *LPNODE;

/* A lifo stack */
typedef union U_Stack
{
unsigned __int64 Value64;

struct
{
struct S_Node *pFront;
LONG lAba;
};

} STACK, *LPSTACK;

/* Global stack */
STACK SharedStack;

/* A lock-free lifo push function */
VOID Push( LPVOID pObj )
{
LPNODE pNewNode = AllocNode( pObj );

struct S_Node *pFront;

LONG lAbaReadId
= GetCurrentThreadAbaReadId()

LONG lAbaWriteId
= GetCurrentThreadAbaWriteId()

do
{
/* PSEUDO LOAD-LINKED ( reserve ) */
InterlockedExchange
( &SharedStack.lAba,
lAbaReadId );

/* Read the front */
pFront = SharedStack.pFront;

/* Modify the new node */
pNewNode->pNext = pFront;
}

/* PSEUDO STORE-CONDITIONAL ( aba proof ) =) */

/* Try and push */
while ( ! CAS_64_BIT
( &SharedStack,

/* Compare front along with our read reservation */
pFront,
lAbaReadId,

/* Exchange the new node with our write reservation */
pNewNode,
lAbaWriteId ) );
}

What do ya think of this one! =)

I think this will work to simulate an ABA proof LL/SC with CAS...

Sean Burke

unread,

May 15, 2003, 3:22:20 PM5/15/03

to

"SenderX" <x...@xxx.com> writes:

> In fact...
>
> I was really digging around in some IBM research docs. Turns out that there
> is no " rock-solid " solution to the ABA problem anyway. The version count
> increment per-cas is meant to make the possibility of ABA extremely remote!
> So, if we can minimize the time an ABA can happen between a load and store
> ( cas ), that would be good.
>
> My brain whipped up another very good idea that will simulate LL/SC with
> CAS32 and CAS64! =)

Looks OK to me, but I'm not that worried about the odds of the version-count
solution failing either. It would be nice to have a lock-free solution that
would work with CAS64 in a 64-bit address space. :-{

I don't see why separate readId and writeId are needed though.
Presumably, the pop operation would also set its read id prior
to reading the pFront pointer.

-SEan

SenderX

unread,

May 15, 2003, 4:32:03 PM5/15/03

to

> Looks OK to me, but I'm not that worried about the odds of the
version-count
> solution failing either. It would be nice to have a lock-free solution
that
> would work with CAS64 in a 64-bit address space. :-{

Joe Seigh said that the 64-bit Merced / Itanium processors have a
cmp8bxchg16b opcode! You could use this for 64-bit lock-free for sure.

If the 64-bit platform had LL/SC you could use the lock-free algos. mixed
with full-instruction syncs ( like mfence ).

My read / write id per-thread would work for the LL/SC too.

> I don't see why separate readId and writeId are needed though.
> Presumably, the pop operation would also set its read id prior
> to reading the pFront pointer.

It was to see how many threads contend with reads, and how many threads
contend with writes. I basically used it to dual-proof ( eliminate ) the ABA
problem and to get needed info for an exponential backoff system including a
mixture of pause & nop instructions. If the contention for a CAS critical
section becomes to great, it will do a Sleep( 0 ).

By the way, I have a working AtomicPtr coded in C on my site:

http://appcore.home.attbi.com/src/AtomicPtr.c

It lacks my rock-solid solution for ABA, but I will add it soon and post it
=)

Joseph Seigh

unread,

May 15, 2003, 4:49:52 PM5/15/03

to

Sean Burke wrote:

>
> "SenderX" <x...@xxx.com> writes:
>
> >
> > My brain whipped up another very good idea that will simulate LL/SC with
> > CAS32 and CAS64! =)

I hope you are using appropiate brain coolants, SenderX. :)

>
> Looks OK to me, but I'm not that worried about the odds of the version-count
> solution failing either. It would be nice to have a lock-free solution that
> would work with CAS64 in a 64-bit address space. :-{

Actually it will work if you use a (in win32/64 terms) private heap that is
less than 4 GB and use 32 bit offsets instead of pointers. This has been
discussed before.

>
> I don't see why separate readId and writeId are needed though.

writeId should be the same as readId (which should be the thread id)
or zero (not a valid thread id). And you don't need an interlocked
exchange. A simple store followed by a membar will do. So that part
should be just

lAbaId = GetCurrentThreadAbaId();

do {
&SharedStack.lAba = lAbaId; // atomic

membar(); // #storeload ?

pFront = SharedStack.pFront; // atomic mostly

pNewNode->pNext = pFront;
}

while (CAS_64_BIT(
&SharedStack,
// comparand values
pFront, // only required is compare is this wide (see Itanium note)
lAbaId,
// exchange values
pNewNode,
lAbaId)); // or zero would work also

Itanium note:
On the Itanium you can use cmp8xchg16 to just compare the lAbaId and
not have to include pFront, so you can get 64 bit pointers supported
directly. So it would look like
while (CMP8XCHNG_XX(
&SharedStack,
// 8 byte comparand value
lAbaId,
// 16 byte exchange values
pNewNode,
lAbaId)); // or zero would work also

Appropiate structure sizes and layout assumed.

> Presumably, the pop operation would also set its read id prior
> to reading the pFront pointer.

Yes, its all LL/SC kind of logic that has to be used.

Joe Seigh

>

SenderX

unread,

May 16, 2003, 12:26:19 AM5/16/03

to

> I hope you are using appropiate brain coolants, SenderX. :)

=)

> Actually it will work if you use a (in win32/64 terms) private heap that
is
> less than 4 GB and use 32 bit offsets instead of pointers. This has been
> discussed before.

There should be a way to implement AtomicPtr with cmp8xchg16 as well.
Thinking, thinking... Hummm...

> A simple store followed by a membar will do. So that part
> should be just

Good point, an mfence would be fine.

> Itanium note:
> On the Itanium you can use cmp8xchg16 to just compare the lAbaId and
> not have to include pFront, so you can get 64 bit pointers supported
> directly.

Wow!

My aba solution actually helped in using cmp8xchg16 effectively... Cool!!!

=)

cmp8xchg16 dosen't seem as wierd as it should now. =)

No we need to put my aba stuff in AtomicPtr!

Joseph Seigh

unread,

May 16, 2003, 1:06:30 PM5/16/03

to

Actually, this just requires a unique token to work. Thread
local (stack) addresses are unique w.r.t the process address
space. So this would work also

VOID Push( LPVOID pObj )
{
LPNODE pNewNode = AllocNode( pObj );
struct S_Node *pFront;

char *unique = &pNewNode; // ... or &pFront or any local variable

do {
&SharedStack.lAba = unique; // atomic

membar(); // #storeload ?

pFront = SharedStack.pFront; // atomic

pNewNode->pNext = pFront;
}

while (CAS_64_BIT(
&SharedStack,
// comparand values
pFront, // only required is compare is this wide (see Itanium note)

unique,
// exchange values
pNewNode,
0)); // reset lAba to zero

}

Also, the address of any allocated memory would also be unique up until the moment
you pushed it onto the stack, so you could use it as a unique token also. This
wouldn't help you for popping from the stack though since you have no allocated memory
in that case.

Remember, this wouldn't work for an intra process shared stack/queue since the thread
local stack addresses would not be unique.

Joe Seigh

mov...@googlemail.com

unread,

Mar 24, 2020, 7:55:05 AM3/24/20

to

Could you upload this example http://appcore.home.attbi.com/zips/AtomicPtrABATest.zip again, the link no longer works ... maybe the participants have this code and have the opportunity to help, thanks in advance.

среда, 14 мая 2003 г., 9:33:15 UTC+2 пользователь SenderX написал: