Using only an 8-bit version count to overcome ABA... ;)

SenderX

unread,

May 27, 2003, 1:04:21 AM5/27/03

to

I believe I have a way to implement a lock-free inter-thread / process
communication system capable of posting hundreds-of-thousands of concurrent
messages, using only an 8-bit counter for the ABA problem.

The 8-bit counter is important because it allows one to use the normal
InterlockedCompareExchange API function for virtually ABA proof algo's,
which makes lock-free super portable on windows platforms.

It involves Joes idea of using offsets, or indexes in my case, as
pointers... Some other tricks I thought up ties my new solution all
together. ;)

/* My New ABA Solution */

Now, the ABA problem arises from another thread popping and pushing the SAME
node, all INSIDE another contending threads CAS section, BEFORE it gets a
chance to compare its local comprand. This corrupts the entire collection by
allowing another thread to complete its CAS section when it should not, thus
crashing the application as a whole. ;(

So, the goal is to make other threads aware of each other changes so their
CAS sections will fail accordingly. This is where a version count per.
successful CAS comes into play.

However, if you really break down the problem... It is the " rapid " reuse
of pointers that simply destroys a contending threads CAS section in the
first place.

So, what if we did not reuse nodes? Well, that would kill the ABA problem
right off! =)

But we don't want to do that...

What if we could reuse nodes in a way that did not release freed pointers to
other threads immediately?

Say we gave the lock-free system some time for all concurrent CAS sections
to fail ( which is quick ), and we coupled that with a small version-count.

Say the in order for the lock-free algo to reuse nodes, it hade to go
through over 65,535+ of them.

And lets add my sub-queue scheme to this to make the number of Front and
Back pointers more fine-grain thus making the ABA problem extremely more
remote.

In order for the ABA problem to happen using all of the above, another
thread would have to go through 65,535+ nodes, overflow an 8-bit version
counter, and happen to get the EXACT sub-queue, ALL INSIDE another threads
CAS section in order for the ABA problem to happen! This seems as remote as
a 32-bit version count rolling over inside another CAS section at the same
time.

( Microsoft SList API on a 32-bit system uses 16-bit version counts, mixed
with a node counter. )

I believe that is virtually impossible on current hardware, even on a
64-processor giant, and the faster the hardware gets, the tighter the algo
is cause contending threads will fail their CAS sections quicker and avoid
the ABA problem.

The way to implement this is simple, its kinda based on the IBM FreeList
memory scheme...

Node Delete Rules:

When a lock-free algo needs to delete a node, it will NOT LET other threads
reuse the node right away. Instead it will store the node in the back of a
static FIFO queue, thus blocking other threads from receiving the node and
virtually preventing the ABA problem in the first place.

Node Alloc Rules:

When a lock-free algo needs to alloc a node it will pop the front off a
static FIFO queue, it will NOT immediately reuse nodes from other threads
cause they queue on the back.

Node CAS Rules:

You will use the InterlockedCompareExchange, or a compatible OS API, to
atomically work on memory the size of the system pointer.

You WILL compare and increment an 8-byte version number on EVERY CAS.

This idea should implement a virtually 100% ABA proof, normal CAS
compatible, lock-free thread / process communication system.

I will present working code in a couple of days, my tests ( 4-processor
Xeon ) haven't crashed at all.

Please review my solution and tell me what ya think!

Thank you.

=)

--
The designer of the SMP and HyperThread friendly, AppCore library.

http://AppCore.home.attbi.com

Joseph Seigh

unread,

May 27, 2003, 9:48:43 AM5/27/03

to

How exactly are you managing that static array?

Joe Seigh

John Hickin

unread,

May 27, 2003, 9:58:16 AM5/27/03

to

If you are looking for a Windows solution, you might try based pointers.
Just a though off the top of my head.

Regards, John.

"SenderX" <x...@xxx.com> wrote in message
news:plCAa.475147$Si4.4...@rwcrnsc51.ops.asp.att.net...

SenderX

unread,

May 27, 2003, 9:38:55 PM5/27/03

to

Good thought, but based pointers won't work here because they use 32-bits as
pointers on a 32-bit system, and 64-bit pointers on a 64-bit system.

On a 32-bit system I need 16-bit pointers, and on a 64-bit system I need
16-bit or 32-bit pointers.

A 32bit pointer for my system looks like this:

typedef union U_NodePtr
{
__int32 Value32;

struct
{
unsigned __int16 Ptr;
unsigned __int8 Nodes;
unsigned __int8 Aba;
};

} NODE_PTR, *LPNODE_PTR;

Based pointers are too big to work here, damn...

;)

SenderX

unread,

May 27, 2003, 10:08:31 PM5/27/03

to

> How exactly are you managing that static array?

The static ( a little over 3.9 MB's for over 300,000 nodes ) node system
goes like this:

/* Static node storage */

NODE Nodes1[65536];
NODE Nodes2[65536];
NODE Nodes3[65536];
NODE Nodes4[65536];
NODE Nodes5[65536];

#define NULL_NODE 65535

/* Static node access */

LPNODE NodeAccess[5];

NodeAccess[0] = Nodes1;
NodeAccess[1] = Nodes2;
NodeAccess[2] = Nodes3;
NodeAccess[3] = Nodes4;
NodeAccess[4] = Nodes5;

FIFO_QUEUE NodeQueue;

The queue has a number of sub-queues it uses for storage.

/* A pointer to a static node */

typedef union U_NodePtr
{
__int32 Value32;

struct
{
unsigned __int16 Ptr;
unsigned __int8 Nodes;
unsigned __int8 Aba;
};

} NODE_PTR, *LPNODE_PTR;

/* A node itself */

typedef struct S_Node
{
NODE_PTR Next;
unsigned __int8 MyNodes;
unsigned __int16 MyPtr;
LPVOID pUserObj;

} NODE, *LPNODE;

You would access a node like this:

NODE_PTR PtrToNode;

LPNODE pNode = &NodeAccess[PtrToNode.Nodes][PtrToNode.Ptr];

pNode is now pointing to PtrToNode's node.

I'll post some code later, but at init time ( inside my acStartup ) the
system stores the static nodes in a series of FIFO queues. ( lock-free
sub-queues ).

When a thread needs to get a node, the node system chooses a lock-free
sub-queue from a pool, and pops the back off off it.

When a thread needs to remove a node, the node system chooses a lock-free
sub-queue from a pool, and pushes the node on the front.

This ensures that nodes are not released to the system or the application,
but have to go through a sub-queue system to get reused.

Joseph Seigh

unread,

May 28, 2003, 10:20:11 AM5/28/03

to

SenderX wrote:
>
> > How exactly are you managing that static array?
>
> The static ( a little over 3.9 MB's for over 300,000 nodes ) node system
> goes like this:
>

(snip)

I was wondering what algorithm you were using for the FIFO queue. If you
are using M.Michael's lock-free, AFAICT it's totally broken. Not just ABA
problems at *both* ends.

Joe Seigh

SenderX

unread,

May 28, 2003, 11:45:19 AM5/28/03

to

I have implemented it using both of the following FIFO lock-free algo's:

http://www.cs.rochester.edu/u/scott/papers/1996_PODC_queues.pdf

http://perso.wanadoo.fr/gmem/evenements/jim2002/articles/L17_Fober.pdf

The first paper cited is an older work by Maged M. Michael and Michael L.
Scott.

The second paper is by three French computer scientists who's FIFO queue
claims it runs a little faster than the mike and scott stuff, although I
cannot measure hardly any difference. ;)

> AFAICT it's totally broken.

Humm, either one haven't crashed yet. =)

I will post the code using the M. Michael and M. Scott algo to see if people
can break it.

> Not just ABA
> problems at *both* ends.

Actually, pushing and popping concurrently from the mike and scott algo ( or
the other one ) doesn't contend each other at all.

The ONLY time they will try to contend is if the first CAS fails, and they
need to advance the back to the next.

Even then, in order for ABA to happen they have to reuse the next pointer,
which won't be able to happen in a quick enough time. Since the a thread
would have to keep popping from the front in order to get at the back's next
node.

Exactly where in the algo do you think something cas go wrong?

How would I be able to force this bug to happen?

David Schwartz

unread,

May 28, 2003, 12:40:36 PM5/28/03

to

"SenderX" <x...@xxx.com> wrote in message

news:jQ4Ba.31110$_t5....@rwcrnsc52.ops.asp.att.net...

> > AFAICT it's totally broken.
>
> Humm, either one haven't crashed yet. =)

You're probably testing on x86. The x86 platform has stronger ordering
and tends to make it very hard for race conditions to show up in testing.

DS

SenderX

unread,

May 28, 2003, 1:06:46 PM5/28/03

to

> You're probably testing on x86. The x86 platform has stronger ordering
> and tends to make it very hard for race conditions to show up in testing.

Weak ordering can be overcome with acquire / release barriers, that is well
known.

Both algos have no race conditions, look at them and you will see that
everything takes place at atomic meeting points. Were you talking about
ABA???

I'm testing this on a 4-way Xeon hyperthreaded box.

I doub't the mike and scott lock-free FIFO is broken.

Both the algo's work good.

In fact, the mike and scott algo has been testing on a friends dual xeon box
with 256 time-critical threads for a couple of weeks now.

He has not called me to say it crashed yet.

;)

Alexander Terekhov

unread,

May 28, 2003, 1:12:33 PM5/28/03

to

SenderX wrote:

[... hyperthreaded box ...]

http://testdrive.hp.com/current.shtml

regards,
alexander.

SenderX

unread,

May 28, 2003, 1:33:48 PM5/28/03

to

> http://testdrive.hp.com/current.shtml

Thank you for that, I was unaware of that gold mine.

It is really cool of HP to do that! Wow. =)

However...

I was looking through the Test Drive policies and came across this:

> The following are discouraged in the Test Drive Program:
> Long running, CPU-bound processes.

The program I wish to test drive, is EXACTLY that.

I will contact them and ask about this issue.

My code will use up the entire system because I test using time-critical
priority threads.

Any ideas to use less processor, but still give my algo a good test???

I would LOVE to test my code on an Itanium 2!!!!!!!

Joseph Seigh

unread,

May 28, 2003, 5:34:28 PM5/28/03

to

SenderX wrote:
>
> Exactly where in the algo do you think something cas go wrong?

I misremembered the algorithm. I reread it and it seems to be ok.

Joe Seigh

SenderX

unread,

May 28, 2003, 8:18:17 PM5/28/03

to

> I misremembered the algorithm. I reread it and it seems to be ok.

Great!

Thanks for taking the time to look man.

;)