double-word CMPXCHG equivalent on AMD64-CPUs

trottelv...@xxx.com

unread,

Sep 9, 2003, 12:08:19 AM9/9/03

to

Does anyone know if there is a double-word (two times the size of a poin-
ter) cmpxchg-equivalent with AMD64, i.e a cmpxchg-instruction that oper-
ates on a 16-byte memory-operand? I only found the cmpxchg8b-instruction
in the AMD64-manuals and if there's definitely nothing like the instruc-
tion I'm asking for, lock-free algorithms wouldn't be possible with AMD64
-systems in long-mode (or only in the lower 4GB of the address-space).

SenderX

unread,

Sep 9, 2003, 2:13:46 AM9/9/03

to

> lock-free algorithms wouldn't be possible with AMD64
> -systems in long-mode (or only in the lower 4GB of the address-space).

They are possible on AMD64.

=)

Lock-Free collections that allow for ( lock-free && concurrent ) iterations,
pushes, and pops will work on systems if the target processor has proper (
if needed ) memory barriers,and a LL/SC and/or CAS that works on memory
sizes >= sizeof( void* )... Lock-free algos will work with 64-bit
processors. No problem.

Some tips to get things going:

Replace pointers, with index's.

Use a proxy garbage collector.

Use a lock-free read-write mutex.

Use atomic_ptr with index's, and/or mapped memory with offsets as pointers.

There are many solutions to your problem...

Take a look ( disassemble ) at the SList API for AMD64. It is lock-free, and
it uses the stable IBM FreeList algo. How does it avoid the ABA problem? Its
all about ABA...

;)

--
The designer of the experimental, SMP and HyperThread friendly, AppCore
library.

http://AppCore.home.comcast.net

Joe Seigh

unread,

Sep 9, 2003, 6:35:00 AM9/9/03

to

There isn't. Use storage pools of 4GB or less and offsets into those pools.

Joe Seigh

Neill Clift [MSFT]

unread,

Sep 9, 2003, 11:11:45 AM9/9/03

to

"SenderX" <x...@xxx.xxx> wrote in message
news:uce7b.11334$mp....@rwcrnsc51.ops.asp.att.net...

> > lock-free algorithms wouldn't be possible with AMD64
> > -systems in long-mode (or only in the lower 4GB of the address-space).
>
> They are possible on AMD64.
>
> =)
>
>
> Lock-Free collections that allow for ( lock-free && concurrent )
iterations,
> pushes, and pops will work on systems if the target processor has proper (
> if needed ) memory barriers,and a LL/SC and/or CAS that works on memory
> sizes >= sizeof( void* )... Lock-free algos will work with 64-bit
> processors. No problem.
>
>
> Some tips to get things going:
>
> Replace pointers, with index's.
>
> Use a proxy garbage collector.
>
> Use a lock-free read-write mutex.
>
> Use atomic_ptr with index's, and/or mapped memory with offsets as
pointers.
>
> There are many solutions to your problem...
>
>
> Take a look ( disassemble ) at the SList API for AMD64. It is lock-free,
and
> it uses the stable IBM FreeList algo. How does it avoid the ABA problem?
Its
> all about ABA...

I assume your talking about the win32 implementation. I have never heard it
refered to with this name. Can you give detais on this?
Anyway win32 tries to avoid ABA via pointer compression and sequence
numbers.
So it does suffer from sequence number wrap.

>
> ;)
>
> --
> The designer of the experimental, SMP and HyperThread friendly, AppCore
> library.
>
> http://AppCore.home.comcast.net
>
>

--
This posting is provided "AS IS" with no warranties, and confers no rights.

SenderX

unread,

Sep 9, 2003, 5:49:23 PM9/9/03

to

> And AMD64 don't has a 16byte-CAS. So lockfree-stuff is only possible
> in a very limited way in the lower 4GB of the address-space.

Limited?

If a 64-bit processor has a CAS and/or LL/SC that works on 64-bit pointers
you can implement lock-free semaphores, mutexs, read-write mutexs, stacks,
queues, and linked lists.

The ABA problem is eliminated when the lock-free algo in question is used
with a normal or lock-free garbage-collector. No need for sequence numbers,
and the algo is now dyanamic.

Therefore, a normal CAS is all that is needed. That makes lock-free algos
very portable on 64-bit boxs.

Also... You can simply replace pointers with index's/offsets.

SenderX

unread,

Sep 9, 2003, 6:03:39 PM9/9/03

to

> There isn't. Use storage pools of 4GB or less and offsets into those
pools.

Does AMD64 have a CAS or LL/SC that operate on 8-bytes of memory?

If so...

SMR or other proxy garbage collectors can be used to eliminate the need for
ABA counts. That would make lock-free algos compatible with AMD64.

I am working on a proxy garbage collector that uses normal CAS instead of a
wide CAS.

SenderX

unread,

Sep 9, 2003, 6:26:38 PM9/9/03

to

> Take a look ( disassemble ) at the SList API for AMD64. It is lock-free,
and
> it uses the stable IBM FreeList algo. How does it avoid the ABA problem?
Its
> all about ABA...

I meant to say the the IA-32 version of the SList API uses the IBM FreeList
algo.

I am not sure what you are using on AMD64...

> I assume your talking about the win32 implementation. I have never heard
it
> refered to with this name. Can you give detais on this?

I say SList API, because it is shorter than InterlockedPushEntrySList and
InterlockedPopEntrySList functions. ;)

> Anyway win32 tries to avoid ABA via pointer compression

Please elaborate on this. This must be a trick to get pointers and sequence
numbers to both fit in a normal CAS?

> So it does suffer from sequence number wrap.

The ABA danger of Sequence wrap can be significantly reduced by delaying the
reuse of pointers into collections.

Neill Clift [MSFT]

unread,

Sep 9, 2003, 6:39:20 PM9/9/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:yss7b.406195$Ho3.61365@sccrnsc03...

> > Take a look ( disassemble ) at the SList API for AMD64. It is lock-free,
> and
> > it uses the stable IBM FreeList algo. How does it avoid the ABA problem?
> Its
> > all about ABA...
>
> I meant to say the the IA-32 version of the SList API uses the IBM
FreeList
> algo.
>
> I am not sure what you are using on AMD64...
>

We use the same algorithm on x86, AMD64 and IA64. Can you give me
a pointer to this IBM FreeList stuff? I am interested in seeing the origins
of
the algorithm if its indeed the same.

>
> > I assume your talking about the win32 implementation. I have never heard
> it
> > refered to with this name. Can you give detais on this?
>
> I say SList API, because it is shorter than InterlockedPushEntrySList and
> InterlockedPopEntrySList functions. ;)
>
>
> > Anyway win32 tries to avoid ABA via pointer compression
>
> Please elaborate on this. This must be a trick to get pointers and
sequence
> numbers to both fit in a normal CAS?
>

Basicaly the SList code is tied to memory managment. It knows what bits are
unused by the h/w and memory manager. This frees up bits that we use for a
16 bit depth field (arguably a mistake but its usefull for lookaside lists)
and
an 8 or 9 bit sequence number (16 on x86). Obviously this stuff can change
and is different from platform to platform.

So you right its a trick to get everything to fit into 64 bits on the 64 bit
platforms
as the only primitive we have on AMD64 and IA64.

>
> > So it does suffer from sequence number wrap.
>
> The ABA danger of Sequence wrap can be significantly reduced by delaying
the
> reuse of pointers into collections.
>

There are lots of things about SLists I don't like. For example the need to
handle
bad memory references chasing the forward pointers and sequence number wrap.
Other schemes seem to have lots of overhead registering pointers etc in the
system
though. This is exported stuff so I doubt it's likely to change.

I have thought about using something thats mostly lock free to achieve the
same
thing. Sort of how we do fast referencing with process tokens etc. Most of
the
time we can be lock free but rarely we have to obtain a lock. This seems to
work
well in practice.

> --
> The designer of the experimental, SMP and HyperThread friendly, AppCore
> library.
>
> http://AppCore.home.comcast.net
>
>

--

Joe Seigh

unread,

Sep 9, 2003, 7:31:27 PM9/9/03

to

"Neill Clift [MSFT]" wrote:
>
> "SenderX" <x...@xxx.xxx> wrote in message
> news:yss7b.406195$Ho3.61365@sccrnsc03...
> > > Take a look ( disassemble ) at the SList API for AMD64. It is lock-free,
> > and
> > > it uses the stable IBM FreeList algo. How does it avoid the ABA problem?
> > Its
> > > all about ABA...
> >
> > I meant to say the the IA-32 version of the SList API uses the IBM
> FreeList
> > algo.
> >
> > I am not sure what you are using on AMD64...
> >
>
> We use the same algorithm on x86, AMD64 and IA64. Can you give me
> a pointer to this IBM FreeList stuff? I am interested in seeing the origins
> of
> the algorithm if its indeed the same.

Here http://www-1.ibm.com/servers/eserver/zseries/zos/bkserv/zswpdf/zarchpops.html
in appendix A under multi-programming examples. It's been there since at least
system 370.

Also somebody recently patented it in patent 6,178,473. I wonder if they
figured out yet their patent is no good.

...

> There are lots of things about SLists I don't like. For example the need to
> handle
> bad memory references chasing the forward pointers and sequence number wrap.
> Other schemes seem to have lots of overhead registering pointers etc in the
> system
> though. This is exported stuff so I doubt it's likely to change.
>
> I have thought about using something thats mostly lock free to achieve the
> same
> thing. Sort of how we do fast referencing with process tokens etc. Most of
> the
> time we can be lock free but rarely we have to obtain a lock. This seems to
> work
> well in practice.

SLists use intrusive (imbedded) link structures. If you use those you have to
use some sort of hashing to compress the addresses as you have mentioned. What
you can do is use an external link structure transparently allocated by the
SList routines. If you allocate these out of a a <4GB memory pool you can
use 32 bit offsets instead of 64 bit addresses for the link fields, and a
64 bit pointer to the orginal object being queued. So you end up with 32 bits
available for the sequence/version number which is probably adequate for ABA
detection.

Joe Seigh

SenderX

unread,

Sep 9, 2003, 7:41:19 PM9/9/03

to

> We use the same algorithm on x86, AMD64 and IA64. Can you give me
> a pointer to this IBM FreeList stuff? I am interested in seeing the
origins
> of the algorithm if its indeed the same.

IBM System/370 Extended Architecture, Principles of Operation.

IBM Publication No. SA22-7085, 1983

You guys tweaked it to allow for an atomic depth counter.

Here is a another paper that cites the IBM FreeList algo:

www.research.ibm.com/people/m/michael/podc-2002.pdf

Here is my current implementation of the IBM FreeList algo for IA-32:

LockFree_IA32.h
------------------

#ifndef LockFree_IA32H
#define LockFree_IA32H

/* Pointer to a lock-free node */
typedef struct SLfNodePtr_
{
volatile struct SLfNode_* pNode;
int iAba;

} volatile SLfNodePtr;

/* Lock-Free node object */
typedef struct SLfNode_
{
SLfNodePtr Next;
const void* pObj;

} volatile SLfNode;

/* Lock-Free Stack */
typedef struct SLfStack_
{
SLfNodePtr Front;

} volatile SLfStack;

/* Lock-Free Proxy Garbage */
typedef struct SLfProxyGarbage_
{
SLfStack Nodes;
int iRefs;

} volatile SLfProxyGarbage;

/* 32-bit Pushes a node onto a stack */
extern void acLfStackPush( SLfStack* pStack, SLfNode* pNode );

/* 64-bit Pops a node from a stack */
extern SLfNode* acLfStackPopAba( SLfStack* pStack );

#endif

LockFree_IA32.asm
------------------

.586
.MODEL FLAT, C
.CODE

; 32-bit Lock-Free Stack Push
; extern void acLfStackPush( SLfStack* pStack, SLfNode* pNode );
acLfStackPush PROC

push ebp
mov ebp, esp

; Load the front
mov edx, [ebp + 8]
mov eax, [edx]

; Load the new node
mov ecx, [ebp + 12]

acLfStackPush_Retry:

; Prepare the Cas
mov [ecx], eax

; Try and link the new node
cmpxchg [edx], ecx

jnz acLfStackPush_Retry

pop ebp
ret

acLfStackPush ENDP

; 64-bit Lock-Free Stack Pop
; extern SLfNode* acLfStackPopAba( SLfStack* pStack );
acLfStackPopAba PROC

push ebp
mov ebp, esp

; Preserve
push esi
push ebx

; Load the front
mov esi, [ebp + 8]
mov eax, [esi]
mov edx, [esi + 4]

acLfStackPopAba_Retry:

test eax, eax

jz acLfStackPopAba_Fail

; Prepare the Cas
mov ebx, [eax]
mov ecx, edx
inc ecx

; Try and unlink the front
cmpxchg8b qword ptr [esi]

jnz acLfStackPopAba_Retry

acLfStackPopAba_Fail:

; Restore
pop ebx
pop esi

pop ebp

ret

acLfStackPopAba ENDP

END

You can also hook this stack up to a garbage collector to allow for node
deletion.

SenderX

unread,

Sep 9, 2003, 8:05:01 PM9/9/03

to

> So you right its a trick to get everything to fit into 64 bits on the 64
bit
> platforms
> as the only primitive we have on AMD64 and IA64.

IA-64 has cmp8xchg16. You can use that to implement the SList API.

Neill Clift [MSFT]

unread,

Sep 9, 2003, 10:52:41 PM9/9/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:zyt7b.409245$uu5.73990@sccrnsc04...

> > We use the same algorithm on x86, AMD64 and IA64. Can you give me
> > a pointer to this IBM FreeList stuff? I am interested in seeing the
> origins
> > of the algorithm if its indeed the same.
>
> IBM System/370 Extended Architecture, Principles of Operation.
>
> IBM Publication No. SA22-7085, 1983
>
> You guys tweaked it to allow for an atomic depth counter.
>
> Here is a another paper that cites the IBM FreeList algo:
>
> www.research.ibm.com/people/m/michael/podc-2002.pdf
>
>

Thanks for this reference. It nice to see where these things come from.

Neill Clift [MSFT]

unread,

Sep 9, 2003, 11:26:41 PM9/9/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:zyt7b.409245$uu5.73990@sccrnsc04...

Sorry didn't notice you code till I replied. Maybe I am missing
something obvious but your code seems to have a race condition.

Lets say the list looks like this:

Head: <x,1> followed by elements y and z. 1 is the sequence number
here.

You load eax with the value x
Now before you do anything else another thread pops off x
converting the list to this:

Head: <y,2> followed by z

You read the sequence number into edx = 2. You effectively looking
forward in time with this read.

Now your code continues and reads garbage from x->next as its
allocated.

Now if the thread that popped x pushes it back on the list you get this:

Head: <x,2> followed by y and z

You original thread now does its CAS and succeeds since both
the sequence number and the front pointer match but you updated
the head with garbage.

You need to read the sequence number first right?
Neill.

...
...

--

SenderX

unread,

Sep 10, 2003, 1:57:52 AM9/10/03

to

> Sorry didn't notice you code till I replied.

Here is basically the same lock-free algo in C++:

http://intel.forums.liveworld.com/thread.jsp?forum=242&thread=6721

This has the lock-free stack hooked up to my proxy garbage collector algo,
and a fast-pathed semaphore.

The C++ is clearer than my rusty asm. ;)

My C++ version and all my other lock-free collections increment their
sequence counts on pop, and push.

Thats good because all of them would have EXACT same race-condition you
found if they didn't.

Anyway, the current asm version does not increment ABA on push, so the ABA
race-condition due to reading the sequence number last is real!

The reason I have not seen this racer in the asm code yet is because I was
testing it with a gc algo I am tinkering with. So ABA was a non issue
altogether.

> Maybe I am missing
> something obvious but your code seems to have a race condition.

Thank you for taking the time to say something!!!

=)

P.S.

Will there need to be a memory barrier after the initial sequence load on
processors with weak memory orders?

Neill Clift [MSFT]

unread,

Sep 10, 2003, 11:42:16 AM9/10/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:A3z7b.407532$Ho3.63708@sccrnsc03...

> > Sorry didn't notice you code till I replied.
>
> Here is basically the same lock-free algo in C++:
>
> http://intel.forums.liveworld.com/thread.jsp?forum=242&thread=6721
>
> This has the lock-free stack hooked up to my proxy garbage collector algo,
> and a fast-pathed semaphore.
>
> The C++ is clearer than my rusty asm. ;)

I don't think I have the time to look at a large chunk of code.

>
> My C++ version and all my other lock-free collections increment their
> sequence counts on pop, and push.

It's not necessary to do this. I made this change in windows a few years
ago.
We only increment the SN on pushes. Pick your side and analyse the code
in detail. I didn't change one side to a 4 byte cmpexch as you have done
though as the intel manuals specificaly call out not to have differing
lengths
for interlocked ops. I think its works just fine if you do though.

Windows had a least one bug related to SN use that I found a few years
ago. Specificaly the flush code (pop entire list) set the whole head to
zero.
Thats a bug as flushes with pushes and pops and failure paths much shorter
that you would find with SN overflow. You need to either leave the sequence
number alone or incremenet it depending on where you have your SN inc.

This stuff isn't a simple as it seems. I ran our current algorithm through
Leslie Lamports TLA+ to give me a high degree of confidence in it.
I was able to see that incrementing the SN only on one side does increase
the failure path length (SN overflow).

>
> Thats good because all of them would have EXACT same race-condition you
> found if they didn't.
>

I was beggining to doubt myself as every other post in the newsgroups tells
me
how stupid people at MS are supposed to be :-).

[snip]

>
> Thank you for taking the time to say something!!!
>
> =)
>
>
> P.S.
>
> Will there need to be a memory barrier after the initial sequence load on
> processors with weak memory orders?
>

On the P4 they have Load and Store barriers so two writes or two
reads can get reordered. One of the loads is dependent on another
(you can't read next pointer until you have read the head) but I think
they often call out cases and say the can be reordered (hard to see how).
Presumably the sequence load and head ptr can get reordered.
We don't set any of these relaxed options in windows as far as I know.

We have barriers in our code for the ia64 versions for example.
Neill.

SenderX

unread,

Sep 10, 2003, 8:43:10 PM9/10/03

to

> I don't think I have the time to look at a large chunk of code.

;)

> > My C++ version and all my other lock-free collections increment their
> > sequence counts on pop, and push.
>
> It's not necessary to do this.

Your correct.

> We only increment the SN on pushes. Pick your side and analyse the code
> in detail.

You still need to include the SN in the pop functions CAS, correct?

If you increment on pop, you can omit the SN from the push functions CAS.

> I didn't change one side to a 4 byte cmpexch as you have done

You can only use a 4-byte CAS in the push function.

> though as the intel manuals specificaly call out not to have differing
> lengths
> for interlocked ops. I think its works just fine if you do though.

I've never had a problem with cmpxchg's and cmpxchg8b's executing
concurrently on the same data.

> Thats a bug as flushes with pushes and pops and failure paths much shorter
> that you would find with SN overflow. You need to either leave the
sequence
> number alone or incremenet it depending on where you have your SN inc.

Yeah, just cas the head with a NULL along with a updated SN.

> I was able to see that incrementing the SN only on one side does increase
> the failure path length (SN overflow).

Yep.

> I was beggining to doubt myself as every other post in the newsgroups
tells
> me
> how stupid people at MS are supposed to be :-).

Jealousy?

=P

> On the P4 they have Load and Store barriers so two writes or two
> reads can get reordered.

You mean lfence, sfence, && mfence?

Neill Clift [MSFT]

unread,

Sep 11, 2003, 12:32:59 PM9/11/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:xyP7b.411628$o%2.187567@sccrnsc02...

> > I don't think I have the time to look at a large chunk of code.
>
> ;)
>
>
> > > My C++ version and all my other lock-free collections increment their
> > > sequence counts on pop, and push.
> >
> > It's not necessary to do this.
>
> Your correct.
>
>
> > We only increment the SN on pushes. Pick your side and analyse the code
> > in detail.
>
> You still need to include the SN in the pop functions CAS, correct?

No we don't do this. We leave SN alone for flush and pop. We
increment SN on push and pushlist. We do 8 bye swaps for all operations
and we read the SN first in the pop operation. I believe this to be solid.

>
> If you increment on pop, you can omit the SN from the push functions CAS.
>
>
> > I didn't change one side to a 4 byte cmpexch as you have done
>
> You can only use a 4-byte CAS in the push function.
>

This sounds right as the pop has to fail if the SN changes and you
can only see that with an 8b swap.

>
> > though as the intel manuals specificaly call out not to have differing
> > lengths
> > for interlocked ops. I think its works just fine if you do though.
>
> I've never had a problem with cmpxchg's and cmpxchg8b's executing
> concurrently on the same data.
>
>
> > Thats a bug as flushes with pushes and pops and failure paths much
shorter
> > that you would find with SN overflow. You need to either leave the
> sequence
> > number alone or incremenet it depending on where you have your SN inc.
>
> Yeah, just cas the head with a NULL along with a updated SN.
>
>
> > I was able to see that incrementing the SN only on one side does
increase
> > the failure path length (SN overflow).
>
> Yep.
>
>
> > I was beggining to doubt myself as every other post in the newsgroups
> tells
> > me
> > how stupid people at MS are supposed to be :-).
>
> Jealousy?
>
> =P
>
>
> > On the P4 they have Load and Store barriers so two writes or two
> > reads can get reordered.
>
> You mean lfence, sfence, && mfence?
>

yes.

> --
> The designer of the experimental, SMP and HyperThread friendly, AppCore
> library.
>
> http://AppCore.home.comcast.net
>
>

SenderX

unread,

Sep 13, 2003, 4:00:10 AM9/13/03

to

>> You still need to include the SN in the pop functions CAS, correct?
> No we don't do this.

I meant, you have to compare the SN in the pop operation...

;)

> We leave SN alone for flush and pop. We
> increment SN on push and pushlist. We do 8 bye swaps for all operations
> and we read the SN first in the pop operation. I believe this to be solid.

For sure.

> > You can only use a 4-byte CAS in the push function.
> >
>
> This sounds right as the pop has to fail if the SN changes and you
> can only see that with an 8b swap.

The push function is inherently immune to the ABA problem...

SenderX

unread,

Sep 13, 2003, 4:27:05 AM9/13/03

to

> We have barriers in our code for the ia64 versions for example.

IA-64 has acquire and release built into cmpxchg && cmp8xchg16...

Do you release on pop and acquire on push?

Do you have an acquire after the pop functions initial SN load?

Neill Clift [MSFT]

unread,

Sep 13, 2003, 12:37:22 PM9/13/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:txA8b.328463$cF.99684@rwcrnsc53...

> > We have barriers in our code for the ia64 versions for example.
>
> IA-64 has acquire and release built into cmpxchg && cmp8xchg16...

This is not an area I know real well.

>
> Do you release on pop and acquire on push?

They are all release cmpxchg

>
> Do you have an acquire after the pop functions initial SN load?

We have a normal load and a release for the cmpxchg. Obviously
we only have one load on the 64 bit platforms.

>
> --
> The designer of the experimental, SMP and HyperThread friendly, AppCore
> library.
>
> http://AppCore.home.comcast.net
>
>

Neill Clift [MSFT]

unread,

Sep 15, 2003, 11:04:35 PM9/15/03

to

"Oliver S." <Foll...@gmx.net> wrote in message
news:Xns93F8328D13CF...@130.133.1.4...

> > There are lots of things about SLists I don't like. For example the
> > need to handle bad memory references chasing the forward pointers
>

> With Windows you can catch this bad memory-references with
> a SEH-handler. Check the following code; it's provides similar
> functionality as your SList-API, but you don't need to care
> about any list-headers which may have been deallocated:
>
...
...

The think I don't like about the exception is that it can have side
effects. For example if the memory that was in the list is freed
and reused as a guard page for a stack. The reference would blow
this. There is also the possibility of having device registers mapped
into the user mode address space and those refs have side effects.

Sure its very low probability but I like zero probability better.

SenderX

unread,

Sep 16, 2003, 12:39:56 AM9/16/03

to

<Crappy advise>
Please stuty the source I supplied. It provides the same like your
SLists, but the pages which host the items can be deallocated without
any precautions. As I read your words, I asked myself why I did docu-
ment my code that verbose.
</Crappy advise>

Listen, Oliver...

I have had to repeat the obvious to you... Over and over again!

;(

Your exception handling schema will seem to work, but is flawed in a most
serious way...

You rely on exceptions!!!

Damn...

How many times do I have to explain this to you, Oliver????

Joes atomic_ptr, my proxy garbage collector code, SMR, and many other
schemas are used to avoid the exceptions in the first place!

Your solution seems to think that exceptions in lock-free algos are the
norm!

WRONG!

MS would have to be truly "brain-dead" to adopt your "foolish" strategy...

My little, X-No-Archive: yes, friend...

=)

P.S.

Oliver...

All the algos that bypass the need to rely on exceptions(YUCK!) are posted
to this newsgroup by joe and I, or you can google them...

Please try to learn:

I remember when you said dyanamic lock-free FIFO queues were, quote
"impossible"...

This is of course laughable!!

Get a CLUE!

Neill Clift [MSFT]

unread,

Sep 16, 2003, 1:05:57 AM9/16/03

to

"Oliver S." <Foll...@gmx.net> wrote in message

news:Xns93F83DE11C67...@130.133.1.4...

> > For example if the memory that was in the list is freed and reused

> > as a guard page for a stack. ...

>
> Please stuty the source I supplied. It provides the same like your
> SLists, but the pages which host the items can be deallocated without
> any precautions. As I read your words, I asked myself why I did docu-
> ment my code that verbose.

I didn't read your code. I just read enough to see your using exception
handling. If you need to use SEH you probably have side effects and
that's bad in very obscure cases.
I am not saying you can't catch the exception. I am saying that the act
of getting an exception screws something up outside of this code.

You don't seem to have the A-B-A issues the other poster had with his
list code.

Neill Clift [MSFT]

unread,

Sep 16, 2003, 1:15:21 AM9/16/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:wuw9b.468660$o%2.210440@sccrnsc02...
...
...

>
>
> MS would have to be truly "brain-dead" to adopt your "foolish" strategy...
>

...
...

Much as I would like to say that we got this right we didn't. We use
exception
handling as well. I think its got problems with side effects from the bad
references that only kernel trap support can address (don't trigger guard
page conversion etc). Even then mapped device registers would be an issue.

Of course you will never see these obscure situations as you need whole
pages to go away and the pages get reused for stacks etc. You would
need to delete your heaps etc.
No excuse though I think its poor.

Don't be hard on the guy. His code didn't have the A-B-A problem :-)

SenderX

unread,

Sep 16, 2003, 1:45:31 AM9/16/03

to

> Much as I would like to say that we got this right we didn't. We use
> exception
> handling as well.

You can delete nodes from lock-free algos, without using exceptions.

> Don't be hard on the guy. His code didn't have the A-B-A problem :-)

Only if he reuses nodes.

;)

Exceptions should be avoided. There are VERY well known algos that allow
lock-free algos to delete nodes.

SenderX

unread,

Sep 16, 2003, 1:48:07 AM9/16/03

to

> You don't seem to have the A-B-A issues the other poster had with his
> list code.

His code relies on exceptions to avoid the ABA problem.

His code will suffer from ABA, if he tries to reuse memory.

P.S.

Memory reuse is a good thing! Exceptions are BAD../

Very BAD!

SenderX

unread,

Sep 16, 2003, 1:52:15 AM9/16/03

to

> But these x86-fences are only needed for the weakly-ordered SSE
> -operations: MOVNTDQ, MOVNTI, MOVNTPD, MOVNTPS, MOVNTQ and in some
> cases MASKMOVEDQU and MASKMOVQ.

I believe you are correct.

lfence, sfence, && mfence do not need to be used for lock-free algos.

SenderX

unread,

Sep 16, 2003, 1:53:49 AM9/16/03

to

> Only if he reuses nodes.

DOH...

I meant:

Only if "does NOT reuse" nodes!

Joe Seigh

unread,

Sep 16, 2003, 6:54:55 AM9/16/03

to

Depends. The interlocked instructions which impose memory barriers may
be enough for some situations but not all lock-free algorithms use just
interlocked instructions. DCL for instance.

Joe Seigh

unread,

Sep 16, 2003, 7:00:01 AM9/16/03

to

You should only use this algorithm for storage pools over which you
control the virtural memory space allocation to prevent seg faults
or remapping to memory that has side effects. Otherwise you do have
to employ some sort of GC to track memory usage.

Joe Seigh

Neill Clift [MSFT]

unread,

Sep 16, 2003, 11:27:39 AM9/16/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:rux9b.363497$Oz4.142606@rwcrnsc54...

> > You don't seem to have the A-B-A issues the other poster had with his
> > list code.
>
> His code relies on exceptions to avoid the ABA problem.
>
> His code will suffer from ABA, if he tries to reuse memory.
>
>

No I don't think it will. His only A-B-A issue will be sequence overflow.
Please correct my ignorance and post the A-B-A sequence like I did
in an earlier message.

>
>
> P.S.
>
> Memory reuse is a good thing! Exceptions are BAD../
>
> Very BAD!

Not arguing there.

>
> --
> The designer of the experimental, SMP and HyperThread friendly, AppCore
> library.
>
> http://AppCore.home.comcast.net
>
>

SenderX

unread,

Sep 16, 2003, 2:10:03 PM9/16/03

to

> Depends. The interlocked instructions which impose memory barriers may
> be enough for some situations but not all lock-free algorithms use just
> interlocked instructions. DCL for instance.

Ahhh....

SenderX

unread,

Sep 16, 2003, 2:17:25 PM9/16/03

to

> No I don't think it will.

You are right. I am wrong:

Stupid me missed the ABA increment on Pool::IA32Push.

> His only A-B-A issue will be sequence overflow.

Yes.

> Please correct my ignorance and post the A-B-A sequence like I did
> in an earlier message.

Here is a 100% experimental algo that I am creating to help avoid ABA:

http://groups.google.com/groups?&selm=plCAa.475147%24Si4.412103%40rwcrnsc51.ops.asp.att.net&rnum=1

I can post the current c++ implementation if you want.

Joe Seigh

unread,

Sep 16, 2003, 2:34:19 PM9/16/03

to

"Neill Clift [MSFT]" wrote:
>
> "SenderX" <x...@xxx.xxx> wrote in message
> news:rux9b.363497$Oz4.142606@rwcrnsc54...
> > > You don't seem to have the A-B-A issues the other poster had with his
> > > list code.
> >
> > His code relies on exceptions to avoid the ABA problem.
> >
> > His code will suffer from ABA, if he tries to reuse memory.
> >
> >
>
> No I don't think it will. His only A-B-A issue will be sequence overflow.

Sequence wrap actually (modulo 2**32). Is a 32 bit sequence number enough?
Are the time slices that far apart in windows that a 32 bit counter could
wrap in that interval.

Of course a 64 bit counter is better which is what Itanium has with their
cmp8xchg16 instruction. But Itanium arch is a little schizophrenic. They
also have MWAIT/MONITOR but no load locked/store conditional which would
have also solved the ABA problem.

Joe Seigh

Neill Clift [MSFT]

unread,

Sep 16, 2003, 3:28:23 PM9/16/03

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:3F675A22...@xemaps.com...

>
>
> "Neill Clift [MSFT]" wrote:
> >
> > "SenderX" <x...@xxx.xxx> wrote in message
> > news:rux9b.363497$Oz4.142606@rwcrnsc54...
> > > > You don't seem to have the A-B-A issues the other poster had with
his
> > > > list code.
> > >
> > > His code relies on exceptions to avoid the ABA problem.
> > >
> > > His code will suffer from ABA, if he tries to reuse memory.
> > >
> > >
> >
> > No I don't think it will. His only A-B-A issue will be sequence
overflow.
>
> Sequence wrap actually (modulo 2**32). Is a 32 bit sequence number
enough?
> Are the time slices that far apart in windows that a 32 bit counter could
> wrap in that interval.

Yes your right. Its better described as sequence number wrap.
I am sure its no real problem for x86 but the space for sequence numbers
is smaller on the 64 bit platforms witht he 8 byte swaps.

>
> Of course a 64 bit counter is better which is what Itanium has with their
> cmp8xchg16 instruction. But Itanium arch is a little schizophrenic. They
> also have MWAIT/MONITOR but no load locked/store conditional which would
> have also solved the ABA problem.
>
> Joe Seigh

--

SenderX

unread,

Sep 16, 2003, 3:28:51 PM9/16/03

to

> __declspec(naked)
> Pool::Item *__fastcall
> Pool::IA32Push( IA32ListHeader volatile *phdr, Item *pitmPush )
> {
> __asm
> {

Is it safe to use __declspec(naked) && __fastcall with the inline assembler?

SenderX

unread,

Sep 16, 2003, 8:04:04 PM9/16/03

to

> What I do through exception-handling here has absolultely nothing
> to do with the aba-problem; the aba-prevention is simply as usual.

You are right. I was wrong:

Stupid me missed the ABA increment on Pool::IA32Push.

--

SenderX

unread,

Sep 16, 2003, 8:24:04 PM9/16/03

to

> > There are VERY well known algos that allow lock-free algos to delete
> > nodes.
>

> But they're not very efficient when compared to the trivial lf-algos
> which don't allow deallocation of the underlying pages; that's because
> they need additional data-structures (like hazard-ptrs) to be swapped
> between the caches of the participating CPUs.

They avoid exceptions, and avoid the need for ABA counts. There is a version
of atomic_ptr that I tweaked that has ABA counts built in, so you can use
the atomic_ptr interface to directly implement the lock-free algos...

> But they're not very efficient

Think of a lock-free proxy garbage collector that allows for node reuse
before decrementing to a zero ref-count... This prevents the garbage
overflow problem inherent in many proxy gc's.

I have working code that implements the above schema. It is "extremely
efficient". It can also be configured to reuse memory on a global scale...

BTW...

My lock-free garbage collector code can be used with virtually all existing
lock-free collection algos!

For instance, its 100% compatible with the SList API.

It can also be used to ensure that the lock-free pool code that you posted
never gets an exception, without changing a single line of code in your
algo.

The current version uses Cas64 on x86.

The current experimental version uses normal Cas on x86, which makes it very
portable to 64-bit processors.

=)

Eric Dumazet

unread,

Sep 17, 2003, 2:46:05 AM9/17/03

to

Hello all

"Oliver S." <Foll...@gmx.net> a écrit dans le message de
news:Xns93F8328D13CF...@130.133.1.4...

> __declspec(naked)
> Pool::Item *__fastcall
> Pool::IA32Push( IA32ListHeader volatile *phdr, Item *pitmPush )
> {
> __asm
> {

> push ebx // preserve ebx
> push edi // preserve edi
> mov edi, ecx // phdr -> edi - to free ecx for cmpxchg8b
> mov ebx, edx // pitmPush -> ebx - to be cmpxchg8b-conformant
>
> mov eax, [edi] // current phdr->pitmTop -> eax
> mov edx, [edi + 4] // current phdr->dwAbaSequence -> edx
> xchgLoop:
> mov [ebx], eax // pitmPush->pitmNext = phdr->pitmTop
> lea ecx, [edx + 1] // new phdr->dwAbaSequence -> ecx
> lock cmpxchg8b [edi]
> jne xchgLoop
>
> pop edi // restore edi
> pop ebx // restore ebx
> ret
> }
> }

I beleive there is a bug here :

mov eax, [edi] // current phdr->pitmTop -> eax
{ other cpu can change [edi] AND [edi+4] }
mov edx, [edi + 4] // current phdr->dwAbaSequence -> edx

So you should exchange to :
mov edx, [edi + 4] // current phdr->dwAbaSequence -> edx

mov eax, [edi] // current phdr->pitmTop -> eax

I never met yet a CPU that reorder the two loads.
If it happens one day, then we shall use a cmpxchg8b to force an atomic load
of the 2 pointers (forcing a false cmp)

Eric Dumazet

Joe Seigh

unread,

Sep 17, 2003, 2:04:35 PM9/17/03

to

SenderX wrote:
>
>
> > But they're not very efficient
>
> Think of a lock-free proxy garbage collector that allows for node reuse
> before decrementing to a zero ref-count... This prevents the garbage
> overflow problem inherent in many proxy gc's.
>
> I have working code that implements the above schema. It is "extremely
> efficient". It can also be configured to reuse memory on a global scale...

I am getting around to adding recyclable object pools to atomic_ptr. I'm
using the standard lock-free lifo queue but you can implement any object
pool class you want. You'll have to wait a little bit because I'm doing
it as part of a win32 condvar (again). Got past the hard part though.
That was ditching the pthread naming conventions and going with win32
conventions. Though doing pseudo win32 synchronization objects poses
interesting challenges.

Joe Seigh

Neill Clift [MSFT]

unread,

Sep 17, 2003, 5:16:08 PM9/17/03

to

"Oliver S." <Foll...@gmx.net> wrote in message

news:Xns93F9A483F5C...@130.133.1.4...

> > Yes your right. Its better described as sequence number wrap.
> > I am sure its no real problem for x86 but the space for sequence
> > numbers is smaller on the 64 bit platforms witht he 8 byte swaps.
>

> And how are the SLists implemented on Win64/AMD64 platforms ?

The same as x86 just with pointer compression and small seq. num.
I have said this in other posts.

Neill Clift [MSFT]

unread,

Sep 17, 2003, 5:21:28 PM9/17/03

to

"Oliver S." <Foll...@gmx.net> wrote in message

news:Xns93F95BC51BFA...@130.133.1.4...

> > I beleive there is a bug here :
> >
> > mov eax, [edi] // current phdr->pitmTop -> eax
> > { other cpu can change [edi] AND [edi+4] }
>

> No, there isn't a bug; the cmpxchg8b would simply fail
> under the above circumstances.

I don't think there is a bug and when this code was posted I
checked in detail. The reason I checked is that the seq. number
read is sort of going forward in time. What I mean by this is
that the seq. number can be for a later header than the header
read (forward pointer).
I could not find an issue with this code so I think its solid.
Neill.

SenderX

unread,

Sep 18, 2003, 6:42:01 PM9/18/03

to

> I am getting around to adding recyclable object pools to atomic_ptr.

I have had Great success with recycling the atomic_ptr_ref objects:

I store them in a per-thread stack, like 10 per thread.

If the per-thread stack hits a high-watermark( 10 ), they go into a global
set of pools.

> using the standard lock-free lifo queue but you can implement any object
> pool class you want.

Nice.

> You'll have to wait a little bit because I'm doing
> it as part of a win32 condvar (again).

For reusing HANDLE's?

> That was ditching the pthread naming conventions and going with win32
> conventions. Though doing pseudo win32 synchronization objects poses
> interesting challenges.

;)

Joe Seigh

unread,

Sep 18, 2003, 8:11:53 PM9/18/03

to

SenderX wrote:
>
> > I am getting around to adding recyclable object pools to atomic_ptr.
>
> I have had Great success with recycling the atomic_ptr_ref objects:

Yeah, basically. I'm recycling both the atomic_ptr_ref and the
object it points to together instead of separately. Still deciding
on the api. C++ basically sucks as far a object allocation and
deallocation abstraction are concerned. new/delete are in the
wrong place. You can't explicitly invoke object constructors.
etc, etc...

> > You'll have to wait a little bit because I'm doing
> > it as part of a win32 condvar (again).
>
> For reusing HANDLE's?

Yeah, ref counting with atomic_ptr this time. But that alone wouldn't
justify a rewrite. There's some other stuff that's the main reason for
the rewrite.

>
> > That was ditching the pthread naming conventions and going with win32
> > conventions. Though doing pseudo win32 synchronization objects poses
> > interesting challenges.
>
> ;)

Well, more for Microsoft. They just don't realize it yet.

Joe Seigh

SenderX

unread,

Sep 22, 2003, 6:17:26 PM9/22/03

to

> The same as x86 just with pointer compression and small seq. num.

Could a user-space application implement "pointer-compression"?

Can you compress 32-bit pointers?