Asynchronous queue without synchronization primitives

Shimshon Duvdevan

unread,

May 30, 2003, 6:12:35 PM5/30/03

to

Hello,

I have implemented an asynchronous queue for communication of two
threads/processes in shared memory model. The implementation doesn't use any
synchronization primitives, such as mutexes.

The C++ sources (fairly short) are here:
http://www.geocities.com/shuki_duv/asynch_queue.h.txt
http://www.geocities.com/shuki_duv/queue.C.txt
(the first is header, the latter is usage example; strip the .txt
extension...)

I would be happy to hear comments about this approach, especially whether
the assumptions I make don't hold in some circumstances.

I assume that reading from memory word (an int) during a write to the same
location will not result in a garbled value. Also, I assume that C++
implementation of volatile qualifier has certain execution order semantics.

The queue is implemented as a circular queue, with tail pointer writable by
the queue reader, and head pointer writable by the queue writer.

I have tested it on multiple-processor Sun Sparc, SGI Origin and Intel Xeon
workstations, and it worked as expected.

Shuki

Joseph Seigh

unread,

May 30, 2003, 7:21:36 PM5/30/03

to

Shimshon Duvdevan wrote:
>
> Hello,
>
> I have implemented an asynchronous queue for communication of two
> threads/processes in shared memory model. The implementation doesn't use any
> synchronization primitives, such as mutexes.
>
> The C++ sources (fairly short) are here:
> http://www.geocities.com/shuki_duv/asynch_queue.h.txt
> http://www.geocities.com/shuki_duv/queue.C.txt
> (the first is header, the latter is usage example; strip the .txt
> extension...)
>
> I would be happy to hear comments about this approach, especially whether
> the assumptions I make don't hold in some circumstances.
>
> I assume that reading from memory word (an int) during a write to the same
> location will not result in a garbled value. Also, I assume that C++
> implementation of volatile qualifier has certain execution order semantics.

Only as far as the compiler is concerned. It doesn't affect memory reordering by the
processor. You need store memory barriers before you increment the read and write pointers
and fetch memory barriers after reading the others thread's pointer.

>
> The queue is implemented as a circular queue, with tail pointer writable by
> the queue reader, and head pointer writable by the queue writer.

A basic distributed algorithm version of consumer/producer. I'm familiar with it.

Joe Seigh

SenderX

unread,

May 31, 2003, 7:34:26 AM5/31/03

to

What FIFO queue algo are you using?

Post some code. ;)

Some of us have already posted very good, full-source lock-free algos here.

> I would be happy to hear comments about this approach, especially whether
> the assumptions I make don't hold in some circumstances.

How well does your algo scale under hundreds-of-thousands of queues, with
64+ threads pushing and popping them all?

I have two super fast, working lock-free collections on my site (
full-source ), please look at the FreeQueue.c code and tell me if yours is
similar.

Also, I am almost ready to post an update to my lock-free collections that
allow one to use a normal CAS instead of a wide-CAS.

Here is a link to a new algo I developed, that has been working great under
load:

http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=bavq
sh%24t6o%241%40zcars0v6.ca.nortel.com&prev=/groups%3Fhl%3Den%26lr%3D%26ie%3D
UTF-8%26oe%3DUTF-8%26group%3Dcomp.programming.threads

What do ya think of them apples?!

Its great that this group is getting to experience state-of-the-art
lock-free algo's in runnable code.

;)

--
The designer of the SMP and HyperThread friendly, AppCore library.

http://AppCore.home.attbi.com

SenderX

unread,

May 31, 2003, 6:51:43 PM5/31/03

to

> What FIFO queue algo are you using?
>
> Post some code. ;)

Doh!

I guess I should have looked at the header first.

I personally don't like putting code in headers.

;)

It seems that you only provide a function that dequeues all the items.

Can I dequeue a single item at a time?

It also seems that you allow only 2 threads per-queue, is that correct?

Shimshon Duvdevan

unread,

Jun 1, 2003, 11:56:20 AM6/1/03

to

"SenderX" <x...@xxx.xxx> wrote in message news:<3maCa.148$DV....@rwcrnsc52.ops.asp.att.net>...

> I personally don't like putting code in headers.

Well, me neither, but one doesn't have lot of choice when templatizing
classes in C++...

> It seems that you only provide a function that dequeues all the items.
>
> Can I dequeue a single item at a time?

Yes, I just wanted to minimize tests for success (enqueue function
initially didn't return anything, too). Since the queue is
asynchronous, I don't see many use to dequeuing elements one-by-one,
perhaps a maximum on number of dequeue operations.

> It also seems that you allow only 2 threads per-queue, is that correct?

Yes, I don't think it's possible to have really lock-free
implementation for more threads. I have looked at your library, the
lock-free algorithms lock the bus - I know this is assumed, but what
happens if there are several buses?

Your code is quite specific, too - Windows on P4. I am more interested
in having lots of asynchronous P2P communication on high-end
architecture, such as Cray or CC-Numa, which is the reason I
implemented this queue. I'm afraid locking the shared memory on these
architectures (if at all possible) will be much less efficient than
using more traditional synchronization.

Shimshon Duvdevan

unread,

Jun 1, 2003, 12:01:25 PM6/1/03

to

Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3ED7E8C2...@xemaps.com>...

> Shimshon Duvdevan wrote:
> > I assume that reading from memory word (an int) during a write to the same
> > location will not result in a garbled value. Also, I assume that C++
> > implementation of volatile qualifier has certain execution order semantics.
>
> Only as far as the compiler is concerned. It doesn't affect memory reordering
> by the processor. You need store memory barriers before you increment the read
> and write pointers and fetch memory barriers after reading the others thread's
> pointer.

The reordering by the processor is very interesting, I didn't think
about that. Are you sure these techniques (like superscalar execution)
are applied to anything else than register references though? When I
write assembly code, I certainly expect the processor to keep the
execution semantics for all conceivable circumstances, including same
memory accesses by multiple CPUs.

Of course, there is latency when write by one processor becomes
visible by another, but I also expect that two writes become visible
in the same order.

Can you point me to an architecture where these assumptions do not
hold?

Thanks,
Shuki

SenderX

unread,

Jun 1, 2003, 12:10:18 PM6/1/03

to

> Your code is quite specific, too - Windows on P4.

Yes, it was to show off a correct working lock-free algos using a wide-cas.

I have new lock-free stuff on my site now that uses the normal CAS, or a
normal LL/SC that all processors have.

So instead of " Windows on P4 ", it can be " Windows on Any Processor ".

So that solves the processor limits, but still has the Windows only
factor... How to get by that?

I need to list different OS API's that access the normal CAS, or LL/SC.

Does Linux have an API that accesses the processors CAS, like
InterlockedCompareExchange API?

If so, then you could port my algos to different OS's.

> I am more interested
> in having lots of asynchronous P2P communication on high-end
> architecture, such as Cray or CC-Numa, which is the reason I
> implemented this queue. I'm afraid locking the shared memory on these
> architectures (if at all possible) will be much less efficient than
> using more traditional synchronization.

Hell yeah, your queue is a LOT more efficient than using OS provided sync!

;)

Now, do you need to #ifdef the correct memory barriers for the processor you
are compiling for?

David Schwartz

unread,

Jun 1, 2003, 1:27:32 PM6/1/03

to

"Shimshon Duvdevan" <shuk...@yahoo.com> wrote in message
news:25a93df7.03060...@posting.google.com...

> The reordering by the processor is very interesting, I didn't think
> about that. Are you sure these techniques (like superscalar execution)
> are applied to anything else than register references though?

Yes. They most certainly are.

> When I
> write assembly code, I certainly expect the processor to keep the
> execution semantics for all conceivable circumstances, including same
> memory accesses by multiple CPUs.

They do. It's just that those semantics include the necessity to control
visibility between CPUs. It would be really stupid if the CPU had to force
strict ordering for all memory stores and reads just because of the
occasional case where it might matter to another CPU.

> Of course, there is latency when write by one processor becomes
> visible by another, but I also expect that two writes become visible
> in the same order.

This expectation is not guaranteed.

DS

Shimshon Duvdevan

unread,

Jun 3, 2003, 9:19:50 PM6/3/03

to

"SenderX" <x...@xxx.xxx> wrote in message news:<KzpCa.9328$DV....@rwcrnsc52.ops.asp.att.net>...

> Does Linux have an API that accesses the processors CAS, like
> InterlockedCompareExchange API?

Seems like system.h and atomic.h in include/asm directory of the
kernel provide the necessary macros.

> Now, do you need to #ifdef the correct memory barriers for the processor you
> are compiling for?

Looks like that's what I will end up with - I need Store/Store barrier
before head update in enqueue, and Load/Store barrier (which is
general barrier everywhere except Sparc) before tail update in
dequeueAll.

Here's the summary for the more recent processors I came up with,
mainly by pondering in linux kernel include/adm/system.h, and by
reading architecture docs from CPU manufacturers sites - if anyone can
add to this list, I will appreciate.

UltraSparc
----------

mb: "membar #LoadLoad | #LoadStore | #StoreLoad | #StoreStore"
rmb: "membar #LoadLoad"
wmb: "membar #StoreStore"

MIPS32/64
---------

mb: ".set push", ".set noreorder", "sync", ".set pop"
rmb: as above
wmb: as above

Alpha
-----

mb: "mb"
rmb: "mb"
wmb: "wmb"

IA32, AMD64
-----------

mb: "mfence" - if SSE2 present
"lock; addl $0,0(%%esp)" - otherwise

rmb: "lfence" - if SSE2 present
"lock; addl $0,0(%%esp)" - otherwise

wmb: "sfence" - if SSE present, only necessary for
unordered SSE store instructions on
intel/amd
"lock; addl $0,0(%%esp)" - if no SSE, and cpu is non-intel
supporting out of order store
nop - otherwise

IA64
----

mb: "mf"
rmb: as above
wmb: as above

ESA390
------

mb: "bcr 15,0"
rmb: as above
wmb: as above

PPC
---

mb: "sync"
rmb: "sync"
wmb: "eieio"

PPC64
-----

mb: "sync"
rmb: "lwsync"
wmb: "eieio"

EPPC (Book E)
-------------

mb: "mbar"
rmb: as above
wmb: as above

Shuki.

David Schwartz

unread,

Jun 3, 2003, 11:54:37 PM6/3/03

to

"Shimshon Duvdevan" <shuk...@yahoo.com> wrote in message
news:25a93df7.03060...@posting.google.com...

> Seems like system.h and atomic.h in include/asm directory of the

> kernel provide the necessary macros.

It's extremely bad karma to use these files in user programs. As just
one example of what can go wrong, consider the following:

1) You compile on a UP machine, hence CONFIG_SMP is not declared. (In
fact, even on my RedHat 7.3 SMP machine, CONFIG_SMP is not declared in
/usr/include/linux/config.h!)

2) Because CONFIG_SMP is not declared, the LOCK_PREFIX becomes empty
(see asm/bitops.h).

3) Any program I compile this way is dangerous on any SMP machine,
including the machine it was compiled on!

One machine determines whether to set CONFIG_SMP or not based upon which
kernel it booted up with! So if I happened to boot with a UP kernel, I'll
compile code that fails in interesting, unpredictable ways if I boot with an
SMP kernel.

Lovely.

DS

SenderX

unread,

Jun 4, 2003, 4:54:13 AM6/4/03

to

Thank you so much for those!

Very helpfull to post those.

Now alls I need are all the cas, or ll(reserved)/sc opcodes for those
processors.

And I will be able to port my lock-free thread/process comm. system to each
one.

CAS and LL/SC instructions:

IA-32

cmpxchg
cmpxchg8b

IA-64

cmpxchg
cmp8xchg16b ( not 100% sure on spelling )

POWER PC

lwarx (LL) ( reserved )
stwcx (SC)

ALPHA

ldq_l (LL) ( reserved )
ldl_l (LL) ( reserved )

stq_c (SC)
stl_c (SC)

Thats all I know off hand.

;)

David Schwartz

unread,

Jun 4, 2003, 5:07:07 AM6/4/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:VsiDa.565168$Si4.5...@rwcrnsc51.ops.asp.att.net...

> IA-32
>
> cmpxchg
> cmpxchg8b

I presume you mean 'lock cmpxchg'? Otherwise, the memory accesses aren't
atomic on SMP machines.

DS

Joseph Seigh

unread,

Jun 4, 2003, 8:15:41 AM6/4/03

to

Shimshon Duvdevan wrote:

...

>
> Here's the summary for the more recent processors I came up with,
> mainly by pondering in linux kernel include/adm/system.h, and by
> reading architecture docs from CPU manufacturers sites - if anyone can
> add to this list, I will appreciate.

(snip)

Depending on what you or others may have in mind, those memory barriers may
or may not do what you or those others want. There are different memory models
and what works on one doesn't necessarily work on another.

I've worked with ESA/370, now Z architecture. If you do DCL on that architecture,
you don't need explict memory barriers other then the ones provided by the lock.
Does that mean you don't need some kind of memory barriers defined? No, but it
depends on what you are trying to accomplish. Some uses of "rmb" and "wmb" may
require "BCR 15,0" or they may require nothing to be done.

Now, multipy that by all the other now and future memory models. You think mb,
rmb, and wmb is going to make your code portable? You think spelling them
differently, ie. pthread_mb, pthread_rmb, and pthread_wmb will work?

Joe Seigh

Alexander Terekhov

unread,

Jun 4, 2003, 8:53:43 AM6/4/03

to

Joseph Seigh wrote:
[...]

> Now, multipy that by all the other now and future memory models. You think mb,
> rmb, and wmb is going to make your code portable? You think spelling them
> differently, ie. pthread_mb, pthread_rmb, and pthread_wmb will work?

It surely will work... something ala <iohw.h> macros for atomic access
with various "built-in" barriers (C++ atomic<>s aside for a moment).
BTW, "rmb" and "wmb" isn't really what you need; hoist/sink load/store
barriers combined with the operations is The Right Thing, I believe.

regards,
alexander.

Joseph Seigh

unread,

Jun 4, 2003, 9:45:38 AM6/4/03

to

Defining them would be nice. Mnemonics and/or syntax is not a definition. Neither
is a meta implementation, unless you subscribe to "working as coded" since by definition
the meta implementation is always right.

Joe Seigh

Shimshon Duvdevan

unread,

Jun 4, 2003, 3:24:05 PM6/4/03

to

Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3EDDE43F...@xemaps.com>...

> I've worked with ESA/370, now Z architecture. If you do DCL on that
> architecture, you don't need explict memory barriers other then the ones

> provided by the lock.Does that mean you don't need some kind of memory

> barriers defined? No, but it depends on what you are trying to accomplish.
> Some uses of "rmb" and "wmb" may require "BCR 15,0" or they may require
> nothing to be done.

In other words, using BCR instruction does ensure the memory barrier
semanthics?

> Now, multipy that by all the other now and future memory models. You think
> mb, rmb, and wmb is going to make your code portable? You think spelling them
> differently, ie. pthread_mb, pthread_rmb, and pthread_wmb will work?

Definitely. mb, rmb and wmb have well-defined semanthics. For example,
StoreStore barrier (wmb) means that before any of the following writes
becomes visible to any other architecture component, all previous
writes are already visible to those components.

Shuki.

Shimshon Duvdevan

unread,

Jun 4, 2003, 3:37:32 PM6/4/03

to

"David Schwartz" <dav...@webmaster.com> wrote in message news:<bbkcrq$h5j$1...@nntp.webmaster.com>...

Here this is rather not an issue of atomic memory access, but of
atomicity of sequential read and write. Unlike "xchg" instruction,
"cmpxchg" variants don't have an implicit "lock" prefix, which I guess
was what the original poster assumed.

SenderX

unread,

Jun 4, 2003, 3:59:03 PM6/4/03

to

> > I presume you mean 'lock cmpxchg'? Otherwise, the memory accesses
aren't
> > atomic on SMP machines.

Of course, I assumed people would know that already:

lock cmpxchg
lock cmpxchg8b
lock cmp8xchg16b

Joseph Seigh

unread,

Jun 4, 2003, 4:02:46 PM6/4/03

to

Shimshon Duvdevan wrote:
>
> Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3EDDE43F...@xemaps.com>...

> > Some uses of "rmb" and "wmb" may require "BCR 15,0" or they may require
> > nothing to be done.
>
> In other words, using BCR instruction does ensure the memory barrier
> semanthics?

Yes, but it's fairly expensive. It also does hardware checkpoint synchronization.

>
> > Now, multipy that by all the other now and future memory models. You think
> > mb, rmb, and wmb is going to make your code portable? You think spelling them
> > differently, ie. pthread_mb, pthread_rmb, and pthread_wmb will work?
>
> Definitely. mb, rmb and wmb have well-defined semanthics. For example,
> StoreStore barrier (wmb) means that before any of the following writes
> becomes visible to any other architecture component, all previous
> writes are already visible to those components.
>

There are memory models where that may do what you want it to do, but there
are memory models where it will not do whatever you think it does.

Joe Seigh

SenderX

unread,

Jun 4, 2003, 5:26:03 PM6/4/03

to

> There are memory models where that may do what you want it to do, but
there
> are memory models where it will not do whatever you think it does.

We need the OS makers to create portable API's that can access the
processors atomic ops, and barriers. Make the OS people worry about making
there API's comply with hardware.

Microsoft has done exactly that by exposing the Interlocked API's. They are
guaranteed by Microsoft to work, regardless of the hardware. If Windows
runs, Interlocked API's run.

Alex's atomic<T> should be the answer for portable atomic's, that are not
tied to the WinAPI.

By the way, is M$ the only OS with portable atomic API's?

Joseph Seigh

unread,

Jun 4, 2003, 6:38:04 PM6/4/03

to

SenderX wrote:
>

> We need the OS makers to create portable API's that can access the
> processors atomic ops, and barriers. Make the OS people worry about making
> there API's comply with hardware.
>
> Microsoft has done exactly that by exposing the Interlocked API's. They are
> guaranteed by Microsoft to work, regardless of the hardware. If Windows
> runs, Interlocked API's run.

Maybe. They suffer from being on Intel which adds synchronization semantics
on the fly. You're never quire sure what the old semantics really were when
new ones come out. You have to make a lot of assumptions when using them.
Particularly with memory visibility.

>
> Alex's atomic<T> should be the answer for portable atomic's, that are not
> tied to the WinAPI.

It's not clear what his semantics are. I haven't published them (mainly
because it appears that nobody actually cares whether semantics are defined
or not) but atomic_ptr and what will probably be called atomic also (different
namespace) will have formal semantics. This will allow anyone doing an
implementation to know precisely what memory synchronization is required
and where it is required no matter what platform you implement it on.

I find the whole process of doing a formal definition useful. I get a
better understanding of what the memory synchronization really means.
The memory barriers I will end up using are different than the ones I
though would be sufficient before I went thru that excercise.

>
> By the way, is M$ the only OS with portable atomic API's?

AIX maybe. I don't know what they have currently. The last I used it
was 3.x

Joe Seigh

SenderX

unread,

Jun 4, 2003, 7:10:30 PM6/4/03

to

> I haven't published them (mainly
> because it appears that nobody actually cares whether semantics are
defined
> or not) but atomic_ptr and what will probably be called atomic also
(different
> namespace) will have formal semantics.

If you publish a paper on atomic_ptr, are you going to include the offset
pointer solution so one can implement the aba resistant atomic_ptr::cas that
I added to it?

atomic_ptr is great on its own, but you can't use it to make lock-free algos
dynamic. ;(

The ABA solution I added to atomic_ptr should really be talked about in any
paper about atomic_ptr, just so people know how to use it for lock-free
algos.

;)

There has to be a way to implement the ABA atomic_ptr:cas using cmp8xchg16b
without using a pointer offset scheme, humm...

It makes atomic_ptr very useful.

Joseph Seigh

unread,

Jun 4, 2003, 8:10:17 PM6/4/03

to

SenderX wrote:
>
> > I haven't published them (mainly
> > because it appears that nobody actually cares whether semantics are
> defined
> > or not) but atomic_ptr and what will probably be called atomic also
> (different
> > namespace) will have formal semantics.
>
> If you publish a paper on atomic_ptr, are you going to include the offset
> pointer solution so one can implement the aba resistant atomic_ptr::cas that
> I added to it?

The semantics wouldn't be in a paper. It would be in the comments/documentation.
I'd probably do the offset solution when I got a 64 bit machine which is not
in the near future.

You already know how it's done in principle. I didn't mention it in the Intel
threads forum, but you can do a lock-free interlocked slist in 64 bit windows
on an opteron using the offset. You just ignore the embedded SLIST_ENTRY and
use a pool of your own using offsets. The free pool will also be an slist.
Just be carefull of using offset zero. You can use the interlocked initialization
from the condvar posted earlier.

>
> atomic_ptr is great on its own, but you can't use it to make lock-free algos
> dynamic. ;(

:) I wouldn't be too sure of that. I'm working on something that is very very
cool, but I can't talk about it since I'm deciding whether or not to patent it.
I don't like the idea of patents but economic conditions dictate otherwise.
There's no point in producing a superfast library of lock-free containers if
anybody could just reverse engineer it. I can't give any hints except it has
nothing to do with ABA solutions and is in a category of lock-free solutions that
has not really been discussed in c.p.t.

Joe Seigh

SenderX

unread,

Jun 4, 2003, 8:33:12 PM6/4/03

to

> You already know how it's done in principle.

Yep. ;)

> I didn't mention it in the Intel
> threads forum, but you can do a lock-free interlocked slist in 64 bit
windows
> on an opteron using the offset.

I can use a normal CAS to implement my new lock-free system, which you can
see on my site.

So I would use InterlockedCompareExchange for my algos on a 32, or 64 bit
system. That's pretty cool!

If I did use cmpxchg8b for my new algos, it would allow me to compare and
swap two nodes, and their aba counters at the same time. Very nice for
lock-free double-linked lists!

If I used cmp8bxchg16b, I could compare 2 nodes, and swap 4 nodes including
their ABA couters.

I have actually implemented atomic_ptr using InterlockedCompareExchange, and
it has not leaked or crashed yet! But that's another story...

> > atomic_ptr is great on its own, but you can't use it to make lock-free
algos
> > dynamic. ;(
>
> :) I wouldn't be too sure of that.

Nice!

> I can't give any hints except it has
> nothing to do with ABA solutions and is in a category of lock-free
solutions that
> has not really been discussed in c.p.t.

It avoids ABA huh...

Will it work on IA-32?

Can I use it to do the Mike and Scott FIFO, and the IBM FreeList API?

Those questions should not be too prying.

By the way... if you do document the ABA soultion that I added to it, post a
link to my site please.

;)

Joseph Seigh

unread,

Jun 4, 2003, 9:06:24 PM6/4/03

to

SenderX wrote:
>
> > I can't give any hints except it has
> > nothing to do with ABA solutions and is in a category of lock-free
> solutions that
> > has not really been discussed in c.p.t.
>
> It avoids ABA huh...
>
> Will it work on IA-32?
>
> Can I use it to do the Mike and Scott FIFO, and the IBM FreeList API?
>
> Those questions should not be too prying.

Different category. Think AWTEventMulticaster or RCU(Read, Copy, Update).

Joe Seigh

Shimshon Duvdevan

unread,

Jun 4, 2003, 10:35:00 PM6/4/03

to

Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3EDE51B9...@xemaps.com>...

> Shimshon Duvdevan wrote:
> >
> > Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3EDDE43F...@xemaps.com>...
> > > Some uses of "rmb" and "wmb" may require "BCR 15,0" or they may require
> > > nothing to be done.
> >
> > In other words, using BCR instruction does ensure the memory barrier
> > semanthics?
>
> Yes, but it's fairly expensive. It also does hardware checkpoint
> synchronization.

Well, from reading ESA/390 and z/Arch POPs, it's unclear whether
memory barriers (which control visibility order, not visibility
itself) are necessary at all (3-2 and 5-87 in ESA/390 POP). So I find
it's a documentation problem.

> > > Now, multipy that by all the other now and future memory models. You
> > > think mb, rmb, and wmb is going to make your code portable? You think
> > > spelling them differently, ie. pthread_mb, pthread_rmb, and pthread_wmb
> > > will work?
> >
> > Definitely. mb, rmb and wmb have well-defined semanthics. For example,
> > StoreStore barrier (wmb) means that before any of the following writes
> > becomes visible to any other architecture component, all previous
> > writes are already visible to those components.
> >
>
> There are memory models where that may do what you want it to do, but there
> are memory models where it will not do whatever you think it does.

Do you mind giving a reference to such memory models? Because if it's
a speculation on what future technology may bring us, I don't find it
particularly interesting when writing code whose purpose is to
optimize speed of communication between processors on existing
hardware.

Shuki

David Schwartz

unread,

Jun 4, 2003, 11:04:35 PM6/4/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:bcsDa.569617$Si4.5...@rwcrnsc51.ops.asp.att.net...

> > > I presume you mean 'lock cmpxchg'? Otherwise, the memory accesses
> > > aren't
> > > atomic on SMP machines.
>
> Of course, I assumed people would know that already:
>
> lock cmpxchg
> lock cmpxchg8b
> lock cmp8xchg16b

By the way, these instructions take on the order of 200 cycles on
Pentium 4's.

DS

SenderX

unread,

Jun 5, 2003, 3:24:17 AM6/5/03

to

> By the way, these instructions take on the order of 200 cycles on
> Pentium 4's.

How you gonna get by with lock-free without using the CAS opcode?

Would you suggest I use a mutex, or critical section?

SenderX

unread,

Jun 5, 2003, 4:53:56 AM6/5/03

to

> Different category. Think AWTEventMulticaster or RCU(Read, Copy, Update).

Ah yes. RCU is interesting...

It seems that RCU simply delays the freeing of old pointers ( old copies ),
in order to prevent another processor from accessing a freed pointer?

I read that there will be about a 25-millisecond delay between the time that
you ( call_rcu ) to the time the callback is fired.

Where did they get the 25 millisecond delay... Is that enough to bring the
processors through their " quiescent state "?

I does seem like it should work, hazard pointer algos also seem to delay
pointers being freed.

" If a given process removes all globally-accessible references to a
particular data structure and waits till all the CPUs pass through a
quiescent state, it can then manipulate that data structure without any
concern of interference from other CPUs. "

That seems to say it all...

Do you really need to know when context-switches happen in order to
implement RCU?

It seems like RCU can maybe be used to free pointers in dynamic lock-free
algos, and bypass the need for atomic_ptr?

Joseph Seigh

unread,

Jun 5, 2003, 6:36:27 AM6/5/03

to

SenderX wrote:
>
> > Different category. Think AWTEventMulticaster or RCU(Read, Copy, Update).
>
> Ah yes. RCU is interesting...
>
> It seems that RCU simply delays the freeing of old pointers ( old copies ),
> in order to prevent another processor from accessing a freed pointer?
>
> I read that there will be about a 25-millisecond delay between the time that
> you ( call_rcu ) to the time the callback is fired.
>
> Where did they get the 25 millisecond delay... Is that enough to bring the
> processors through their " quiescent state "?

Probably. It's a polling interval, not a fixed delay in releasing the resources.
Depending on actual history of processors going thruough the quiesce state,
more than one polling interval may occur before the resource is released.
I think the polling interval on VM was 1/10 of a second.

>
> I does seem like it should work, hazard pointer algos also seem to delay
> pointers being freed.
>
> " If a given process removes all globally-accessible references to a
> particular data structure and waits till all the CPUs pass through a
> quiescent state, it can then manipulate that data structure without any
> concern of interference from other CPUs. "
>
> That seems to say it all...
>
> Do you really need to know when context-switches happen in order to
> implement RCU?

No. The quiesce points can be arbitrary but they should be a point that
is frequently executed (the dispatcher meets this criteria), and that
programmers are well aware of (the dispatcher in a non preemptive kernel
meets this criteria). It also lets you equate threads to processors
by considering a suspended thread as being in a thread quiesce state
so you only have to track processors. The polling in RCU is linear
and has the same kind scalability problems that hazard pointers have.
So it's more likely you'll see RCU in non-preemptive kernels like VM, Linux,
and such.

>
> It seems like RCU can maybe be used to free pointers in dynamic lock-free
> algos, and bypass the need for atomic_ptr?
>

If you have RCU, yes.

Joe Seigh

Joseph Seigh

unread,

Jun 5, 2003, 7:11:06 AM6/5/03

to

Shimshon Duvdevan wrote:
>
>
> Well, from reading ESA/390 and z/Arch POPs, it's unclear whether
> memory barriers (which control visibility order, not visibility
> itself) are necessary at all (3-2 and 5-87 in ESA/390 POP). So I find
> it's a documentation problem.

In 370 or Z architecture, loads are in order and stores are in order so
you do not need rmb and wmb as you defined them to do anything. You
need bcr 15,0 to order stores followed by loads. You don't need a
memory barrier for loads followed by stores, since stores can't be early
and loads can't be late.

> >
> > There are memory models where that may do what you want it to do, but there
> > are memory models where it will not do whatever you think it does.
>
> Do you mind giving a reference to such memory models? Because if it's
> a speculation on what future technology may bring us, I don't find it
> particularly interesting when writing code whose purpose is to
> optimize speed of communication between processors on existing
> hardware.
>

Well, if you had a memory architecture that only had loadload, storestore for
rmb and wmb respectively, and now you have something like sparc with
loadload, loadstore, storestore, and storeload, you have to revisit the
prior architectures and analyse what was really going on there.

So, on that architecture with only loadload and storestore, rmb was
just loadload because loadstore was implicit in the memory model.
Now there are uses of rmb that just require loadload and not
loadstore but you don't know if all uses do. So to be on the safe
side you'd have to define rmb in sparc as

membar #loadload | #loadstore

at the very least.

So from my point of view, the present scheme is unstrustworthy since I'd
have to verify the definitions for every new memory architecture to verify
they actually did what I thought they should do. Either that or
use mb for everything to be on the safe side.

What I am doing it to define memory barrier semantics is to tie them to
mutex "acquire/release" memory semantics since those are presumably
well understood (I already have a formal definition for those anyway) and
if any future architecture breaks defining memory barriers this way,
it will breaking locking as well.

Joe Seigh

SenderX

unread,

Jun 5, 2003, 7:16:31 AM6/5/03

to

> Probably. It's a polling interval, not a fixed delay in releasing the
resources.

Ahh... I see.

> The polling in RCU is linear
> and has the same kind scalability problems that hazard pointers have.

Yep.

> > It seems like RCU can maybe be used to free pointers in dynamic
lock-free
> > algos, and bypass the need for atomic_ptr?
> >
>
> If you have RCU, yes.

I am thinking of another way of delaying pointers, that can scale.

Try this on for size... ;)

Scheme for freeing lock-free nodes:

Every node has a reference counter.

When a lock-free algo wants to use a node, it increments its count.

When a lock-free algo is finished with a node, it decrements its count.

Now... This will suffer from a race-condition.

A thread can decrement the reference counter to zero and free the pointer,
when another thread is using the node inside its CAS section!!!

The problem is freeing the pointer " right " after the count goes to zero.

We need to either delay the free, or use atomic_ptr.

So, if we freed the node into a " to-be-freed " lock-free FIFO queue that
was consumed by a " single low-priority " garbage collector every so
often... We could provide a substantial delay.

Remember, the reference counting race-condition means that other threads can
be using the pointer inside their cas sections. The time it takes for the
CAS sections to fail its CAS, loop around, and retry gets faster with the
hardware. So a delay will not have to be long at all...

The garbage collector would check its FIFO queue and make sure it hit a
high-watermark, and pop a few off the back every so often. The garbage
collector will then free the nodes. That will provide a lot of time for
threads to fail their CAS sections.

And, the more load the algo goes under... The more stable it gets, because
the garbage queue will grow. The space between the front and back nodes will
expand, and the delay to free nodes will be further extended.

The garbage collector is free to delete nodes from its lock-free queue,
because its the only consumer to service it.

Also, by filtering " to-be-freed " pointers through a FIFO queue, use don't
reuse pointers until they have been completely through the FIFO. This will
make ABA virtually impossible. ;)

This should make it so I don't need atomic_ptr for dynamic lock-free
algos...

Does this sound like it will work?

;)

Alexander Terekhov

unread,

Jun 5, 2003, 7:49:31 AM6/5/03

to

Joseph Seigh wrote:
[...]

> What I am doing it to define memory barrier semantics is to tie them to
> mutex "acquire/release" memory semantics since those are presumably

> well understood ....

Right.

op."acquire" == op."hoist-load-barrier + hoist-store-barrier"
op."release" == op."sink-load-barrier + sink-store-barrier"

http://www.terekhov.de/pthread_refcount_t/experimental/refcount.cpp

Note that it's still "incomplete"; just an "illustration".

"....
struct msync {
enum hlb_t { hlb }; // hoist-load barrier
enum ddhlb_t { ddhlb }; // hoist-load barrier with data-dependency "hint"
enum hsb_t { hsb }; // hoist-store barrier
enum slb_t { slb }; // sink-load barrier
enum ddslb_t { ddslb }; // sink-load barrier with data-dependency "hint"
enum ssb_t { ssb }; // sink-store barrier
enum acq_t { acq }; // hoist-load + hoist-store barrier
enum rel_t { rel }; // sink-load + sink-store barrier
enum none_t { none }; // naked
};

template<class T>
struct atomic { //
atomic(T n) : t(n) { }
T load(msync::none_t) const { return t;}
T load(msync::hlb_t) const { return t; }
T load(msync::ddhlb_t) const { return t; }
void store(T n, msync::none_t) { t = n; }
void store(T n, msync::ssb_t) { t = n; }
void store(T n, msync::acq_t) { t = n; }
void store(T n, msync::rel_t) { t = n; }
bool attempt_update(T o,T n, msync::none_t) { return (t == o) ? (t=n, true) : false; }
bool attempt_update(T o,T n, msync::ssb_t) { return (t == o) ? (t=n, true) : false; }
bool attempt_update(T o,T n, msync::acq_t) { return (t == o) ? (t=n, true) : false; }
bool attempt_update(T o,T n, msync::rel_t) { return (t == o) ? (t=n, true) : false; }
T t;
};

enum thread_safety { unsafe, basic }; // strong aside for a moment
...."

regards,
alexander.

Joseph Seigh

unread,

Jun 5, 2003, 7:57:04 AM6/5/03

to

SenderX wrote:
>
>
> Try this on for size... ;)
>
> Scheme for freeing lock-free nodes:
>

(snip)

>
> Does this sound like it will work?

Delay based schemes have been proposed before. They're
problematic. For at start, you'd need a real time
scheduler or better to give you forward progress guarantees.
Even that may not be enough. Maybe some other scheduling tricks
too. But still, you'd need to be real careful.

Joe Seigh

Joseph Seigh

unread,

Jun 5, 2003, 8:43:09 AM6/5/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
> [...]
> > What I am doing it to define memory barrier semantics is to tie them to
> > mutex "acquire/release" memory semantics since those are presumably
> > well understood ....
>
> Right.
>
> op."acquire" == op."hoist-load-barrier + hoist-store-barrier"
> op."release" == op."sink-load-barrier + sink-store-barrier"
>

If that is what how the locks implement it. If they implement it
differently for a different architecture, then you'd have to do it
differently also.

Unless of course, you tied your definitions to a particular implementation.
Then you're screwed when a new memory architecture comes out.

Joe Seigh

Alexander Terekhov

unread,

Jun 5, 2003, 9:28:36 AM6/5/03

to

Joseph Seigh wrote:
>
> Alexander Terekhov wrote:
> >
> > Joseph Seigh wrote:
> > [...]
> > > What I am doing it to define memory barrier semantics is to tie them to
> > > mutex "acquire/release" memory semantics since those are presumably
> > > well understood ....
> >
> > Right.
> >
> > op."acquire" == op."hoist-load-barrier + hoist-store-barrier"
> > op."release" == op."sink-load-barrier + sink-store-barrier"
> >
>

> If that is what how the locks implement it. ...

That has really nothing to do with the implementation. I'm just
sort of splitting "acquire/release" memory semantics into load
and store related semantics. A >>read-write<< lock, for example,
has rdlock()/rdunlock() operations that do NOT need a "full"
acquire/release memory semantics.

rdlock() needs "hoist-load-barrier" semantics
rdunlock() needs "sink-load-barrier" semantics

Do you now understand it?

regards,
alexander.

Joseph Seigh

unread,

Jun 5, 2003, 12:56:57 PM6/5/03

to

No. It's still implicitly modeled on some sort of memory architecture.
I doubt you find anything in the Posix documetation that states
memory barriers are required or something in those terms. Unfortunately,
Posix probably does not have any formal semantics either to borrow from,
but you can tie directly into Posix semanics in the following manner
using atomic<> types as an example.

For atomic<> variables with store.release and load.acquire semantics
given
T x; // x is local only, not shared
atomic<T> y; // shared

load.acquire semantics for "x = y;" is
lock();
x = y;
unlock();

store.release semantics for "y = x; is
lock();
y = x;
unlock();

Now you're memory model independent;

Joe Seigh

Shimshon Duvdevan

unread,

Jun 5, 2003, 1:16:24 PM6/5/03

to

Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3EDF269F...@xemaps.com>...

> Well, if you had a memory architecture that only had loadload, storestore for
> rmb and wmb respectively, and now you have something like sparc with
> loadload, loadstore, storestore, and storeload, you have to revisit the
> prior architectures and analyse what was really going on there.
>
> So, on that architecture with only loadload and storestore, rmb was
> just loadload because loadstore was implicit in the memory model.
> Now there are uses of rmb that just require loadload and not
> loadstore but you don't know if all uses do. So to be on the safe
> side you'd have to define rmb in sparc as
>
> membar #loadload | #loadstore
>
> at the very least.

I disagree! I think that Sparc just increases granularity of memory
barriers. I guess that on every other architecture, the cost of any
barrier which is not LoadLoad or StoreStore is equal or very close to
the cost of total memory barrier, so these architectures don't provide
the granularity. This may be different on Sparc, or it may not, and
the membar variants are due to Sparc being open specification which
anyone is free to implement.

In any case, I am quite sure that no LoadLoad (rmb) barrier on any
architecture had an implicit LoadStore or similar, which the designers
didn't think of (similar for StoreStore (wmb)). If you claim
otherwise, please provide architecture details, so that the issue
ceases to be ephemeral.

> So from my point of view, the present scheme is unstrustworthy since I'd
> have to verify the definitions for every new memory architecture to verify
> they actually did what I thought they should do. Either that or
> use mb for everything to be on the safe side.

That's the regular porting process. I just don't see any real problem
that can arise with introduction of "new memory architecture". Not
sure about the original superscalar or MP architecture in which memory
barriers became necessary, but let's say it's Alpha, with "mb" and
"wmb". If you wrote a program that used these, you would have had no
problem redefining "mb" and "wmb" for any subsequent architecture that
arised - at most you would experience a penalty for not using newer
features (say rmb), but this is a problem you can't avoid in general.

So I don't see much basis to a "new shiny memory model with new
paradigm" becoming a problem.

Shuki

Sean Burke

unread,

Jun 5, 2003, 1:57:30 PM6/5/03

to

"David Schwartz" <dav...@webmaster.com> writes:

This sounds a lot like the the L2 miss penalty,
so I wonder if this number assumes an L2 miss
(likely in the case of a contended datum), or
if CAS itself is implementation-wise equivalent
to handling an L2 miss.

-SEan

--
Remove the pi to email me.

Joseph Seigh

unread,

Jun 5, 2003, 1:58:58 PM6/5/03

to

Shimshon Duvdevan wrote:
>
> Joseph Seigh <jsei...@xemaps.com> wrote in message news:<3EDF269F...@xemaps.com>...
>

..

> In any case, I am quite sure that no LoadLoad (rmb) barrier on any
> architecture had an implicit LoadStore or similar, which the designers
> didn't think of (similar for StoreStore (wmb)). If you claim
> otherwise, please provide architecture details, so that the issue
> ceases to be ephemeral.

I just did. On an architecture that doesn't have an explicit LoadStore
barrier, it's implicit since that architecture does not allow stores to
move forward of loads. And obviously since it's implicit, it's not
mentioned. Just like on IBM Z architecture where there is no LoadLoad
barrier since LoadLoad is implicit in the memory model. It's be
stupid to extrapolate that no memory barrier on other architectures
require LoadLoad since IBM's did not require LoadLoad. Yet that's
exactly what you did when you defined rmb for Sparc. You're saying
that since other architectures do not explicitly require LoadStore in
their rmb, therefore Sparc does not require it.

>
> > So from my point of view, the present scheme is unstrustworthy since I'd
> > have to verify the definitions for every new memory architecture to verify
> > they actually did what I thought they should do. Either that or
> > use mb for everything to be on the safe side.
>
> That's the regular porting process. I just don't see any real problem
> that can arise with introduction of "new memory architecture". Not
> sure about the original superscalar or MP architecture in which memory
> barriers became necessary, but let's say it's Alpha, with "mb" and
> "wmb". If you wrote a program that used these, you would have had no
> problem redefining "mb" and "wmb" for any subsequent architecture that
> arised - at most you would experience a penalty for not using newer
> features (say rmb), but this is a problem you can't avoid in general.

I'm more worried about if I used these memory barriers in a program
and later on, these memory barriers are ported to a new architecture
(not by me) whether the new memory barriers would do what I thought
they should do. I'd have to restrict the program to platforms that
I knew about when I wrote the program which kind of limits
portability.

Joe Seigh

SenderX

unread,

Jun 5, 2003, 3:12:43 PM6/5/03

to

> But still, you'd need to be real careful.

I will implement this idea as a test in my library, and post it. So we can
see it in action.

=)

In the scheme I'm thinking up, the pointer delay would have to be a little
per-node delay based on the number of threads that we're concurrently inside
the CAS section that popped the pointer. This would make the algo even
tighter.

A LIFO Pop function that would use my idea, does this:

VOID Pop( LPVOID *ppObj )
{
LONG lPeriod;

/* Local " to-be-freed " list */
LPNODE pLocalNodes = NULL, pNode;

/* Enter a quiescent period */
InterlockedIncrement
( &pStack->lPops );

/* This thread is in its quiescent period */

*ppObj = NULL;

/* LIFO Pop CAS Section */
for(;;)
{
Front = pStack->Front;

/* Increment the references */
InterlockedIncrement
( &Front.pNode->lRefs );

/* Try and pop */
if ( ! CAS2( &pStack->Front,
Front.pNode,
Front.lAba,
Front.pNode->pNext,
Front.lAba + 1 ) )
{
break;
}

/* Decrement the references */
if ( InterlockedDecrement
( &Front.pNode->lRefs ) == 0 )
{
/* Add to our local CAS section list */
Front.pNode->pNext = pLocalNodes;

pLocalNodes = Front.pNode;
}
}

*ppObj = Front.pNode->pObj;

/* Decrement the references */
if ( InterlockedDecrement
( &Front.pNode->lRefs ) == 0 )
{
/* Add to our local CAS section list */
Front.pNode->pNext = pLocalNodes;

pLocalNodes = Front.pNode;
}

/* Leave our quiescent period */
lPeriod = InterlockedDecrement
( &pStack->lPops );

/* This thread is not in its quiescent period */

/* Free our local list */
while ( pLocalNodes )
{
pNode = pLocalNodes;

pLocalNodes = pNode->pNext;

/* Record how many threads were inside the CAS section,
at the end of the previous period. */
pNode->lPeriod = lPeriod;

/* This value will be used for a per-node delay */

/* Release to the " to-be-freed " FIFO queue */
sys_acReleaseNode( Front.pNode );
}
}

The " to-be-freed " queue can insure that nodes don't even get a change to
be freed until the garbage collector cycles through 65535+ ( can be dynamic
number, based on load ) of them. By the time all those nodes have been
deqeueud by a low priority thread that has some Sleeps in it and per-node
delays, a boat load of threads will have failed their CAS sections, making
the nodes in the free-queue safe.

I think this will work, and scale better than RCU.

When I post test code, I hope somebody tests it.

SenderX

unread,

Jun 5, 2003, 3:22:24 PM6/5/03

to

> /* Release to the " to-be-freed " FIFO queue */
> sys_acReleaseNode( Front.pNode );

Oops, I meant:

sys_acReleaseNode( pNode );

=)

David Schwartz

unread,

Jun 5, 2003, 5:34:19 PM6/5/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:BeCDa.1134099$S_4.1168711@rwcrnsc53...

> > By the way, these instructions take on the order of 200 cycles on
> > Pentium 4's.

> How you gonna get by with lock-free without using the CAS opcode?

You aren't.

You know my opinion of lock-free algorithms. They're often more
expensive in the real world than corresponding lockless implementations.
Having more than one CPU operating on the same data concurrently just means
lots of cache ping ponging. Worse, lockless algorithms tend to need more
serializing instructions than just acquiring a lock would.

DS

David Schwartz

unread,

Jun 5, 2003, 5:42:40 PM6/5/03

to

"Sean Burke" <burke_...@pacbell.net> wrote in message
news:x7d6hsh...@bolo.xenadyne.com...

> > > Of course, I assumed people would know that already:

> > > lock cmpxchg
> > > lock cmpxchg8b
> > > lock cmp8xchg16b

> > By the way, these instructions take on the order of 200 cycles on
> > Pentium 4's.

> This sounds a lot like the the L2 miss penalty,
> so I wonder if this number assumes an L2 miss
> (likely in the case of a contended datum), or
> if CAS itself is implementation-wise equivalent
> to handling an L2 miss.

I'm afraid I really don't know. That number is an average from a real
world use on a spinlock's acquire function. Note that the spinlock doesn't
actually spin on that function, it spins on a read of the variable in a
pause loop.

If you acquire your spinlocks like this (pseudo-code):

#define FREE 0
#define LOCKED 1
int spinlock=FREE;

top:
int temp=LOCKED;
temp<->spinlock;
if(temp==FREE) return;
while(spinlock!=FREE)
pause();
goto top:

Then your 'temp<->spinlock' (which is xchg, locked implicitly) will take
on the order of 200 clock cycles longer on a P4 than if you replace the
exchange with a dummy instruction (temp=free). Part of the penalty is
emptying the piplines before the exchange. Part of it is filling the
pipelines back up afterwards. But I think you're right that the L2 miss is
probably most of it.

(Yes, spinlock was aligned at the beginning of its own 128-byte line.)

DS

Joseph Seigh

unread,

Jun 5, 2003, 6:13:55 PM6/5/03

to

Well, you tell Linus Torvalds, who has a no nonsense reputation, that he was
some kind of fool for allowing RCU (which is a lock-free algorithm) into
the Linux kernel. I'll just stand back out of spatter range. :)

http://lse.sourceforge.net/locking/rcupdate.html

Joe Seigh

SenderX

unread,

Jun 5, 2003, 7:16:49 PM6/5/03

to

The scheme seems to work.

But there is one problem...

The node deletion delay is actually to great.

So if you post a couple hundered thousand nodes, and cycle them between
producer and consumer threads... The memory will rise faster than the
garbage collector can free nodes!

Can RCU suffer from this, under heavy copying and updating?

Joseph Seigh

unread,

Jun 5, 2003, 7:51:10 PM6/5/03

to

SenderX wrote:
>
> The scheme seems to work.
>
> But there is one problem...
>
> The node deletion delay is actually to great.
>
> So if you post a couple hundered thousand nodes, and cycle them between
> producer and consumer threads... The memory will rise faster than the
> garbage collector can free nodes!
>
> Can RCU suffer from this, under heavy copying and updating?

It's basically works out to a fixed delay, usually 2 to 3 polling
intervals, before releasing the resource. RCU isn't meant for
frequently updated data. It's an optimized reader/writer with
the reader really optimized. Zero synchronization overhead for
readers.

But as a GC scheme of sorts, it's probably as efficient as any out there,
maybe even more so.

Joe Seigh

SenderX

unread,

Jun 5, 2003, 8:06:06 PM6/5/03

to

> You know my opinion of lock-free algorithms.

You should write to Microsoft and tell them to take the SList API out of
windows, because its lock-free. Which makes it perform badly under
real-world situations.

Microsoft would have to take SLists out of their kernels too.

What a shame...

;)

> They're often more
> expensive in the real world than corresponding lockless implementations.

The LIFO && FIFO Lock-free algos we're written specifically to overcome the
problems / overhead of using locks, for protecting high contention
collections.

Lock-Free is very elegant sync.

Also... Lock-Free is portable...

Look at my new lock-free system:

http://AppCore.home.attbi.com/src/LockFree.c

It uses a normal CAS, or LL (reserved/locked)/SC!

That means it can port to all the popular processors.

=)

SenderX

unread,

Jun 5, 2003, 8:47:35 PM6/5/03

to

> But as a GC scheme of sorts, it's probably as efficient as any out there,
> maybe even more so.

RCU, or my scheme? ;)

I have made the garbage collector release a certian amount nodes that can be
freed, to a global FIFO queue for reuse. It helps out a bit in the memory
rise do to delayed freeing, under heavy load.

This stil won't due for collections under heavy contention.

Studying the RCU doc's, makes be appreciate atomic_ptr /w ABA resistant CAS
more, and more!

Why would you want to read, copy, register a callback, and then finally
update pointers after a delay... When you could just do this:

atomic_ptr< C_Object > g_pSharedPtr = something;

atomic_ptr enhanced RCU:

/* Read */
local_ptr< C_Object > pLocalPtr( g_pSharedPtr );

/* Copy */
local_ptr< C_Object > pCopyPtr( new C_Object );

*pCopyPtr.get() = *pLocalPtr.get();

/* Update */
if ( ! g_pSharedPtr.cas
( pLocalPtr,
pCopyPtr ) )
{
/* New copy freed */
pCopyPtr = NULL;
}

else
{
/* Old copy freed */
pLocalPtr = NULL;
}

That would be much better than RCU?

SenderX

unread,

Jun 6, 2003, 4:57:45 AM6/6/03

to

Take a look at this power point presentation:

http://www.microsoft.com/taiwan/events/slides/serverdevcon/14-5-Building%20S
calable%20Applications%20Part%202.ppt

Microsoft is pushing developers of scaleable servers to use lock-free,
whenever possible.

Joseph Seigh

unread,

Jun 6, 2003, 6:31:30 AM6/6/03

to

SenderX wrote:
>
> > But as a GC scheme of sorts, it's probably as efficient as any out there,
> > maybe even more so.
>
> RCU, or my scheme? ;)
>
> I have made the garbage collector release a certian amount nodes that can be
> freed, to a global FIFO queue for reuse. It helps out a bit in the memory
> rise do to delayed freeing, under heavy load.
>
> This stil won't due for collections under heavy contention.
>
> Studying the RCU doc's, makes be appreciate atomic_ptr /w ABA resistant CAS
> more, and more!
>
> Why would you want to read, copy, register a callback, and then finally
> update pointers after a delay... When you could just do this:
>

...

That's not quite right. It's update pointers, delay, then deallocate. The
general rule of thumb I gave in VM was, update (remove objects by making them
unreachable), delay (wait for temporary references to complete), then deallocate
objects made unreachable (GC).

http://lse.sourceforge.net/locking/rcupdate.html

It's a reader/writer solution where you don't have really frequent updates but
they are sensitive to any blocking delays (usually the case in an operating system)
and you have either relative long duration reader access, frequent reader access
that is on a really performance critical path, or has to be lock-free (e.g.,
interrupt handlers). Conventional reader/writer solutions are suboptimal here.

Joe Seigh

Joseph Seigh

unread,

Jun 6, 2003, 6:39:52 AM6/6/03

to

I wrote:
>
> For atomic<> variables with store.release and load.acquire semantics
> given
> T x; // x is local only, not shared
> atomic<T> y; // shared
>
> load.acquire semantics for "x = y;" is
> lock();
> x = y;
> unlock();
>
> store.release semantics for "y = x; is
> lock();
> y = x;
> unlock();
>

Actually, I like this. I think I use it as an alternate defintion
of acquire/release semantics. I'll still do the formal definition
because it's nice to have it explicity defined and it's useful for
doing formal program proofs.

Joe Seigh

Alexander Terekhov

unread,

Jun 6, 2003, 6:54:28 AM6/6/03

to

Joseph Seigh wrote:
>
> I wrote:
> >
> > For atomic<> variables with store.release and load.acquire semantics
> > given
> > T x; // x is local only, not shared
> > atomic<T> y; // shared
> >
> > load.acquire semantics for "x = y;" is
> > lock();
> > x = y;
> > unlock();
> >
> > store.release semantics for "y = x; is
> > lock();
> > y = x;
> > unlock();
> >
>

> Actually, I like this. ...

I also kinda like it. ;-) Your "x = y" has a *hoist-load-barrier* plus
a *hoist-store-barrier* plus a *sink-load-barrier* plus a *sink-store-
barrier* semantics. Oh, and your "y = x" is absolutely the same thing.

regards,
alexander.

Joseph Seigh

unread,

Jun 6, 2003, 7:37:45 AM6/6/03

to

While the locks have memory barriers on both sides, the use of local
variables in the definitions limits inferable visibility in one of the
directions, hence the different "acquire" and "release".

The assignment of two global variables is equivalent to the assignment to
a temporary local variable for the purposes of semantics and to avoid
deadlock (even though there are no real locks. locks are associated with
the atomic variable itself).

So if y and z are global, and temp is local, "y = z" decomposes to
"temp = z; y = temp;" which further resolves as above.

So, for example, to use an atomic variable to implement DCL (aka DCCI) correctly, let
globalp be an atomic global pointer and localp be a thread local pointer.
We have

localp = globalp;
if (localp == NULL) {
lock(somethingctl);
if (globalp == NULL) {
localp = new Something(); // new probably returns via a local temp but avoid issue
// of whether assigning directly to globalp is safe.
globalp = localp;
}
unlock(somethingctl);
}

which is sematically equivalent to

lock(globalp.mutex);
localp = globalp;
unlock(globalp.mutex);
if (localp == NULL) {
lock(somethingctl);
if (globalp == NULL) {
localp = new Something();
lock(globalp.mutex);
globalp = localp;
unlock(globalp.mutex);
}
unlock(somethingctl);
}

So if the second solution is correct and the atomic<> implementation is correct, then the
first solution is correct.

Joe Seigh

Alexander Terekhov

unread,

Jun 6, 2003, 1:07:18 PM6/6/03

to

Joseph Seigh wrote:
[...]

> > > > load.acquire semantics for "x = y;" is
> > > > lock();
> > > > x = y;
> > > > unlock();
> > > >
> > > > store.release semantics for "y = x; is
> > > > lock();
> > > > y = x;
> > > > unlock();
> > > >
> > >
> > > Actually, I like this. ...
> >
> > I also kinda like it. ;-) Your "x = y" has a *hoist-load-barrier* plus
> > a *hoist-store-barrier* plus a *sink-load-barrier* plus a *sink-store-
> > barrier* semantics. Oh, and your "y = x" is absolutely the same thing.
> >
>
> While the locks have memory barriers on both sides, the use of local
> variables in the definitions limits inferable visibility in one of the
> directions, hence the different "acquire" and "release".

Well, okay, I'll try again. How about:

The informal semantics of acquire/release memory synchronization
operations can be defined as: (1) before an acquire access or an
ordinary access is allowed to perform with respect to any other
processor, all preceding acquire accesses must be performed (in the
program order), and (2) before a release access is allowed to perform
with respect to any other processor, all preceding acquire accesses,
release accesses, and all ordinary accesses must be performed. An act
of acquiring mutex ownership can be viewed as performing an acquire
operation. An act of releasing mutex ownership can be viewed as
performing a release operation. An acquire operation prevents all
subsequent load and store accesses from being performed before the
acquire operation -- it can be viewed as store or load operation that
is performed in conjunction with the "hoist-load" and the "hoist-store"
reordering constraints (barriers) applied with respect to the
subsequent load and store accesses. A release operation prevents all
preceding load and store accesses from being performed after the
release operation -- it can be viewed as store or load operation that
is performed in conjunction with the "sink-load" and the "sink-store"
reordering constraints/barriers applied with respect to the preceding
load and store accesses. An act of acquiring a read lock on a read-
write lock can be viewed as an operation that is performed in
conjunction with the "hoist-load" barrier only; without "hoist-store"
barrier that is needed for an acquire operation on a mutex. An act of
releasing a read lock on a read-write lock can be viewed as an
operation that is performed in conjunction with the "sink-load"
barrier only; without "sink-store" barrier that is needed for a
release operation on a mutex.

[...]

> So, for example, to use an atomic variable to implement DCL (aka DCCI) correctly, let
> globalp be an atomic global pointer and localp be a thread local pointer.
> We have
>
> localp = globalp;
> if (localp == NULL) {
> lock(somethingctl);

Sorry. That's DCSI, not DCCI.

regards,
alexander.

Sean Burke

unread,

Jun 6, 2003, 2:14:18 PM6/6/03

to

Joseph Seigh <jsei...@xemaps.com> writes:

Dave's assertion - "often more expensive in the real world"
is defensible, especially since `often' doesn't mean `always'.

When I last looked at a lockless algorithm for a doubly-linked
list, I found that it introduced a "sentinel" node between
every live node.

In academic circles, it is still acceptable to predict an
algorithm's performance based on instruction counts, but in
the real world, performance is determined by cache misses.
So my expectation would be that doubling the number of nodes
in a list would double the number of cache misses needed to
traverse it, without even taking into consideration the
increased exposure to ping-pong under severe contention.

As for RCU, my understanding is that it is only practical
for use in kernel-ish environments, where the code has
strong guarantees about (or control over) scheduling.

For user-mode code, practical applications of lockless
algorithms appear to consist of FIFOs, reference counting,
and similar "tiny" algorithms.

Even there, lockless algorithms come with restrictions.
For example, I've looked at using lockless push/pop operations
on a free-node list, and it certainly is a win versus the
mutex. But, under mutex control you can prune my list from
time to time to free excess nodes, and I would have to give
this up in the lockless code(*).

-SEan

(* The nodes are allocated in blocks, so pruning requires
traversing the list to gather a complete block of nodes
that are unused, so that the block can be freed.)

SenderX

unread,

Jun 6, 2003, 3:43:54 PM6/6/03

to

> When I last looked at a lockless algorithm for a doubly-linked
> list, I found that it introduced a "sentinel" node between
> every live node.

You were looking at one that could only compare and swap one node at a time,
correct?

If you use my new lock-free system with cmpxchg8b, or cmpxchg on 64-bit
systems, you can compare and swap two nodes at a time.

So, you can atomically change the next and prev pointers of a lock-free node
in a single CAS64.

If you use cmp8xchg16b you can compare two nodes, and swap four.

This will work for fast lock-free doubly-linked lists.

I remember another paper that presented an algo that does this:

1. Atomic " to be freed " flag set on next pointer.

2. Swing next pointer, to its next.

Other threads will be able to traverse the " to be freed " node.

I can find and cite the docs if you want.

> Even there, lockless algorithms come with restrictions.
> For example, I've looked at using lockless push/pop operations
> on a free-node list, and it certainly is a win versus the
> mutex.

Oh yeah.

> But, under mutex control you can prune my list from
> time to time to free excess nodes, and I would have to give
> this up in the lockless code(*).

Not really, I am working on a memory pool api using LIFO's that use my new
lock-free system.

You can free objects, not nodes ( hint ), when the pool hits a
high-watermark.

Also, you can just use atomic_ptr /w the ABA CAS that I added to do
lock-free dynamic algos.

;)

Another use for lock-free:

I have posted a working " fast " semaphore using CAS64 in this group.

Disregard the first code, and look at the second.

I only posted the first algo to show off the race condition it has.

The second algo was to show how CAS64 can beat the race condition.

It is much faster than an OS provided semaphore. I will post working test
code on my site to show off this new algo.

Joseph Seigh

unread,

Jun 7, 2003, 12:40:46 PM6/7/03

to

Sean Burke wrote:
>
> As for RCU, my understanding is that it is only practical
> for use in kernel-ish environments, where the code has
> strong guarantees about (or control over) scheduling.

Non-preemptive kernels are a natural for RCU. I wouldn't
say that necessarily means that other environments are not
applicable. There is criteria for evaluating environments
to exploit RCU. RCU requires a runtime, a polling thread
usually. But it's a lot less onerous than the runtime required
for Boehm style GC.

I think the reason you see relatively few implementations of RCU
is due more the fact that few people who are aware of how to
implements it are in those situations where it can be exploited.
I can state that for a fact, actually.

>
> For user-mode code, practical applications of lockless
> algorithms appear to consist of FIFOs, reference counting,
> and similar "tiny" algorithms.
>
> Even there, lockless algorithms come with restrictions.
> For example, I've looked at using lockless push/pop operations
> on a free-node list, and it certainly is a win versus the
> mutex. But, under mutex control you can prune my list from
> time to time to free excess nodes, and I would have to give
> this up in the lockless code(*).
>
> -SEan
>
> (* The nodes are allocated in blocks, so pruning requires
> traversing the list to gather a complete block of nodes
> that are unused, so that the block can be freed.)
>

There is an lock-free algorithm that does this, but unfortunately
it requires a non-existent hardware instruction.

There are other ways to do that also. I may have actually
done something like that once with free-pools but it was a long
time ago. But you have to be more attuned to lock-free to
be aware of where it can be exploited.

Joe Seigh

SenderX

unread,

Jun 7, 2003, 3:22:18 PM6/7/03

to

> store.release semantics for "y = x; is
> lock();
> y = x;
> unlock();

Like this:

atomic_whatever<T> Shared;

/* Load */

__asm { lfence }; /* Acquire? */

Local = Shared.LoadPtr();

__asm { sfence }; /* Release? */

/* Store */

__asm { lfence }; /* Acquire? */

Shared.StorePtr( Local );

__asm { sfence }; /* Release? */

Is that even close?

Joseph Seigh

unread,

Jun 8, 2003, 1:07:57 PM6/8/03

to

SenderX wrote:
>
> > store.release semantics for "y = x; is
> > lock();
> > y = x;
> > unlock();
>
> Like this:
>
> atomic_whatever<T> Shared;
>
> /* Load */
>
> __asm { lfence }; /* Acquire? */
>
> Local = Shared.LoadPtr();
>
> __asm { sfence }; /* Release? */
>
> /* Store */
>
> __asm { lfence }; /* Acquire? */
>
> Shared.StorePtr( Local );
>
> __asm { sfence }; /* Release? */
>
> Is that even close?
>

More like

/* Load */

Local = Shared.LoadPtr();

__asm { lfence }; /* Acquire */

/* Store */

__asm { sfence }; /* Release */

Shared.StorePtr( Local );

depending on the actual memory architecture. But that's an implementation.
It's more of a question of how you define the semantics at this point.

Joe Seigh

Joseph Seigh

unread,

Jun 9, 2003, 6:32:23 AM6/9/03

to

Alexander Terekhov wrote:
>
> Joseph Seigh wrote:
> [...]
> > > > > load.acquire semantics for "x = y;" is
> > > > > lock();
> > > > > x = y;
> > > > > unlock();
> > > > >
> > > > > store.release semantics for "y = x; is
> > > > > lock();
> > > > > y = x;
> > > > > unlock();
> > > > >
> > > >
> > > > Actually, I like this. ...
> > >
> > > I also kinda like it. ;-) Your "x = y" has a *hoist-load-barrier* plus
> > > a *hoist-store-barrier* plus a *sink-load-barrier* plus a *sink-store-
> > > barrier* semantics. Oh, and your "y = x" is absolutely the same thing.
> > >
> >
> > While the locks have memory barriers on both sides, the use of local
> > variables in the definitions limits inferable visibility in one of the
> > directions, hence the different "acquire" and "release".
>
> Well, okay, I'll try again. How about:
>

(snip)

Informal? The formal definition is probably a bit briefer than that.

To complicate things, there's at least 3 memory barrier definitions you
could apply to atomic<T>. The acquire/release semantics mentioned here,
semantics for immutable objects, DCL initialized objects for example,
which are what most people are thinking of, and atomic with no memory
barriers at all.

What kind of semantics you use matters depending on that your usage is.
But I don't think the fact that there are different kinds will matter
to most people. They'll just use whatever passes itself off as a
memory barrier whether or not it is appropiate or consistent accross
platforms even.

For instance, if you are using RCU in the Linux kernel, you may need any
of the 3 types I mentioned. I doubt Linux has all three defined or if
only one, which one it is.

Joe Seigh

Alexander Terekhov

unread,

Jun 10, 2003, 5:14:47 AM6/10/03

to

Joseph Seigh wrote:
[...]

> For instance, if you are using RCU in the Linux kernel, you may need any
> of the 3 types I mentioned. I doubt Linux has all three defined or if
> only one, which one it is.

Frankly, I don't much care what Linux does currently. I'm pretty sure that
what's really needed is actually something along the lines of:

http://www.terekhov.de/pthread_refcount_t/experimental/refcount.cpp

struct msync {
enum hlb_t { hlb }; // hoist-load barrier
enum ddhlb_t { ddhlb }; // hoist-load barrier with data-dependency "hint"
enum hsb_t { hsb }; // hoist-store barrier
enum slb_t { slb }; // sink-load barrier
enum ddslb_t { ddslb }; // sink-load barrier with data-dependency "hint"
enum ssb_t { ssb }; // sink-store barrier
enum acq_t { acq }; // hoist-load + hoist-store barrier
enum rel_t { rel }; // sink-load + sink-store barrier
enum none_t { none }; // naked
};

I mean naked atomic operations and atomic ops with various combinations
of "hoist" and "sink" barriers (with and without data-dependency "hints")
that would impose reordering restrictions on both compilers and hardware.

regards,
alexander.

--
http://listman.redhat.com/archives/phil-list/2003-June/msg00018.html
(Subject: Re: nptl 0.45)