Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Lock-free binary trees

61 views
Skip to first unread message

Joe Seigh

unread,
Jul 2, 2006, 12:05:09 PM7/2/06
to
Well, c.p.t is getting boring and slashdot has become the daytime tv of the internet
so I suppose I will post some code I did prototyping lock-free binary trees since it's
been long enough since I've done anything on it and I still haven't figured out a use for it.
Also posting it might be a way to have fun with Sun Microsystems Research again
later on (inside joke. I won't explain here)

It's actually just a partial prototype which I wrote to see what the code would sort of
look like. It doesn't have any of the base methods that java and STL collections have
that are really composite collections including list/array collections. Basically your
tree collection is really a tree collection and a list collection side by side under the
covers.

Some of the stuff in it like the height stuff isn't actually used since I haven figured
out whether to do height balanced or weight balanced trees and it's not complete
or correct. It's just there to see what it would look like and where it might go.

It's only reader lock-free. The write side has to use conventional locking or synchronization.
Now for the basic techniques.

Adding a node is straight forward. Just insert it as a leaf node. I'm assuming JSR-133
volatile semantics has fixed the memory visibility problems.

Removing a nodes is also pretty straight forward using PCOW (partial copy on write)
replacing part of the subtree of the removed node. See my previous posting on this
http://groups.google.com/groups?selm=opsock48geqm36vk%40grunion
for details.

Balancing a tree is a little tricky. If you move nodes from one subtree to another
subtree, there's a risk that any concurrent lookup might miss finding the node
as it's move out from underneath them. It's an error for a lookup to not find a
node that is in the tree collection. The trick is to have a rotate method that
adds the node to be moved to the destination subtree before it removes the
node from the source subtree. That way the new location of the node will
be atomically visible when the old location is removed. I only implemented
rotateRight in this prototype. I also didn't implement the heuristics of the
balancing yet either. You obviously don't want to do too many overlapping
PCOW operations. The current algorithms for red-black trees or AVL trees
probably won't be too useful as is since they weren't written with atomic
operations or PCOW in mind.

Source code in a followup post. I am deleting some debug methods which aren't
directly applicable. Hopefully I didn't delete too much.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Joe Seigh

unread,
Jul 2, 2006, 12:12:25 PM7/2/06
to
Partial and probably not correct prototype below

import java.util.*;

public class LFTreeSet
{
class Sequence {
int value;
synchronized int next() {
value++;
return value;
}

}

Sequence sequence = new Sequence();

class Node {

Object value;
int height; // tree height
int lheight; // left subtree height
int rheight; // right subtree height
volatile Node ltree; // left subtree
volatile Node rtree; // right subtree

int unique; // unique node id;
int cost; // node cost
int weight; // node weight

void init(Object o, Node lnode, Node rnode) {
value = o;
if (lnode != null) {
ltree = lnode;
lheight = lnode.height;
}
else {
ltree = null;
lheight = 0;
}
rtree = rnode;
if (rnode != null) {
rtree = rnode;
rheight = rnode.height;
}
else {
rtree = null;
rheight = 0;
}
height = maximum(lheight, rheight) + 1;

unique = sequence.next(); // assign unique sequence number
}

Node(Object o) {
init(o, null, null);
}

Node(Object o, Node lnode, Node rnode) {
init(o, lnode, rnode);
}

}

Comparator comparator; // comparator
volatile Node root; // tree root node

//=========================================================================
// Private Methods
//=========================================================================

//-------------------------------------------------------------------------
// maximum --
//-------------------------------------------------------------------------
static int maximum(int x, int y) {
if (x > y) return x; else return y;
}

//-------------------------------------------------------------------------
// addNode --
//
// returns:
// subtree height or
// 0 if tree already contains node
//
//-------------------------------------------------------------------------
Node addNode(Node z, Object o)
throws ClassCastException {
Node p;
int rc;

if (z == null) {
comparator.compare(o, o); // force ClassCastException if wrong class
return new Node(o);
}

rc = comparator.compare(o, z.value);
if (rc < 0) {
z.ltree = addNode(z.ltree, o);
z.lheight = z.ltree.height;
}

else if (rc > 0) {
z.rtree = addNode(z.rtree, o);
z.rheight = z.rtree.height;
}

// calculate height
z.height = maximum(z.lheight, z.rheight) + 1;

return z;
}


//-------------------------------------------------------------------------
// maxNode --
//
//-------------------------------------------------------------------------
static Node maxNode(Node z) {
while (z.rtree != null)
z = z.rtree;
return z;
}


//-------------------------------------------------------------------------
// pcloneMinusMaxNode -- partial copy
//
// static won't work?
//-------------------------------------------------------------------------
Node pcloneMinusMaxNode(Node z) {
Node p;

if (z.rtree == null)
return z.ltree;
else
return new Node(z.value,
z.ltree,
pcloneMinusMaxNode(z.rtree));
}


//-------------------------------------------------------------------------
// rotateRight --
//
//-------------------------------------------------------------------------
Node rotateRight(Node p) {
if (p == null)
return null;

if (p.ltree == null)
return p;

if (p.rtree == null)
p.rtree = new Node(p.value);
else
addNode(p.rtree, p.value);

return new Node(maxNode(p.ltree).value,
pcloneMinusMaxNode(p.ltree),
p.rtree);

}


//=========================================================================
// Public Constructors and Methods
//=========================================================================

//-------------------------------------------------------------------------
// LFTreeSet -- ctor
//
//
//-------------------------------------------------------------------------
public LFTreeSet(Comparator comparator) {
this.comparator = comparator;
root = null;
}

//-------------------------------------------------------------------------
// add --
//
//-------------------------------------------------------------------------
public boolean add(Object o)
throws ClassCastException {

if (contains(o)) {
return false;
}

else {
root = addNode(root, o);
return true;
}

}


//-------------------------------------------------------------------------
// remove --
//
//-------------------------------------------------------------------------
public boolean remove(Object o)
throws ClassCastException {
Node p, p2, p3;
int n;

// find parent node

p2 = null;
p = root;
while (p != null && ((n = comparator.compare(o, p.value)) != 0)) {
p2 = p;
if (n < 0)
p = p.ltree;
else
p = p.rtree;
}

if (p == null)
return false; // not found

// leaf node
if (p.ltree == null && p.rtree == null)
p3 = null;

// single subtree
else if (p.ltree == null)
p3 = p.rtree;

// single subtree
else if (p.rtree == null)
p3 = p.ltree;

// new subtree with root node replaced by max node from left subtree
else
p3 = new Node(maxNode(p.ltree).value,
pcloneMinusMaxNode(p.ltree),
p.rtree);

//
// store subtree
//
if (p2 == null)
root = p3;
else if (p2.ltree == p) {
p2.ltree = p3;
p2.lheight = p3.height;
p2.height = maximum(p2.lheight, p2.rheight) + 1;
}
else {
p2.rtree = p3;
p2.rheight = p3.height;
p2.height = maximum(p2.lheight, p2.rheight) + 1;
}

return true;
}


//-------------------------------------------------------------------------
// contains --
//
//-------------------------------------------------------------------------
public boolean contains(Object o)
throws ClassCastException {
Node p;
int n;

p = root;
while (p != null && ((n = comparator.compare(o, p.value)) != 0)) {
if (n < 0)p = p.ltree; else p = p.rtree;
}

return (p != null);
}


//-------------------------------------------------------------------------
// comparator --
//
//-------------------------------------------------------------------------
public Comparator comparator() {
return comparator;
}


//=========================================================================
// Debug Methods
//=========================================================================

//-------------------------------------------------------------------------
//-------------------------------------------------------------------------
int nodeCost(Node p) {
if (p == null) return 0;
return p.cost = ((nodeCost(p.ltree) + nodeCost(p.rtree))*2 + 1);
}

//-------------------------------------------------------------------------
//-------------------------------------------------------------------------
int nodeWeight(Node p) {
if (p == null) return 0;
return p.weight = (nodeWeight(p.ltree) + nodeWeight(p.rtree) + 1);
}


//-------------------------------------------------------------------------
//-------------------------------------------------------------------------
public void rotateTreeRight() {
root = rotateRight(root);
}


}

//--//

Chris Uppal

unread,
Jul 3, 2006, 5:09:24 AM7/3/06
to
Joe Seigh wrote:

> It's only reader lock-free. The write side has to use conventional
> locking or synchronization. Now for the basic techniques.
>
> Adding a node is straight forward. Just insert it as a leaf node. I'm
> assuming JSR-133 volatile semantics has fixed the memory visibility
> problems.

Rather a dangerous assumption. As I read the code, you need more
synchronisation than you have (almost none). There doesn't seem to be anything
to ensure that the height values are propagated correctly (I saw your comment
that they aren't used), nor the costs and weights, nor the "value" of the node,
nor the memory pointed /to/ by value.

Incidentally, the stuff in java.lang.concurrent.atomic might be useful if you
don't already know about it.

-- chris

Joe Seigh

unread,
Jul 3, 2006, 7:29:08 AM7/3/06
to
Chris Uppal wrote:
> Joe Seigh wrote:
>
>
>>It's only reader lock-free. The write side has to use conventional
>>locking or synchronization. Now for the basic techniques.
>>
>>Adding a node is straight forward. Just insert it as a leaf node. I'm
>>assuming JSR-133 volatile semantics has fixed the memory visibility
>>problems.
>
>
> Rather a dangerous assumption. As I read the code, you need more
> synchronisation than you have (almost none). There doesn't seem to be anything
> to ensure that the height values are propagated correctly (I saw your comment
> that they aren't used), nor the costs and weights, nor the "value" of the node,
> nor the memory pointed /to/ by value.

The height/weight stuff was was only partly implemented. I didn't take it
out as I didn't want to do any major editting. Just ignore it.

AFAIK, JSR-133 added acquire and release semantics to the loads and stores
of volatile variables respectively.

There isn't any synchronization. Read access doesn't require it and for
write access it needs to be explicity added if required. E.g., if you have a
single writer then you don't need it.

Not requiring synchronization is sort of the whole point of lock-free. You
do know what lock-free is and how it works, don't you?

>
> Incidentally, the stuff in java.lang.concurrent.atomic might be useful if you
> don't already know about it.
>

From the java.util.concurrent.atomic description

The memory effects for accesses and updates of atomics generally follow the rules for volatiles:

* get has the memory effects of reading a volatile variable.
* set has the memory effects of writing (assigning) a volatile variable.
* weakCompareAndSet atomically reads and conditionally writes a variable,
is ordered with respect to other memory operations on that variable,
but otherwise acts as an ordinary non-volatile memory operation.
* compareAndSet and all other read-and-update operations such as
getAndIncrement have the memory effects of both reading and
writing volatile variables.

I don't need compareAndSet so volatile is sufficient.

You might want to take a look at the source for the lock-free classes in
java.util.concurrent.

Chris Uppal

unread,
Jul 4, 2006, 5:53:31 AM7/4/06
to
Joe Seigh wrote:

[me:]


> > Rather a dangerous assumption. As I read the code, you need more
> > synchronisation than you have (almost none). There doesn't seem to be
> > anything
> > to ensure that the height values are propagated correctly (I saw your
> > comment
> > that they aren't used), nor the costs and weights, nor the "value" of
> > the node,
> > nor the memory pointed /to/ by value.
>
> The height/weight stuff was was only partly implemented. I didn't take it
> out as I didn't want to do any major editting. Just ignore it.
>
> AFAIK, JSR-133 added acquire and release semantics to the loads and stores
> of volatile variables respectively.

I believe so too, but that JSR is no longer relevant -- this stuff is now part
of the language spec. See JLS3 for details (which I confess I haven't
followed).

But -- as far as I know -- reads /though/ a volatile variable are not
transitively given the "special" quality that makes volatile
not-entirely-useless for, say, integer-valued variables.

For instance, given:

class A
{
volatile int[] array; // NB: not final
A()
{
array = new int[4];
for (int i = 0; i < array.length; i++)
array[i] = i;
}

int get(int i) { return array[i]; }
}

and in one thread:

someA = new A();

and later, but in a different thread;

int x = someA.get(2);

then that can return either 0 or 2. All that the 'volatile' qualifier does is
to make it impossible (in this case) for get() to throw an NPE. The value of
"array" itself is passed correctly from thread to thread, but that (unlike
synchronisation) doesn't imply that memory transitively reachable via that
value
is also resynchronised.

If I'm misreading the spec in this, then I'd appreciate correction and a
reference (from anyone).

-- chris

Chris Thomasson

unread,
Jul 7, 2006, 2:42:03 PM7/7/06
to
"Chris Uppal" <chris...@metagnostic.REMOVE-THIS.org> wrote in message
news:44a8dddb$2$661$bed6...@news.gradwell.net...

> Joe Seigh wrote:
>
>> It's only reader lock-free. The write side has to use conventional
>> locking or synchronization. Now for the basic techniques.
>>
>> Adding a node is straight forward. Just insert it as a leaf node. I'm
>> assuming JSR-133 volatile semantics has fixed the memory visibility
>> problems.
>
> Rather a dangerous assumption. As I read the code, you need more
> synchronisation than you have (almost none).

I disagree. I would argue that the synchronization properties are perhaps,
too *strong...

He is relying on implicit release semantics for writer threads. A release
barrier is used within the writer threads critical section AFTER you
initialize the new nodes links/ect... and BEFORE you store a pointer to the
new node into a location that is assessable to reader
threads.


AFAIK, Java volatiles have release semantics for stores:

JavaVolatileStore(...) {
membar #LoadStore|#StoreStore
/* Perform the actual store */
}


This will work for lock-free reader patterns in general:

https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068


* Very Strong Sync Properties

"Iteration" over lock-free read data-structures is an expensive process in
Java; volatile qualifier seems to force a release barrier for every store
and, AFAIK, force an acquire barrier for every load. Every node visited
during a lock-free iteration would be executing full-blown memory barriers,
when clearly, simple dependant load ordering is all that is required...

Then you have to factor in the hefty garbage collector that is used in
Java... It will be generating a lot more overhead under load (thread
suspension/stack introspection, ect...) than a lock-free proxy garbage
collector would...


Joe Seigh

unread,
Jul 7, 2006, 6:58:50 PM7/7/06
to
To illustrate the rotate operations a little more
take the following tree for a rotate right operation

x
/ \
/ \
A B
\
y

The first step is to duplicate the root node by adding
it to the right subtree B.

x
/ \
/ \
A B
\ /
y x

The second step is to delete the root node x and replace it
with the maximum node y from the left subtree A.

y
/ \
/ \
A B
/
x


Even though the add and delete operations are two separate
atomic operations, the effect of the rotate operation is
atomic, i.e. two nodes appear to be moved atomically.

Chris Uppal

unread,
Jul 8, 2006, 7:17:06 AM7/8/06
to
Chris Thomasson wrote:

> I disagree. I would argue that the synchronization properties are perhaps,
> too *strong...
>
> He is relying on implicit release semantics for writer threads. A release
> barrier is used within the writer threads critical section AFTER you
> initialize the new nodes links/ect... and BEFORE you store a pointer to
> the
> new node into a location that is assessable to reader
> threads.

I don't know what a "release barrier" is. I see that this is x-posted to
c.p.threads, where presumably the term is well known, but I'm reading in
c.l.j.p where the term has no meaning (unless we want to talk about a specific
implementation of the JVM spec, or provide an explanation in terms of the JVMs
semantics). Anyway...

It's not clear to me that the synchronised operation included in a Node's
constructor is intended to be part of the "real" code, or is just another bit
of left-over "this is incorrect, but ignore it" code like the cost and weight
fields (note that the 'sequence' field itself is not hread-safe). Also it's
not clear whether the use of a synchronised method is intended to invoke the
full synchronisation semantics, or is just an implementation detail --
something that might better be implemented (without synchronisation) by using a
java.util.concurrent.atomic.AtomicInteger.

Still, assuming all that, I agree that the synchronised Sequence.next() does,
after all, provide the necessary happens-before relationship between assignment
to the Nodes value field, and uses of it by the Comparator. (And assuming that
the value objects don't change significantly once added, but that seems fair
even without threading ;-)

-- chris


Thomas Hawtin

unread,
Jul 8, 2006, 11:17:49 AM7/8/06
to
Chris Uppal wrote:
>
> I don't know what a "release barrier" is. I see that this is x-posted to

I think he means "write barrier". In the new Java Memory Model, the
semantics for releasing a lock are not that strong.

> It's not clear to me that the synchronised operation included in a Node's
> constructor is intended to be part of the "real" code, or is just another bit
> of left-over "this is incorrect, but ignore it" code like the cost and weight
> fields (note that the 'sequence' field itself is not hread-safe). Also it's
> not clear whether the use of a synchronised method is intended to invoke the
> full synchronisation semantics, or is just an implementation detail --
> something that might better be implemented (without synchronisation) by using a
> java.util.concurrent.atomic.AtomicInteger.

As there is only a single writer thread, the synchronisation does
nothing and can therefore be removed by the JVM.

> Still, assuming all that, I agree that the synchronised Sequence.next() does,
> after all, provide the necessary happens-before relationship between assignment
> to the Nodes value field, and uses of it by the Comparator. (And assuming that
> the value objects don't change significantly once added, but that seems fair
> even without threading ;-)

Giving the code a quick once over, it looks as if the volatiles ensure
the bits of the code that are supposed to work to what they are supposed to.

I don't know how well it will perform. Reading a volatile is apparently
very cheap. The main cost may be chucking away register values, but that
wont matter after the first read. If writing is really infrequent, then
an immutable tree could be used, with a single volatile holding it
(memory allocation is cheap).

Tom Hawtin
--
Unemployed English Java programmer
http://jroller.com/page/tackline/

Joe Seigh

unread,
Jul 8, 2006, 11:23:49 AM7/8/06
to
Chris Uppal wrote:
> Chris Thomasson wrote:
>
>
>>I disagree. I would argue that the synchronization properties are perhaps,
>>too *strong...
>>
>>He is relying on implicit release semantics for writer threads. A release
>>barrier is used within the writer threads critical section AFTER you
>>initialize the new nodes links/ect... and BEFORE you store a pointer to
>>the
>>new node into a location that is assessable to reader
>>threads.
>
>
> I don't know what a "release barrier" is. I see that this is x-posted to
> c.p.threads, where presumably the term is well known, but I'm reading in
> c.l.j.p where the term has no meaning (unless we want to talk about a specific
> implementation of the JVM spec, or provide an explanation in terms of the JVMs
> semantics). Anyway...
>

http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile

might answer some questions.

gottlo...@gmail.com

unread,
Jul 10, 2006, 7:57:16 AM7/10/06
to

Joe Seigh wrote:
> Well, c.p.t is getting boring and slashdot has become the daytime tv of the internet
> so I suppose I will post some code I did prototyping lock-free binary trees since it's
> been long enough since I've done anything on it and I still haven't figured out a use for it.
>
...

> It's only reader lock-free. The write side has to use conventional locking or synchronization.

....


>
> Balancing a tree is a little tricky.

...
>
> --
> Joe Seigh
>

Why not do a skip-list instead? A skiplist
- has the same performance characteristics as a tree (O(logN) lookup,
etc)
- is easier to code
- can be both reader and writer lock free

Tony

Chris Thomasson

unread,
Jul 10, 2006, 5:52:13 PM7/10/06
to
<gottlo...@gmail.com> wrote in message
news:1152532636....@m79g2000cwm.googlegroups.com...

>
> Joe Seigh wrote:
>> Well, c.p.t is getting boring and slashdot has become the daytime tv of
>> the internet
>> so I suppose I will post some code I did prototyping lock-free binary
>> trees since it's
>> been long enough since I've done anything on it and I still haven't
>> figured out a use for it.
>>
> ...
>
>> It's only reader lock-free. The write side has to use conventional
>> locking or synchronization.
> ....
>>
>> Balancing a tree is a little tricky.
> ...
>>
>> --
>> Joe Seigh
[...]

> - can be both reader and writer lock free

Is there a lock-free skip-list writer algorithm that has less overhead than
an uncontented mutex? I don't ever remember seeing one that used fewer than
two atomic operations and/or memory barriers. The basic point that I getting
at here is that if an uncontended lock-free writer algorithm has two or more
atomic operations and/or memory barriers, the overhead will be equal to or
greater than an uncontended mutex! For instance, the well known lock-free
queue by Michael and Scott forces producer threads to execute at least two,
or three under contention, CAS /w StoreLoad or LoadStore operations;
uncontended fast-pathed mutex access will perform equal to or better than
their enqueue operations. Therefore, IMHO using a hashed locking scheme for
the writer side of a lock-free reader pattern is usually ideal...

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4/3ca11e0c3dcf762c?q=multi-mutex&rnum=1#3ca11e0c3dcf762c

Any thoughts?


gottlo...@gmail.com

unread,
Jul 11, 2006, 12:47:10 PM7/11/06
to

Chris Thomasson wrote:
>
> Is there a lock-free skip-list writer algorithm that has less overhead than
> an uncontented mutex? I don't ever remember seeing one that used fewer than
> two atomic operations and/or memory barriers. The basic point that I getting
> at here is that if an uncontended lock-free writer algorithm has two or more
> atomic operations and/or memory barriers, the overhead will be equal to or
> greater than an uncontended mutex! For instance, the well known lock-free
> queue by Michael and Scott forces producer threads to execute at least two,
> or three under contention, CAS /w StoreLoad or LoadStore operations;
> uncontended fast-pathed mutex access will perform equal to or better than
> their enqueue operations. Therefore, IMHO using a hashed locking scheme for
> the writer side of a lock-free reader pattern is usually ideal...
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4/3ca11e0c3dcf762c?q=multi-mutex&rnum=1#3ca11e0c3dcf762c
>
> Any thoughts?

I haven't looked at the skip lists very closely. One day it dawned on
me that lock-free skip lists would be possible, and so I searched the
internet and found they already existed (only in the last few years).
So I was "too late" and didn't bother looking at them much further.

Anyhow, it's the *contended* mutex path that takes longer :-). If you
don't expect many writers, then yeah, the extra work probably isn't
worth it. But if you need multiple writers, maybe those extra barriers
are still faster than waiting for a writer?
Also, by the same logic, an uncontended mutex on reading might not be
too high a price to pay (even if it was slower than lock-free, at least
the code will be simpler), but then we wouldn't be talking about
lock-free at all, would we?

And remember, it is not just the number of atomics/barriers, it is the
number MORE than what you had to do anyhow for the sake of the
lock-free readers.

Tony

Chris Thomasson

unread,
Jul 11, 2006, 7:26:20 PM7/11/06
to
<gottlo...@gmail.com> wrote in message
news:1152636430.5...@m73g2000cwd.googlegroups.com...
>
> Chris Thomasson wrote:
>>
>> [...]

>> Therefore, IMHO using a hashed locking scheme for
>> the writer side of a lock-free reader pattern is usually ideal...
>>
>> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4/3ca11e0c3dcf762c?q=multi-mutex&rnum=1#3ca11e0c3dcf762c
>>
>> Any thoughts?
>
[...]

>
> Anyhow, it's the *contended* mutex path that takes longer :-).

Well, yeah... However, you kind of have to take adaptive mutexs into
account; a contended case under moderate load can usually still hit a
fast-path...


> If you don't expect many writers, then yeah, the extra work probably isn't
> worth it. But if you need multiple writers, maybe those extra barriers
> are still faster than waiting for a writer?

Maybe... You have to know if the contended case for a locking scheme based
on adaptive mutexs was found to frequently park writer threads. A well
designed locking scheme can be "contention-less"... Think of the Solaris
memory allocator...


> Also, by the same logic, an uncontended mutex on reading might not be
> too high a price to pay

I have to disagree here. Lock-free reader patterns are all about "completely
removing the need for atomic-ops/membars". IMO, its very hard to try to
compare that to mutexs, or "any other" algorithm that uses
atomics/membars...

http://groups.google.com/group/comp.programming.threads/msg/ec103670e323111a

http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce


For instance, I know that the StoreLoad barrier in SMR can "drastically
reduce" per-thread reads-per-second. This is why RCU-SMR and vZOOM was
developed...


> (even if it was slower than lock-free, at least
> the code will be simpler), but then we wouldn't be talking about
> lock-free at all, would we?

;)


> And remember, it is not just the number of atomics/barriers, it is the
> number MORE than what you had to do anyhow for the sake of the
> lock-free readers.

The general idea for lock-free reader patterns is to completely eliminate
all of the atomics/barriers for read-only accesses:


http://groups.google.com/group/comp.programming.threads/msg/5c0dbc2ab7fc46c3

http://groups.google.com/group/comp.programming.threads/msg/e0c599d20481fab8


Joe Seigh and I have both developed user-space solutions' for this kind of
stuff...


gottlo...@gmail.com

unread,
Jul 12, 2006, 5:36:57 PM7/12/06
to

Chris Thomasson wrote:
>
> The general idea for lock-free reader patterns is to completely eliminate
> all of the atomics/barriers for read-only accesses:
>
>

So let's say I'm at least(!) one step behind on all this stuff - how do
you ensure the read thread sees up to date data without a read barrier?

Tony

Joe Seigh

unread,
Jul 12, 2006, 6:11:37 PM7/12/06
to

Dependent load ordering for the most part. You get it for free on most
platforms. Or more accurately, you can't avoid it. The other part of
avoiding it on whatever you synchronize the read access critical sections
with. VZOOM avoids it if I undertand it correctly. RCU avoids it.
RCU+SMR avoids it. And some GC implementations avoid it.

Chris Thomasson

unread,
Jul 12, 2006, 7:09:42 PM7/12/06
to
<gottlo...@gmail.com> wrote in message
news:1152740217.6...@h48g2000cwc.googlegroups.com...

Dependant loads:

<psuedo>


#if ! defined (ArchAlpha)
#define dependant_load(x) (*x)
#else
dependant_load(x) {
l = *x;
rmb();
return l;
}
#endif


void* vzAtomicLoadPtr_mbDepends(void* volatile* d) {
return dependant_load(d);
}


Chris Thomasson

unread,
Jul 12, 2006, 7:19:41 PM7/12/06
to
[...]

> void* vzAtomicLoadPtr_mbDepends(void* volatile* d) {
> return dependant_load(d);
> }

DOH. I should of omitted volatile:

void* vzAtomicLoadPtr_mbDepends(void** d) {
return dependant_load(d);
}


Some C/C++ compilers actually inject full-blown acquire/release barriers on
volatile access...!


<volatile rant>

I think a Microsoft compiler for Itanium does this sort of non-sense...

Volatile should be defined as an optimization hint to the compiler ONLY!

It should have absolutely NO effect on any hardware ordering!

</volatile rant>

gottlo...@gmail.com

unread,
Jul 12, 2006, 9:02:09 PM7/12/06
to

Chris Thomasson wrote:
> [...]
>
> > void* vzAtomicLoadPtr_mbDepends(void* volatile* d) {
> > return dependant_load(d);
> > }
>
> DOH. I should of omitted volatile:
>
> void* vzAtomicLoadPtr_mbDepends(void** d) {
> return dependant_load(d);
> }
>
>
> Some C/C++ compilers actually inject full-blown acquire/release barriers on
> volatile access...!
>

I think this is an option in xcode/gcc as well. I know they have an
option for thread safe function-local statics.

gottlo...@gmail.com

unread,
Jul 12, 2006, 9:07:06 PM7/12/06
to

Chris Thomasson wrote:

> Dependant loads:
>
> <psuedo>
>
>
> #if ! defined (ArchAlpha)
> #define dependant_load(x) (*x)
> #else
> dependant_load(x) {
> l = *x;
> rmb();
> return l;
> }
> #endif
>
>
> void* vzAtomicLoadPtr_mbDepends(void* volatile* d) {
> return dependant_load(d);
> }

so for the typical case of one time init / DCLP:

if (!initted) // [1]
{
Lock lock;
init(foo);
}

// use foo:

foo->doSomething(); // [2]

you need (or get for free) a dependant load on the init flag at [1]?
and/or when accessing foo, at [2]?

I thought Itanium also didn't quite do this?

Tony

David Hopwood

unread,
Jul 13, 2006, 12:24:37 PM7/13/06
to
Chris Thomasson wrote:
> [...]
>
>>void* vzAtomicLoadPtr_mbDepends(void* volatile* d) {
>> return dependant_load(d);
>>}
>
> DOH. I should of omitted volatile:
>
> void* vzAtomicLoadPtr_mbDepends(void** d) {
> return dependant_load(d);
> }

In that case, how do you know that the compiler isn't optimizing away the
call (which it should, assuming we're not on an Alpha machine), and then
reordering the load relative to things that it shouldn't be reordered
relative to?

Answer: you don't.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Thomasson

unread,
Jul 13, 2006, 12:48:33 PM7/13/06
to
"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message
news:9tutg.61169$181....@fe3.news.blueyonder.co.uk...

Well, the call would ideally be an external assembled function:

http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6


David Hopwood

unread,
Jul 13, 2006, 4:49:02 PM7/13/06
to
Chris Thomasson wrote:

> "David Hopwood" <david.nosp...@blueyonder.co.uk> wrote:
>>Chris Thomasson wrote:
>>
>>>[...]
>>>
>>>>void* vzAtomicLoadPtr_mbDepends(void* volatile* d) {
>>>> return dependant_load(d);
>>>>}
>>>
>>>DOH. I should of omitted volatile:
>>>
>>>void* vzAtomicLoadPtr_mbDepends(void** d) {
>>> return dependant_load(d);
>>>}
>>
>>In that case, how do you know that the compiler isn't optimizing away the
>>call (which it should, assuming we're not on an Alpha machine), and then
>>reordering the load relative to things that it shouldn't be reordered
>>relative to?
>>
>>Answer: you don't.
>
> Well, the call would ideally be an external assembled function:
>
> http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6

The problem with this is that you're still relying on the assumption that your
compiler does not perform certain kinds of valid optimization. This is already
false for some compilers (e.g. llvm-gcc), and will probably be false for future
main-line versions of gcc that will do link-time optimization.

<http://llvm.org>
<http://gcc.gnu.org/ml/gcc/2005-11/msg00888.html>

(Although the specific proposal for integrating LLVM into gcc has not been agreed
and seems to be at least temporarily stalled, there was concensus on the general
idea of supporting link-time inter-procedural optimizations in gcc.)


Using inline asm with a "clobbers memory" directive before and after the load
should solve this particular problem. However, I am skeptical of reliance on
dependent load ordering in general. I've never seen any rigorous definition of
it (as a property of a memory model, rather than in discussions of processor
optimizations), or any explicit statement in the architecture manuals of commonly
used processor lines saying that it is supported. In fact, I wonder if the idea
that there is an implicit commitment for future versions of these processors
to guarantee dependent load ordering is not just wishful thinking.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,
Jul 13, 2006, 7:06:40 PM7/13/06
to
David Hopwood wrote:

> Chris Thomasson wrote:
>>Well, the call would ideally be an external assembled function:
>>
>>http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6
>
>
> The problem with this is that you're still relying on the assumption that your
> compiler does not perform certain kinds of valid optimization. This is already
> false for some compilers (e.g. llvm-gcc), and will probably be false for future
> main-line versions of gcc that will do link-time optimization.
>
> <http://llvm.org>
> <http://gcc.gnu.org/ml/gcc/2005-11/msg00888.html>
>
> (Although the specific proposal for integrating LLVM into gcc has not been agreed
> and seems to be at least temporarily stalled, there was concensus on the general
> idea of supporting link-time inter-procedural optimizations in gcc.)
>

I suppose you could change gcc so as to break any threading api, except Posix
of course. But then you'd have to go into the witless protection program to
protect you from the wrath of Linus and all the other Linux kernel developers.
:)

David Hopwood

unread,
Jul 13, 2006, 8:40:25 PM7/13/06
to

<http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/i386-and-x86-64-Options.html>
# These -m switches are supported in addition to the above on AMD x86-64
# processors in 64-bit environments. [...]
# -mcmodel=kernel
# Generate code for the kernel code model. The kernel runs in the
# negative 2 GB of the address space. This model has to be used for
# Linux kernel code.

<http://gcc.gnu.org/onlinedocs/gcc/S_002f390-and-zSeries-Options.html>
# [...] In order to build a linux kernel use -msoft-float.

<http://gcc.gnu.org/ml/gcc-patches/2004-01/msg01619.html>
# > People run more and more into problems with kernel modules on x86-64
# > and the new gcc because they don't specify the -mno-red-zone parameter.
# > With the old gcc 3.2 they got away, but with the new one it breaks.

<http://www.faqs.org/docs/kernel/x204.html>
# Kernel modules need to be compiled with certain gcc options to make
# them work. [...]
# * -O2: The kernel makes extensive use of inline functions, so modules
# must be compiled with the optimization flag turned on. Without
# optimization, some of the assembler macros calls will be mistaken
# by the compiler for function calls. This will cause loading the
# module to fail, since insmod won't find those functions in the kernel.

<http://www.luv.asn.au/overheads/compile.html>
# From /usr/doc/gcc/README.Debian.gz in gcc-2.95.2-0pre2:

[an old version, but I found it amusing considering the previous quote:]

# As of gcc-2.95, optimisation at level 2 (-O2) and higher includes an
# optimisation which breaks C code that does not adhere to the C standard.
# Such code occurs in the Linux kernel, and probably other places.


Anyway, the point is that the Linux kernel is a law unto itself, which may
or may not work with any particular combination of gcc version and options
(it certainly won't work with other compilers, except maybe icc with a patch).

If you are writing a general purpose library or application, you want it to
be less compiler-version/option-dependent than that; preferably, you want
something that works for a documented reason, rather than just by coincidence
of the compiler not doing some optimizations. As I said earlier,

Joe Seigh

unread,
Jul 14, 2006, 6:15:03 AM7/14/06
to
David Hopwood wrote:
...

> Anyway, the point is that the Linux kernel is a law unto itself, which may
> or may not work with any particular combination of gcc version and options
> (it certainly won't work with other compilers, except maybe icc with a patch).
>
> If you are writing a general purpose library or application, you want it to
> be less compiler-version/option-dependent than that; preferably, you want
> something that works for a documented reason, rather than just by coincidence
> of the compiler not doing some optimizations. As I said earlier,
>
> Using inline asm with a "clobbers memory" directive before and after the load
> should solve this particular problem. However, I am skeptical of reliance on
> dependent load ordering in general. I've never seen any rigorous definition of
> it (as a property of a memory model, rather than in discussions of processor
> optimizations), or any explicit statement in the architecture manuals of commonly
> used processor lines saying that it is supported. In fact, I wonder if the idea
> that there is an implicit commitment for future versions of these processors
> to guarantee dependent load ordering is not just wishful thinking.
>

Hypothetically, anything is possible. So gcc could do something extremely stupid.
Especially since the current C standard does not recognise threads. However
C++0x is discussing threaded support so unless there is a complete break between
C and C++, something will have to be worked out.

Worst case is that gcc eliminates itself as a research tool for multi-threading and
that putting support for any new threading features in C will become as tedious
as the JSR process except the place where the new features came from will be a lot
smaller. Think about it. The prototypes for the stuff that was put in JSR-166
came from somewhere. Some of it may have been assembler but I doubt all of it was.

The commercial compiler vendors would be happy with gcc doing that. Intel provides
a library that uses lock-free algorithms. That you'd have to buy a compiler
from them as well would make their day.

David Hopwood

unread,
Jul 14, 2006, 11:29:07 AM7/14/06
to
Joe Seigh wrote:
> David Hopwood wrote:
> ...
>
>> Anyway, the point is that the Linux kernel is a law unto itself, which
>> may or may not work with any particular combination of gcc version and
>> options (it certainly won't work with other compilers, except maybe icc
>> with a patch).
>>
>> If you are writing a general purpose library or application, you want
>> it to be less compiler-version/option-dependent than that; preferably,
>> you want something that works for a documented reason, rather than just
>> by coincidence of the compiler not doing some optimizations. As I said
>> earlier,
>>
>> Using inline asm with a "clobbers memory" directive before and after
>> the load should solve this particular problem. However, I am skeptical
>> of reliance on dependent load ordering in general. I've never seen any
>> rigorous definition of it (as a property of a memory model, rather than
>> in discussions of processor optimizations), or any explicit statement
>> in the architecture manuals of commonly used processor lines saying
>> that it is supported. In fact, I wonder if the idea that there is an
>> implicit commitment for future versions of these processors to guarantee
>> dependent load ordering is not just wishful thinking.
>
> Hypothetically, anything is possible. So gcc could do something
> extremely stupid.

Link-time optimization is *not* extremely stupid. Relying on it not being
done is arguably stupid, when there are alternatives that rely only on
documented (if architecture-dependent) features.

> Especially since the current C standard does not recognise threads.
> However C++0x is discussing threaded support so unless there is a complete
> break between C and C++, something will have to be worked out.

I have no confidence in the C++0x committee to produce anything usable.
I will be pleasantly surprised if they don't make things worse.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,
Jul 15, 2006, 10:58:55 AM7/15/06
to
David Hopwood wrote:

> Joe Seigh wrote:
>>
>>Hypothetically, anything is possible. So gcc could do something
>>extremely stupid.
>
>
> Link-time optimization is *not* extremely stupid. Relying on it not being
> done is arguably stupid, when there are alternatives that rely only on
> documented (if architecture-dependent) features.

Not that. I was refering to some gcc developer completely destroying
gcc's ability to tolerate multi-threading since the standards would
allow them to do that.

>
>
>>Especially since the current C standard does not recognise threads.
>>However C++0x is discussing threaded support so unless there is a complete
>>break between C and C++, something will have to be worked out.
>
>
> I have no confidence in the C++0x committee to produce anything usable.
> I will be pleasantly surprised if they don't make things worse.
>

Hell freezes over happens-before cpp-threads figures out how to
define multi-threading semantics.

To slightly change the topic from my alledged claim to 100% accurately
predict the future with respect to what will or will not be in C/C++,
let's assume for the sake of argument that C/C++ will no longer be around.
The question becomes how would you develop and test new concurrent
programming facilities and constructs? With C/C++, you could
prototype and compare against alternative techniques. Without C/C++,
you'd have to add support for in in a language and modify the compiler
for that language. And you'd test it against what? Not having
the facility? You could test it against another language but that
would be an apples and oranges comparison due to the number of
overall differences between the languages. All you'd get would
be a flame war. There must be some great flame wars out there
on Java vs. Erlang vs. Occam. None of which conclusively proved
anything.

Chris Thomasson

unread,
Jul 15, 2006, 4:32:55 PM7/15/06
to
"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message
news:2lytg.61623$181....@fe3.news.blueyonder.co.uk...

http://groups.google.com/group/linux.kernel/msg/e10f88ada7cd04f4

Humm...


> Using inline asm with a "clobbers memory" directive before and after the
> load
> should solve this particular problem. However, I am skeptical of reliance
> on
> dependent load ordering in general. I've never seen any rigorous
> definition of
> it (as a property of a memory model, rather than in discussions of
> processor
> optimizations), or any explicit statement in the architecture manuals of
> commonly
> used processor lines saying that it is supported.

https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068


- I received a request from SUN to talk to the "press" about the CoolThreads
contest, and basically vZOOM techniques'/coolthreads technology/"uses" in
general...


I actually "may have a chance" to bring all of this "stuff" up, and perhaps
get it published in a "respectable medium"...


> In fact, I wonder if the idea
> that there is an implicit commitment for future versions of these
> processors
> to guarantee dependent load ordering is not just wishful thinking.

I will address this "issue" in detail...


Chris Thomasson

unread,
Jul 15, 2006, 8:41:08 PM7/15/06
to
"Chris Thomasson" <cri...@comcast.net> wrote in message
news:PPKdnU-XmNaC0iTZ...@comcast.com...

> "David Hopwood" <david.nosp...@blueyonder.co.uk> wrote in message
> news:2lytg.61623$181....@fe3.news.blueyonder.co.uk...
>> Chris Thomasson wrote:
>>> "David Hopwood" <david.nosp...@blueyonder.co.uk> wrote:
>>>>Chris Thomasson wrote:
>>>>
>>>>>[...]
>>>>>
[...]

>> In fact, I wonder if the idea
>> that there is an implicit commitment for future versions of these
>> processors
>> to guarantee dependent load ordering is not just wishful thinking.

An atomic api abstraction should provide the means for loads with #LoadLoad
barrier semantics. So, if dependant loads are no longer supported, we switch
to the #LoadLoad version.... Simple...

;)


David Hopwood

unread,
Jul 16, 2006, 11:16:35 AM7/16/06
to
[c.l.j.programmer removed from Newsgroups]

Chris Thomasson wrote:


> "Chris Thomasson" <cri...@comcast.net> wrote:
>> "David Hopwood" <david.nosp...@blueyonder.co.uk> wrote:
>
> [...]
>
>>> In fact, I wonder if the idea that there is an implicit commitment
>>> for future versions of these processors to guarantee dependent load
>>> ordering is not just wishful thinking.
>
> An atomic api abstraction should provide the means for loads with #LoadLoad
> barrier semantics. So, if dependant loads are no longer supported, we switch
> to the #LoadLoad version.... Simple...

In that case why should the API user not specify #LoadLoad (which has more
clearly specified semantics) to start with?

Distinguishing dependent load barriers from #LoadLoad only makes sense if there
are platforms on which a dependent load barrier can be implemented as a nop,
but a #LoadLoad cannot. Are there any such platforms?

--
David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,
Jul 16, 2006, 11:19:22 AM7/16/06
to
David Hopwood wrote:
> Chris Thomasson wrote:
>>"Chris Thomasson" <cri...@comcast.net> wrote:
>>>"David Hopwood" <david.nosp...@blueyonder.co.uk> wrote:
>>
>>[...]
>>
>>>>In fact, I wonder if the idea that there is an implicit commitment
>>>>for future versions of these processors to guarantee dependent load
>>>>ordering is not just wishful thinking.
>>
>>An atomic api abstraction should provide the means for loads with #LoadLoad
>>barrier semantics. So, if dependant loads are no longer supported, we switch
>>to the #LoadLoad version.... Simple...
>
> In that case why should the API user not specify #LoadLoad (which has more
> clearly specified semantics) to start with?
>
> Distinguishing dependent load barriers from #LoadLoad only makes sense if there
> are platforms on which a dependent load barrier can be implemented as a nop,
> but a #LoadLoad cannot.

For ordinary loads from normally cached memory, I mean.

Joe Seigh

unread,
Jul 16, 2006, 12:25:44 PM7/16/06
to

Almost all of them that need it. Some early alpha processors were the exception.
Dependent load ordering is heavily used by RCU in the Linux kernel
since using #LoadLoad would nullify any advantage RCU had. The
big problem with dependent loads are that it's not part of the
formal memory model so it's a little harder to verify. The good
news is since it's used by Linux, any hardware vendor that breaks
it makes themselves a serious condender for the corporate Darwin
awards. The bad news is that since test and set spinlocks are the
most advanced form of synchronization known to hardware engineers,
they probably won't have a clue how to improve memory performance
without breaking RCU.

Chris Thomasson

unread,
Jul 16, 2006, 3:42:03 PM7/16/06
to
"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:SdCdnX-o5cvg-ifZ...@comcast.com...

> David Hopwood wrote:
>> David Hopwood wrote:
>>
>>>Chris Thomasson wrote:
>>>In that case why should the API user not specify #LoadLoad (which has
>>>more
>>>clearly specified semantics) to start with?
>>>
>>>Distinguishing dependent load barriers from #LoadLoad only makes sense if
>>>there
>>>are platforms on which a dependent load barrier can be implemented as a
>>>nop,
>>>but a #LoadLoad cannot.
>>
>>
>> For ordinary loads from normally cached memory, I mean.
>>
>>
>>>Are there any such platforms?
>>
>
> Almost all of them that need it. Some early alpha processors were the
> exception.
> Dependent load ordering is heavily used by RCU in the Linux kernel
> since using #LoadLoad would nullify any advantage RCU had.

We could possibly amortize the cost of #LoadLoad /w dependency hint by
simply "batching up" the writes 'and' reads.... Something like this:


static node_t *node_batch = 0;


// synchronized writers
1. setup/init "batch" of nodes
3. #StoreStore
2. node_batch = nodes


// lock-free readers
1. batch = node_batch;
2. /* #LoadLoad /w dependency-hint */
3. if (batch) {
do_something(batch);
}


Chris Thomasson

unread,
Jul 16, 2006, 4:54:46 PM7/16/06
to
<gottlo...@gmail.com> wrote in message
news:1152752826.7...@s13g2000cwa.googlegroups.com...

>
> Chris Thomasson wrote:
>
>> Dependant loads:
>>
>> <psuedo>
>>
>>
>> #if ! defined (ArchAlpha)
>> #define dependant_load(x) (*x)
>> #else
>> dependant_load(x) {
>> l = *x;
>> rmb();
>> return l;
>> }
>> #endif
>>
>>
>> void* vzAtomicLoadPtr_mbDepends(void* volatile* d) {
>> return dependant_load(d);
>> }
>
> so for the typical case of one time init / DCLP:
>
> if (!initted) // [1]
> {
> Lock lock;
> init(foo);
> }
>
> // use foo:
>
> foo->doSomething(); // [2]
>
> you need (or get for free) a dependant load on the init flag at [1]?

No. You need #LoadLoad after loading the init flag, and a #StoreStore before
setting the flag. You have introduced a level of indirection via. the "init
flag", so you need #LoadLoad... Dependent ordering is not strong enough
here...


> and/or when accessing foo, at [2]?

Dependent ordering applies to the load of foo, but not the load of the init
flag...


Chris Thomasson

unread,
Jul 16, 2006, 4:58:21 PM7/16/06
to
> // synchronized writers
> 1. setup/init "batch" of nodes
> 3. #StoreStore
> 2. node_batch = nodes
>

1. setup/init "batch" of nodes

2. #StoreStore
3. node_batch = nodes

Had to fix the numbering so nobody gets confused...


Joe Seigh

unread,
Jul 16, 2006, 5:17:00 PM7/16/06
to
Chris Thomasson wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:SdCdnX-o5cvg-ifZ...@comcast.com...
>>>
>>Almost all of them that need it. Some early alpha processors were the
>>exception.
>>Dependent load ordering is heavily used by RCU in the Linux kernel
>>since using #LoadLoad would nullify any advantage RCU had.
>
>
> We could possibly amortize the cost of #LoadLoad /w dependency hint by
> simply "batching up" the writes 'and' reads.... Something like this:
>
>
> static node_t *node_batch = 0;
>
>
> // synchronized writers
> 1. setup/init "batch" of nodes
> 3. #StoreStore
> 2. node_batch = nodes
>
>
> // lock-free readers
> 1. batch = node_batch;
> 2. /* #LoadLoad /w dependency-hint */
> 3. if (batch) {
> do_something(batch);
> }
>
>

See patent application 20020194436 Software implementation of synchronous memory Barriers
by Paul McKenney. I found it after the discussion of RCU+SMR on LKML which explained
why Paul was asking about signaling at the time. I expect some continuation patents to
show up at some point.

And speaking of patent applications check out
20060130061 Use of rollback RCU with read-side modifications to RCU-protected data structures
by Paul McKenney, Robert Bauer, and Paul Russel who is "Rusty" Russel I believe. Anyway,
deja vu or what?

Joe Seigh

unread,
Jul 16, 2006, 7:11:27 PM7/16/06
to

> And speaking of patent applications check out
> 20060130061 Use of rollback RCU with read-side modifications to
> RCU-protected data structures
> by Paul McKenney, Robert Bauer, and Paul Russel who is "Rusty" Russel I
> believe. Anyway,
> deja vu or what?
>
>
More on what starting here
http://groups.google.com/groups?selm=40BE14E5...@xemaps.com

David Hopwood

unread,
Jul 16, 2006, 8:39:09 PM7/16/06
to
Joe Seigh wrote:
> David Hopwood wrote:
>> David Hopwood wrote:
>>> Chris Thomasson wrote:
>>>
>>> In that case why should the API user not specify #LoadLoad (which has
>>> more clearly specified semantics) to start with?
>>>
>>> Distinguishing dependent load barriers from #LoadLoad only makes
>>> sense if there are platforms on which a dependent load barrier can be
>>> implemented as a nop, but a #LoadLoad cannot.
>>
>> For ordinary loads from normally cached memory, I mean.
>>
>>> Are there any such platforms?
>
> Almost all of them that need it. Some early alpha processors were the
> exception.

I think we must be misunderstanding each other.

x86 with "processor ordering" is not an example of such a platform [*].
Same for x86-64, as far as I can tell.

SPARC TSO is not an example because #LoadLoad is a nop (the arch manuals
say this explicitly).

Alpha is not an example, because neither a dependent load nor a #LoadLoad
can be implemented as a nop.

Itanic apparently *is* an example of such a platform. See the DF:RAR rule in
section 3.3.4 of <http://www.intel.com/design/itanium/downloads/25142901.pdf>,
and also Table 9 (ignore the misleading caption of the latter).

I don't understand POWER's memory model well enough to answer this question
for POWER. (The memory model sections of its arch manuals are even more
poorly written than is usual for memory model descriptions, IMHO.)

> Dependent load ordering is heavily used by RCU in the Linux kernel
> since using #LoadLoad would nullify any advantage RCU had.

It wouldn't on x86[-64] and SPARC.

> The big problem with dependent loads are that it's not part of the
> formal memory model so it's a little harder to verify.

If it's not part of the formal memory model, how is it possible to verify
it at all?


[*] In <http://groups.google.com/group/comp.arch/msg/7200ec152c8cca0c>,
Andy Glew wrote:

Instruction execution at a single Pi proceeds as if one instruction at
a time were executed, with some interleaving of the external stores
Pi.Xk. I.e. from the point of view of the local processor,
it's loads Pi.Lj are performed in order, and in order with the local
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stores Pi.Sj. More specifically, there can be constructed an ordering
Pi.Mx which is an interleaving of Pi.Ij (and hence Pi.Lj and Pi.Sj)
and Pi.Xk, and local processor execution is consistent with such an
ordering Pi.Mx.

but note (IA-32 Intel Architecture Software Developer's Manual Volume 3A:
System Programming Guide, Part 1 section 7.2.4)
<http://download.intel.com/design/Pentium4/manuals/25366820.pdf>:

"Despite the fact that Pentium 4, Intel Xeon, and P6 family processors
support processor ordering, Intel does not guarantee that future
processors will support this model."

(Despite the /Pentium4/ in the URL, this is the up-to-date manual for
'Core' chips according to
<http://www.intel.com/design/mobile/core/duodocumentation.htm>, so at
least they haven't broken "processor ordering" yet.)

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,
Jul 16, 2006, 9:31:25 PM7/16/06
to
David Hopwood wrote:
> Joe Seigh wrote:
>
>
>>The big problem with dependent loads are that it's not part of the
>>formal memory model so it's a little harder to verify.
>
>
> If it's not part of the formal memory model, how is it possible to verify
> it at all?

Most of the info I got was from searching through the early RCU
discussions in the LKML archives.

David Hopwood

unread,
Jul 16, 2006, 9:38:14 PM7/16/06
to
Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>
>>> The big problem with dependent loads are that it's not part of the
>>> formal memory model so it's a little harder to verify.
>>
>> If it's not part of the formal memory model, how is it possible to verify
>> it at all?
>
> Most of the info I got was from searching through the early RCU
> discussions in the LKML archives.

OK, so it's not possible to verify.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joe Seigh

unread,
Jul 17, 2006, 6:54:57 AM7/17/06
to
David Hopwood wrote:
> Joe Seigh wrote:
>
>>David Hopwood wrote:
>>
>>>Joe Seigh wrote:
>>>
>>>
>>>>The big problem with dependent loads are that it's not part of the
>>>>formal memory model so it's a little harder to verify.
>>>
>>>If it's not part of the formal memory model, how is it possible to verify
>>>it at all?
>>
>>Most of the info I got was from searching through the early RCU
>>discussions in the LKML archives.
>
>
> OK, so it's not possible to verify.
>
Not without a lot of research. The LKML discussions didn't
document their research. None of the Linux synchronization
is documented, e.g. atomic_read().

gottlo...@gmail.com

unread,
Jul 17, 2006, 10:14:07 PM7/17/06
to

Chris Thomasson wrote:
> >
> > so for the typical case of one time init / DCLP:
> >
> > if (!initted) // [1]
> > {
> > Lock lock;
> > init(foo);
> > }
> >
> > // use foo:
> >
> > foo->doSomething(); // [2]
> >
> > you need (or get for free) a dependant load on the init flag at [1]?
>
> No. You need #LoadLoad after loading the init flag, and a #StoreStore before
> setting the flag. You have introduced a level of indirection via. the "init
> flag", so you need #LoadLoad... Dependent ordering is not strong enough
> here...
>

OK, that's what I thought you meant.


So what if I change the code to make init a int *, ie

if (*init)
// etc

?

So then init is loaded correctly, but foo isn't (yet) until foo->...?


And I assume that what is really being loaded is an entire cache line?
What if foo is larger than a cache line?


And there is, in certain high load situations, a noticeable difference
between the dependent load vs read barrier?


>
> > and/or when accessing foo, at [2]?
>
> Dependent ordering applies to the load of foo, but not the load of the init
> flag...


Thanks.
Tony

Joe Seigh

unread,
Jul 18, 2006, 6:37:58 AM7/18/06
to
gottlo...@gmail.com wrote:
>
>
> OK, that's what I thought you meant.
>
>
> So what if I change the code to make init a int *, ie
>
> if (*init)
> // etc
>
> ?

No. foo may not be seen as initialized. You have to test foo
unless init points to foo, e.g. init->foo->...

>
> So then init is loaded correctly, but foo isn't (yet) until foo->...?
>
>
> And I assume that what is really being loaded is an entire cache line?
> What if foo is larger than a cache line?

foo has to be atomic so it has to fit in a cache line and not overlap
a cache boundary.


>
>
> And there is, in certain high load situations, a noticeable difference
> between the dependent load vs read barrier?

Read barriers would likely have a higher impact no matter what. A
higher load might excerbate that, I suppose.

Chris Thomasson

unread,
Jul 18, 2006, 4:31:46 PM7/18/06
to
<gottlo...@gmail.com> wrote in message
news:1153188847.4...@35g2000cwc.googlegroups.com...
>
> Chris Thomasson wrote:
[...]


Compiler ordering aside:

<pseudo-code>

template<typename T>
class DCLP {
T* mPtr; // init to 0

public:
T* Load() {
T* local = mPtr;
// #LoadLoad /w Dependency-Hint > Load Barrier
if ( ! local ) {
lock_guard Lock(SomeMutex);
local = new T;
#LoadStore|#StoreStore > Release Barrier
mPtr = local;
}
return local;
}
};


FWIW, DCLP functionality should be coded in 100% assembly and "externally"
assembled...


Chris Thomasson

unread,
Jul 18, 2006, 4:35:21 PM7/18/06
to
WHOA! I forgot the actual double check!!!


> template<typename T>
> class DCLP {
> T* mPtr; // init to 0
>
> public:
> T* Load() {
> T* local = mPtr;
> // #LoadLoad /w Dependency-Hint > Load Barrier
> if ( ! local ) {
> lock_guard Lock(SomeMutex);

----------

local = mPtr;
if ( local ) { return local; }

----------


> local = new T;
> #LoadStore|#StoreStore > Release Barrier
> mPtr = local;
> }
> return local;
> }
> };

Sorry!


Chris Thomasson

unread,
Jul 21, 2006, 4:52:17 PM7/21/06
to
"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:-qydnQvEJ_ZfNifZ...@comcast.com...

Interesting... I just briefly scanned through the abstract; it sounds like
classic RCU. It seems to tie them to a kernel based implementation. They
explicitly mention interrupts and influencing CPU behavior; per-cpu data...
I was messing around with signals to achieve a somewhat similar effect:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/9545d3e17806ccfe/0c892c8aac6a13e4?lnk=gst&q=pthread_kill+semantics&rnum=1#0c892c8aac6a13e4

This is a "simple way" to implement signal-safe per-thread data... Also, I
don't remember it suffering to much from the real signal problem you
described. I guess that is due to the fact that my testing applications seem
to naturally have a syscall pattern that is "frequent enough" to allow for
the signals to go through... Humm...

Anyway, I think the batching of reads and writes should improve "overall"
performance for RCU running on an Alpha... Humm...


> And speaking of patent applications check out
> 20060130061 Use of rollback RCU with read-side modifications to
> RCU-protected data structures
> by Paul McKenney, Robert Bauer, and Paul Russel who is "Rusty" Russel I
> believe. Anyway,
> deja vu or what?

LOL!

However, I wonder if this technique "may" suffer from a sort of livelock
situation caused by frequent preemptions; an actual read through a massive
linked data-structure "may" have to be restarted quite a few times...
Humm...

Kind of sound like transactional memory... A preemption seems basically
"equal" to a transactional commit call...


Joe Seigh

unread,
Jul 21, 2006, 6:44:09 PM7/21/06
to
Chris Thomasson wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:-qydnQvEJ_ZfNifZ...@comcast.com...

>
>>And speaking of patent applications check out
>>20060130061 Use of rollback RCU with read-side modifications to
>>RCU-protected data structures
>>by Paul McKenney, Robert Bauer, and Paul Russel who is "Rusty" Russel I
>>believe. Anyway,
>>deja vu or what?
>
>
> LOL!
>
> However, I wonder if this technique "may" suffer from a sort of livelock
> situation caused by frequent preemptions; an actual read through a massive
> linked data-structure "may" have to be restarted quite a few times...
> Humm...
>
> Kind of sound like transactional memory... A preemption seems basically
> "equal" to a transactional commit call...
>
>
You're right, this scheme does assume a time slice large enough to afford
forward progress. The only preemptions that would occur would be involuntary
ones since voluntary preemptions are not allowed in an RCU critical region.
involuntary preemptions are relatively rare except in CPU intensive code and
you should keep the RCU critical regions short anyway.

This is the stuff I discussed with Paul, Rusty, and either Tridge or Jeremy.
I forget which of the latter two was on that conference all. See the entire
thread starting here
http://groups.google.com/groups?selm=40BE14E5...@xemaps.com

I've trying to figure out what is going on here is IBM is trying to
get a defensive patent in here. If somebody challenges the patent,
all that happens is that what is in the public domain gets established
and IBM keep the remaining claims. If they didn't do this and it turns
out that the public domain part is much narrower than was though, somebody
else could get their own patent in there and cause problems.

Anyway, it's all obsoleted by RCU+SMR so you can expect at least one defensive
patent there. It seems to take two years for the uspto to publish patent
applications, so we have another year to wait.

Alexander Terekhov

unread,
Jul 22, 2006, 5:26:28 AM7/22/06
to

Joe Seigh wrote:
[...]

> I've trying to figure out what is going on here is IBM is trying to

C'mon, LTC guys are just making some cash. IBM's Invention Achievement
Plateau Awards and all that.

regards,
alexander.

Joe Seigh

unread,
Jul 22, 2006, 12:04:43 PM7/22/06
to

Nice work if you can get it. But don't these guys realize all this
stuff belongs to SCO?

Chris Thomasson

unread,
Jul 23, 2006, 8:12:21 PM7/23/06
to
"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:duednTNO36c9ylzZ...@comcast.com...

> Chris Thomasson wrote:
>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:-qydnQvEJ_ZfNifZ...@comcast.com...
>>
>>>And speaking of patent applications check out
>>>20060130061 Use of rollback RCU with read-side modifications to
>>>RCU-protected data structures
>>>by Paul McKenney, Robert Bauer, and Paul Russel who is "Rusty" Russel I
>>>believe. Anyway,
>>>deja vu or what?
>>
>>
>> LOL!
>>
>> However, I wonder if this technique "may" suffer from a sort of livelock
>> situation caused by frequent preemptions; an actual read through a
>> massive linked data-structure "may" have to be restarted quite a few
>> times... Humm...
>>
>> Kind of sound like transactional memory... A preemption seems basically
>> "equal" to a transactional commit call...
> You're right, this scheme does assume a time slice large enough to afford
> forward progress. The only preemptions that would occur would be
> involuntary
> ones since voluntary preemptions are not allowed in an RCU critical
> region.
> involuntary preemptions are relatively rare except in CPU intensive code
> and
> you should keep the RCU critical regions short anyway.

This is one of the places where RCU and vZOOM part ways... Voluntary
preemptions for read-side regions are legal in vZOOM. Persistent references
are allowed to exist over multiple sync epochs... So, the read-side critical
regions can actually be fairly large...


> This is the stuff I discussed with Paul, Rusty, and either Tridge or
> Jeremy.
> I forget which of the latter two was on that conference all. See the
> entire
> thread starting here
> http://groups.google.com/groups?selm=40BE14E5...@xemaps.com
>
> I've trying to figure out what is going on here is IBM is trying to
> get a defensive patent in here. If somebody challenges the patent,
> all that happens is that what is in the public domain gets established
> and IBM keep the remaining claims. If they didn't do this and it turns
> out that the public domain part is much narrower than was though, somebody
> else could get their own patent in there and cause problems.

I made sure to disassociate my collector implementation(s) from RCU and/or
SMR... No point in challenging IBM... They patent everything, including the
kitchen sink, and the design for its drain, and every other little part in
between point A and point Anywhere...

;)


> Anyway, it's all obsoleted by RCU+SMR so you can expect at least one
> defensive
> patent there. It seems to take two years for the uspto to publish patent
> applications, so we have another year to wait.

Yup...


Alexander Terekhov

unread,
Jul 24, 2006, 6:59:59 AM7/24/06
to

Joe Seigh wrote:
>
> Alexander Terekhov wrote:
> > Joe Seigh wrote:
> > [...]
> >
> >>I've trying to figure out what is going on here is IBM is trying to
> >
> >
> > C'mon, LTC guys are just making some cash. IBM's Invention Achievement
> > Plateau Awards and all that.
> >
>
> Nice work if you can get it. But don't these guys realize all this
> stuff belongs to SCO?

Under SCO's theory, IBM owns it and merely can't disclose it (due to
nonsensical reading of AT&T contracts, ignoring side letters, etc.).

And since patenting requires disclosure... IBM just can't patent it
without oweing SCO a petabuck or two in contractual damages
(copyright claim arising from subsequent use of purportedly SCO's
copyrighted material after license termination for breach aside for
a moment).

regards,
alexander.

0 new messages