Why no double wide compare and swap on Sparc?

257 views
Skip to first unread message

Joe Seigh

unread,
Jun 12, 2006, 1:29:39 PM6/12/06
to
Of all the mainstream architectures, Sparc seems to be the
only one that doesn't have double wide compare and swap or
ll/sc which provides equivalent functionality as far as
the more common synchronization algorithms are concerned.

Even AMD, which started out ia32-64 without it, added one
(cmpxchg16b) in.

What's with the Sparc architects? I mean, Sun has a whole
group in Sun Research working on synchronization algorithms,
most of which use atomic operations not present on the Sparc.
There's a bit of a disconnect here I think.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Chris Thomasson

unread,
Jun 13, 2006, 5:06:21 PM6/13/06
to
"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:adudnTg-VNF_PhDZ...@comcast.com...

> Of all the mainstream architectures, Sparc seems to be the
> only one that doesn't have double wide compare and swap or
> ll/sc which provides equivalent functionality as far as
> the more common synchronization algorithms are concerned.
>
> Even AMD, which started out ia32-64 without it, added one
> (cmpxchg16b) in.
>
> What's with the Sparc architects?

IMHO, I believe the major reason 32-bit architectures support DWCAS was to
be "64-bit ready"; DWCAS gets converted to normal CAS on 64-bit systems. So,
if your 32-bit applications used DWCAS, then they can be run 64-bit arch
without changes... Of course there is a caveat...

A 32-bit application cannot use DWCAS to manipulate a pointer along with
another contiguous variable...

struct dosent_work_on_64_bit_s {
void *a;
whatever b;
};

IMO, the hardware architects' did NOT have lock-free algorithms in mind when
they decided to put DWCAS in their 32-bit instruction sets. The fact that
128-bit CAS is not reliably supported seems to support my opinion...


> I mean, Sun has a whole
> group in Sun Research working on synchronization algorithms,
> most of which use atomic operations not present on the Sparc.

Yes. I particularly enjoyed reading through their lock-free reference
counting patent teachings:

http://groups.google.com/group/comp.programming.threads/msg/9d410f7e02201aa1

:)


> There's a bit of a disconnect here I think.

Perhaps they are designing the lock-free algorithms for a upcoming processor
design. Maybe one that has hardware transactional memory... That could be
the reason they designed KCSS:

http://groups.google.com/group/comp.programming.threads/msg/390d46204899b954

Humm...


Joe Seigh

unread,
Jun 13, 2006, 6:24:19 PM6/13/06
to
Chris Thomasson wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote
>>What's with the Sparc architects?
>
>
> IMHO, I believe the major reason 32-bit architectures support DWCAS was to
> be "64-bit ready"; DWCAS gets converted to normal CAS on 64-bit systems. So,
> if your 32-bit applications used DWCAS, then they can be run 64-bit arch
> without changes... Of course there is a caveat...
>
> A 32-bit application cannot use DWCAS to manipulate a pointer along with
> another contiguous variable...
>
> struct dosent_work_on_64_bit_s {
> void *a;
> whatever b;
> };
>
> IMO, the hardware architects' did NOT have lock-free algorithms in mind when
> they decided to put DWCAS in their 32-bit instruction sets. The fact that
> 128-bit CAS is not reliably supported seems to support my opinion...
>
IBM S370 had double wide compare and swap long before 64 bit support
became an issue.

>
>
>
>>There's a bit of a disconnect here I think.
>
>
> Perhaps they are designing the lock-free algorithms for a upcoming processor
> design. Maybe one that has hardware transactional memory... That could be
> the reason they designed KCSS:

They've done a bit on STM (software transactional memory). If they did
come out with hardware based transactional memory it would be after the
fact of 64 bit sparc and wouldn't be generally available. So it would
have to be hidden behind some system or library api with alternate
implementations on non TM platforms. That would limit its applicability.
For example, you couldn't have atomically thread-safe refcounted smart
pointers. Well, not without RCU and/or SMR. If they go with STM to
compliment hw TM, it's only going to work at a lower contention level
if the STM algorihtm is only obstruction-free.

It's a big problem. You can design efficient synchronization mechanisms
that are portable if you ignore Sparc. If you want portability to
Sparc you have to sacrifice performance or functionality.

If Sun has a 64 bit JVM implementation, I wonder what they do for
the AtomicMarkedReference and AtomicStampedReference implementations
for non Sparc platforms. Cripple them so they perform as poorly as
the Sparc implementations?

Andy Glew

unread,
Jun 13, 2006, 6:48:17 PM6/13/06
to
> IMO, the hardware architects' did NOT have lock-free algorithms in mind when
> they decided to put DWCAS in their 32-bit instruction sets. The fact that
> 128-bit CAS is not reliably supported seems to support my opinion...

Actually, I am pretty sure that some, although maybe not all,
architects had lock free algorithms in mindwhen they put what you call
DWCAS in the 16 and 32 bit instruction sets.

Perhaps more accurately described as "compare and swap 2 adjacent
pointer sized memory locations".

Myself, I received the DCAS lock-free indoctrination from, if my
memory serves me accurately, a Motorola 68K family application note
circa 1984-1989. (At least I think it was a 68K family app note - I
am sure I will be corrected if no 68K had this function.) I have
evangelized it to other computer architects ever since. In
particular, I recall Dave James (then of Apple) giving a FutureBus+ or
SCI presentation, talking about compare-and-swap. From the udience:
Q: what is your pointer size?
A: 64 bits
Q: then don't you need cas128?
loud mnurmurs of agreement from audience.
Dave came and talked to me about it afterwards.

I am sure that you are correct that not all architects understand
lock-free algorithms.


Although I might not assume this of the guys who left CMPXCHG16B out
of AMD-64 --- they may just not have been able to implement it in time
for the first chip, and therefore left it out. Hoping, maybe, that
software would find another way to do the same things, perhaps. Or
perhaps not.

Lots of stuff gets left out that we know should be put in, in a
perfect world. Each such leaving out is an occasion for
rejustification.


> IMHO, I believe the major reason 32-bit architectures support DWCAS was to
> be "64-bit ready"; DWCAS gets converted to normal CAS on 64-bit systems. So,
> if your 32-bit applications used DWCAS, then they can be run 64-bit arch
> without changes... Of course there is a caveat...
>
> A 32-bit application cannot use DWCAS to manipulate a pointer along with
> another contiguous variable...
>
> struct dosent_work_on_64_bit_s {
> void *a;
> whatever b;
> };

Why not? Alignment? Language embedding? etc.

People have been solving the alignment constraint for years by
mallocing extra space and then rounding.

===

By the way, yes, I am reasonably sure that Sun is doing either KCSS or
TM.

Tommy Thorn

unread,
Jun 13, 2006, 7:38:44 PM6/13/06
to
Andy Glew wrote:
> Myself, I received the DCAS lock-free indoctrination from, if my
> memory serves me accurately, a Motorola 68K family application note
> circa 1984-1989. (At least I think it was a 68K family app note - I
> am sure I will be corrected if no 68K had this function.) I have
> evangelized it to other computer architects ever since.

I hope this is not a FAQ or OT, but how does CAS compare to LL/SC? It
would appear to me that the latter is simpler to implement (in HW) and
not much slower if any. What are the advantages of CAS as a primitive?

Thanks,
Tommy

Joe Seigh

unread,
Jun 13, 2006, 9:04:37 PM6/13/06
to
Well, it's easier to emulate CAS with LL/SC rather than the other way
around. Given that, some people write portable libraries with CAS as
the standard primative. So if you have CAS as a native primative, your
machine is likely to be more efficient using these libraries.

CAS is subject to the ABA problem with certain algorithms if you don't
have double wide CAS, whereas you don't need double wide LL/SC to avoid
the ABA problem.

They both could have scalability problems so you might see fetch-and-op
showing up on high processor count machines. Itanium's fetchadd was
fetch-and-op on certain memory types (vs. being in interlocked instruction).

Piotr Wyderski

unread,
Jun 14, 2006, 11:10:21 AM6/14/06
to
Tommy Thorn wrote:

> I hope this is not a FAQ or OT, but how does CAS compare to LL/SC?

CAS is weaker. It cannot handle some situations (google: "ABA problem"),
where LL/SC is inherently ABA-immune and thus does not need any ABA
preventers, as for instance the tag field (version counter).

> What are the advantages of CAS as a primitive?

No, there is no advantage.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,
Jun 14, 2006, 11:30:13 AM6/14/06
to
Andy Glew wrote:

> Myself, I received the DCAS lock-free indoctrination from, if my
> memory serves me accurately, a Motorola 68K family application note
> circa 1984-1989. (At least I think it was a 68K family app note - I
> am sure I will be corrected if no 68K had this function.) I have
> evangelized it to other computer architects ever since.

Lock-free algorithms are very good in many situations, I use
them on a daily basis. However, they are not the silver bullet.
They have many disadvantages, e.g. their tremendous complexity
of the lock-free solution when you need to do something very
simple, e.g. a FIFO. They are very slow (but surely faster than
the synchronized versions), 80 ns is a typical interlocked time
required for an interlocked bus transaction on most architectures.
It means at most ~12.5 million atomic instruction per second.

In many cases it is possible to arrange parallel processing in
a way does not exchange data between processors very often.
Then the performance boost can be an order of magnitude or
so, compared even to the lock-free design. Sync-free processing
is a way to go, lock-free is just a kludge. ;-)

Best regards
Piotr Wyderski

Joe Seigh

unread,
Jun 14, 2006, 12:05:21 PM6/14/06
to
Piotr Wyderski wrote:
> Andy Glew wrote:
>
>> Myself, I received the DCAS lock-free indoctrination from, if my
>> memory serves me accurately, a Motorola 68K family application note
>> circa 1984-1989. (At least I think it was a 68K family app note - I
>> am sure I will be corrected if no 68K had this function.) I have
>> evangelized it to other computer architects ever since.
>
>
> Lock-free algorithms are very good in many situations, I use
> them on a daily basis. However, they are not the silver bullet.
> They have many disadvantages, e.g. their tremendous complexity
> of the lock-free solution when you need to do something very
> simple, e.g. a FIFO. They are very slow (but surely faster than
> the synchronized versions), 80 ns is a typical interlocked time
> required for an interlocked bus transaction on most architectures. It
> means at most ~12.5 million atomic instruction per second.

Saying the complexity of lock-free algorithms is a disadvantage is
like saying the complexity of AVL trees is a disadvantage. It
would be if you coded an AVL tree from scratch everytime you needed
to use one. But most of us use a library implementation.

There are reader lock-free algorithms that don't use interlocked
instructions. Some that don't require memory barriers even.
These run very fast.

>
> In many cases it is possible to arrange parallel processing in a way
> does not exchange data between processors very often.
> Then the performance boost can be an order of magnitude or
> so, compared even to the lock-free design. Sync-free processing
> is a way to go, lock-free is just a kludge. ;-)
>

Avoiding unnecessary data exchange is always good. But if you have
to exchange data then some form of synchronization is necessary.
What form is best all depends. I always though barrier synchronization
commonly used in parallel programming was somewhat suboptimal. Sort of
like having a traffic light on a formula one race track to stop all the
cars once per lap.

Tommy Thorn

unread,
Jun 14, 2006, 12:39:32 PM6/14/06
to
Piotr Wyderski wrote:
>> What are the advantages of CAS as a primitive?
>
> No, there is no advantage.

Thanks Piotr and Joe,

that agrees with my understanding, but Andy made me think I was missing
something.

LL/SC has many nice properties and low hardware complexity, though Eric
Pattison's LLE and LLC extension are desirable also.

Are there a rule of thumb as to when LL/SC starts showing scalability
issues? On the particular 7 way platform I experimented on, with all
threads sitting in tight LL/SC loops on a shared location, the
performance was great.

I wonder how much harder it would have been for x64/AMD64 to have added
LL/SC instructions.

Regards,
Tommy

Elcaro Nosille

unread,
Jun 14, 2006, 1:24:28 PM6/14/06
to
> I mean, Sun has a whole group in Sun Research working on
> synchronization algorithms, ...

I question that Sun has a group of developers that only
deal with synchronization.

Message has been deleted

Joe Seigh

unread,
Jun 14, 2006, 2:13:33 PM6/14/06
to

Joe Seigh

unread,
Jun 14, 2006, 2:29:48 PM6/14/06
to
Tommy Thorn wrote:
> Piotr Wyderski wrote:
>
>>> What are the advantages of CAS as a primitive?
>>
>>
>> No, there is no advantage.
>
>
> Thanks Piotr and Joe,
>
> that agrees with my understanding, but Andy made me think I was missing
> something.
>
> LL/SC has many nice properties and low hardware complexity, though Eric
> Pattison's LLE and LLC extension are desirable also.
>
> Are there a rule of thumb as to when LL/SC starts showing scalability
> issues? On the particular 7 way platform I experimented on, with all
> threads sitting in tight LL/SC loops on a shared location, the
> performance was great.

It would probably be dictated by the OSes the processor was intended to
support. The LL/SC would be used in kernel synchronization primatives
and if there was a scalability problem then either LL/SC has to be
improved, less processors configured, or somebody has to reduce contention
by the kernel. Applications won't have a much influence since if they
don't work it's not as much of a show stopper as the OS not working.

The only problem I know of with LL/SC is instruction stepping in
debug mode.


>
> I wonder how much harder it would have been for x64/AMD64 to have added
> LL/SC instructions.
>

Well, Intel has problems getting MONITOR/MWAIT to perform well though
I don't know if they use a similar mechanism as LL/SC. Bus snooping in
the former most likely but I don't know if it's used by the latter.

Andy Glew

unread,
Jun 14, 2006, 4:42:59 PM6/14/06
to
> > I hope this is not a FAQ or OT, but how does CAS compare to LL/SC?


It is rather common for LL/SC to be implemented in the microprocessor,
but not to be implemented (or implementable) on the platform.

E.g. Stanford Flash - or was it Dash? The MIPS processor had LL/SC,
but in order to guarantee forward progress they had to kluge together
traditional RMWs. I.e. on at least one such system you couldn't use
LL/SC for parallel programming.

LL/SC's problems can always be fixed... but fixing requires extra
hardware. Whereas explicit atomic RMWs such as CAS or LOCK INC mem
provide enough clues to make the implementation simpler.


Piotr Wyderski

unread,
Jun 14, 2006, 4:57:59 PM6/14/06
to
Joe Seigh wrote:

> Saying the complexity of lock-free algorithms is a disadvantage is
> like saying the complexity of AVL trees is a disadvantage.

AVL trees are not complex. When implementing an AVL, you
don't need to study the documentation of the destination machine
witch extreme care. You don't need to understand its memory
ordering, cache coherence protocols, discover sequence points,
object lifetimes etc. All you need is to implement a subtree balancer
and forget about the problem. You cannot skip these problems
when you try to design or implement something lock-free, because
the result will be an extremely nasty and hard to find bug.

> It would be if you coded an AVL tree from scratch everytime
> you needed to use one.

I've implemented it several times, in most cases for special purposes.
Lock-free techniques are not general and require high-skilled personel.
Thus, we (my company and particularly myself) don't use them to solve
generic problems via a template one can pick from a library like STL.
This tool is too sharp. If a problem is serious (poor scalability on the
critical path, long execution time etc.), we design a problem-specific
lock-free solution. There are no standard lists, queues, stacks etc.
-- frankly speaking, nobody needs them. The generic solutions would
be too general and too complex. The context of the particular use case
often leads to a much better, specialized design, which does extremely
well the things it should do, but is useless outside that context. If
the problem deserves the effort, we start casting spells and provide
enchanted data structures, but the key point is that "if".



> There are reader lock-free algorithms that don't use interlocked
> instructions. Some that don't require memory barriers even.

Yes, they do exist and their performance is impressive, but they are
very limited. I use them whenever possible, but I call them "sync-free"
in order to separate this particularly interesting fraction from the broad
group of lock-free algorithms. Currently the intuitive meaning of the
term "lock-free" implies some form of CASturbation, which is not
the case here.

> These run very fast.

Of course they do.

> Avoiding unnecessary data exchange is always good. But if you have
> to exchange data then some form of synchronization is necessary.

Necessary -- yes, ultimately there is some form of synchronization,
but in many cases this necessity can be reduced and thus the
overhead will be virtually neglected. One can do this by coarse
batching jobs (one synchronization per bunch of jobs) or by more
fine-grained variants of it, like the "push-one/grab-all" stack build
on a single-word CAS instruction (it is ABA-immune).

> What form is best all depends. I always though barrier synchronization
> commonly used in parallel programming was somewhat suboptimal. Sort of
> like having a traffic light on a formula one race track to stop all the
> cars once per lap.

As you said, it depends. Barriers are good in some situations and
can perform much better than a purely lock-free solution. Anyway,
in my case the first question I think of is "can I batch it?". If the
answer is "yes", then the choice of the synchronization algorithm
doesn't really matter.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,
Jun 14, 2006, 5:03:41 PM6/14/06
to
Tommy Thorn wrote:

> Are there a rule of thumb as to when LL/SC starts showing scalability
> issues?

IMHO there is no rule of thumb. Scalability issues are not typically
induced by LL/SC itself. The cache coherence protocol is your real
foe. A picture of a bunch of CPUs playing ping-pong with a shared
cache line should worry you, not the lock-free algorithm on top of it.

Best regards
Piotr Wyderski

Piotr Wyderski

unread,
Jun 14, 2006, 5:08:08 PM6/14/06
to
Elcaro Nosille wrote:

> They could be even slower than syncronized operations

Yes, they could.

> because DCASes are slower than single-word
> CASes on current x86-Architectures.

That's true, but even the single-word CAS is so
slow, that the difference between a CAS (~90
cycles on Pentium 4) and a DCAS (~130 cycles)
doesn't matter. You measure it in percents, not
in orders of magnitude.

Best regards
Piotr Wyderski

Chris Thomasson

unread,
Jun 14, 2006, 5:24:16 PM6/14/06
to
"Piotr Wyderski" <wyde...@mothers.against.spam-ii.uni.wroc.pl> wrote in
message news:e6pt8l$prp$1...@news.dialog.net.pl...
> Joe Seigh wrote:
[...]

>> Avoiding unnecessary data exchange is always good. But if you have
>> to exchange data then some form of synchronization is necessary.
>
> Necessary -- yes, ultimately there is some form of synchronization,
> but in many cases this necessity can be reduced and thus the
> overhead will be virtually neglected. One can do this by coarse
> batching jobs (one synchronization per bunch of jobs) or by more
> fine-grained variants of it, like the "push-one/grab-all" stack build
> on a single-word CAS instruction (it is ABA-immune).

Indeed:

http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b

This trick can outperform the well known lock-free queue by Michael & Scott.
Unfortunately, their implementation is saturated with atomic operations and
StoreLoad dependencies; that means StoreLoad style memory barriers... Not
Good!

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/3cb6041545e55406/0297660c07efb7e1?q=commercializing+lock-free&rnum=1#0297660c07efb7e1

http://groups.google.com/group/comp.programming.threads/msg/0297660c07efb7e1

Andy Glew

unread,
Jun 14, 2006, 5:21:49 PM6/14/06
to
> > Are there a rule of thumb as to when LL/SC starts showing
> > scalability issues?
>
> IMHO there is no rule of thumb. Scalability issues are not typically
> induced by LL/SC itself.

I disagree.

Although maybe it's not really scalability issues. LL/SC starts
introducing deadlock / forward progress issues the moment you have
more than one processor on any interconnect that is not a simple
snoopy bus, and they just get worse as you add more, and have more
complicated interconnects.

LL/SC's problems can be generically solved in a scalable manner. But
it has a cost. A relatively minor cost; worse is that it tends to
rule out certain structural optimizations.

E.g. the straightforward way to guarantee LL/SC forward progress is to
provide one load-linked register for evcery processor and every thread
at every directory and snoop filter. When you have 100s of
processors/threads, that costs.

Yes, you can do better than this.

Chris Thomasson

unread,
Jun 14, 2006, 5:58:59 PM6/14/06
to
"Andy Glew" <first...@employer.domain> wrote in message
news:peypzmgg...@PXPL8591.amr.corp.intel.com...

>> IMO, the hardware architects' did NOT have lock-free algorithms in mind
>> when
>> they decided to put DWCAS in their 32-bit instruction sets. The fact that
>> 128-bit CAS is not reliably supported seems to support my opinion...
>
> Actually, I am pretty sure that some, although maybe not all,
> architects had lock free algorithms in mindwhen they put what you call
> DWCAS in the 16 and 32 bit instruction sets.
>
> Perhaps more accurately described as "compare and swap 2 adjacent
> pointer sized memory locations".

Yes. DWCAS stands for "Double-Width Compare-And-Swap".


> Myself, I received the DCAS lock-free indoctrination from, if my
> memory serves me accurately, a Motorola 68K family application note
> circa 1984-1989. (At least I think it was a 68K family app note - I
> am sure I will be corrected if no 68K had this function.) I have
> evangelized it to other computer architects ever since. In
> particular, I recall Dave James (then of Apple) giving a FutureBus+ or
> SCI presentation, talking about compare-and-swap. From the udience:
> Q: what is your pointer size?
> A: 64 bits
> Q: then don't you need cas128?
> loud mnurmurs of agreement from audience.
> Dave came and talked to me about it afterwards.

Yup. Detlefs based a lock-free reference counting algorithm on the Motorola
DCAS instruction:

http://citeseer.ist.psu.edu/cache/papers/cs/25408/http:zSzzSzwww.sun.comzSzresearchzSzpeoplezSzdetlefszSzpaperszSz01-podc.pdf/detlefs01lockfree.pdf

Interestingly, SUN has recently adapted Detlefs work and patented a new
lock-free reference counting algorithm. Unfortunately, the patent teachings
prove that it is identical to Joe Seighs atomic_ptr...

http://groups.google.com/group/comp.programming.threads/msg/d9f47a280b56c909


I bet that pisses him off real good! They also use one of my lock-free
reference counting techniques in this paper:

http://groups.google.com/group/comp.programming.threads/msg/0b9072fea09fb64e


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f2c94118046142e8/0b9072fea09fb64e?q=lock-free+patent&rnum=1#0b9072fea09fb64e

I better stop posting my lock-free techniques into public space because a
Giant will come along, alter one small thing, or nothing at all, and patent
it! Then I can't event use my own prior-art!

;)


> I am sure that you are correct that not all architects understand
> lock-free algorithms.

[...]


> Although I might not assume this of the guys who left CMPXCHG16B out
> of AMD-64 --- they may just not have been able to implement it in time
> for the first chip, and therefore left it out. Hoping, maybe, that
> software would find another way to do the same things, perhaps. Or
> perhaps not.
>
> Lots of stuff gets left out that we know should be put in, in a
> perfect world. Each such leaving out is an occasion for
> rejustification.

Indeed... What a nice place a perfect world would be!

;)


>> IMHO, I believe the major reason 32-bit architectures support DWCAS was
>> to
>> be "64-bit ready"; DWCAS gets converted to normal CAS on 64-bit systems.
>> So,
>> if your 32-bit applications used DWCAS, then they can be run 64-bit arch
>> without changes... Of course there is a caveat...
>>
>> A 32-bit application cannot use DWCAS to manipulate a pointer along with
>> another contiguous variable...
>>
>> struct dosent_work_on_64_bit_s {
>> void *a;
>> whatever b;
>> };
>
> Why not? Alignment? Language embedding? etc.
>
> People have been solving the alignment constraint for years by
> mallocing extra space and then rounding.

Yes. I have used various techniques to solve this... IMO, the most commonly
known trick is to use a simple index/offset into an array as a "pointer":

void *Pointers[SufficentDepth];

struct 64_bit_clean_anchor_s {
uint16_t Pointer1Index;
uint16_t Pointer2Index;
uint32_t Whatever;
};


I used this method to create a "64-bit clean adaptor" for Joe Seighs DWCAS
version of his neat atomic_ptr algorithm:

http://groups.google.ca/group/comp.programming.threads/msg/aae28a0e1f9b0d66

This allows atomic_ptr to run on SPARC in 64-bit mode. Another simple one
that you mentioned involves aligning the lock-free anchor on a sufficiently
sized boundary and then using the extra bits as a counter variable. One of
my lock-free proxy collectors uses this simple technique:

http://home.comcast.net/~vzoom/demos/pc_sample.c

This is a 100% 64-bit clean lock-free collector algorithm... Very simple,
however, I do think that "reliable support" of DWCAS would have made our
lives a lot easier...!

;)


Any thoughts?


Chris Thomasson

unread,
Jun 14, 2006, 6:21:01 PM6/14/06
to
"Andy Glew" <first...@employer.domain> wrote in message
news:peyp8xnz...@PXPL8591.amr.corp.intel.com...

>> > Are there a rule of thumb as to when LL/SC starts showing
>> > scalability issues?
>>
>> IMHO there is no rule of thumb. Scalability issues are not typically
>> induced by LL/SC itself.
>
> I disagree.
>
> Although maybe it's not really scalability issues. LL/SC starts
> introducing deadlock / forward progress issues the moment you have
> more than one processor on any interconnect that is not a simple
> snoopy bus, and they just get worse as you add more, and have more
> complicated interconnects.

Something like this?:

http://groups.google.com/group/comp.arch/msg/d40cf092b679169d

Coarse reservation granularity can cause live-lock in LL/SC via. simple
false-sharing. I really do believe that HTM is going to suffer from the same
thing... Multiple processors can live-lock if they keep interfering with the
reservations that track the TM.

Humm... Kind of like the live-lock that is inherent in any obstruction-free
algorithm; they need to use "backoff-and-retry" algorithms to keep up
forward-progress...


Joe Seigh

unread,
Jun 14, 2006, 7:09:34 PM6/14/06
to
Chris Thomasson wrote:
...

>
> Yup. Detlefs based a lock-free reference counting algorithm on the Motorola
> DCAS instruction:
>
> http://citeseer.ist.psu.edu/cache/papers/cs/25408/http:zSzzSzwww.sun.comzSzresearchzSzpeoplezSzdetlefszSzpaperszSz01-podc.pdf/detlefs01lockfree.pdf
>
They got a patent on it.
http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=6,993,770

> Interestingly, SUN has recently adapted Detlefs work and patented a new
> lock-free reference counting algorithm. Unfortunately, the patent teachings
> prove that it is identical to Joe Seighs atomic_ptr...
>
> http://groups.google.com/group/comp.programming.threads/msg/d9f47a280b56c909

The application for that one is here
http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220060037026%22.PGNR.&OS=DN/20060037026&RS=DN/20060037026


>
>
> I bet that pisses him off real good!

You must mean the Sun guys. I think it's pretty funny.

Andy Glew

unread,
Jun 15, 2006, 12:57:23 PM6/15/06
to
> Humm... Kind of like the live-lock that is inherent in any obstruction-free
> algorithm; they need to use "backoff-and-retry" algorithms to keep up
> forward-progress...

Yeah, I was thinking of correcting my older post to try to get to this.

It is not fundamentally LL/SC vs. CAS and other atomic RMWs.

It is more that LL/SC was originally proposed and implemented via an
"optimistic concurrency" type of address "link", whereas CAS and other
atomic RMWs are traditionally implemented via a "lock" - whether bus
lock, cache lock, or address lock.

If other people try to access the lock while the RMW is in flight they
are stalled or told to go away, come back later. This guarantees
forward progress.

Whereas if other people access the link between the LL and SC, it is
the SC that fails. In the presence of an adversary doing ordinary
stores, the LL/SC might never complete.

But these are just implementation artifacts. E.g. you can implement
CAS or an atomic RMW with LL/SC, whether in user code or microcode -
and such an RMW would have the same forward progress problems as
LL/SC. You can implement a hybrid - try the optimistic concurreny
approach first, and then try the lock approach.

Similarly, you can implement LL/SC by acquiring a lock, guaranteeing
that the SC will finish. But then you need some way of backing out,
protecting yourself against a malicious user who does the LL but never
the SC. E.g. the Intel 860 released such a lock after 32 instructions.

Andy Glew

unread,
Jun 15, 2006, 1:00:18 PM6/15/06
to
> This is a 100% 64-bit clean lock-free collector algorithm... Very simple,
> however, I do think that "reliable support" of DWCAS would have made our
> lives a lot easier...!
>
> ;)
>
>
> Any thoughts?


I'm open for discussion, but I don;t know what you mean by "reliable support".

If alignment is the only issue, then

a) software can already solve the problem

b) it's really part of a larger question: should hardware support
unaligned memory accesses.

Anne & Lynn Wheeler

unread,
Jun 15, 2006, 1:27:47 PM6/15/06
to
Andy Glew <first...@employer.domain> writes:
> Yeah, I was thinking of correcting my older post to try to get to this.
>
> It is not fundamentally LL/SC vs. CAS and other atomic RMWs.
>
> It is more that LL/SC was originally proposed and implemented via an
> "optimistic concurrency" type of address "link", whereas CAS and other
> atomic RMWs are traditionally implemented via a "lock" - whether bus
> lock, cache lock, or address lock.
>
> If other people try to access the lock while the RMW is in flight they
> are stalled or told to go away, come back later. This guarantees
> forward progress.
>
> Whereas if other people access the link between the LL and SC, it is
> the SC that fails. In the presence of an adversary doing ordinary
> stores, the LL/SC might never complete.
>
> But these are just implementation artifacts. E.g. you can implement
> CAS or an atomic RMW with LL/SC, whether in user code or microcode -
> and such an RMW would have the same forward progress problems as
> LL/SC. You can implement a hybrid - try the optimistic concurreny
> approach first, and then try the lock approach.
>
> Similarly, you can implement LL/SC by acquiring a lock, guaranteeing
> that the SC will finish. But then you need some way of backing out,
> protecting yourself against a malicious user who does the LL but never
> the SC. E.g. the Intel 860 released such a lock after 32 instructions.

the traditional locking on 360 multiprocessors had been test-and-set.

charlie was doing a lot of multiprocessor fine-grain locking work for
cp/67 at the science center
http://www.garlic.com/~lynn/subtopic.html#545tech

when he invented compare-and-swap (came up with compare-and-swap to
match charlie's initials "CAS").

a large part of the invention of compare-and-swap was because he noted
that much of the fine-grain test&test locking was purely for some
simple storage updates ... aka pointers and counters ... and that
compare-and-swap could accomplish both the "locking" of the earlier
test-and-set ... but the compare-and-swap semantics could also be used
to directly accomplish various storage updates w/o having to acquire
independent locking operations.

early attempts to deal with the pok 370 architecture group trying to
get compare-and-swap into 370 architecture wasn't succesful. they
claimed that the "mainstream" (pok) operating system group didn't see
sufficient additional smp benefit of compare-and-swap over the earlier
simple test-and-set for locking. they said that in order to get
compare-and-swap justified for 370 architecture a non-smp lock use had
to be used.

as a result, the multi-threaded application use for various storage
location updates was invented (independent of running on a single
processor or a multiprocessor). in the traditional smp kernel use, the
instruction stream is disabled for interrupts so a lock, load, modify,
sotre, unlock sequence ... so there isn't a lot of deadlock problems.
in multi-thread applications, the lock, update, unlock sequence could
be interrupted with another thread getting dispatched and then
deadlocking (which then requires all sorts of logic to avoid). the
atomic compare-and-swap operation significantly reduced the various
deadlock scenarios and software complexity for dealing with them.

the original rios/power/rs6000 didn't have for shared-memory
multiprocessing as well as no compare-and-swap instruction. the
problem was that in the intervening years (between early 70s and early
90s) a number of multi-threaded applications (like large database
infrastructures) had adopted use of compare-and-swap instruction. in
order to simplify support, aix defined a software compare-and-swap
macro ... which interrupted into the kernel and had a special
fast-path for emulating compare-and-swap semantics (while disabled for
interrupts, aka approx. atomic compare-and-swap in a single processor
environment).

misc. past posts on smp, compare-and-swap, etc
http://www.garlic.com/~lynn/subtopic.html#smp

Piotr Wyderski

unread,
Jun 15, 2006, 1:41:10 PM6/15/06
to
Andy Glew wrote:

> Although maybe it's not really scalability issues.

I think so, here the term "scalability" means how much
faster N processors perform the task relatively to a single
processor. The task is a typical concurrent program, not
the worst case that someone insane can implement. :-)

On most implementation one can destroy his own LL
reservation by a non-SC write into the reservation granule,
but I haven't seen anyone who claims that it is a problem.

> LL/SC starts introducing deadlock / forward progress issues

LL/SC itself cannot cause deadlocks, there is no shared
resource to wait for. The LL and SC instructions always
eventually terminate. However, the guarantee of forward
progress is (or better: can be) an issue. LL/SC is susceptible
to a livelock. But so does CAS, unless there is a hardware
arbitration logic which can guarantee that every candidate
for an interlocked bus transaction initiator will eventually
become the initiator. If not, then one procesor can start
CASes one by one, when the others wait for it infinitely long.
The probability of this is extremely low, but same is with LL/SC.

> LL/SC's problems can be generically solved in a scalable manner.

But are they _real_ problems, which real applications must face?
I doubt. The probability of an infinite stall is low and rapidly
decreases with time. 10 million successfull LL/SC pairs per
second, 100 faults requiring 2 protocol restarts per second,
one fault with 1000 restarts per year? It's fine with me.

> When you have 100s of processors/threads, that costs.

Hundreds of CPUs on a single bus? It looks more like a NUMA machine...
And on a NUMA generally it's a very bad idea to cross the locality domains
(or how should one call the CPU clusters?) with locks, at least frequently.

Best regards
Piotr Wyderski

Joe Seigh

unread,
Jun 15, 2006, 2:48:20 PM6/15/06
to


Lock-free algorithms are a lot more sensitive to hardward support
than other types of algorithms. While you can emulate some forms
of hardware support with only performance penalties, you can't
emulate interlocked instructions very well and get forward
progress and async safety guarantees that are useful.

It's important for hardware features to be ubiquitous and be
around long enough for software developers to feel it's
worthwhile to exploit them. Who, besides Sun, would waste
time developing algorithms for short lived hardware features.
I could think of interesting things to do with MONITOR/MWAIT
and/or hyperthreading. But going past the stage of mental
exercise hardly seems worthwhile.

Andy Glew

unread,
Jun 15, 2006, 3:52:25 PM6/15/06
to
> > LL/SC starts introducing deadlock / forward progress issues
>
> LL/SC itself cannot cause deadlocks, there is no shared
> resource to wait for. The LL and SC instructions always
> eventually terminate. However, the guarantee of forward
> progress is (or better: can be) an issue. LL/SC is susceptible
> to a livelock. But so does CAS, unless there is a hardware
> arbitration logic which can guarantee that every candidate
> for an interlocked bus transaction initiator will eventually
> become the initiator. If not, then one procesor can start
> CASes one by one, when the others wait for it infinitely long.
> The probability of this is extremely low, but same is with LL/SC.

No.

Consider N processors trying to obtain a lock. With possibly M other
adversary processors that are willing to create worst case memory access
patterns to disrupt this.

The simplest implementations of LL/SC - the implementations in the
original processors that used this approach - cannot guarantee that
any of the N processors will ever acquire the lock. All of the lockers
can be disrupted by the adversary.

The simplest implementations of atomic RMWs - the implementation on
every processor that I have ever seen it implemented - guarantees that
at least 1 of the N will complete the CAS. Some of the N may be
starved forever - but at least one will eventually get work done.

As mentioned in other email, both LL/SC and atomic RMWs can be
extended in ways that provide better guarantees - such as guaranteeing
that everyone will eventually get to do a CAS. Similarly, the LL/SC
forward progress problem in the presence of adversaries can be solved.
IMHO fixing LL/SC is ugly, but certainly doable.

Terje Mathisen

unread,
Jun 15, 2006, 4:12:10 PM6/15/06
to

Indeed!

The one thing that I worry about in a (potentially) high-contention
situation is forward progress: Some system architectures can guarantee
this, others (and it can be hard to figure out which kind you're working
with) cannot.

In the OP problem of DCAS it is normally possible to re-design the
problem (i.e. data structures) in such a way that a single-width CAS is
enough.

If even this isn't enough, then I like locked fetch_and_add (Read old
value and adjust it, write back the new value and return either the old
or the new value. For the last part it really doesn't matter which
version the cpu architect chooses), i.e. a RMW that can guarantee
forward progress on every bus cycle as long as the cache protocol
support some form of allocate_for_update which it pretty much has to, right?

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Joe Seigh

unread,
Jun 15, 2006, 5:09:53 PM6/15/06
to
Terje Mathisen wrote:

> If even this isn't enough, then I like locked fetch_and_add (Read old
> value and adjust it, write back the new value and return either the old
> or the new value. For the last part it really doesn't matter which
> version the cpu architect chooses), i.e. a RMW that can guarantee
> forward progress on every bus cycle as long as the cache protocol
> support some form of allocate_for_update which it pretty much has to,
> right?
>

I have seen a mainframe checkstop because the microcode for an
interlocked instruction timed out because it couldn't make forward
progress. Yes, magic is nice but even it has problems at times.

Piotr Wyderski

unread,
Jun 15, 2006, 5:01:01 PM6/15/06
to
Terje Mathisen wrote:

> The one thing that I worry about in a (potentially) high-contention
> situation is forward progress:

Forward progress is a very strong theoretical property, but in
most practical cases the risk of unbounded starvation is not
that important, if the expected "time to progress" is almost
constant and the fluctuations are low and decrease with time.
You can achieve this with, for instance, random backoff.

The best thing one can do in the case of high contention is
to reduce that contention via hashing, sectored data structures,
dedicated communication channels etc.

> In the OP problem of DCAS it is normally possible to re-design the
> problem (i.e. data structures) in such a way that a single-width CAS is
> enough.

It is, but it often requires fancy memory management techniques,
specialized allocators and similar things, which make the final solution
only a mental exercise, not an industrial-quality code. If the data
structure needs only a single-word CAS without such techniques,
that's great! If it requires DCAS, it's a no-go unless it has extremely
important features -- DCAS is weakly portable, even the IA-32/64
family processors from AMD do not support it in 64-bit mode,
providing the curiosal cmp8xchg16 instruction...

Best regards
Piotr Wyderski

Terje Mathisen

unread,
Jun 16, 2006, 10:04:38 AM6/16/06
to
Piotr Wyderski wrote:

> Terje Mathisen wrote:
>> In the OP problem of DCAS it is normally possible to re-design the
>> problem (i.e. data structures) in such a way that a single-width CAS
>> is enough.
>
> It is, but it often requires fancy memory management techniques,
> specialized allocators and similar things, which make the final solution
> only a mental exercise, not an industrial-quality code. If the data

I disagree:

Another poster showed one example of how to do it, by using a level of
indirection via an array of items that are small enough to fit in a
regular CAS.

This is still usable in "industrial-quality" code, even if it introduces
problems similar to the classic memory management "above/below the line"
issue.

Alex Colvin

unread,
Jun 17, 2006, 12:47:28 AM6/17/06
to
>> foe. A picture of a bunch of CPUs playing ping-pong with a shared
>> cache line should worry you, not the lock-free algorithm on top of it.

>The one thing that I worry about in a (potentially) high-contention

>situation is forward progress: Some system architectures can guarantee
>this, others (and it can be hard to figure out which kind you're working
>with) cannot.

For example, if you use Ethernet, you should bw comfortable using LL/SC.
If that's not good enough, maybe you should be using 802.4 token bus.

--
mac the naïf

Steinar Haug

unread,
Jun 17, 2006, 4:44:41 AM6/17/06
to
> >The one thing that I worry about in a (potentially) high-contention
> >situation is forward progress: Some system architectures can guarantee
> >this, others (and it can be hard to figure out which kind you're working
> >with) cannot.
>
> For example, if you use Ethernet, you should bw comfortable using LL/SC.
> If that's not good enough, maybe you should be using 802.4 token bus.

I presume you're talking about the original CSMA/CD implementation.

Ethernet today is usually implemented with full duplex point to point
links. CSMA/CD doesn't really enter the picture at all.

Steinar Haug, Nethelp consulting, sth...@nethelp.no

Terje Mathisen

unread,
Jun 17, 2006, 5:56:29 AM6/17/06
to

That is _not_ a good comparison!

If you run with multiple broken Ethernet cards, i.e. cards that cheat to
get good benchmarketing numbers by disobeying the required rules about
exponential backoff in the case of collisions, then Ethernet can very
easily bog down completely.

I've seen this 10+ years ago with a vendor/card that would start to
transmit before the minimum legal interpacket gap, causing
correctly-behaving cards to never even see the startup preamble.

This caused packets to be lost to collisions that the hw couldn't
detect, which forced the 2.5 second sw timout to take over. Performance
went away completely of course, except for the single connection between
two of the broken/illegal cards.

If LL/SC came with compulsory (hw) backoff algorithms then it would be
just as safe as Ethernet, as it is today, with a few threads sitting in
tight loops, or a single DoS-intent programmer, all bets are off.

I.e. LL/SC around a really buzy memory location is really bad, while
lots of Ethernet users sending small packets as fast as possible will be
significantly better.

Piotr Wyderski

unread,
Jun 17, 2006, 6:19:08 AM6/17/06
to
Terje Mathisen wrote:

> Another poster showed one example of how to do it, by using a level of
> indirection via an array of items that are small enough to fit in a
> regular CAS.

Of course this classic trick is doable, I use it in my shared-memory
messaging
library. But the question is not whether it is possible to get rid of those
problems
-- it always is, as we know from the papers of Herlihy et al. The question
is
whether the gain in performance and scalability justifies the additional
complexity
and design effort. Sometimes it does, sometimes it does not.

> This is still usable in "industrial-quality" code

Sure, usable, but also abusable. The lock-free algorithms (and any other
too) solve a particular problem in a particulat context. One can focus on
the problem and solve it without any further question with more or less
effort, but in most cases it is worthwile to modify the context and then
re-phrase the problem. Sometimes things get much simpler then and
much easier, less demanding lock- or even sync-free techniques can be
applied instead of the original idea.

Best regards
Piotr Wyderski

David Hopwood

unread,
Jun 17, 2006, 6:29:54 PM6/17/06
to
Alex Colvin wrote:
>
>>The one thing that I worry about in a (potentially) high-contention
>>situation is forward progress: Some system architectures can guarantee
>>this, others (and it can be hard to figure out which kind you're working
>>with) cannot.
>
> For example, if you use Ethernet, you should be comfortable using LL/SC.

Maybe I'm taking this comment more seriously than it was meant, but of
course it's an oversimplification:

- many systems have real-time requirements for some operations, but not
for their use of networking,
- if you use switched Ethernet, that doesn't imply you should be
comfortable with techniques that don't guarantee forward progress.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Jan Vorbrüggen

unread,
Jun 19, 2006, 3:19:25 AM6/19/06
to
> Consider N processors trying to obtain a lock. With possibly M other
> adversary processors that are willing to create worst case memory access
> patterns to disrupt this.

But why should I contemplate such a scenario? For user applications, there
is no need to do this - if your adversary process is part of one application,
the application needs to be fixed; if it is part of a different application,
it simply will not be allowed access to the protected data structure.

The only scenario I can envisage is this happening in the kernel - but then
again, case 1 above applies: the kernel needs to be fixed.

I don't see why such a Byzantine failure mode should be considered in an
environment which can only function at all when the participants cooperate
anyway.

Jan

Jan Vorbrüggen

unread,
Jun 19, 2006, 3:21:19 AM6/19/06
to
> Coarse reservation granularity can cause live-lock in LL/SC via. simple
> false-sharing.

This is a red herring: An application suffering from false sharing is simply
broken, deficits of the C standard (hello, Nick!) notwithstanding. AIUI, lack
of forward progress under contention is the real (only?) issue.

Jan

Jan Vorbrüggen

unread,
Jun 19, 2006, 3:47:29 AM6/19/06
to
> Interestingly, SUN has recently adapted Detlefs work and patented a new
> lock-free reference counting algorithm. Unfortunately, the patent teachings
> prove that it is identical to Joe Seighs atomic_ptr...
>
> http://groups.google.com/group/comp.programming.threads/msg/d9f47a280b56c909

If that is so, write a letter to the USPTO and inform them of the fact.
Better still, write a letter to the applicants; if they don't forward it
to the Examiner and the patent is then granted, you can fairly easily get
a re-examination and invalidation based on the fact that they withheld
such information, AIUI.

> I better stop posting my lock-free techniques into public space because a
> Giant will come along, alter one small thing, or nothing at all, and patent
> it! Then I can't event use my own prior-art!

AIUI, on the one hand your prior art would invalidate the patent; in any case
you are allowed to use it if you can prove, to the court's satisfaction, that
you were practicing it before the invention. Beware of the one year before
application grace period in the US, though.

Jan

Joe Seigh

unread,
Jun 19, 2006, 6:52:12 AM6/19/06
to
Jan Vorbrüggen wrote:
>> Interestingly, SUN has recently adapted Detlefs work and patented a
>> new lock-free reference counting algorithm. Unfortunately, the patent
>> teachings prove that it is identical to Joe Seighs atomic_ptr...
>>
>> http://groups.google.com/group/comp.programming.threads/msg/d9f47a280b56c909
>>
>
>
> If that is so, write a letter to the USPTO and inform them of the fact.
> Better still, write a letter to the applicants; if they don't forward it
> to the Examiner and the patent is then granted, you can fairly easily get
> a re-examination and invalidation based on the fact that they withheld
> such information, AIUI.

In this case some emails were exchanged. In retrospect, I probably
shouldn't have done that since it would have been interesting to see
if Sun could get a patent on it. If they got the patent, they'd
probably promote the idea since they'd have a commercial interest in
it.

>
>> I better stop posting my lock-free techniques into public space
>> because a Giant will come along, alter one small thing, or nothing at
>> all, and patent it! Then I can't event use my own prior-art!
>
>
> AIUI, on the one hand your prior art would invalidate the patent; in any
> case
> you are allowed to use it if you can prove, to the court's satisfaction,
> that
> you were practicing it before the invention. Beware of the one year before
> application grace period in the US, though.
>

Defending yourself in a patent lawsuit is prohibitively expensive. It
wouldn't be a problem for me directly anyway, only for any company
wanting to use the technique commercially. I could probably negotiate
really favorable licensing terms since suing the company I was working
for would jepordize their patent and licensing with my employer would
reinforce the patent and probably nullify using me as a witness in any
counter suits by other companies.

John F. Carr

unread,
Jun 19, 2006, 8:58:55 AM6/19/06
to
In article <e701h0$fea$1...@pcls4.std.com>,

Ethernet and 802.11 mandate random and increasing delays
to prevent excess collisions and ensure fairness. Is this
technique used with LL/SC?

(Now I have to go off to figure out a bug where the 802.11
collision avoidance scheme seems to be doing a bad job.
But at least it isn't live- or deadlock.)

--
John Carr (j...@mit.edu)

Jan Vorbrüggen

unread,
Jun 19, 2006, 9:30:41 AM6/19/06
to
> Defending yourself in a patent lawsuit is prohibitively expensive.

In the US, yes - and you only rarely get your expenses reimbursed by the
other side if they loose. Elsewhere, things are much more reasonable.

Jan

Eric P.

unread,
Jun 20, 2006, 11:35:01 AM6/20/06
to

I was thinking that for RMWs, AtomicExchange or AtomicAdd where the
write operation is unconditional, it should be possible to set up a
hardware "queue" - a chain of processors interested in the cache
line and essentially minimize bus contention traffic.

Software could replace the bus bashing CAS spinlocks, where everyone
fights for ownership of a cache line, with much more bus and fairness
friendly FIFO spinlocks (e.g. LH Spinlock) that use AtomicExchange,
at least for high contention situations.

A processor executing an AtomicExchange performs ReadExclusive op
to acquire the cache line, invalidating all shared copies and getting
exclusive ownership.

The directory controller tracks which processor cache owns which
line. A cpu can only own 1 at a time. If another processor requests
the same line, the controller send a message to the current owner
telling it a forwarding destination when it finishes exclusive access.
A forwarding request can arrive even before a previous read request
is satisfied and so an ownership chain can be built up if there is
contention.

When the current owner finishes its RMW transaction it checks for
a forwarding request and sends the data and line ownership directly
on to a new owner, and immediately invalidates its copy.

The bus traffic would be only a ReadExclusive message to the
controller, a ForwardWhenDone message from controller to current
owner, sends a reply message with the data to hand off to new owner.

It looks to me that the bus traffic would be constant no matter how
many processors there were. No ping-ponging or other thrashing about.
However there could be latency issues if your processor was at the
end of a long chain of requesters.

Maybe this is how it already works, or maybe this is more trouble
than it is worth.

Eric

me

unread,
Jun 21, 2006, 9:56:27 AM6/21/06
to

Therefore, Ethernet is no longer ethernet as per Bob Metcalfe's definition
of ethernet; it's 802.11x WLAN that's now ethernet.

Bernd Paysan

unread,
Jun 21, 2006, 10:45:40 AM6/21/06
to
me wrote:

>> Ethernet today is usually implemented with full duplex point to point
>> links. CSMA/CD doesn't really enter the picture at all.
>
> Therefore, Ethernet is no longer ethernet as per Bob Metcalfe's definition
> of ethernet; it's 802.11x WLAN that's now ethernet.

Yes, and the name now makes much more sense ;-).

However, there's already some effort under the way to make WLAN a full
duplex point to point link, as well (MIMO antennas and such).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Terje Mathisen

unread,
Jun 21, 2006, 1:19:30 PM6/21/06
to
me wrote:

> Steinar Haug wrote:
>> Ethernet today is usually implemented with full duplex point to point
>> links. CSMA/CD doesn't really enter the picture at all.
>
> Therefore, Ethernet is no longer ethernet as per Bob Metcalfe's definition
> of ethernet; it's 802.11x WLAN that's now ethernet.
>
Afair, with the caveat that WLANs really use more of a token passing
approach, where each packet transmission is preceeded by a step to
aquire the sending token.

This is afaik the reason for why you cannot get nearly as close to
theoretical throughput on WLANs as on regular Ethernet, coax or twisted
pair.

Alex Colvin

unread,
Jun 21, 2006, 5:20:20 PM6/21/06
to
>> For example, if you use Ethernet, you should bw comfortable using LL/SC.
>> If that's not good enough, maybe you should be using 802.4 token bus.

Somewhat of an overstatement, but many people were opposed to ethernet
because it's nondeterministic and can't guarantee forward progress.



>I presume you're talking about the original CSMA/CD implementation.
>Ethernet today is usually implemented with full duplex point to point
>links. CSMA/CD doesn't really enter the picture at all.

There's still a similar effect when paths join at a hub. E.G., I can send
1Gbps to someone, and you can send 1Gbps to the same, but we can't both do
so if they've got a 1Gbps connection. When flows join at a switch, packets
can get dropped.

--
mac the naïf

Chris Thomasson

unread,
Jun 22, 2006, 7:16:23 PM6/22/06
to
> This is a red herring: An application suffering from false sharing is
> simply
> broken, deficits of the C standard (hello, Nick!) notwithstanding. AIUI,
> lack
> of forward progress under contention is the real (only?) issue.

Yeah... This has me thinking of how HTM is going avoid "live-lock"... It
would have to have a backoff-and-retry algorithm, or some other form of
contention management. Humm, I also wonder how HTM will handle "large"
linked-data structures... What If commit calls end up having to check/verify
"hundreds" of reads "per-transaction". What kind of impact will it have for
"reader threads"...


Elcaro Nosille

unread,
Jun 24, 2006, 11:27:47 PM6/24/06
to
> Therefore, Ethernet is no longer ethernet as per Bob Metcalfe's
> definition of ethernet; it's 802.11x WLAN that's now ethernet.


The 802.11x-Committe is currently developed "switched air" under
the name "switched ether". This enables the use of a WLAN-Switch. *g*

Chris Thomasson

unread,
Jun 26, 2006, 4:36:37 PM6/26/06
to
"Jan Vorbrüggen" <jvorbr...@not-mediasec.de> wrote in message
news:4fnn6tF...@individual.net...

>> Defending yourself in a patent lawsuit is prohibitively expensive.
>
> In the US, yes - and you only rarely get your expenses reimbursed by the
> other side if they loose. Elsewhere, things are much more reasonable.

Yup. In fact, this was one of the "main" reasons I patented my vZOOM proxy
collector algorithm; basically, just so I could, "actually", use the damn
thing! Commercial, private, whatever, I wanted to be able to "use something
that is my own", and has "at least some legal protection"; wrt. """USA""" of
course... I made sure to make my claims broad, and only, narrowed them down
enough to get some minor details across... ;)


My personal lawyer has "advised me" to attempt to get international patent
protection, however, I think that covering the "States" should be "good
enough"... I mean, who every heard of software and/or "technology" patents
on the other side of the pond anyway?

Just kidding!

:)


Chris Thomasson

unread,
Aug 18, 2006, 12:39:22 AM8/18/06
to
"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:W7adnZTvqanypxLZ...@comcast.com...
> Chris Thomasson wrote:
>> "Joe Seigh" <jsei...@xemaps.com> wrote
>>>What's with the Sparc architects?
>>
>>
>> IMHO, I believe the major reason 32-bit architectures support DWCAS was
>> to be "64-bit ready"; DWCAS gets converted to normal CAS on 64-bit
>> systems. So, if your 32-bit applications used DWCAS, then they can be run
>> 64-bit arch without changes... Of course there is a caveat...
>>
>> A 32-bit application cannot use DWCAS to manipulate a pointer along with
>> another contiguous variable...
>>
>> struct dosent_work_on_64_bit_s {
>> void *a;
>> whatever b;
>> };
>>
>> IMO, the hardware architects' did NOT have lock-free algorithms in mind
>> when they decided to put DWCAS in their 32-bit instruction sets. The fact
>> that 128-bit CAS is not reliably supported seems to support my opinion...
>>
> IBM S370 had double wide compare and swap long before 64 bit support
> became an issue.

Good point. Do you happen to know if they implemented it so they could
successfully pop from a lock-free stack? I wonder....


Humm... Microsoft is using cmpxchg8b for the IBM FreeList algorithm with a
minor tweak... I wonder why cmpxchg16b was added as a processor feature for
x86 64... AFAIK, 64-bit x86 does not reliably support cmpxchg16b.


Also, what the heck was that cmp8xchg16 thing for Itanium all about? Humm...

;)


IIRC, Microsoft has to implement some special modifications to their
implementation in order to get their tweaked version of the IBM FreeList to
run on AMD 64... IIRC, it involves pointer compression to get a 64-bit
pointer and a monotonic version counter (for aba) to fit into a single
CAS... Kind of funny......


>>>There's a bit of a disconnect here I think.
>>
>>
>> Perhaps they are designing the lock-free algorithms for a upcoming
>> processor design. Maybe one that has hardware transactional memory...
>> That could be the reason they designed KCSS:
>
> They've done a bit on STM (software transactional memory). If they did
> come out with hardware based transactional memory it would be after the
> fact of 64 bit sparc and wouldn't be generally available.

I wonder why they are messing around with KCSS software emulations that run
very poorly on SPARC64. Actually, they "really" don't need STM or HTM to
implement robust lock-free algorithms' on 64-bit architectures... Not at
all... Seems that they have discovered all of the tricks yet, eh? Perhaps...


> So it would
> have to be hidden behind some system or library api with alternate
> implementations on non TM platforms. That would limit its applicability.
> For example, you couldn't have atomically thread-safe refcounted smart
> pointers. Well, not without RCU and/or SMR. If they go with STM to
> compliment hw TM, it's only going to work at a lower contention level
> if the STM algorihtm is only obstruction-free.

I agree. I also have some problems with obstruction-free algorithms.
Paticuallry, issues that deal with forward progress.

http://groups.google.com/group/comp.arch/msg/9b00fda2752966f9

http://groups.google.com/group/comp.arch/msg/335aeb22fd6fe526

I don't really trust obstruction-free algorithms. They don't seem to like to
deal with a lot of pressure (load)... The forward progress seems to fold
under the contention manager... Once your relying on an explicit contention
manager to give you forward progress, you know you have some problems
somewhere...


> It's a big problem. You can design efficient synchronization mechanisms
> that are portable if you ignore Sparc. If you want portability to
> Sparc you have to sacrifice performance or functionality.

Well, after looking at their implementation of a LL/SC emulation... And
reading about how one of their lock-free reference counting has to depend on
it... Makes we want to totally agree with you...

:)


Humm. However, I think have to disagree here Joe... Well, FWIW, I have my
entire vZOOM library running on 64-bit SPARC. it performs equal to or
sometimes better than its 32-bit counterpart. However, unfortunately, you
have go to be very careful when you are designing lock-free algorithms for
SPARC 64; CAS based algorithms that use pointers and a contiguous word will
break...


> If Sun has a 64 bit JVM implementation, I wonder what they do for
> the AtomicMarkedReference and AtomicStampedReference implementations
> for non Sparc platforms. Cripple them so they perform as poorly as
> the Sparc implementations?

Ditto for AMD64... Yikes!

;)

Humm... It is their fault if they rely on a lock-free algorithm that is not
portable to a 64-bit system. Luckily, for me, converting vZOOM from SPARC 32
to SPARC 64 involved changing ld/sd instructions to ldx/sdx, and cas to
casx. However, there was one thing that pissed be off a bit. I had to give
up my use if the swap instruction. I can't seem to find reference to a swapx
type instruction. I had to hand-craft the 64-bit version of swap with casx,
damn! That turns a call to a simple instruction, swap, into a casx-based
loop.

:O


Anne & Lynn Wheeler

unread,
Aug 18, 2006, 8:05:05 AM8/18/06
to
"Chris Thomasson" <cri...@comcast.net> writes:
> Good point. Do you happen to know if they implemented it so they could
> successfully pop from a lock-free stack? I wonder....

charlie had invented compare&swap as part of his work on fine-grain
locking (leading to some number of lock free operations) for cp67
(360/67) mutliprocessor support at the science center
http://www.garlic.com/~lynn/subtopic.html#545tech

the trick then was to come up with a mnemonic that matched Charlie'
initials, CAS.

the attempt was then made to get the instruction into the up and
coming 370 architecture. working with ron smith in the pok 370
architecture group (they owned the 370 architecture "red book"), the
response was that 370 didn't need another multiprocessor specific
instruction, that the test and set from 360 was sufficient.

to get compare and swap into the 370 architecture we had to come up
with useage for compare&swap that wasn't multiprocessor specific.
thus was born some number of examples that were applicable to
multi-threaded applications that might be running enabled for
interrupts ... independent of whether the machine was single processor
or multiprocessor.

originally in the 370 principles of operation, the examples were part
of the programming notes that were part of the compare&swap
instruction. in subsequent version of the principle of operations the
examples were moved to a section in the appendix.

also as part of this activity, compare&swap double instruction was
added in addition to compar&swap. that resulted in two instructions
for 370, compare&swap along with compare&swap double ... so the
instruction mnemonics become CS and CDS (instead of CAS ... defeating
the original objective of coming up with instruction name
compare&swap).

total topic drift ... science center was like that ... GML
http://www.garlic.com/~lynn/subtopic.html#sgml

precusor to SGML, HTML, XML, etc ... also invented at the science
center, actually are the first initials of the last name of the three
inventors (and you probably thot it stood for generalized markup
language).

misc. past posts on multiprocessor support and/or compare&swap
instruction
http://www.garlic.com/~lynn/subtopic.html#smp

esa/390 principles of operation appendix for multiprogramming
(i.e. multithread) and multiprocessing
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A.6?SHELF=EZ2HW125&DT=19970613131822

cs & cds appendix:
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A.6.2?SHELF=EZ2HW125&DT=19970613131822

bypassing post and wait
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A.6.3?SHELF=EZ2HW125&DT=19970613131822

lock/unlock:
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A.6.4?SHELF=EZ2HW125&DT=19970613131822

free pool manipulation
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/A.6.5?SHELF=EZ2HW125&DT=19970613131822

and more recent z/architecture (64-bit) principles of operation
multiprogramming and multiprocessing examples appendix
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/A.6?SHELF=DZ9ZBK03&DT=20040504121320

note that the above also includes discussion of the newer PLO ...perform lock
operation instruction

Joe Seigh

unread,
Aug 18, 2006, 5:54:32 PM8/18/06
to
Chris Thomasson wrote:
> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:W7adnZTvqanypxLZ...@comcast.com...

>>It's a big problem. You can design efficient synchronization mechanisms
>>that are portable if you ignore Sparc. If you want portability to
>>Sparc you have to sacrifice performance or functionality.
>
>
> Well, after looking at their implementation of a LL/SC emulation... And
> reading about how one of their lock-free reference counting has to depend on
> it... Makes we want to totally agree with you...
>
> :)
>
>
> Humm. However, I think have to disagree here Joe... Well, FWIW, I have my
> entire vZOOM library running on 64-bit SPARC. it performs equal to or
> sometimes better than its 32-bit counterpart. However, unfortunately, you
> have go to be very careful when you are designing lock-free algorithms for
> SPARC 64; CAS based algorithms that use pointers and a contiguous word will
> break...
>
>

I suppose you can use one of obstruction-free emulations of compare
and swap here. That way you reward the architectures with efficient
synchronization primatives and penalize the ones without.

>
>
> Humm... It is their fault if they rely on a lock-free algorithm that is not
> portable to a 64-bit system. Luckily, for me, converting vZOOM from SPARC 32
> to SPARC 64 involved changing ld/sd instructions to ldx/sdx, and cas to
> casx. However, there was one thing that pissed be off a bit. I had to give
> up my use if the swap instruction. I can't seem to find reference to a swapx
> type instruction. I had to hand-craft the 64-bit version of swap with casx,
> damn! That turns a call to a simple instruction, swap, into a casx-based
> loop.
>

According to the latest architecture manual, swap is obsolete and cas should
be used instead.

Chris Thomasson

unread,
Aug 18, 2006, 6:52:52 PM8/18/06
to
"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:SKSdnX7pBt9rq3vZ...@comcast.com...

> Chris Thomasson wrote:
>> "Joe Seigh" <jsei...@xemaps.com> wrote in message
>> news:W7adnZTvqanypxLZ...@comcast.com...
>
[...]

>>
> I suppose you can use one of obstruction-free emulations of compare
> and swap here. That way you reward the architectures with efficient
> synchronization primatives and penalize the ones without.

Yikes! I can't imagine actually using one of those CAS emulation pieces of
sh^t that are out there... MCAS is one, KCSS is another...

;)


>> Humm... It is their fault if they rely on a lock-free algorithm that is
>> not portable to a 64-bit system. Luckily, for me, converting vZOOM from
>> SPARC 32 to SPARC 64 involved changing ld/sd instructions to ldx/sdx, and
>> cas to

Ooops! I meant ld/st instruction to ldx/stx...


>> casx. However, there was one thing that pissed be off a bit. I had to
>> give up my use if the swap instruction. I can't seem to find reference to
>> a swapx type instruction. I had to hand-craft the 64-bit version of swap
>> with casx, damn! That turns a call to a simple instruction, swap, into a
>> casx-based loop.
>>
>
> According to the latest architecture manual, swap is obsolete and cas
> should
> be used instead.

Hey, I liked swap! cas is no substitute for a swap! Wow, this burns me up...

I have to convert this simple piece of 32-bit code:

! void* SWAP_ACQ(void* volatile*, void const*)
swap [%o0], %o1
membar #StoreLoad
retl
mov %o1, %o0 ! branch delay

To this 64-bit CASX loop:

! void* SWAP_ACQ(void* volatile*, void const*)
ldx [%o0], %o2
1: ! Retry
mov %o1, %o3
casx [%o0], %o2, %o3
cmp %o2, %o3
bne,a,pn %icc, 1b ! predict not taken
mov %o3, %o2 ! branch delay
mov %o2, %o0
retl
membar #StoreLoad ! branch delay


BTW, IIRC, ?I think I read somewhere that a couple of SPARC models
(Spitfire?) had problems with putting a memory barrier in the branch delay
slot... Have you ever head of something like this?


jacko

unread,
Aug 19, 2006, 2:37:34 PM8/19/06
to
all user functions of a variable set encapsulated by an indirect jsr

attomic write of jsr location.

each fn checks return stack for call address before exec

if this fn then do it, else reshedule, and reissue indirect jsr

very small window of remodify.

pop if check???

cheers

jacko

unread,
Aug 19, 2006, 4:12:55 PM8/19/06
to

jacko wrote:
> all user functions of a variable set encapsulated by an indirect jsr
>
> atomic write of jsr location.
>
> each fn checks return stack for call address *after* exec

after exec !!
>
> if this fn then done, else reshedule, and reissue indirect jsr


>
> small window of remodify.
>
> pop if check???
>
> cheers

although if this was written in forth, swapping the stack for another
clear stack would be a primary operation. using R> and >R for passing
params.

then just booking in and out stacks would suffice, but this would be
blocking!!

or would it when indirect sp@ (atomic) stored?? probably

globals BAD...

the merge should be done by one task, or one task per memory segment,
and if multiple inputs to a merge are required, then a single linked
list process que to add one at a time should be an interface.

this would also be locking.

how about a expiring timer locks to prevent full deadlock fail??

cheers

Reply all
Reply to author
Forward
0 new messages