race on multi-processor solaris

Hong Chen

unread,

Nov 3, 2003, 10:07:26 AM11/3/03

to

I ran into a race condition when I tried to run my multi threads
program on solaris. My main thread had the lowest priority than other
threads I created and should be the last one to get running. But it
looked like the main thread got running on a different processor and
created a race condition. I am wondering if there are some function
calls that I can use to assign all the threads to a specific processor
or use some locking mechanism to fix this race?

Thanks for your help!

Hong Chen

Gavin Maltby

unread,

Nov 3, 2003, 10:27:32 AM11/3/03

to

I don't think you can rely on thread priorities to perform your
locking for you, even if you bind all threads to a single
processor.

The best/usual approach is to have locks that protect your data
(eg, mutexed embedded in your structures). Second choice would
be to have global mutexes that protect critical code sections
(but that single threads those portions of code, regardless of
whether they're working on the same object).

If it's just a case of having some/all threads wait until
conditions are right for them to run, then the common
way to do that is to have the spawning thread hold a mutex
during thread creation and, once all threads are created,
it does a condition variable broadcast. All the thread
wait for the condition to be broadcast. An alternative
is to have some sort of mailbox protocol - before starting
a thread set it's mailbox to "no yet" causing it to
spin (chews cpu!) until it see the main thread set the
mailbox to "now do it".

The best solution will depend on the nature of your application
and the sort of race you're seeing. Tell us more.

Gavin

Hong Chen

unread,

Nov 3, 2003, 1:38:43 PM11/3/03

to

Gavin Maltby <G_a_v_i_n....@sun.com> wrote in message news:<bo5s78$nn2$1...@new-usenet.uk.sun.com>...

Thanks for your reply. My system needs to rely on the priority
scheduling. I am looking into some ways to make sure that we can have
it.

Hong

Casper H.S. Dik

unread,

Nov 3, 2003, 3:28:26 PM11/3/03

to

hong...@usa.xerox.com (Hong Chen) writes:

>Thanks for your reply. My system needs to rely on the priority
>scheduling. I am looking into some ways to make sure that we can have
>it.

Why? Locks are good (and fast); relying on priority will never work;
cache misses, page faults, and such in the high priority thread may make
it run slower than the low priority threads.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

David Schwartz

unread,

Nov 5, 2003, 10:51:09 PM11/5/03

to

"Hong Chen" <hong...@usa.xerox.com> wrote in message
news:795f8372.03110...@posting.google.com...

> Thanks for your reply. My system needs to rely on the priority
> scheduling. I am looking into some ways to make sure that we can have
> it.

That won't work. Scheduling *CANNOT* prevent race conditions. What
happens if your priority thread hits a page fault and your lower priority
thread runs? It is a very fundamental thing that scheduling is about
resource allocation, not exclusion. You must 100% prevent a race condition,
not predispose the system not to create one.

DS

Anuj Goyal

unread,

Nov 12, 2003, 1:29:30 AM11/12/03

to

Locks are not good and fast when you are running in TMO mode right? :)

Casper, can you help me with a RMO mode question?

Casper H.S. Dik <Caspe...@Sun.COM> wrote in message news:<3fa6ba6a$0$58704$e4fe...@news.xs4all.nl>...

Gavin Maltby

unread,

Nov 12, 2003, 6:40:56 AM11/12/03

to

Anuj Goyal wrote:
> Locks are not good and fast when you are running in TMO mode right? :)

Well there are about 2 zillion locks in the Solaris kernel which runs
in TSO and scales to hundreds and thousands of processors. Even full
featured locking can be lightweight (at the very least for the common
case). If you want low featured locking (eg, no priority inheritance)
then they can be still lighter weight at the expense of functionality.
Typically locking is good and fast provided you code with some
sympathy towards how scheduling and hardware works. So don't
create bottlenecks (eg, locking code sections with a global
mutex rather than locking the data being operated on). On
large multiprocessors think of cacheline ownership and that
several locks can fit in a cacheline (ie, arrays of locks
may not always be as effective as you think).

You have little choice - as mentioned you cannot rely on priority
to protect data. You can use the locking primives supplied in
libthread etc which are reasonably full featured and well debugged/
matured. You can roll your own, but then you lose those benefits
and stand a good chance of bumping into all sorts of issues that
the provided stuff takes care of.

I'd suggest that if you give a reasonable description of the application
and why you believe it needs to rely on priority scheduling then
people can suggest alternatives. Do you need to protect data,
code sections, just communicate between threads/processes etc.

Cheers

Gavin

Anuj Goyal

unread,

Nov 13, 2003, 12:25:47 AM11/13/03

to

How can I run an executable in RMO mode? Is it a linker option that I
use during compilation? I have read this document: Part No:
816–1386–10

Linker and Libraries Guide (Solaris 9)

Gavin Maltby <g_a_v_i_n.m_a_l_t_b_y@s_u_n.com> wrote in message news:<bot689$1t$1...@new-usenet.uk.sun.com>...

SenderX

unread,

Nov 13, 2003, 6:37:15 AM11/13/03

to

> Why? Locks are good (and fast);

:O

Tell that to a SMP or HyperThreaded box! Locks are NOT good.

A lot of research has been directly aimed at removing the need for locks
altogether in certain collections and algos.

Use ONLY when NEEDED...

;)

For instance: do NOT use a lock to protect queues, stacks && linked-lists!

Those collections do not need a single lock to protect them. Even if a
thousand threads are accessing, no locks needed.

To use a lock for such collections would introduce unnecessary overhead, and
bottlenecks...

David Butenhof

unread,

Nov 13, 2003, 8:39:06 AM11/13/03

to

SenderX wrote:

>> Why? Locks are good (and fast);
>
> :O
>
> Tell that to a SMP or HyperThreaded box! Locks are NOT good.

More correctly, CONTENTION is not good. The more your threads contend for
shared resources, the more you'll waste time (and resources) resolving the
contention instead of solving your problem.

While CONTENTION on a lock is bad, so are contended "lock free" queues, or
trying to schedule more "simultaneous CPU units" than are available (e.g.,
3 compute-bound threads on a dual processor system). Bad lock design (like
a busy queue head controlled by an already heavily contended general
purpose mutex in a different cache line) can of course be "even more
worserer"... but that's not the point.

And as for "hyperthreading", even the most basic elements of that concept
aren't portable from one implementation to another. (Well, OK, maybe the
MOST basic element: one CPU maintains multiple stack and instruction
pointers.) Multiple instruction streams within a CPU may be executed in
parallel or essentially co-routined. They share widely divergent kind and
quantity of resources. It's quite simply impossible to even design (much
less optimize) for "hyperthreading" -- only for the particular
implementation on a particular processor generation.

> A lot of research has been directly aimed at removing the need for locks
> altogether in certain collections and algos.
>
> Use ONLY when NEEDED...

Right. Use CONTENDED resources, like shared memory -- including locked and
so called "lock free" data -- sparingly and only when necessary.

This is of course the fundamental dichotomy and contradiction of threaded
programming. If your threads didn't communicate at a fairly fine level of
granularity and with fairly high frequency, there'd be no point in using
threads. On the other hand, your threads (or any two computing entities)
will perform best when they don't communicate, and better when they
communicate less.

This is an engineering design puzzle that has no single universal answer.
You apply knowledge and tools, you evaluate, and you learn.

A good general rule, though certainly not the only rule, is to design and
code simply, and for maintainability; and optimize where actual use shows
that the resulting complication and effort is justifiable.

> For instance: do NOT use a lock to protect queues, stacks && linked-lists!
>
> Those collections do not need a single lock to protect them. Even if a
> thousand threads are accessing, no locks needed.

This is a fundamental misunderstanding of the way multiprocessor hardware
works. While you can use "lock free" to easily avoid the obvious bottleneck
at the level of user code, if you contend for shared resources there WILL
be one or more bottlenecks further down the system stack; either in the OS
software or in the hardware. Sometimes the bottleneck you can see is easier
to deal with than the ones you can't. (And sometimes more importantly, the
one you can see is often more LIKELY to be dealt with rather than ignored.)
Synchronized communication (contention bottlenecks) between processors is
avoided only when there IS no contention; and lock-free data does not
remove contention. Used carefully and properly, it can sometimes REDUCE
worst case and even average contention, but that's far from guaranteed.

> To use a lock for such collections would introduce unnecessary overhead,
> and bottlenecks...

Perhaps. But while you like to imagine this is a universal and absolute
guarantee, that's simply not the case. Instead, avoid contention when you
can: regardless of how subtly that contention may be packaged for you.

If you happen to need a list structure for which your requirements fit
within the constraints of an existing "lock free" package that you trust
sufficiently with your application's life, on all platforms you need to
support, then by all means use it. Using a correct and appropriate "lock
free" alternative to a traditional synchronization pattern will almost
never give WORSE results -- and if it does, you've almost certainly got
other problems that should be addressed. But don't ever forget that you're
still introducing contention and therefore limiting your multiprocessor
scalability over maintaining independent context.

And if you have needs that don't fit within the constraints of an existing
"lock free" package, or if you're not willing to trust (or modify and
maintain) an existing package, then ENCAPSULATE and ISOLATE an
implementation that meets your initial needs for flexibility, reliability,
and performance. If you're not sure, it's perfectly fine to start out with
the simplest (and most portable) solution. If you've properly encapsulated
and isolated that implementation, you can always spend more time optimizing
it later when you have real performance data to support that effort.

My problem with you, SenderX, is not that you're in love with "lock free",
or even that you evangelize it at every opportunity. It's that you
irresponsibly (and so far as I can tell, ignorantly) sell it as a miracle
"snake oil" cure for all problems rather than as one of many tools that
ought to be available to a threaded programmer.

Threads are NOT the solution to all problems, and neither are "lock free"
patterns. Nor are either always the BEST solution even when they can be
made to fit. And in fact they're not always the best solution even when
they'd result in the FASTEST solution, because engineers are often faced
with other concerns that are more important. Like portability and
maintainability, to name only two.

What people need is information, not marketing hype; and you're more often a
distraction than a help.

--
/--------------------[ David.B...@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/

Gavin Maltby

unread,

Nov 13, 2003, 8:41:03 AM11/13/03

to

SenderX wrote:
>>Why? Locks are good (and fast);
>
>
> :O
>
> Tell that to a SMP or HyperThreaded box! Locks are NOT good.

I can't agree with such a broad statement. Locking for the
sake of locking is not good. Locking where you can simply
design an alternative is not good etc. But locking can often
be elegant and with a little thought need not be a bottleneck
both in terms of scheduling and bus bandwidth.

> A lot of research has been directly aimed at removing the need for locks
> altogether in certain collections and algos.

And in many other places locking is still the correct thing to do.

> Use ONLY when NEEDED...

Of course.

> ;)
>
>
> For instance: do NOT use a lock to protect queues, stacks && linked-lists!
>
> Those collections do not need a single lock to protect them. Even if a
> thousand threads are accessing, no locks needed.
>
> To use a lock for such collections would introduce unnecessary overhead, and
> bottlenecks...

Yes you can manage such things without locks, but locks are typically a much
simpler way to go. In particular cases you may require a lockless design
but I'd suggest they are the minority by far.

Gavin

Rich Teer

unread,

Nov 13, 2003, 12:47:03 PM11/13/03

to

On Thu, 13 Nov 2003, SenderX wrote:

FUrther to David's and Gavin's excellent replies:

> For instance: do NOT use a lock to protect queues, stacks && linked-lists!

Consider the scenario where you have two threads on an SMP machine,
one of which is trying to delete an item from a linked list, and
the other that is trying return the value of that same item.

What happens if, half way through the item retrieving code, the
removal thread runs and removes the item? When the retrieval
code runs again, the best you can hope for is a SIGSEGV so you
know that something has gone wrong. The worst thing that could
happen would be silent data corruption.

A lock on the item would prevent this from happening. (Note that I'm
not advocating a single lock on the whole linked list; I advocating a
lock for each item in the list.)

--
Rich Teer, SCNA, SCSA

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-online.net

Dragan Cvetkovic

unread,

Nov 13, 2003, 1:17:48 PM11/13/03

to

Rich Teer <rich...@rite-group.com> writes:

> On Thu, 13 Nov 2003, SenderX wrote:
>
> FUrther to David's and Gavin's excellent replies:
>
>> For instance: do NOT use a lock to protect queues, stacks && linked-lists!
>
> Consider the scenario where you have two threads on an SMP machine,
> one of which is trying to delete an item from a linked list, and
> the other that is trying return the value of that same item.

[snip]

> A lock on the item would prevent this from happening. (Note that I'm
> not advocating a single lock on the whole linked list; I advocating a
> lock for each item in the list.)
>

Rich, I am sure that SenderX was refering to a different, more
sophisticated algorithms, like ones e.g mentioned in
http://groups.google.ca/groups?selm=3BC69872.833A237F%40informatik.uni-halle.de

However, I have no idea what they do and I am sure that, as David pointed
out, someone somewhere must hold a lock for the think to work. E.g., from
the user point of view, you don't need a lock to append to a file stream
opened in "a" mode, but we know better. :-)

Bye, Dragan

--
Dragan Cvetkovic,

To be or not to be is true. G. Boole No it isn't. L. E. J. Brouwer

!!! Sender/From address is bogus. Use reply-to one !!!

Gavin Maltby

unread,

Nov 13, 2003, 5:12:10 PM11/13/03

to

Dragan Cvetkovic wrote:
> Rich Teer <rich...@rite-group.com> writes:
[cut[

>>A lock on the item would prevent this from happening. (Note that I'm
>>not advocating a single lock on the whole linked list; I advocating a
>>lock for each item in the list.)
>>
>
>
> Rich, I am sure that SenderX was refering to a different, more
> sophisticated algorithms, like ones e.g mentioned in
> http://groups.google.ca/groups?selm=3BC69872.833A237F%40informatik.uni-halle.de

That talks of lockless linked lists via compare-and-swap instructions.
They're easy to do, and easy to screw up.

But why would you want to roll your own here? After all, a mutex_enter()
(or equivalent) consists of a single atomic instruction plus a few
"did I win" instructions in the common, uncontended case. So
in terms of speed you don't save the cache ownership activity
associated with atomics. If you don't win the cas/whatever you
have to try again. With full featured locking primitives
you can decide to spin (not using atomics) until you see the
lock clear and then try again, block on the resource etc;
it sounds heavyweight but it's only for the contended case and
you should design to avoid that. If instead you try lockless
with compare and swap you're pretty much limited to the
spin (or give up) option - ie, just a subset of the former.
That's hardly the timesharing way (busy waiting). If you
spin using atomic instructions you'll melt the bus, but
not everyone knows that.

Certainly the lockless way has its place and its merits, but
it is far from being the best or correct way for the majority
of applications. It's best use in the kernel, for instance,
is not for performance, but for situations where you have to avoid
deadlock.

> However, I have no idea what they do and I am sure that, as David pointed
> out, someone somewhere must hold a lock for the think to work. E.g., from
> the user point of view, you don't need a lock to append to a file stream
> opened in "a" mode, but we know better. :-)

Yes, the kernel makes liberal use of locking to makes those sorts of
things work for you. But if your application is manipulating
a private linked list, for example, then the kernel offers no
help there.

Gavin

Joe Seigh

unread,

Nov 13, 2003, 5:58:15 PM11/13/03

to

Gavin Maltby wrote:

>
> Dragan Cvetkovic wrote:
> >
> > Rich, I am sure that SenderX was refering to a different, more
> > sophisticated algorithms, like ones e.g mentioned in
> > http://groups.google.ca/groups?selm=3BC69872.833A237F%40informatik.uni-halle.de
>
> That talks of lockless linked lists via compare-and-swap instructions.
> They're easy to do, and easy to screw up.

That's easy for you to say. :)

>
> But why would you want to roll your own here? After all, a mutex_enter()
> (or equivalent) consists of a single atomic instruction plus a few
> "did I win" instructions in the common, uncontended case. So
> in terms of speed you don't save the cache ownership activity
> associated with atomics. If you don't win the cas/whatever you
> have to try again. With full featured locking primitives
> you can decide to spin (not using atomics) until you see the
> lock clear and then try again, block on the resource etc;
> it sounds heavyweight but it's only for the contended case and
> you should design to avoid that. If instead you try lockless
> with compare and swap you're pretty much limited to the
> spin (or give up) option - ie, just a subset of the former.
> That's hardly the timesharing way (busy waiting). If you
> spin using atomic instructions you'll melt the bus, but
> not everyone knows that.

Busy waiting or spin waiting is waiting for a specific value
to be set by another processor. Got that? The key points are
that "specific value" and "waiting for" that specific value.
A compare and swap loop in a lock-free algorithm is *NOT*,
I'll repeat that, *NOT* waiting for a specific value to
be set by anybody. It's not waiting, period.

When a compare and swap loop does loop in a lock-free algorithm,
it means some other thread is making forward progress. If it's
looping it's in a race with other processors and while it may
lose a few of the races, it will win soon enough because
the hardware guarantees fairness for compare and swap loops.
(if it doesn't guarantee fairness, the compare and swap
facility is basically and utterly useless and one has to wonder
why they wasted so much effort to put it in).

>
> Certainly the lockless way has its place and its merits, but
> it is far from being the best or correct way for the majority
> of applications. It's best use in the kernel, for instance,
> is not for performance, but for situations where you have to avoid
> deadlock.

There's a number of reasons to use lock-free. That's one of them.
More graceful degradation under contention is another. And performance
is an important point. A lock-free algorithm was key to the Linux
Scalability Effort (LSE) which allowed Linux to scale to and perform
well with more than 4 processors.

Joe Seigh

Rich Teer

unread,

Nov 13, 2003, 5:59:54 PM11/13/03

to

On Thu, 13 Nov 2003, Gavin Maltby wrote:

> That's hardly the timesharing way (busy waiting). If you
> spin using atomic instructions you'll melt the bus, but
> not everyone knows that.

Heh. Anyone else remeber the HCF (Halt and Catch Fire) instruction? :-)

Casper H.S. Dik

unread,

Nov 13, 2003, 6:12:57 PM11/13/03

to

Joe Seigh <jsei...@xemaps.com> writes:

>When a compare and swap loop does loop in a lock-free algorithm,
>it means some other thread is making forward progress. If it's
>looping it's in a race with other processors and while it may
>lose a few of the races, it will win soon enough because
>the hardware guarantees fairness for compare and swap loops.
>(if it doesn't guarantee fairness, the compare and swap
>facility is basically and utterly useless and one has to wonder
>why they wasted so much effort to put it in).

Compare and swap does not guarantee fairness; it does not become
"utterly useless" as you assert if it isn't fair.

Locking hardly ever guarantees fairness; that includes locks implemented
in hardware. You may think it's fair because there's the one interrupt
that disrupts the one CPU and makes the other CPU win for a change;
but you can't be certain of that. Where does the vendor specify
that compare and swap is "fair"?

>There's a number of reasons to use lock-free. That's one of them.
>More graceful degradation under contention is another. And performance
>is an important point. A lock-free algorithm was key to the Linux
>Scalability Effort (LSE) which allowed Linux to scale to and perform
>well with more than 4 processors.

Considering that others have scaled considerably further than 4CPUs
without lockless algorithms I would contend that that proofs fairly little.

As for "more graceful degradation"; again I'd like to see proof rather
than assertions.

Casper

Gavin Maltby

unread,

Nov 13, 2003, 6:58:30 PM11/13/03

to

Joe Seigh wrote:
[cut]

>
> Busy waiting or spin waiting is waiting for a specific value
> to be set by another processor. Got that? The key points are
> that "specific value" and "waiting for" that specific value.
> A compare and swap loop in a lock-free algorithm is *NOT*,
> I'll repeat that, *NOT* waiting for a specific value to
> be set by anybody. It's not waiting, period.

Yes I understand that (but I agree my post didn't make that
clear - not sure the sarcasm is necessary :-)

I was picturing the broader use of cas. Yes you won't suffer
more than a few iterations in cas loops for things like
"attempt to atomically insert at head of list". But there
are other times when you do want to watch for the appearance
of something when an initial cas attempt fails, before
retrying with the atomic.

> When a compare and swap loop does loop in a lock-free algorithm,
> it means some other thread is making forward progress. If it's
> looping it's in a race with other processors and while it may
> lose a few of the races, it will win soon enough because
> the hardware guarantees fairness for compare and swap loops.
> (if it doesn't guarantee fairness, the compare and swap
> facility is basically and utterly useless and one has to wonder
> why they wasted so much effort to put it in).

It does not guarantee fairness. As I remember the design calls
for some bound to waiting time, but it doesn't call for for
fairness. An atomic operation involves the cache grabbing
exlusive ownership of the cacheline. The current owner
of the cacheline, eg this processor if it just performed
an atomic on it, has an unfair head start in this. If it
sits in a tight loop performing atomic updates it will
hold on to the cacheline for an unfair amount of time.

>
>>Certainly the lockless way has its place and its merits, but
>>it is far from being the best or correct way for the majority
>>of applications. It's best use in the kernel, for instance,
>>is not for performance, but for situations where you have to avoid
>>deadlock.
>
>
> There's a number of reasons to use lock-free. That's one of them.
> More graceful degradation under contention is another. And performance
> is an important point. A lock-free algorithm was key to the Linux
> Scalability Effort (LSE) which allowed Linux to scale to and perform
> well with more than 4 processors.

Fair enough, but no reason to suggest it is the best/default way for
the common application (as SenderX would have us believe it seems).
I can think of massively scaleable locked designed in Solaris
that work exceptionally well and have no need to design in a lockless
way. The kernel scheduler, which scales to thousands of cpus and threads
is essentially a locked design. It's just careful in what it locks
and how. For most applications I'd argue that if degradation
under contention is an issue or lock performance itself, then
you should consider whether you're locking the right thing
in the right places; ie avoid the contention in the first
place and use locking to guarantee correctness - in the
common uncontended case it's much the same as lockless
activity and for the rare clash it doesn't cost much.

Gavin

SenderX

unread,

Nov 13, 2003, 7:09:07 PM11/13/03

to

> What happens if, half way through the item retrieving code, the
> removal thread runs and removes the item?

Simple:

It goes into a lock-free garbage collector, atomic_ptr or my proxy garbage
collector will work just great. This race-condition has been researched over
and over again.

It is not a problem at all.

> When the retrieval
> code runs again, the best you can hope for is a SIGSEGV so you
> know that something has gone wrong.

No way! This is 100% impossible.

The garbage collector will ensure that.

> The worst thing that could
> happen would be silent data corruption.

:)

Only if a moron implemented the lock-free linked-list.

P.S.

I can prove all of this to you if you want...

--
The designer of the experimental, SMP and HyperThread friendly, AppCore
library.

http://AppCore.home.comcast.net

SenderX

unread,

Nov 13, 2003, 7:12:32 PM11/13/03

to

> Considering that others have scaled considerably further than 4CPUs
> without lockless algorithms I would contend that that proofs fairly
little.
>
> As for "more graceful degradation"; again I'd like to see proof rather
> than assertions.

A lock-free algo has also helped Java scale a lot more:

http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html#8

A lock-free algo will scale far beyond its lock based cohort.

Joe Seigh

unread,

Nov 13, 2003, 7:14:39 PM11/13/03

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
> ...

> >the hardware guarantees fairness for compare and swap loops.
> >(if it doesn't guarantee fairness, the compare and swap
> >facility is basically and utterly useless and one has to wonder
> >why they wasted so much effort to put it in).
>
> Compare and swap does not guarantee fairness; it does not become
> "utterly useless" as you assert if it isn't fair.
>
> Locking hardly ever guarantees fairness; that includes locks implemented
> in hardware. You may think it's fair because there's the one interrupt
> that disrupts the one CPU and makes the other CPU win for a change;
> but you can't be certain of that. Where does the vendor specify
> that compare and swap is "fair"?

Well, it was in a super secret registered document that had "candy stripes".
You definitely wouldn't have a copy of that. It basically had a note
telling the hardware designers that compare and swap had to be fair.
Basically there has to be documentation somewhere that states under
what conditions the interlocked instructions are useful.

>
> >There's a number of reasons to use lock-free. That's one of them.
> >More graceful degradation under contention is another. And performance
> >is an important point. A lock-free algorithm was key to the Linux
> >Scalability Effort (LSE) which allowed Linux to scale to and perform
> >well with more than 4 processors.
>
> Considering that others have scaled considerably further than 4CPUs
> without lockless algorithms I would contend that that proofs fairly little.

In SMP? No lock-free algorithms what so ever?

>
> As for "more graceful degradation"; again I'd like to see proof rather
> than assertions.

Compared to conventional locks which tank when a queue forms on the lock
due to the fact that the average service time of the critical section
defined by the locks got increased by thread suspend/resume latency.
Adaptive locks with queue jumping are a little better but they still
tank nonetheless.

Joe Seigh

SenderX

unread,

Nov 13, 2003, 7:16:39 PM11/13/03

to

> Fair enough, but no reason to suggest it is the best/default way for
> the common application (as SenderX would have us believe it seems).

I design servers.

But, lock-free algos in a common application would make it scale better than
a lock based version.

Joe Seigh

unread,

Nov 13, 2003, 7:19:16 PM11/13/03

to

Rich Teer wrote:
>
> On Thu, 13 Nov 2003, SenderX wrote:
>
> FUrther to David's and Gavin's excellent replies:
>
> > For instance: do NOT use a lock to protect queues, stacks && linked-lists!
>
> Consider the scenario where you have two threads on an SMP machine,
> one of which is trying to delete an item from a linked list, and
> the other that is trying return the value of that same item.
>
> What happens if, half way through the item retrieving code, the
> removal thread runs and removes the item? When the retrieval
> code runs again, the best you can hope for is a SIGSEGV so you
> know that something has gone wrong. The worst thing that could
> happen would be silent data corruption.
>
> A lock on the item would prevent this from happening. (Note that I'm
> not advocating a single lock on the whole linked list; I advocating a
> lock for each item in the list.)
>

Where would you keep the locks? Not in memory that can be deallocated
or reused.

Joe Seigh

SenderX

unread,

Nov 13, 2003, 7:28:06 PM11/13/03

to

> A lock on the item would prevent this from happening. (Note that I'm
> not advocating a single lock on the whole linked list; I advocating a
> lock for each item in the list.)

:O I missed that!

What if you have 1,500,000+ items?

Sure you could hash the node into an index of an array of locks, but that
would STILL block some threads. That would STILL suffer from priority
inversion. That would STILL suffer from conveyor-belt syndrome.

The solution is simple: Use a lock-free GC'tor.

Gavin Maltby

unread,

Nov 13, 2003, 7:30:46 PM11/13/03

to

Joe Seigh wrote:
>
> "Casper H.S. Dik" wrote:
[cut]

>>Considering that others have scaled considerably further than 4CPUs
>>without lockless algorithms I would contend that that proofs fairly little.
>
>
> In SMP? No lock-free algorithms what so ever?
>

[cut]

Yes in SMP. Certainly with very few lock-free algorithms (and those
few that are aren't core to scalability).

Gavin

Gavin Maltby

unread,

Nov 13, 2003, 7:43:42 PM11/13/03

to

In an effort to keep an open mind I'll ask: what would be be your
lock-free suggestion for traversing a linked list? I have
a particular application in mind but that's unimportant.
I'm searching a linked list from it's head looking for
the first match (if any) on a particular structure members.
A number of other threads will be inserting/deleting from the
list or making modifications to some of it's members (without
unlinking). Deleted members may be instantly reused in other
such linked lists, or they may be freed. There is a need
to minimize memory references.

Gavin

Joe Seigh

unread,

Nov 13, 2003, 8:07:09 PM11/13/03

to

Gavin Maltby wrote:

> I was picturing the broader use of cas. Yes you won't suffer
> more than a few iterations in cas loops for things like
> "attempt to atomically insert at head of list". But there
> are other times when you do want to watch for the appearance
> of something when an initial cas attempt fails, before
> retrying with the atomic.

Okay, but that's not lock-free which I though we were discussing.
Yes, typically for a spin lock it will do what you describe.
Actually, that's probably not the most optimal either since once
the watched for value shows up, all waiting processors attempt a
interlocked instruction at once. A bakery algorithm might be
better since the interlocked updates get spread out more and
there less contention.

...

> > There's a number of reasons to use lock-free. That's one of them.
> > More graceful degradation under contention is another. And performance
> > is an important point. A lock-free algorithm was key to the Linux
> > Scalability Effort (LSE) which allowed Linux to scale to and perform
> > well with more than 4 processors.
>
> Fair enough, but no reason to suggest it is the best/default way for
> the common application (as SenderX would have us believe it seems).
> I can think of massively scaleable locked designed in Solaris
> that work exceptionally well and have no need to design in a lockless
> way. The kernel scheduler, which scales to thousands of cpus and threads
> is essentially a locked design. It's just careful in what it locks

> and how. ...

I'm not sure a cluster can be compared to a shared memory SMP system.
It's more likely you have what is in effect a distributed system.
We're getting into a apples and oranges thing here. We're basically
talking about multi-threaded programming with contention arising *because*
of sharing. Otherwise Beowolf clusters and MPI based supercompters
are fair comparison because they certainly avoid contention on shared
locks.

Joe Seigh

unread,

Nov 13, 2003, 8:21:24 PM11/13/03

to

There are known techniques for this. See "Lock-Free Linked Lists Using
Compare-and-Swap" by Valois. Having lock-free GC helps. See
AWTEventMulticaster in Java (though technically it won't really work
correctly until JSR 133). Ok that doesn't look like a linked list because
it was written apparently by a couple of Lisp programmers. Lock-free atomic
reference counting will work. Also proxy GC techniques such as RCU
or the reference counted proxy GC I posted way back will work as well. Actaully
there are tons of references to this as it seems to be the classic problem
addressed by lock-free research.

Joe Seigh

SenderX

unread,

Nov 13, 2003, 8:35:39 PM11/13/03

to

> In an effort to keep an open mind I'll ask: what would be be your
> lock-free suggestion for traversing a linked list?

Use a lock-free garbage collector.

> I'm searching a linked list from it's head looking for
> the first match (if any) on a particular structure members.

OK.

> A number of other threads will be inserting/deleting from the
> list

Fine.

> or making modifications to some of it's members (without
> unlinking).

Bzzt!

The lock-free linked list protects the nodes themselvs, not the memory you
attach to the node:

typedef struct SLfNode_
{
/* ... */

/* This is user state, we don't know anything about it.
don't touch it at all */
const void *pUserState;

} SLfNode;

> Deleted members may be instantly reused in other
> such linked lists, or they may be freed.

Take a look at my 32-bit cpu garbage collector:

http://intel.forums.liveworld.com/thread.jsp?forum=242&thread=6721

I have a 64-bit cpu collector, but I may try to patent it.

This is a really simple experimental example, a real lock-free list would be
spread over multi-buckets.

But it does show how threads can traverse the list, while others are pushing
and popping && freeing nodes with delete, all concurrently.

However this is just a simple stack, if you want linked-list code that
allows for pushes and pops anywhere in the list, you need the following
x86-32 && 64-bit compatible( /w SMR and other GC's ) IBM algo:

http://www.cs.tau.ac.il/~shanir/reading%20group/p73-michael.pdf

If you have trouble coding this, I can show you my stable C++ implementation
of the posted IBM algo.

This is one of the best IBM FreeList compatible lock-free linked-list code
currently out there.

Joe Seigh has another slick lock-free linked-list that uses his almighty
atomic_ptr. Google for it, its worth it.

SenderX

unread,

Nov 13, 2003, 8:41:47 PM11/13/03

to

> or the reference counted proxy GC I posted way back will work as well.

I missed that one! Time to breakout google.

Your algos have enlightened my understanding of lock-free algos, thanks Joe!

;)

P.S.

Is it anything like my proxy collector?

SenderX

unread,

Nov 13, 2003, 8:56:49 PM11/13/03

to

> Also proxy GC techniques such as RCU
> or the reference counted proxy GC I posted way back will work as well.

It this the one your talking about:

http://groups.google.com/groups?selm=3EF64AA8.180483F2%40xemaps.com

Joe Seigh

unread,

Nov 13, 2003, 9:51:48 PM11/13/03

to

Yes, though people should read the whole thread to get the caveats. It's
a little too non-blocking. GC dtor's are recursive and you can blow the
stack if there are too many collector objects. I haven't gotten around to
putting a limiting mechanism in place. And I should repost updated
atomic_ptr code some day.

I think this is the one that you decided to solve with one fixed GC proxy
sort of along the lines of how the K42 project did it, IIRC. You've never hit
any "writer starvation", i.e. there's always readers so the GC object never
frees the collected objects?

These proxy GC algorithms are kind of nice. So I have two algorithms out
there now, one with zero overhead non-blocking read access, and one with
2 interlocked instructions overhead non-blocking acess (one before and
one after). Kind of a hard act to follow though unless I can come up
with negative overhead access costs.

Joe Seigh

SenderX

unread,

Nov 14, 2003, 2:56:19 AM11/14/03

to

> Yes, though people should read the whole thread to get the caveats.

Yeah.

> It's
> a little too non-blocking. GC dtor's are recursive and you can blow the
> stack if there are too many collector objects. I haven't gotten around to
> putting a limiting mechanism in place.

C++ will do that to ya. ;)

A limit could be as simple as an atomic inc on every collection, and an
exchange to zero when the gc'tor drops to zero, of a per-gc object
counter...

very simple pseudo-code at bottom...

I believe that this scheme would "help" a collections GC'tor handle node
overflow. User apps would have a choice, non-blocking failure or a limited
spin on the current garbage collector count. Simple, but it helps.

> I think this is the one that you decided to solve with one fixed GC proxy

Yes. A single proxy gc for each collection.

A real-world collection would be hash'd based, with many lock-free buckets.
Each bucket would have "its own" proxy garbage collector. Iterating threads
would have to run through different buckets in order to traverse the entire
list. The act of iterating a bucket and switching to another will drop the
reference on the previous bucket. The amount of lock-free buckets will
increase the scalability of their proxy garbage collectors and the
collection as a whole.

> sort of along the lines of how the K42 project did it, IIRC.

I think they have to use a separate thread, is that right?

If so, our gc'tors are more efficient.

> You've never hit
> any "writer starvation", i.e. there's always readers so the GC object
never
> frees the collected objects?

Not yet. Even under massive concurrent iterations with parallel pushes and
pops.

Its due to making the collection's storage fine-grain, and using a simple
non-blocking per-garbage-collector limit, you can scale really good!

> These proxy GC algorithms are kind of nice.

Their clever.

> So I have two algorithms out
> there now, one with zero overhead non-blocking read access, and one with
> 2 interlocked instructions overhead non-blocking acess (one before and
> one after).

Nice!

> Kind of a hard act to follow though unless I can come up
> with negative overhead access costs.

;)

Garbage Limit pseudo:

// A generic proxy garbage collector
class CProxyGc
{

// A collector object
class CCollector
{

// ***

};

public:

// We default to a non-blocking failure
bool AddRef( bool bWait = false, int iSpinCount = 1024 )
{
// Before we can reference,
// we check the current count
if ( m_uiCount > 2048 nodes )
{
// See if the app dosen't mind a little spin
if ( bWait )
{
while ( m_uiSpinCount > 2048 )
{
Sleep( 0 ); // Yeild current thread

if ( ! ( --iCount ) ) { return false; }
}

// The counter just got exchanged to zero
// ok to addref the collector
}

else
{
// Let the application do whatever "else" it wants to
return false;
}
}

atomic_inc_current_collector_refs( ... );

return true;
}

void Collect( CNode *pNode )
{
atomic_int( &m_uiCount );

// Push node on lock-free collectors stack
}

void Release()
{
CCollector *pGc;

if ( ! ( pGc =

atomic_dec_current_collector_refs( ... ) ) )

{
// We dropped to zero, reset the counter
atomic_exchange( &m_uiCount, 0 );

pGc->RecycleSomeNodes();

delete pGc; // Dumps remaining nodes
}
}

private:

IA32CollectorRefs m_Refs; // CPU specific
// AMD64CollectorRefs m_Refs;
// IA64CollectorRefs m_Refs;
// PPC64CollectorRefs m_Refs;

volatile unsigned int m_uiCount;

};

=)

SenderX

unread,

Nov 14, 2003, 3:27:29 AM11/14/03

to

> The lock-free linked list protects the nodes themselvs, not the memory you
> attach to the node:

Actually, the node could "interact with the object"...

Yes it could...

Humm... A proxy garbage collector would work here.

Never really though of that before...

The garbage collector communicating with the objects its collected...

This give me an great idea for a SUPER scaleable lock-free kernel "or"
userland atomic reference counting schema.

It will be a:

Per-Object-"Group" System-Wide Lock-Free Distrubtred Reference Count
Collection

Not anything like K42 or tornado reference counting...

If this new idea works out, K42 & Tornado will be blown away.

Better than RCU, no waiting for quiescent states and no need for a kernel.

This will be interesting!

Casper H.S. Dik

unread,

Nov 14, 2003, 4:21:05 AM11/14/03

to

Joe Seigh <jsei...@xemaps.com> writes:

>In SMP? No lock-free algorithms what so ever?

Well, there's a bit of "atomic reference counting" going on in
Solaris.

You can also have contention free algorithms.

Lockfree algorithms appear to be more complicated; if recent
discussions in the thread group are any indication, people do
forget about certain hardware characteristics at times.
(Some of the lockfree algorithms in Linux needed to be "ammended"
for Alpha)

I'm somewhat distrustful of algorithms which claim to solve
all contention problems by replacing a simple to understand 5 line
routine with something that takes a 10 page paper to explain.

Casper

SenderX

unread,

Nov 14, 2003, 6:22:26 AM11/14/03

to

> I'm somewhat distrustful of algorithms which claim to solve
> all contention problems by replacing a simple to understand 5 line
> routine with something that takes a 10 page paper to explain.

;)

They do not solve contention, they gracefully handle contention by always
making forward progress on every point of contention.

Normal locks just block contending threads ( yuck! ).

You know, no progress can be made when your blocked... ;)

They also solve priority inversion, livelock, and the performance destroying
conveyor-belt syndrome.

Gavin Maltby

unread,

Nov 14, 2003, 6:24:20 AM11/14/03

to

SenderX wrote:
>>In an effort to keep an open mind I'll ask: what would be be your
>>lock-free suggestion for traversing a linked list?
>
>
> Use a lock-free garbage collector.

I thought so (but it was late and I had little imagination left).
Not suitable for the level I'm thinking of I think.

>
>
>
>>I'm searching a linked list from it's head looking for
>>the first match (if any) on a particular structure members.
>
>
> OK.
>
>
>
>>A number of other threads will be inserting/deleting from the
>>list
>
>
> Fine.
>
>
>
>>or making modifications to some of it's members (without
>>unlinking).
>
>
> Bzzt!
>
> The lock-free linked list protects the nodes themselvs, not the memory you
> attach to the node:
>
> typedef struct SLfNode_
> {
> /* ... */
>
> /* This is user state, we don't know anything about it.
> don't touch it at all */
> const void *pUserState;
>
> } SLfNode;
>

So what does protect the memory attached to the node? Not
a lock (obviously!). A reference count or serial number -
sounds like a lot of overhead to me.

>
>>Deleted members may be instantly reused in other
>>such linked lists, or they may be freed.
>
>
> Take a look at my 32-bit cpu garbage collector:
>
> http://intel.forums.liveworld.com/thread.jsp?forum=242&thread=6721

Will do.

> I have a 64-bit cpu collector, but I may try to patent it.
>
> This is a really simple experimental example, a real lock-free list would be
> spread over multi-buckets.
>
> But it does show how threads can traverse the list, while others are pushing
> and popping && freeing nodes with delete, all concurrently.
>
>
> However this is just a simple stack, if you want linked-list code that
> allows for pushes and pops anywhere in the list, you need the following
> x86-32 && 64-bit compatible( /w SMR and other GC's ) IBM algo:
>
> http://www.cs.tau.ac.il/~shanir/reading%20group/p73-michael.pdf

Interesting read (ok I bailed at the proof section). At first
glance I don't see why the Search algorithm (Insert/Delete are
easy and obvious) is immune to livelock. Due to list changes
occuring on other cpus there seems to be no guarantee that
the Search algorithm will ever complete since it may always
hit "list changed; try again from head" scenarios. Maybe this
is handled by hash chain ordering (which is not always
desirable). Maybe I read too quickly!

The assumption of sequentially consistent memory model is fine.
It would be nice if they indicated exactly which memory barriers
are needed where for more relaxed memory models (not trivial
to decide whether that's possible).

>
> If you have trouble coding this, I can show you my stable C++ implementation
> of the posted IBM algo.
>
> This is one of the best IBM FreeList compatible lock-free linked-list code
> currently out there.

I can manage - after some memory barrier discussions. From what I've
read this stuff will not outperform the existing implementation of
what I'm thinking of - and it will certainly add much more complexity
and likely subtle bugs. I was thinking of the Solaris virtual->physical
memory mapping hash. It it mostly a locked design now, with very low
contention. It has some lock-free aspects, however.

>
> Joe Seigh has another slick lock-free linked-list that uses his almighty
> atomic_ptr. Google for it, its worth it.

May do. Solaris has a casptr() which various things use to do this.
Mostly it used to avoid deadlock or in places where you simply cannot
block.

Gavin

Speaking for myself

Joe Seigh

unread,

Nov 14, 2003, 7:19:49 AM11/14/03

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> >In SMP? No lock-free algorithms what so ever?
>
> Well, there's a bit of "atomic reference counting" going on in
> Solaris.

Really? Atomic as opposed to merely thread-safe? Detlefs seems to
think it's impossible without DCAS. Boost shared_ptr is thread-safe,
not atomic. Java ptrs are atomic. You get valid values no matter how
you use them. In C++ that means no invalid values which could cause
a segfault.

>
> You can also have contention free algorithms.

Distribued algorithms like Peterson's algorithm and such?

>
> Lockfree algorithms appear to be more complicated; if recent
> discussions in the thread group are any indication, people do
> forget about certain hardware characteristics at times.
> (Some of the lockfree algorithms in Linux needed to be "ammended"
> for Alpha)

The LL/SC granularity problem?

>
> I'm somewhat distrustful of algorithms which claim to solve
> all contention problems by replacing a simple to understand 5 line
> routine with something that takes a 10 page paper to explain.
>

I wouldn't say solve. Dealing with higher contention than a mutex
can usually. And I don't use thost algorihms unless they work
dramically better by an order of magnitude or better. 20% improvement,
I wouldn't waste my time with it.

Joe Seigh

Greg Menke

unread,

Nov 14, 2003, 8:26:34 AM11/14/03

to

"SenderX" <x...@xxx.xxx> writes:

> > In an effort to keep an open mind I'll ask: what would be be your
> > lock-free suggestion for traversing a linked list?
>
>
>

> > Deleted members may be instantly reused in other
> > such linked lists, or they may be freed.
>
> Take a look at my 32-bit cpu garbage collector:
>
> http://intel.forums.liveworld.com/thread.jsp?forum=242&thread=6721
>
> I have a 64-bit cpu collector, but I may try to patent it.
>

I think that would be the best way to ensure no-one will ever use it.

Gregm

Gavin Maltby

unread,

Nov 14, 2003, 11:37:28 AM11/14/03

to

Joe Seigh wrote:

>>Well, there's a bit of "atomic reference counting" going on in
>>Solaris.
>
>
> Really? Atomic as opposed to merely thread-safe? Detlefs seems to
> think it's impossible without DCAS. Boost shared_ptr is thread-safe,
> not atomic. Java ptrs are atomic. You get valid values no matter how
> you use them. In C++ that means no invalid values which could cause
> a segfault.

I'm not certain we're singing from the same glossary sheet here.
I'd call an "atomic reference count" a successful inc/dec made to a counter
using CAS, along these lines:

do {
old = *addr;
new = old + 1;
while (cas(addr, old, new) fails);

Nothing to do with threads or thread safety there.

Gavin

Joe Seigh

unread,

Nov 14, 2003, 1:18:46 PM11/14/03

to

Atomic means safe with respect to other concurrent operations on the data
object even in the absence of any explicit synchronization. Thread-safe
usually means there are no hidden problems with multiple threads calling
a particular function, like a shared static buffer.

Reference counting has a problem that interlocked increment of the reference
count won't solve. That is a race condition where you fetch a pointer to
an refcounted object, but before you can increment the reference count, the
refcount goes to zero and the object is deleted and the storage possibly
reused. The value of where the refcount used to be is undefined so you
cannot tell by looking at it that it got reused. Atomically incrementing
and invalid memory location doesn't buy you anything, just a different
instruction to segfault on or a more expensive way to clobber some other
threads storage.

This is the problem that Detlefs et all describe in their paper
"Lock-Free Reference Counting" though Detlefs seems to think it can
only be solved with DCAS. Actually you can solve it with a doubleword
CAS or with LL/SC with a minor restriction.

The way Boost gets around this is they require you to "own" the smart pointers
that you access which guarantees the refcount will not go to zero in the middle
of an access. But the cost is the overhead of explicity maintaining smart
pointer ownership.

atomic_ptr doesn't have this ownership problem. The address you get from
dereferencing an atomic_ptr will always be valid, guaranteed. So it's
a lot like Java pointers except it uses reference counting instead of Boehm
style GC.

Joe Seigh

Casper H.S. Dik

unread,

Nov 16, 2003, 4:37:02 PM11/16/03

to

Joe Seigh <jsei...@xemaps.com> writes:

>Reference counting has a problem that interlocked increment of the reference
>count won't solve. That is a race condition where you fetch a pointer to
>an refcounted object, but before you can increment the reference count, the
>refcount goes to zero and the object is deleted and the storage possibly
>reused. The value of where the refcount used to be is undefined so you
>cannot tell by looking at it that it got reused. Atomically incrementing
>and invalid memory location doesn't buy you anything, just a different
>instruction to segfault on or a more expensive way to clobber some other
>threads storage.

That's a bug in your application; if you have a reference to something
and something else can drop "the last reference" then obviously
the reference you had wasn't properly accounted for.

I.e., what you're describing is a non problem when reference counting
is done what I would call properly, the increment the reference count
before handing it out, not after.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Joe Seigh

unread,

Nov 16, 2003, 6:10:58 PM11/16/03

to

"Casper H.S. Dik" wrote:
>
> That's a bug in your application; if you have a reference to something
> and something else can drop "the last reference" then obviously
> the reference you had wasn't properly accounted for.
>
> I.e., what you're describing is a non problem when reference counting
> is done what I would call properly, the increment the reference count
> before handing it out, not after.

So, exactly how would this proper scheme work?

Joe Seigh

SenderX

unread,

Nov 17, 2003, 12:27:02 AM11/17/03

to

> I.e., what you're describing is a non problem when reference counting
> is done what I would call properly, the increment the reference count

^^^^^^^^

:/

> before handing it out, not after.

Properly? The method your thinking of was designed to kludge around the
race-condition that Joe described.

Non-Problem? This is a CLASSIC problem, indeed!

Do you even know the MASSIVE "difference" between thread-safe ref-counting
and "atomic" ref-counting?????? Really...

Look at atomic_ptr before you make any more comments.

Casper H.S. Dik

unread,

Nov 17, 2003, 5:08:48 AM11/17/03

to

Joe Seigh <jsei...@xemaps.com> writes:

A thread which has or uses a reference must always be reflected in
the reference count. Hardly rocket science.

You can only find a reference to an object through some other object;
clearly you must "hold" that object so its reference and the references
to the objects it references cannot go away.

The problem of some other thread hitting 0 while another thread
tries to increment the reference count shouldn't occur unless the thread
which tries to up the reference count has come by the reference count
illegally.

Joe Seigh

unread,

Nov 17, 2003, 5:58:18 AM11/17/03

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> >"Casper H.S. Dik" wrote:
> >>
> >> That's a bug in your application; if you have a reference to something
> >> and something else can drop "the last reference" then obviously
> >> the reference you had wasn't properly accounted for.
> >>
> >> I.e., what you're describing is a non problem when reference counting
> >> is done what I would call properly, the increment the reference count
> >> before handing it out, not after.
>
> >So, exactly how would this proper scheme work?
>
> A thread which has or uses a reference must always be reflected in
> the reference count. Hardly rocket science.
>
> You can only find a reference to an object through some other object;
> clearly you must "hold" that object so its reference and the references
> to the objects it references cannot go away.
>
> The problem of some other thread hitting 0 while another thread
> tries to increment the reference count shouldn't occur unless the thread
> which tries to up the reference count has come by the reference count
> illegally.
>

So that is lock-free reference counting? What exactly did you think the
paper by Detlefs is about then?

Joe Seigh

Casper H.S. Dik

unread,

Nov 17, 2003, 7:54:19 AM11/17/03

to

Joe Seigh <jsei...@xemaps.com> writes:

>So that is lock-free reference counting? What exactly did you think the
>paper by Detlefs is about then?

I mentioned lock free reference counting; you mentioned the paper;
if the paper didn't cover what I was talking about, then why did you
bring it up?

Joe Seigh

unread,

Nov 17, 2003, 11:47:14 AM11/17/03

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> >So that is lock-free reference counting? What exactly did you think the
> >paper by Detlefs is about then?
>
> I mentioned lock free reference counting; you mentioned the paper;
> if the paper didn't cover what I was talking about, then why did you
> bring it up?
>

I guess I brought it up to illustrate that even if you provide a reference
to help define some terminology being used in a discussion, nobody will be
bothered to check the reference and will use whatever meaning they feel like,
without indicating that they are doing so and without providing a definition
of their own.

Joe Seigh

David Schwartz

unread,

Nov 17, 2003, 12:57:03 PM11/17/03

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:3FB8072E...@xemaps.com...

In order to get a reference to an object, you have to find that object
somehow. In order for you to find an object, something must already have a
reference to it. Whatever gives you the reference must do so before it
releases its own reference.

DS

Alexander Terekhov

unread,

Nov 17, 2003, 5:04:05 PM11/17/03

to

Joe's atomic pointer simply allows you to copy AND MUTATE shared
smart pointer instances without any "external" synchronization.

joe::atomic_ptr<T> p1; // shared

void run_me_concurently() {
p1.reset(new T(/*...*/);
joe::atomic_ptr<T> p2(p1);
std::printf("%s\n", p2 == p1 ? "fine" : "nice" );
}

"or something like that" (without internal locks).

regards,
alexander.

SenderX

unread,

Nov 17, 2003, 8:33:30 PM11/17/03

to

> The problem of some other thread hitting 0 while another thread
> tries to increment the reference count shouldn't occur unless the thread
> which tries to up the reference count has come by the reference count
> illegally.

^^^^^^^^^^

Yes, this is illegal for thread-safe reference counting.

NOT, I repeat NOT, for "atomic" reference counting.

look at atomic_ptr please!

--
The designer of the experimental, SMP and HyperThread friendly, AppCore
library.

http://AppCore.home.comcast.net

SenderX

unread,

Nov 18, 2003, 11:36:03 AM11/18/03

to

> Joe's atomic pointer simply allows you to copy AND MUTATE shared
> smart pointer instances without any "external" synchronization.

Dave probably has no appreciation for atomic_ptr. Its lock-free, and Dave
says that lock-free algos perform worse that their lock-based counterparts,
most of the time. I have pointed Dave to atomic_ptr in the past, but its
100% lock-free, so he must not be very interested in it. Even though it is
one of the best GC'tors out there!

;)

Joe Seigh

unread,

Nov 18, 2003, 1:25:25 PM11/18/03

to

I wouldn't worry about it. There's a philosophy that Boost subscribes
to that you don't want builtin synchronization for C++ objects because
the granularity is too small, external locking is needed anyway, and so
you're just adding unneccesary overhead then. From that point of view,
Boost shared_ptr with only the reference count being "atomic", rather
than the whole ptr itself, seems reasonable - if you didn't know any
better. But you're precluding a whole range of solutions that aren't
possible with external locking. Which is fine. It just means there
is less competition in thinking of new algorithms in this area. Sun
research is sort of working on lock-free stuff but we don't have to
worry since most of it uses an instruction only available on the MC68020
and MC68030.

And we don't want give to Sun any ideas. They've already been patenting the
obvious, 5,924,098 Method and apparatus for managing a linked-list data structure.

Joe Seigh

SenderX

unread,

Nov 18, 2003, 3:13:25 PM11/18/03

to

> From that point of view,
> Boost shared_ptr with only the reference count being "atomic", rather
> than the whole ptr itself, seems reasonable - if you didn't know any
> better.

=)

> But you're precluding a whole range of solutions that aren't
> possible with external locking. Which is fine. It just means there
> is less competition in thinking of new algorithms in this area.

New ideas are key...

I have a couple of real-experimental, special collections...

Think of a collection, that uses its buckets for lock-free atomic reference
counting.

Its neat...

> Sun
> research is sort of working on lock-free stuff but we don't have to
> worry since most of it uses an instruction only available on the MC68020
> and MC68030.

Strange...

Alls they need is a normal CAS/LL-SC for 32-bit or 64-bit cpus and one of
our garbage collectors...

Actually 64-bit compatible proxy collectors are not that hard to code...

;)

> And we don't want give to Sun any ideas. They've already been patenting
the
> obvious, 5,924,098 Method and apparatus for managing a linked-list data
structure.

I wonder how many of them are thinking about changing your atomic_ptr, or
one of our proxy collectors a little and patenting it?

atomic_ptr with the ABA logic that I added, or one of our proxy collectors,
would be an asset to any OS!?

:P

Joe Seigh

unread,

Nov 19, 2003, 8:15:33 AM11/19/03

to

SenderX wrote:
>
> > From that point of view,
>

> I wonder how many of them are thinking about changing your atomic_ptr, or
> one of our proxy collectors a little and patenting it?
>

They wouldn't have to change anything and the fact that an algorithm has
been publicaly disclosed on usenet wouldn't prevent them from patenting
it and collecting license fees given the present state of the patent system.

That won't happen with any new algorithms I come up since I am no longer
putting them into the "public domain" by posting them to usenet, even the
ones I consider "trivial". Not really anything to do with the above problem.
There are some larger IP issues I have concerns about which I am not going
to go into here.

Joe Seigh

SenderX

unread,

Nov 20, 2003, 9:51:10 AM11/20/03

to

> They wouldn't have to change anything and the fact that an algorithm has
> been publicaly disclosed on usenet wouldn't prevent them from patenting
> it and collecting license fees given the present state of the patent
system.

Yeah...

Especially when the algos that we posted WILL help operating systems
scale-up to many processors, and servers scale to extreme load.

> That won't happen with any new algorithms I come up since I am no longer
> putting them into the "public domain" by posting them to usenet, even the
> ones I consider "trivial".

I think I should start to follow that line of thinking...

I was thinking about posting my 64-bit collector to make coding lock-free
algos on any 64-bit cpu really easy.

But... Maybe not!

Even though the collector would be a very important asset to high-user
server programmers that want easy lock-free on 64-bit cpus, all over the
world.

;(

P.S.

Did you make any $$$ with the old patent on atomic_ptr?

;)

Joe Seigh

unread,

Nov 20, 2003, 10:47:42 AM11/20/03

to

SenderX wrote:

> > That won't happen with any new algorithms I come up since I am no longer
> > putting them into the "public domain" by posting them to usenet, even the
> > ones I consider "trivial".
>
> I think I should start to follow that line of thinking...
>
> I was thinking about posting my 64-bit collector to make coding lock-free
> algos on any 64-bit cpu really easy.
>
> But... Maybe not!
>
> Even though the collector would be a very important asset to high-user
> server programmers that want easy lock-free on 64-bit cpus, all over the
> world.
>
> ;(
>

Well, somebody patenting your ideas is a potential problem but not a serious
problem. There are other issues. Bascially, there are relatively few advantages
to publishing a new idea and a lot of advantages to keeping it secret. One is,
if you've implemented the idea you can patent the idea later on to keep others
from just copying your idea once you've demonstrated its usefulness. Better yet,
patent the idea, then pubish a paper on it without mentioning there's a patent
on it. Collect royalties later. That happens more often than you might suppose.
Just do a USPTO patent search on the authors of some of the research papers
in this area.

> P.S.
>
> Did you make any $$$ with the old patent on atomic_ptr?
>
> ;)

Only what IBM pays for submitting patentable ideas. IBM owned the patent
and you don't get royalties or anything.

Joe Seigh

SenderX

unread,

Nov 21, 2003, 6:58:37 AM11/21/03

to

> Well, somebody patenting your ideas is a potential problem but not a
serious
> problem. There are other issues. Bascially, there are relatively few
advantages
> to publishing a new idea and a lot of advantages to keeping it secret.
One is,
> if you've implemented the idea you can patent the idea later on to keep
others
> from just copying your idea once you've demonstrated its usefulness.
Better yet,
> patent the idea, then pubish a paper on it without mentioning there's a
patent
> on it. Collect royalties later.

;)

> > Did you make any $$$ with the old patent on atomic_ptr?
> >
> > ;)
>
> Only what IBM pays for submitting patentable ideas. IBM owned the patent
> and you don't get royalties or anything.

Vampire bastards!

:P

Joe Seigh

unread,

Nov 21, 2003, 7:19:23 AM11/21/03

to

SenderX wrote:
> > > Did you make any $$$ with the old patent on atomic_ptr?
> > >
> > > ;)
> >
> > Only what IBM pays for submitting patentable ideas. IBM owned the patent
> > and you don't get royalties or anything.
>
> Vampire bastards!
>

That's pretty much the standard IP contract with any company. Though they
did allow you to use any ideas that you submitted that they said they had
no interest in. I wish I had patented a certain disk drive idea that IBM
claimed they had no interest in back in the early ninties. I'd be making
a lot of money now.

Joe Seigh

Jonathan Adams

unread,

Nov 23, 2003, 8:16:01 AM11/23/03

to

"SenderX" <x...@xxx.xxx> wrote in message news:<Q5Vsb.144191$ao4.460450@attbi_s51>...
> > Considering that others have scaled considerably further than 4CPUs
> > without lockless algorithms I would contend that that proofs fairly
> little.
> >
> > As for "more graceful degradation"; again I'd like to see proof rather
> > than assertions.
>
> A lock-free algo has also helped Java scale a lot more:
>
> http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html#8

From that link:
> An example of such improvement was to make system dictionary reads
> lock-free.
>...
> It helps a lot for calls like Class.forName(), which do lookups into this data
> structure at the lowest level. Before this change, both readers and writers
> took out a lock to look at the system dictionary.

Having a heavily accessed large dictionary protected by a single lock
is a scalability disaster. From their description, they replaced a reader-
writer lock with a writer-only lock + lock-free reads, presumably
because they only delete from the table at safe points. They didn't have
to add the use of CAS, garbage collection, or any of the techniques used
in more complicated lock-free algorithms.

It's hard to go from an example like this to a statement like:

> A lock-free algo will scale far beyond its lock based cohort.
Will it? In all cases? It's a tool, not a panacea.

Solaris scales to hundreds of processors with scant reliance on lock-free
algorithms (they are mostly used for async-safety). You can see exactly
how much contention there is in a Solaris system using lockstat(1M).
Typically, it's not much -- that's one of the big reasons it scales.

Are there large-scale systems that rely exclusively on lock-free
algorithms for consistancy and scalability? Especially the more
complicated algorithms?

Joe Seigh

unread,

Nov 23, 2003, 10:38:55 AM11/23/03

to

Jonathan Adams wrote:
>
> "SenderX" <x...@xxx.xxx> wrote in message news:<Q5Vsb.144191$ao4.460450@attbi_s51>...

> > A lock-free algo has also helped Java scale a lot more:
> >
> > http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html#8
>
> From that link:
> > An example of such improvement was to make system dictionary reads
> > lock-free.
> >...
> > It helps a lot for calls like Class.forName(), which do lookups into this data
> > structure at the lowest level. Before this change, both readers and writers
> > took out a lock to look at the system dictionary.
>
> Having a heavily accessed large dictionary protected by a single lock
> is a scalability disaster. From their description, they replaced a reader-
> writer lock with a writer-only lock + lock-free reads, presumably
> because they only delete from the table at safe points. They didn't have
> to add the use of CAS, garbage collection, or any of the techniques used
> in more complicated lock-free algorithms.

Well, Java has garbage collection built in, so I'm not suprised that they
did not have to add garbage collection. But actually they are adding CAS
to Java with JSR 166 specifically for high performance lock-free algorithms
particularly in situations where the built-in GC for Java cannot keep
up with object reclaimation.

> Solaris scales to hundreds of processors with scant reliance on lock-free
> algorithms (they are mostly used for async-safety). You can see exactly
> how much contention there is in a Solaris system using lockstat(1M).
> Typically, it's not much -- that's one of the big reasons it scales.

SMP w/ uniform memory access?

>
> Are there large-scale systems that rely exclusively on lock-free
> algorithms for consistancy and scalability? Especially the more
> complicated algorithms?

Not exclusively, I don't think that possible at this point. Sun's done
a bit of research into this area and filed a number of hw patents in this
area. You might at some point have to take Solaris off the list of OS's
that do not rely on lock-free for performance and scalability.

Joe Seigh

SenderX

unread,

Nov 23, 2003, 2:24:12 PM11/23/03

to

> Having a heavily accessed large dictionary protected by a single lock
> is a scalability disaster. From their description, they replaced a
reader-
> writer lock with a writer-only lock + lock-free reads, presumably
> because they only delete from the table at safe points.

I hope they are not limiting themselves to 1 Writer! That's crap. Their
collection should allow for concurrent reads & writes... Anyway, this is
good example of how a lock-free algo will help a shared collection scale.

> They didn't have
> to add the use of CAS, garbage collection, or any of the techniques used
> in more complicated lock-free algorithms.

If the internal read/write lock that they use protects its internal state
with a mutex, then it is NOT lock-free at all. The paper says lock-free
reads, so they must use CAS or LL/SC to implement the read/write lock. The
paper would be totally wrong if otherwise.

> Are there large-scale systems that rely exclusively on lock-free

^^^^^^^^^^^

> algorithms for consistancy and scalability? Especially the more
> complicated algorithms?

Not exclusively. But...

The Linux Scalability Effort is using lock-free RCU specifically to scale,
in a lot of places.

FUTEX is developed. It is faster than its lock-based friends.

K42 project and Tornado use lock-free algos specifically to scale.

IBM System 370 uses lock-free algos, the IBM FreeList was introduced here.

Windows uses and exposes lock-free algos.

Java uses them.

Java exposes an atomic CAS & a markable pointer, you could do lock-free
algos using java.

I wonder Java would expose them? Could it be that Java is exposing them
specifically for high-performance lock-free collections?

Intel exposes a compare-and-swap double on 32-bit for lock-free algos...

http://eca.cx/lad/2000/Jul/0319.html

Also, read some papers on lock-free algos, and look at their testing
performance. They all scale far beyond their lock-based counterparts.

;)

Rich Teer

unread,

Nov 23, 2003, 3:15:12 PM11/23/03

to

On Sun, 23 Nov 2003, Jonathan Adams wrote:

> because they only delete from the table at safe points. They didn't have
> to add the use of CAS, garbage collection, or any of the techniques used
> in more complicated lock-free algorithms.

'Scuse my ignorance, but what is "CAS"? With a limited hardware
background, the first thing that comes to mind is "Column Address
Strobe" (as in RAMs), but I doubt that's it. Hmm, Compare And Swap?

TIA,

--
Rich Teer, SCNA, SCSA

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-online.net

Anuj Goyal

unread,

Nov 25, 2003, 2:15:40 AM11/25/03

to

Hmm, I should have posted a better response to your question. I am
trying to run 1000 threads to do 1000 atomic increments. So basically
there is 1 process and 1000 threads and I just need to make sure that
my "int" (4 bytes in memory) is protected while I do these atomic
increments.

I think in my case, I don't want locks (just want to atomically
increment) because fairness doesn't matter to me. (priority inversion
is ok)

Would TMO atomic increments (with membar's) be less efficient than RMO
because TMO constantly flushes the write buffer of the CPUs?

On another note, both AIX and HPUX provide (non-POSIX)
atomic_increment calls that are exposed to the user, with SUN I think
I have to go into the kernel!?! is this correct?

#include <sys/atomic.h>

extern void atomic_add_32(uint32_t *target, int32_t delta);

Aren't these expensive calls if they are going into the kernel?
Does Sun provides users with a user library with atomic ops so that
users don't go through the kernel when atomically incrementing or I am
missing something ?

Thanks
---Anuj

PS: goto http://upc.nersc.gov/download/dist/gasnet/gasnet_atomicops.h

/* $%*(! Solaris has atomic functions in the kernel but refuses to
expose them
to the user... after all, what application would be interested
in performance? */
#include <sys/atomic.h>
typedef struct { volatile uint32_t ctr; } gasneti_atomic_t;
#define gasneti_atomic_increment(p) (atomic_add_32((uint32_t
*)&((p)->ctr),1))
#define gasneti_atomic_read(p) ((p)->ctr)
#define gasneti_atomic_set(p,v) ((p)->ctr = (v))
#define gasneti_atomic_init(v) { (v) }

Gavin Maltby <g_a_v_i_n.m_a_l_t_b_y@s_u_n.com> wrote in message news:<bot689$1t$1...@new-usenet.uk.sun.com>...
> Anuj Goyal wrote:
> > Locks are not good and fast when you are running in TMO mode right? :)
>
> Well there are about 2 zillion locks in the Solaris kernel which runs
> in TSO and scales to hundreds and thousands of processors. Even full
> featured locking can be lightweight (at the very least for the common
> case). If you want low featured locking (eg, no priority inheritance)
> then they can be still lighter weight at the expense of functionality.
> Typically locking is good and fast provided you code with some
> sympathy towards how scheduling and hardware works. So don't
> create bottlenecks (eg, locking code sections with a global
> mutex rather than locking the data being operated on). On
> large multiprocessors think of cacheline ownership and that
> several locks can fit in a cacheline (ie, arrays of locks
> may not always be as effective as you think).
>
> You have little choice - as mentioned you cannot rely on priority
> to protect data. You can use the locking primives supplied in
> libthread etc which are reasonably full featured and well debugged/
> matured. You can roll your own, but then you lose those benefits
> and stand a good chance of bumping into all sorts of issues that
> the provided stuff takes care of.
>
> I'd suggest that if you give a reasonable description of the application
> and why you believe it needs to rely on priority scheduling then
> people can suggest alternatives. Do you need to protect data,
> code sections, just communicate between threads/processes etc.
>
> Gavin

Casper H.S. Dik

unread,

Nov 25, 2003, 3:26:31 AM11/25/03

to

ang...@siebel.com (Anuj Goyal) writes:

>On another note, both AIX and HPUX provide (non-POSIX)
>atomic_increment calls that are exposed to the user, with SUN I think
>I have to go into the kernel!?! is this correct?

You cannot call in the kernel for these; it's just that we've
never provided a userland library; this is somewhat difficult to
do for the older SPARCs as I understand it but easy on UltraSPARC.

>Aren't these expensive calls if they are going into the kernel?
>Does Sun provides users with a user library with atomic ops so that
>users don't go through the kernel when atomically incrementing or I am
>missing something ?

You can't call them because they're in the kernel and provided only
for use in the kernel; there's no way to get there.

(But the next Solaris release may very well have them)

Casper

SenderX

unread,

Nov 25, 2003, 6:11:19 AM11/25/03

to

> 'Scuse my ignorance, but what is "CAS"? With a limited hardware
> background, the first thing that comes to mind is "Column Address
> Strobe" (as in RAMs), but I doubt that's it. Hmm, Compare And Swap?

Compare & Swap.

Gavin Maltby

unread,

Nov 25, 2003, 6:20:51 AM11/25/03

to

Hi,

Anuj Goyal wrote:
> Hmm, I should have posted a better response to your question. I am
> trying to run 1000 threads to do 1000 atomic increments. So basically
> there is 1 process and 1000 threads and I just need to make sure that
> my "int" (4 bytes in memory) is protected while I do these atomic
> increments.
>
> I think in my case, I don't want locks (just want to atomically
> increment) because fairness doesn't matter to me. (priority inversion
> is ok)

Yes I'd agree that if all you want to do is perform atomic increment
and decrement on an int then doing it without locks is probably
best. If you know (by design) that contention will be rare then
locks will likely be just as good, but if contention is going to be
a factor then all the blocking/wakeup etc will be an overhead.

That said, if contention is a factor then even without locks
you need to have sympathy for the cache state activity on
a SMP. An atomic instruction (cas, swap, ldstub on sparcv9)
requires the processor to have exclusive ownership of the
cacheline (thus allowing it to modify while nobody else
is able to read or modify). If you have all processors
furiously issuing atomic instructions to the same address
then they argue over ownership and burn up bus cycles.
So whichever model you choose, aim to minimize contention
or it will hurt.

> Would TMO atomic increments (with membar's) be less efficient than RMO
> because TMO constantly flushes the write buffer of the CPUs?

From a practical point of view on current sparc processors, no.
Solaris runs current sparc processors in total store order
(the strongest model they offer) and I don't believe there
is any way for the user to request otherwise (even just
for their application). You'd need to modify %pstate which
is a privileged register, and I don't know of any wrappers
that do it for you either. But even if you could, current
ultrasparc processors don't do a great deal of out order
stuff anyway (eg, in US-III a number of the ordering membars
are just NOPs).

You only need the membar if you need to force an order on the
completion of your loads and stores. You don't need it for
atomic inc/dec alone. But if, say, you need to know that
your atomic increment is visible to all other processors
before a subsequent load you perform then you'd insert
a membar to order stores wrt loads.

> On another note, both AIX and HPUX provide (non-POSIX)
> atomic_increment calls that are exposed to the user, with SUN I think
> I have to go into the kernel!?! is this correct?
>
> #include <sys/atomic.h>

That's a kernel private header file, you are not able to use
functions it declares (I don't mean it's not legal to, I mean
you simply are not able to).

> extern void atomic_add_32(uint32_t *target, int32_t delta);
>
> Aren't these expensive calls if they are going into the kernel?
> Does Sun provides users with a user library with atomic ops so that
> users don't go through the kernel when atomically incrementing or I am
> missing something ?

I think I remember seeing these added as libc calls in the development
release. I don't think they form part of any standards interface which
I imagine is why they've not been present in the past (and some of them
might not be readily/cheaply implementble on older processors). They're
trivial to do yourself, but I'd warn (unlike the link below apparently)
that they don't win instant performance and can generate a lot of
bus activty as described above (ie, atomic instructions are often best
left behind interfaces that stop people abusing them too much).

Here's an (untested, uncompiled) atomic increment of an int you should be
able to use on a sparcv9 system:

void atomic_add_32(int *ip, int delta);

cat > myatomics.s << EOM
.seg ".text"
.type atomic_add_32, #function
.global atomic_add_32
atomic_add_32:
ld [%o0], %o4
1:
add %o4, %o1, %o5
cas [%o0], %o4, %o5
cmp %o4, %o5
bne,a,pn %icc, 1b
ld [%o0], %o4
retl
nop
EOM

This is easily modified to return the new value that your call
changed the int to (remembering that this may bear no relation to
the current value as others may since have changed it).

Use a negative delta to perform atomic decrement.

The above loads the "current" int value into %o4; this is a
snapshot if you like - the real value may be changing on
another cpu as we sample it here. Based on this
believed current value we calculate a new value (add the
delta to the current value). Next the compare and swap
instruction atomically checks that the integer has the
value sampled into %o4 and if it does (and it will know
for sure as it grabs sole access to the cacheline) it will
atomically install the newly computed value (which was
calculated from the base value we have now confirmed to be
there); if the cas finds the contents are not as expected
in %o4 then an update has become visible since we sampled,
and we must go round again loading the new apparent value,
calculating a new intended result etc.

Hope that helps

Gavin

Alexander Terekhov

unread,

Nov 25, 2003, 7:14:13 AM11/25/03

to

Anuj Goyal wrote:
>
> Hmm, I should have posted a better response to your question. I am
> trying to run 1000 threads to do 1000 atomic increments. So basically
> there is 1 process and 1000 threads and I just need to make sure that
> my "int" (4 bytes in memory) is protected while I do these atomic
> increments.

1000 threads aside for a moment, do you have any shared data
dependencies with respect to your counter? Is this reference
counting or what?

regards,
alexander.

--
http://www.linuxdevices.com/files/misc/gpl-proposed-rev3.pdf
http://www.opensource.org/licenses/cpl.php

Anuj Goyal

unread,

Nov 25, 2003, 8:36:11 PM11/25/03

to

This is __only__ proof of concept for my own personal knowledge! I
just wanted to prove to myself that atomically incrementing an integer
(from 1000 threads) would really be faster than using a mutex and
locking/unlocking (from 1000 threads)

PS: yes I know starting 1000 threads can cause system degradation, but
this is just my own personal test.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#include "atomic.h" // this is where I have defined my atomic ops (I
happen to link to a *.o file that has been compiled from assembly
code)

void * do_something(void *arg)
{
AtomicIncrement((long *)arg);
return NULL;
}

const int NUMTHREAD = 1000;
int main()
{
pthread_t arr_pt[NUMTHREAD];
long k = 0, bobo =0;

for (int i=0; i <NUMTHREAD; ++i) {

// send the int pointer to do_something
int val = pthread_create(&arr_pt[i], NULL, &do_something,
(void *)&k);
if (val != 0) {
printf("error: pthread_create %d\n", val);
exit(-1);
}
}

for (int i=0; i < NUMTHREAD; ++i) {
int j = pthread_join(arr_pt[i], NULL);
}

return (0);
}

Alexander Terekhov <tere...@web.de> wrote in message news:<3FC34795...@web.de>...

Anuj Goyal

unread,

Nov 26, 2003, 2:09:56 AM11/26/03

to

Thanks for your reply Gavin, I have some comments/questions.

Gavin Maltby <G_a_v_i_n....@sun.com> wrote in message news:<bpve13$np0$1...@new-usenet.uk.sun.com>...

> Anuj Goyal wrote:
> > Hmm, I should have posted a better response to your question. I am
> > trying to run 1000 threads to do 1000 atomic increments. So basically
> > there is 1 process and 1000 threads and I just need to make sure that
> > my "int" (4 bytes in memory) is protected while I do these atomic
> > increments.
> >
> > I think in my case, I don't want locks (just want to atomically
> > increment) because fairness doesn't matter to me. (priority inversion
> > is ok)
>
> Yes I'd agree that if all you want to do is perform atomic increment
> and decrement on an int then doing it without locks is probably
> best. If you know (by design) that contention will be rare then
> locks will likely be just as good, but if contention is going to be
> a factor then all the blocking/wakeup etc will be an overhead.
>
> That said, if contention is a factor then even without locks
> you need to have sympathy for the cache state activity on
> a SMP. An atomic instruction (cas, swap, ldstub on sparcv9)
> requires the processor to have exclusive ownership of the
> cacheline (thus allowing it to modify while nobody else
> is able to read or modify). If you have all processors
> furiously issuing atomic instructions to the same address
> then they argue over ownership and burn up bus cycles.
> So whichever model you choose, aim to minimize contention
> or it will hurt.

I agree that one should always choose to minimize contention - I am
trying to create situations that __will__ have lots of contention so
that I can learn what happens in very very bad (highly contentious)
situations. It's challenging to me because I have to think of
contentious situations (like the thread example) and then try to think
of ways to solve the problem.

Here are some thoughts on :

1. I could implement a mutex situation and just let the OS take care
of the queuing nastines. Let's say I am OS providing the mutex
capability... if 1000 threads are trying to get access one little int,
the OS will have to put 1000 threads on a queue right? This involves
little hardware contention, but lots of software contention (ie lots
of cycles burned implementing the queue stuff)

2. I could implement the atomic_increment and cause hardware
contention - which is also bad, but for me is a bit "better" than
software contention because hopefully less cycles (overall) will be
used. Correct me if I am wrong.

3. I could implement a try-and-enqueue approach. Try for some number
of cycles (preferably less than the # of cycles the kernel would take
to put the thread on the mutex queue) to increment the integer and
once my cycles are up, place myself on a queue. This is basically a
spinlock right?

> > Would TMO atomic increments (with membar's) be less efficient than RMO
> > because TMO constantly flushes the write buffer of the CPUs?
>
> From a practical point of view on current sparc processors, no.
> Solaris runs current sparc processors in total store order
> (the strongest model they offer) and I don't believe there
> is any way for the user to request otherwise (even just
> for their application). You'd need to modify %pstate which
> is a privileged register, and I don't know of any wrappers
> that do it for you either. But even if you could, current
> ultrasparc processors don't do a great deal of out order
> stuff anyway (eg, in US-III a number of the ordering membars
> are just NOPs).

Thanks for the info, I did not know this. I assumed (wrongly) that an
executable on Solaris would be allowed to run in RMO, PSO, and TSO
mode by somehow setting a bit in the ELF header (ie this could be
accomplished by some compiler or linker option). When I was reading
the Sparc-V9 architecture PDF (section 8) I got fooled into thinking
that RMO might be a quick way for a large performance increase! But in
retrospect, the section was stating that an hardware implementation of
SPARC can __choose__ to implement RMO - but in reality most
implementations are TSO (for Sparc). Is this is a correct assumption?
UltraSparc may or may not run in RMO - depending on vendor, OS,
licensing, etc etc.

Could you point me to a good link that details the differences between
Sparc and UltraSparc (I, II, III)? I googled for a bit but got tired
of searching. It's late.

> You only need the membar if you need to force an order on the
> completion of your loads and stores. You don't need it for
> atomic inc/dec alone. But if, say, you need to know that
> your atomic increment is visible to all other processors
> before a subsequent load you perform then you'd insert
> a membar to order stores wrt loads.
>
> > On another note, both AIX and HPUX provide (non-POSIX)
> > atomic_increment calls that are exposed to the user, with SUN I think
> > I have to go into the kernel!?! is this correct?
> >
> > #include <sys/atomic.h>
>
> That's a kernel private header file, you are not able to use
> functions it declares (I don't mean it's not legal to, I mean
> you simply are not able to).
>
> > extern void atomic_add_32(uint32_t *target, int32_t delta);
> >
> > Aren't these expensive calls if they are going into the kernel?
> > Does Sun provides users with a user library with atomic ops so that
> > users don't go through the kernel when atomically incrementing or I am
> > missing something ?
>
> I think I remember seeing these added as libc calls in the development
> release. I don't think they form part of any standards interface which
> I imagine is why they've not been present in the past (and some of them
> might not be readily/cheaply implementble on older processors). They're
> trivial to do yourself, but I'd warn (unlike the link below apparently)

I agree, the link does not warn about bus contention. Atomic
increments are not any better than mutexes, there are just a different
tool (with advantages/disadvantages) one can use when trying to solve
problems.

I heard that Solaris 10 will have userland atomic increment functions.

This helps a lot, I'm not an assembly guru, but I understand your
explanation.
The only thing that confuses me is the 2nd `ld' instruction? Is it an
optimization for pipelining? (perform the branch and the next
instruction - because it's already in the pipeline)

> Gavin

Casper H.S. Dik

unread,

Nov 26, 2003, 4:15:36 AM11/26/03

to

ang...@siebel.com (Anuj Goyal) writes:

>Here are some thoughts on :

>1. I could implement a mutex situation and just let the OS take care
>of the queuing nastines. Let's say I am OS providing the mutex
>capability... if 1000 threads are trying to get access one little int,
>the OS will have to put 1000 threads on a queue right? This involves
>little hardware contention, but lots of software contention (ie lots
>of cycles burned implementing the queue stuff)

I'm not sure I'd call that "lots of cycles"; and a sleeping
thread doesn't use CPU; so CPU cycles are available to do
other things. With NUMA architectures the "lockless" algorithms
are undoubtedly even more interesing; if one CPU owns the cacheline
will it really be fair to CPUs which are 100s of nanoseconds away?

>2. I could implement the atomic_increment and cause hardware
>contention - which is also bad, but for me is a bit "better" than
>software contention because hopefully less cycles (overall) will be
>used. Correct me if I am wrong.

It's unclear: the contention on atomatic instructions jams the
bus because it is uses many "RTO" cycles; a thread which is
just idle and sleeps does cause any cycles. (A Solaris kernel
mutex, e.g., will not even try to spin if the thread currently
owning the mutex isn't running on another CPU; it will just
sleep instead)

>3. I could implement a try-and-enqueue approach. Try for some number
>of cycles (preferably less than the # of cycles the kernel would take
>to put the thread on the mutex queue) to increment the integer and
>once my cycles are up, place myself on a queue. This is basically a
>spinlock right?

Yes.

Note that example programs with many threads aren't really valid
scalability tests as only as many threads as CPUs can possibly run
at a time so contention is low anyway unless when you have
many CPUs.

Casper

Gavin Maltby

unread,

Nov 26, 2003, 8:06:01 AM11/26/03

to

Anuj Goyal wrote:
[cut]

>
>
> I agree that one should always choose to minimize contention - I am
> trying to create situations that __will__ have lots of contention so
> that I can learn what happens in very very bad (highly contentious)
> situations. It's challenging to me because I have to think of
> contentious situations (like the thread example) and then try to think
> of ways to solve the problem.

I suspect your results mayvary substantially with the number of
CPUs present, so try a range if you can and be careful not to
draw conclusions for "SMP systems" from experiments with a
particular configuration.

> Here are some thoughts on :
>
> 1. I could implement a mutex situation and just let the OS take care
> of the queuing nastines. Let's say I am OS providing the mutex
> capability... if 1000 threads are trying to get access one little int,
> the OS will have to put 1000 threads on a queue right? This involves
> little hardware contention, but lots of software contention (ie lots
> of cycles burned implementing the queue stuff)

More or less true I guess.

> 2. I could implement the atomic_increment and cause hardware
> contention - which is also bad, but for me is a bit "better" than
> software contention because hopefully less cycles (overall) will be
> used. Correct me if I am wrong.

The hardware contention will get worse with more cpus (more cpus to
issue read-to-own for cachelines as they run your threads). You
still have the scheduling overhead in software, your 1000 threads
is a lot more than the number of cpus you have.

> 3. I could implement a try-and-enqueue approach. Try for some number
> of cycles (preferably less than the # of cycles the kernel would take
> to put the thread on the mutex queue) to increment the integer and
> once my cycles are up, place myself on a queue. This is basically a
> spinlock right?

I guess you'd call it a form of adaptive locking. But it's not obvious
how/if it would work. A spin lock is still a lock, and an adaptive mutex
is still a mutex. If you have some bits of code incrementing without
holding a lock while others increment under a lock then you still
have chaos.

[cut]

> Thanks for the info, I did not know this. I assumed (wrongly) that an
> executable on Solaris would be allowed to run in RMO, PSO, and TSO
> mode by somehow setting a bit in the ELF header (ie this could be
> accomplished by some compiler or linker option). When I was reading
> the Sparc-V9 architecture PDF (section 8) I got fooled into thinking
> that RMO might be a quick way for a large performance increase! But in
> retrospect, the section was stating that an hardware implementation of
> SPARC can __choose__ to implement RMO - but in reality most
> implementations are TSO (for Sparc). Is this is a correct assumption?
> UltraSparc may or may not run in RMO - depending on vendor, OS,
> licensing, etc etc.

The current UltraSPARC chips (UltraSPARC is Sun's SPARC v9 implmentation)
do relatively little out of order stuff. I believe the current
Fujitsu sparcv9 offering does somewhat more.

> Could you point me to a good link that details the differences between
> Sparc and UltraSparc (I, II, III)? I googled for a bit but got tired
> of searching. It's late.

Afraid not. The sparc international website has some implementation
dependency docs. To avoid non-portable software it's desirable not
to have people code to the individual implementation but instead
to code to SPARC v9.

[cut]

>
> I agree, the link does not warn about bus contention. Atomic
> increments are not any better than mutexes, there are just a different
> tool (with advantages/disadvantages) one can use when trying to solve
> problems.
>

They're tricky tools, but on some systems you can use cpustat(1M) and
cputrack(1M) and friends to obtain statistics on hardware -
instructions executed, cache fetches etc etc. You may find those
useful in your investigations.

[cut]

> This helps a lot, I'm not an assembly guru, but I understand your
> explanation.
> The only thing that confuses me is the 2nd `ld' instruction? Is it an
> optimization for pipelining? (perform the branch and the next
> instruction - because it's already in the pipeline)

Yes, it allow use of the delay slot to the branch instruction. Control
transfer instructions (call, jmp, branch etc) on sparc have delay slots.

Gavin

Jonathan Adams

unread,

Nov 28, 2003, 12:21:48 AM11/28/03

to

Joe Seigh <jsei...@xemaps.com> wrote in message news:<3FC0D7D2...@xemaps.com>...

> Jonathan Adams wrote:
> >
> > "SenderX" <x...@xxx.xxx> wrote in message news:<Q5Vsb.144191$ao4.460450@attbi_s51>...
> > > A lock-free algo has also helped Java scale a lot more:
> > >
> > > http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html#8
> >
> > From that link:
> > > An example of such improvement was to make system dictionary reads
> > > lock-free.
> > >...
> > > It helps a lot for calls like Class.forName(), which do lookups into this data
> > > structure at the lowest level. Before this change, both readers and writers
> > > took out a lock to look at the system dictionary.
> >
> > Having a heavily accessed large dictionary protected by a single lock
> > is a scalability disaster. From their description, they replaced a reader-
> > writer lock with a writer-only lock + lock-free reads, presumably
> > because they only delete from the table at safe points. They didn't have
> > to add the use of CAS, garbage collection, or any of the techniques used
> > in more complicated lock-free algorithms.
>
> Well, Java has garbage collection built in, so I'm not suprised that they
> did not have to add garbage collection. But actually they are adding CAS
> to Java with JSR 166 specifically for high performance lock-free algorithms
> particularly in situations where the built-in GC for Java cannot keep
> up with object reclaimation.

I was just talking about the particular case the original poster mentioned. I
don't pretend to know much about the current state of Java.

> > Solaris scales to hundreds of processors with scant reliance on lock-free
> > algorithms (they are mostly used for async-safety). You can see exactly
> > how much contention there is in a Solaris system using lockstat(1M).
> > Typically, it's not much -- that's one of the big reasons it scales.
>
> SMP w/ uniform memory access?

Generally, though some of the more recent machines have had some
NUMA-ish characteristics that Solaris has started to optimize for.

> > Are there large-scale systems that rely exclusively on lock-free
> > algorithms for consistancy and scalability? Especially the more
> > complicated algorithms?
>
> Not exclusively, I don't think that possible at this point. Sun's done
> a bit of research into this area and filed a number of hw patents in this
> area. You might at some point have to take Solaris off the list of OS's
> that do not rely on lock-free for performance and scalability.

Keep in mind that there is a large difference between Sun Labs and the
Solaris engineering group. The papers I've read on lock free algorithms
don't lead me to think that they are really ready for prime time, in terms
of both complexity and performance gain -- in particular, the fact that
there is little performance gain in the uncontended behavior and a huge
complexity bill to handle the contended case. Most locks simply don't
experience contention, in a well-designed system.

Jonathan Adams

unread,

Nov 28, 2003, 1:07:13 AM11/28/03

to

"SenderX" <x...@xxx.xxx> wrote in message news:<wP7wb.287188$Fm2.297447@attbi_s04>...

> > Having a heavily accessed large dictionary protected by a single lock
> > is a scalability disaster. From their description, they replaced a
> reader-
> > writer lock with a writer-only lock + lock-free reads, presumably
> > because they only delete from the table at safe points.
>
> I hope they are not limiting themselves to 1 Writer! That's crap.

That's what they seem to be saying. But keep in mind that loading new
classes is a reasonable rare activity which is probably *already* single-
threaded. From an engineering perspective, they got rid of a heavily
contended cache line for reads and for a small investment in
complexity.

> Their
> collection should allow for concurrent reads & writes... Anyway, this is
> good example of how a lock-free algo will help a shared collection scale.

It's also a good example of how little "lock-free"ness is needed, and
about engineering trade-offs. A *fully* lock-free datastructure would
have increased complexity by a huge margin, for no performance gain,
since there's rarely multiple writers anyway.

> > They didn't have
> > to add the use of CAS, garbage collection, or any of the techniques used
> > in more complicated lock-free algorithms.
>
> If the internal read/write lock that they use protects its internal state
> with a mutex, then it is NOT lock-free at all. The paper says lock-free
> reads, so they must use CAS or LL/SC to implement the read/write lock. The
> paper would be totally wrong if otherwise.

Sorry, I should have stated that as "They didn't have to add the direct use
of Compare-and-Swap (CAS)."

They changed a reader-writer lock into a mutex grabbed only for writing,
from their description. The readers don't grab any kind of lock, so
'lock-free readers' is a good description of the changes.

> > Are there large-scale systems that rely exclusively on lock-free
> ^^^^^^^^^^^
> > algorithms for consistancy and scalability? Especially the more
> > complicated algorithms?
>
> Not exclusively. But...
>
> The Linux Scalability Effort is using lock-free RCU specifically to scale,
> in a lot of places.

But that's got problems getting into Linux proper, if I remember
correctly.

> FUTEX is developed. It is faster than its lock-based friends.

FUTEX is a primitive -- since it's main use so far has been to
implement mutexes and wait queues, and its designers say that
its semantics are very hard to use correctly, using it as an example
of lock-freeness is tricky. Are there examples of people using
it for lock-free stuff (I did do a bit of google searching and didn't
find any)

> K42 project and Tornado use lock-free algos specifically to scale.
>
> IBM System 370 uses lock-free algos, the IBM FreeList was introduced here.
>
> Windows uses and exposes lock-free algos.
>
> Java uses them.
>
> Java exposes an atomic CAS & a markable pointer, you could do lock-free
> algos using java.
>
> I wonder Java would expose them? Could it be that Java is exposing them
> specifically for high-performance lock-free collections?
>
> Intel exposes a compare-and-swap double on 32-bit for lock-free algos...
>
> http://eca.cx/lad/2000/Jul/0319.html
>
> Also, read some papers on lock-free algos, and look at their testing
> performance. They all scale far beyond their lock-based counterparts.

I have read some, and I don't think the more complicated ones are
solving problems in real systems. They also add a lot of complexity
to optimize what should be rare -- the contended case. In my opinion,
well designed locked datastructures can do at least as well as many
lock-free algorithms, with much easier-to-understand behavior.

In any case, my original statement stands: Solaris scales to hundreds of
processors without relying on lock-free algorithms to do so.

- jonathan

Casper H.S. Dik

unread,

Nov 28, 2003, 4:13:27 AM11/28/03

to

jonath...@ofb.net (Jonathan Adams) writes:

>Keep in mind that there is a large difference between Sun Labs and the
>Solaris engineering group. The papers I've read on lock free algorithms
>don't lead me to think that they are really ready for prime time, in terms
>of both complexity and performance gain -- in particular, the fact that
>there is little performance gain in the uncontended behavior and a huge
>complexity bill to handle the contended case. Most locks simply don't
>experience contention, in a well-designed system.

The complexity of the algorithms indicates to me that if a
system uses lockfree algorithms the system must provided the
complete lockfree primitives; people should not roll their own.

Just like the kernel/C-runtime now provides a variety of locking
and rendezvous primitives.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Joe Seigh

unread,

Nov 28, 2003, 9:34:04 AM11/28/03

to

Jonathan Adams wrote:
>
> "SenderX" <x...@xxx.xxx> wrote in message news:<wP7wb.287188$Fm2.297447@attbi_s04>...

> > FUTEX is developed. It is faster than its lock-based friends.

> FUTEX is a primitive -- since it's main use so far has been to
> implement mutexes and wait queues, and its designers say that
> its semantics are very hard to use correctly, using it as an example
> of lock-freeness is tricky. Are there examples of people using
> it for lock-free stuff (I did do a bit of google searching and didn't
> find any)
>

Not as such. Futex is for implementing fast pathed synchronization that does
block if there's contention. Futex provids the blocking mechanism. So anything
that uses a futex would not be lock-free by definition. However, the fast
path is trying to avoid kernel calls, otherwise it wouldn't be so fast, so you
are likely to see "lock-free" like logic in the fast path.

Joe Seigh

SenderX

unread,

Nov 29, 2003, 5:22:31 PM11/29/03

to

> The complexity of the algorithms indicates to me that if a
> system uses lockfree algorithms the system must provided the
> complete lockfree primitives; people should not roll their own.

You simply have to know what your doing. Correct lock-free algos are stable.

David Schwartz

unread,

Nov 29, 2003, 8:23:09 PM11/29/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:G_8yb.159989$Dw6.637523@attbi_s02...

Surely you have to admit that it's much harder to code a stable
lock-free algorithm than than a comparable (does the same job) stable
lock-based algorithm.

DS

Casper H.S. Dik

unread,

Nov 30, 2003, 4:54:02 AM11/30/03

to

"David Schwartz" <dav...@webmaster.com> writes:

Exactly; what I was trying to point out is that funcitons like
"mutex_lock()" hide a *lot* of complexity even though they're
build on the same "CAS" instructions as the "lock free" algorithms
and that using them is fairly straightforward.

But implementing mutex_lock() is not quite so straightforward
even if you're content with just spinning.

A lock free algorithm requires you to implement the equivalent
of mutex_lock()+ each time; that's far beyond the reach of
most programmers; getting it nearly right is probably not too
hard; getting it exactly right is going to be more difficult.
And it's that which makes the lockfree algorithms fragile.

Contention free algorithms are much nicer.

SenderX

unread,

Nov 30, 2003, 7:52:10 PM11/30/03

to

> Contention free algorithms are much nicer.

Hell yeah!

=)

But, when you must code for contention, lock-free algos provide an elegant
solution.

They gracefully handle contention, by allowing constant forward progress.
Every CAS failure, means another thread moved forward through the algo. No
CAS has to wait for any specific value.

> A lock free algorithm requires you to implement the equivalent
> of mutex_lock()+ each time; that's far beyond the reach of
> most programmers;

What if the lock-free algos exist in a stable pthread compatible library,
opaque to the user?

>getting it nearly right is probably not too
> hard; getting it exactly right is going to be more difficult.

Not really...

Joe Seigh and I have posted solutions to this group that significantly
decreased the difficulty in coding stable lock-free algos... Some of my
colleges and I have come to the conclusion that lock-free garbage collectors
are KEY, almost REQUIRED, to developing stable lock-free algos. By using a
proxy garbage collector you remove the ABA problem, which is one of the
"central" problems for CAS based lock-free algos. LL/SC reserved does not
suffer ABA.

Now, what's left...

The ABA problem is gone, thanks to the gc, and you can rely on CAS that
works on sizeof( void* ) chunks of memory. That means that CAS2 is not
needed. A classic lock-free research problem of deleting nodes from
lock-free algos is also eliminated due to the garbage collector. The core
problems to lock-free research are now removed. This lock-free proxy garbage
collector solution can easily be implemented under the hood of a library.

I think a stable lock-free library for pthreads would be a good thing????

> And it's that which makes the lockfree algorithms fragile.

Sure, if somebody did not know what they were doing, you can crash the hell
out of your applications collections!

How is a "correct" lock-free algo fragile?

Many correct and stable lock-free algos have been posted to this group.

A lock-free semaphore and read/write lock, plus an atomic_ptr and my proxy
garbage collector come to mind...

Joe Seigh

unread,

Nov 30, 2003, 8:22:43 PM11/30/03

to

"Casper H.S. Dik" wrote:
>
> A lock free algorithm requires you to implement the equivalent
> of mutex_lock()+ each time; that's far beyond the reach of
> most programmers; getting it nearly right is probably not too
> hard; getting it exactly right is going to be more difficult.
> And it's that which makes the lockfree algorithms fragile.
>

Required? So, you're allowed to use someone else's sychronized
queue implementation only if it uses locks but not if it is lock-free?
That doesn't make a lot of sense.

With regards to the complexity or trickiness of lock-free algorithms,
there are a number of people that believe that for normal programmers
multi-threaded programming (SMP) is too complex and too easy to screw up.
It's usage is actually restricted at some companies.

And I know people who consider computers in general too complex.

So, using your logic we should not only avoid lock-free algorithms,
but multi-threading and, taking it to its logical conclusion, computers
as well.

Joe Seigh

Casper H.S. Dik

unread,

Dec 1, 2003, 4:36:09 AM12/1/03

to

Joe Seigh <jsei...@xemaps.com> writes:

>Required? So, you're allowed to use someone else's sychronized
>queue implementation only if it uses locks but not if it is lock-free?
>That doesn't make a lot of sense.

I'm saying that the system should provide the primitives and that they
should be avoided for direct use by programmers. Not that they shouldn't
be used at all.

SenderX

unread,

Dec 1, 2003, 12:51:00 PM12/1/03

to

> I'm saying that the system should provide the primitives and that they
> should be avoided for direct use by programmers.

A little God complex eh?

So if I worked for you, would I be unable to experiment, and preset
patentable ideas based on lock-free algos? Would you say: well, your not a
sys programmer, so take your ideas elsewhere?

An attitude like that limits some of your employees imaginations? How can
you identify the genius you may have working for you?

> I'm saying that the system should provide the primitives

What about a user-space library?

I can wrap a non-experimental, stable lock-free pthread library that exposes
many things that will outperform their lock-based counterparts right now.
This library would present simple API's, and easy to understand caveats?
Would that be a bad thing?

I suppose you think talented lock-free researchers have no place in the
stable library writing department? Perhaps we could code a library that
would outperform a number of pthread implementations exposed primitives?

> Expressed in this posting are my opinions. They are in no way related
> to opinions held by my employer, Sun Microsystems.

I bet Sun is interested in lock-free research. I think they even spend $$$
on it?

Are they wasting their time?

--
The designer of the experimental, SMP and HyperThread friendly, AppCore
library.

http://AppCore.home.comcast.net

Greg Menke

unread,

Dec 1, 2003, 2:43:56 PM12/1/03

to

"SenderX" <x...@xxx.xxx> writes:

Chill, dude. Its important to allow skeptics to have reservations.

You've made a number of grand claims. I'd like to know the following;

- In what circumstances are locking algorithms preferable to lock-free
alternatives?

- Under the same test scenarios, on different architectures, and as a
function of the # of processors, what are the bus and cpu
utilization costs of your algorithm relative to a conventional lock
approach?

- You've mentioned patents in a scary way several times. What is your
position regarding other's commercial or noncommercial use and/or
porting of your appcore algorithm? Your website doesn't speak to
the issue. To be honest, I don't want to go anywhere near your code
until that is made very clear.

Gregm

SenderX

unread,

Dec 1, 2003, 5:41:42 PM12/1/03

to

> You've made a number of grand claims. I'd like to know the following;
>
> - In what circumstances are locking algorithms preferable to lock-free
> alternatives?

If you can eliminate "most" contention in the first place, you can use
locks.

When you don't care that much about the performance gain. For most normal
applications, scalability and high-performance doesn't matter.

But, high-user server applications or operating systems, that are "expected"
to scale, are good candidates for lock-free design. Maybe a 3d game that
uses multi-threads, and wants to guarantee that blocking will never occur.

Massive online games would benefit from scaleable servers... The more
connections a shared collection can handle, the less computers you need to
buy.

Lock-free is perfect for real-time as well.

If the server suffers from bottle-necks due to critical sections that exist
in a shared collections, a lock-free algos will help you there.

Also, if you care about context-switches in your worker threads, lock-free
will help decrease them. If you care about priority inversion in your
critical sections, lock-free will cure that as well.

> - Under the same test scenarios, on different architectures, and as a
> function of the # of processors, what are the bus and cpu
> utilization costs of your algorithm relative to a conventional lock
> approach?

I will include some details regarding that when I release my pthread
lock-free collection library.

If many threads are dequeing unrelated objects (bad locality) from a heavy
lock-based shared queue, cache misses and bus will be high anyway. As an
application programmer in general, you should be aware of data locality.

Well, for a single update on a shared collection, a conventional lock would
have to do something like:

1. Load a shared lock value.

2. Check to see if the shared lock value is a certain value, by atomic
instruction.

3a. If the shared value is not a certain value (pessimistic), it has to go
into a shared waitset algo, which will cost as well. kernel.

3b. If the shared value is the value expected, it atomically stores a new
value to indicate to other threads that the critical section is locked.

< Critical Section Start >

4. It has to load the shared data that the critical section protects.

5. It has to modify and store the shared data.

< Critical Section End >

6. It has to atomically unlock the mutex via. atomic instruction.

7. It has to wake contending threads in the shared waitset, costs. kernel
crap.

8. A contending thread will be woken, and go right back to step 1, because a
concurrent thread may slip in and acquire the mutex between the unlock (
step 7 ) and waitset wake ( step 8 ).

That is 8 expensive steps.

8 steps for a single update, mixed with kernel and thread wakes?

A lock-free approach, a thread does this to update the stack, queue, ect...:

1. Load the "actual stack header itself" from shared memory.

2. Update the "local" copy.

3a. If the shared data stayed the same (optimistic), it atomically changes
the shared critical data directly.

3b. If the shared data changed go to step 1.

3 Steps for a single critical section update is eaiser on the system.

A simple pasue instruction will reduce CAS failures in lock-free algos. A
SMP system can assign an individual backoff value for each CPU, and
incorporate that into the algo. There is a straight forward method, made up
of the above tricks and some others, to dramatically decrease the number of
CAS failures. I am not sure if I want to present it here, but I have seen it
avoid millions of CAS failures on extended testing.

Lock-free reads are VERY cache friendly.

The big time reduction in context-switches provided by non-blocking algos,
and an application design that take data locality into mind with be good to
the cache.

> - You've mentioned patents in a scary way several times. What is your
> position regarding other's commercial or noncommercial use and/or
> porting of your appcore algorithm? Your website doesn't speak to
> the issue. To be honest, I don't want to go anywhere near your code
> until that is made very clear.

Well, the algos that AppCore present are open-source and if you provide a
link to the website, and you can use and change them.

I may patent some of my new garbage collector designs, and a couple of
lock-free collections.

Those algos are not presented anywhere on the AppCore site.

I did however, release a stable 32-bit collector into the public domain, but
I don't want to release my 64-bit collector. I would rather create a
lock-free 64-bit library and license its use.

The "new" pthread AppCore will use algos that will not exist on the AppCore
site.

The current Win32 only AppCore Lock-Free Communication System uses an algo I
presented to this group a while back:

http://groups.google.com/groups?selm=plCAa.475147%24Si4.412103%40rwcrnsc51.ops.asp.att.net&rnum=1

Here is implementation:

http://appcore.home.comcast.net/src/LockFree.c

So the AppCore System can be ported to other operating systems and 64-bit
processors.

;)

David Schwartz

unread,

Dec 1, 2003, 6:58:39 PM12/1/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:_gwyb.374819$Fm2.389232@attbi_s04...

> > Contention free algorithms are much nicer.
>
> Hell yeah!
>
> =)
>
> But, when you must code for contention, lock-free algos provide an elegant
> solution.

Actually, they provide a horribly inelegant solution. Waiting until
there is no contention is elegant. Suffering through the contention at
reduced performance and wasted CPU usage is inelegant.

> They gracefully handle contention, by allowing constant forward progress.
> Every CAS failure, means another thread moved forward through the algo. No
> CAS has to wait for any specific value.

You mean they allow threads to waste CPU and saturate the FSB rather
than scheduling tasks that can run without contention.

> > A lock free algorithm requires you to implement the equivalent
> > of mutex_lock()+ each time; that's far beyond the reach of
> > most programmers;
>
> What if the lock-free algos exist in a stable pthread compatible library,
> opaque to the user?

They're still only suitable for those cases where the tremendously
increased contention is balanced out by some other gain.

> How is a "correct" lock-free algo fragile?
>
> Many correct and stable lock-free algos have been posted to this group.

Because if the circumstances are such (or change to be such) that the
reduced context switches no longer pay off for the increased contention and
CPU consumption, performance will start to get really bad. A lock-free
algorithm is fragile in the sense that its benefits are quickly lost if you
are doing anything else (free from contention) at the same time.

You still haven't seemed to grasp my fundamental point. Imagine if you
have threads 2 CPUs and threads A and B running and contending for the same
resources and thread C ready-to-run and wanting to use resources that are
not contended for. A lock-free algorithm will let A and B continue to run at
FSB speed, using up both CPUs and making little forward progress. A locking
algorithm will deschedule B, and let A and C run concurrently, both at full
speed.

The problem is, whether or not there are other threads ready-to-run at
the same time that won't contend is something outside the scope of the
algorithm itself. So lock-free algorithms are fragile -- change something
totally outside the algorithm and not only does their performance drop, but
the performance of other tasks drops too (due to FSB saturation).

They scale horribly as well. The more CPUs you add, the worse the
problem gets. The more other threads you add, the worse the problem gets.

Yes, you can point to specific situations (particularly in OS code where
you have much more control) where they're a big win. But you can point to
specific cases where you'd rather have an 18-wheeler than a Chevy, that
doesn't mean it's appropriate to recommend an 18-wheeler to someone who is
learning how to drive.

You can only responsibly elect to use a lock-free algorithm in two
cases. One is where performance doesn't matter and you know the lock-free
algorithm is correct. The other is where you totally understand the
performance implications.

DS

SenderX

unread,

Dec 1, 2003, 7:36:46 PM12/1/03

to

> They scale horribly as well. The more CPUs you add, the worse the
> problem gets. The more other threads you add, the worse the problem gets.

You have NEVER worked with a lock-free collection in your life then.

Lock-free collections scale bad? Tell that to implementers of Java and Linux
scalability people...

I challenge you to provide me with papers that show well-designed lock-free
collections do not scale. I suppose the Linux scalability effort is wrong
for using RCU?

I have asked you to code a well published IBM algo and test it against a
lock-based version, but you have not come through... Are you afraid that you
will prove yourself 100% wrong? I think that has to be it.

Since you do not want to code a lock-free collection for yourself, look at
this one:

http://groups.google.com/groups?selm=3EF64AA8.180483F2%40xemaps.com

See how the wait-free algos outperform the lock-based ones? This small
example renders your stupid claim that lock-free doesn't scale well moot.
Really!

I can provide code that people can test for themselves, you seem to refuse
to do so.

Dave, present mutex based linked-list code to this group that emulates the
IBM algo I want you to try. You code can exploit ZERO lock-free techniques.
I will present a x86 lock-free version of your code. Different people can
test. My code will win.

:)

Rich Teer

unread,

Dec 1, 2003, 8:51:24 PM12/1/03

to

On Tue, 2 Dec 2003, SenderX wrote:

> Lock-free collections scale bad? Tell that to implementers of Java and Linux
> scalability people...

I'm no lock-free collection expert, but I wouldn't be using Linux
as an example of good scalability. Last I heard, the Linux kernel
starts having scalability problems beyond 4 (or was it 8) CPUs.

Compare that with Solaris' > 90% linearity all the way up to 100 way.

SenderX

unread,

Dec 1, 2003, 9:02:42 PM12/1/03

to

> I'm no lock-free collection expert, but I wouldn't be using Linux
> as an example of good scalability. Last I heard, the Linux kernel
> starts having scalability problems beyond 4 (or was it 8) CPUs.

That's the reason for the Linux scalability effort...

Greg Menke

unread,

Dec 1, 2003, 9:57:03 PM12/1/03

to

"SenderX" <x...@xxx.xxx> writes:

> > You've made a number of grand claims. I'd like to know the following;
> >
> > - In what circumstances are locking algorithms preferable to lock-free
> > alternatives?
>
> If you can eliminate "most" contention in the first place, you can use
> locks.
>
> When you don't care that much about the performance gain. For most normal
> applications, scalability and high-performance doesn't matter.
>
> But, high-user server applications or operating systems, that are "expected"
> to scale, are good candidates for lock-free design. Maybe a 3d game that
> uses multi-threads, and wants to guarantee that blocking will never occur.
>
> Massive online games would benefit from scaleable servers... The more
> connections a shared collection can handle, the less computers you need to
> buy.
>
> Lock-free is perfect for real-time as well.
>
> If the server suffers from bottle-necks due to critical sections that exist
> in a shared collections, a lock-free algos will help you there.
>
> Also, if you care about context-switches in your worker threads, lock-free
> will help decrease them. If you care about priority inversion in your
> critical sections, lock-free will cure that as well.

If you're proposing lock-free algorithms as a universal tool, I think
you aren't adequately aware of their implications. Yes, I know all
about the doctrine and the reasoned conclusions, you've repeated it
several times- however, what you have not yet presented is data.

Show me the tests - demo code and results - of your algorithm that
prove its always more efficient and always preferable to locking
algorithms for any contention problem on any architecture. If you
don't have the evidence to support that claim, then please show me the
evidence that indicate which architectures and which problems its best
suited too. Anecdotal experience with the algorithm is not enough,
nor is reasoning about it. If you're pitching lock-free as a panacea
that all OS developers are fools not to exploit, you'd better have
some damn good repeatable and relevant measurements to back it up.

> > - You've mentioned patents in a scary way several times. What is your
> > position regarding other's commercial or noncommercial use and/or
> > porting of your appcore algorithm? Your website doesn't speak to
> > the issue. To be honest, I don't want to go anywhere near your code
> > until that is made very clear.
>
> Well, the algos that AppCore present are open-source and if you provide a
> link to the website, and you can use and change them.

Please change the comments in the sourcecode to reflect that. I will
scrupulously avoid looking at any of your code until you do- perhaps
simply including the BSD license or something like it may be a
reasonable solution. You've succeeded in making me paranoid about
your "licensing" goals.

Gregm

David Schwartz

unread,

Dec 2, 2003, 2:42:30 AM12/2/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:y8Ryb.275145$9E1.1431062@attbi_s52...

> Dave, present mutex based linked-list code to this group that emulates the
> IBM algo I want you to try. You code can exploit ZERO lock-free
techniques.
> I will present a x86 lock-free version of your code. Different people can
> test. My code will win.

Your algorithm will lose every time on a 2 CPU machine with four
threads, two of which are trying to use one lock-free linked list and two of
which are trying to use the other.

Think about it.

On a linked list with locks, one thread will always be running on each
linked list. If at any time two threads try to use the same linked list,
they'll hit a lock and one of them will be descheduled. The threads will run
at full speed and there will be zero contention.

On a lock-free linked list, you will frequently have two threads running
accessing the same linked-list. There will be massive contention.

The measured performance of the lock-free algorithm will heavily depend
upon the test environment. On benchmark code (especially where the system is
quiet and all the threads just mess with the same collection), the lock-free
code will win.

Just answer me one question. If you were being billed for CPU time,
would you prefer a lock-free algorithm or a lock-based one? Remember, with
the lock-based one, no CPU time is wasted due to contention, threads are
immediately descheduled. With the lock-free one, threads will run through
contention and burn CPU time as cache lines ping-pong across the FSB.

DS

Logan Shaw

unread,

Dec 2, 2003, 4:16:22 AM12/2/03

to

David Schwartz wrote:

> Your algorithm will lose every time on a 2 CPU machine with four
> threads, two of which are trying to use one lock-free linked list and two of
> which are trying to use the other.
>
> Think about it.
>
> On a linked list with locks, one thread will always be running on each
> linked list. If at any time two threads try to use the same linked list,
> they'll hit a lock and one of them will be descheduled. The threads will run
> at full speed and there will be zero contention.
>
> On a lock-free linked list, you will frequently have two threads running
> accessing the same linked-list. There will be massive contention.

I'm not sure it's as simple as this. The lock free algorithm seems
like it might be a win if the critical section is really really short.
Yes, the other threads will busy wait if there is contention, but
if the critical section is only, say, 10 instructions long (change a
pointer or increment an integer or something), that's not a big deal
because there will have to be a huge amount of contention before the
busy-waiting threads waste an appreciable amount of CPU time.

Also, since this discussion is cross-posted to comp.unix.solaris,
doesn't Solaris have an adaptive mutex implementation? If I understand
correctly, this means that on a two CPU Solaris machine, the version
with locks will also have the loser thread spinning and waiting for
the resource to come free. If it doesn't within a period of time,
only then will the thread be put to sleep. So, the version with
locks will suffer some of the same CPU wastage as the lock free version.

> Just answer me one question. If you were being billed for CPU time,
> would you prefer a lock-free algorithm or a lock-based one? Remember, with
> the lock-based one, no CPU time is wasted due to contention, threads are
> immediately descheduled. With the lock-free one, threads will run through
> contention and burn CPU time as cache lines ping-pong across the FSB.

It seems like this might depend on the hardware. If location CPU A is
constantly reading from location X and CPU B does some stuff and then
writes to it, it seems like location X will stay in CPU A's cache and
all will be well (relatively). If CPUs C and D are also reading from
location X, is that a problem if the system has a good bus snooping
cache coherency algorithm? (It seems like it'd be safe for all processors
to have a cached copy of the memory and read from it.) I guess things
get really complicated on a NUMA machine...

- Logan

David Schwartz

unread,

Dec 2, 2003, 4:22:53 AM12/2/03

to

"Logan Shaw" <lshaw-...@austin.rr.com> wrote in message
news:GLYyb.72566$Ek.3...@twister.austin.rr.com...

> David Schwartz wrote:

> > Your algorithm will lose every time on a 2 CPU machine with four
> > threads, two of which are trying to use one lock-free linked list and
two of
> > which are trying to use the other.
> >
> > Think about it.
> >
> > On a linked list with locks, one thread will always be running on
each
> > linked list. If at any time two threads try to use the same linked list,
> > they'll hit a lock and one of them will be descheduled. The threads will
run
> > at full speed and there will be zero contention.
> >
> > On a lock-free linked list, you will frequently have two threads
running
> > accessing the same linked-list. There will be massive contention.

> I'm not sure it's as simple as this. The lock free algorithm seems
> like it might be a win if the critical section is really really short.

It doesn't really matter how big the critical section is. The CPU is
probably very fast compared to the cache coherency bus.

> Yes, the other threads will busy wait if there is contention, but
> if the critical section is only, say, 10 instructions long (change a
> pointer or increment an integer or something), that's not a big deal
> because there will have to be a huge amount of contention before the
> busy-waiting threads waste an appreciable amount of CPU time.

Every time the CPU has to wait for the shared memory to pass over the
bus, CPU time is wasted.

> Also, since this discussion is cross-posted to comp.unix.solaris,
> doesn't Solaris have an adaptive mutex implementation? If I understand
> correctly, this means that on a two CPU Solaris machine, the version
> with locks will also have the loser thread spinning and waiting for
> the resource to come free. If it doesn't within a period of time,
> only then will the thread be put to sleep. So, the version with
> locks will suffer some of the same CPU wastage as the lock free version.

Not if the adaptive mutexes adapt correctly. Sufficient contention
should cause the mutex to adapt and block. Once the mutex blocks once, the
thread that can run without contention will use up a full timeslice. The big
loss only happens when both threads continue blundering through the
contention for all or most of their timesilces.

> > Just answer me one question. If you were being billed for CPU time,
> > would you prefer a lock-free algorithm or a lock-based one? Remember,
with
> > the lock-based one, no CPU time is wasted due to contention, threads are
> > immediately descheduled. With the lock-free one, threads will run
through
> > contention and burn CPU time as cache lines ping-pong across the FSB.

> It seems like this might depend on the hardware. If location CPU A is
> constantly reading from location X and CPU B does some stuff and then
> writes to it, it seems like location X will stay in CPU A's cache and
> all will be well (relatively).

No, because the first read after the write will cause the data to become
shared and the next write will unshare it.

> If CPUs C and D are also reading from
> location X, is that a problem if the system has a good bus snooping
> cache coherency algorithm? (It seems like it'd be safe for all processors
> to have a cached copy of the memory and read from it.) I guess things
> get really complicated on a NUMA machine...

It's not a problem, it's just very slow. Contention is bad. Lock-free
algorithms are ways to maximize contention. Locks are ways to avoid
contention by switching to a thread that can run without contention.

Yes, this is deliberately oversimplified. It's intended to be the
rebuttal to "context switches are bad, locking algorithms maximize context
switches while lock-free algorithms avoid them."

Locking algorithms only cause a context switch when there's contention.
If the thread that switches out would hit more contention but the thread
that switches in can run for a long time without contention, then that's a
win. If not, it can be a loss.

DS

Joe Seigh

unread,

Dec 2, 2003, 7:36:33 AM12/2/03

to

Rich Teer wrote:
>
> On Tue, 2 Dec 2003, SenderX wrote:
>
> > Lock-free collections scale bad? Tell that to implementers of Java and Linux
> > scalability people...
>
> I'm no lock-free collection expert, but I wouldn't be using Linux
> as an example of good scalability. Last I heard, the Linux kernel
> starts having scalability problems beyond 4 (or was it 8) CPUs.

I don't know what the practical limit for Linux is currently but the
fact is that RCU, a lock-free algorithm, played a key role in improving
the scalability of Linux.

>
> Compare that with Solaris' > 90% linearity all the way up to 100 way.
>

I don't think anyone actually stated how they did this. I can make
some pretty good guesses. However, I can't exclude lock-free. If you don't
know what Solaris is doing then you can't very well state what it is
not doing.

Joe Seigh

unread,

Dec 2, 2003, 8:17:54 AM12/2/03

to

David Schwartz wrote:
...

>
> Yes, this is deliberately oversimplified. It's intended to be the
> rebuttal to "context switches are bad, locking algorithms maximize context
> switches while lock-free algorithms avoid them."
>
> Locking algorithms only cause a context switch when there's contention.
> If the thread that switches out would hit more contention but the thread
> that switches in can run for a long time without contention, then that's a
> win. If not, it can be a loss.
>

I don't want to get into deconstructing your arguments to figure out that
you are really saying and then having to argue about what you are arguing
about. But, are you talking about FIFO locks or about locks which allow
queue jumping?

Joe Seigh

Rich Teer

unread,

Dec 2, 2003, 12:08:43 PM12/2/03

to

On Tue, 2 Dec 2003, Joe Seigh wrote:

> I don't know what the practical limit for Linux is currently but the
> fact is that RCU, a lock-free algorithm, played a key role in improving
> the scalability of Linux.

I won't deny that - but if Linux's scalability increased from (say)
2 way to 16 way as a result of this algorithm, that's still not that
big of a deal. It's been a LONG time since Sun's biggest machine had
only 16 processors, for example.

> I don't think anyone actually stated how they did this. I can make
> some pretty good guesses. However, I can't exclude lock-free. If you don't
> know what Solaris is doing then you can't very well state what it is
> not doing.

I have access to the Solaris 8 source, through the (now discontinued)
Community Source program. However, I haven't trawled through all the
code to see how it handles this sort of thing - and even if I did, the
NDA I signed would prohibit me from releasing any details.

That notwithstanding, one can look at the data structures in /usr/include/sys,
and gather from the number of locks in structures that a fair amount
of locking is going on. But I don't doubt that some places use
lock-free alorithms when it's useful. As someone else said, Solaris
uses adaptive mutexes, so although the design uses locks, it's possible
that at a given time a lock will just spin until the resource becomes
available.

David Schwartz

unread,

Dec 2, 2003, 2:59:42 PM12/2/03

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:3FCC9461...@xemaps.com...

> David Schwartz wrote:

Either. Why do you think that makes any difference? Either way if a
thread tries to use a resource that another thread is using, it is not
permitted to make forward progress until the first thread is finished, so
the threads don't fight for cache lines over the FSB.

DS

SenderX

unread,

Dec 2, 2003, 3:38:02 PM12/2/03

to

> Show me the tests - demo code and results - of your algorithm that
> prove its always more efficient and always preferable to locking
> algorithms for any contention problem on any architecture.

Well...

I will go ahead and post a couple of forced high-contention demos in a
couple of days, OK?

It will have a lock-free stack && queue.

It will have also have a lock-based stack, queue.

It will run 2 tests, a lock-free and lock-based one.

256+ threads will be created.

The main thread will post 350, 000+ reference counted items.

The 256+ threads will cycle the items between themselves, and drop the items
when their references drop to zero. During the cycling the main thread will
send off commands to the workers, instructing them to iterate the queue
and/or stacks. A lock-based version will die right here.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Memory pools will be used as well, to eliminate as many calls to new and
free as possible...

You will be able to test it for yourself.

I don't want to use a read/write lock, because that exploits lock-free
iteration. The lock-based version cannot use ANY lock-free techniques.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Agreed?

I already know who will win, but you will be able to see for yourself...

The lock-free code "will be" in Intel Style x86 && win32-pthreads, and need
inline asm or MASM.

You can port to your platform of choice.

Does this sound OK?

I won't post the test until you get back to me saying that the proposed test
will be OK.

I don't want to post a test, and get feedback saying its not a good test!

;)

I don't really want to post a lock-free linked-list test ( patent reasons ),
unless David S. takes my challenge I proposed to him...

> Please change the comments in the sourcecode to reflect that. I will
> scrupulously avoid looking at any of your code until you do- perhaps
> simply including the BSD license or something like it may be a
> reasonable solution. You've succeeded in making me paranoid about
> your "licensing" goals.

Sorry. The new site will have what you require.

AppCore is going through radical changes.

The one that is there now is basically for your experimentation purposes and
leisure. You can use the algos and change them.

SenderX

unread,

Dec 2, 2003, 3:42:43 PM12/2/03

to

> It will have a lock-free stack && queue.

I will "also" show lock-free semaphore vs. a lock-based semaphore as well...

The performance difference will blow you away...

=)

SenderX

unread,

Dec 2, 2003, 3:44:52 PM12/2/03

to

To clarify...

The lock-based test WILL NOT USE ANY lock-free optimizations.

The test is to compare lock-based tech with lock-free tech, so lock-based
has to be true to itself.