A Lock-Free vs. Lock-Based Test Here...

SenderX

unread,

Dec 4, 2003, 1:33:30 AM12/4/03

to

The lock-free code avoids millions and millions of kernel calls!

Just look at the performance monitor differences on kernel usage between the
lock-based and lock-free versions!

Wor...

David S... You are wrong about lock-free. Dead wrong!

--
The designer of the experimental, SMP and HyperThread friendly, AppCore
library.

http://AppCore.home.comcast.net

SenderX

unread,

Dec 4, 2003, 1:36:02 AM12/4/03

to

> Wor...
^^^^^^

>
> David S... You are wrong about lock-free. Dead wrong!

I mean...

WOW!

:)

David Schwartz

unread,

Dec 4, 2003, 2:49:59 AM12/4/03

to

"SenderX" <x...@xxx.xxx> wrote in message
news:ZyAzb.22587$_M.98466@attbi_s54...

> The lock-free code avoids millions and millions of kernel calls!

Wow.

> Just look at the performance monitor differences on kernel usage between
the
> lock-based and lock-free versions!
>
> Wor...
>
> David S... You are wrong about lock-free. Dead wrong!

So you're saying that code that uses locks must use more kernel calls
than lock-free code? Isn't it possible to implement any lock-free algorithm
as kernel calls? And isn't it possible to implement locks totally in user
space?

Yes, I will agree that in general, code that uses fewer kernel calls is
likely to outperform code that uses more kernel calls. For code that uses
equal numbers of kernel calls on P4 x86 SMP platforms, code that uses fewer
locked instructions will tend to outperform code that uses more locked
instructions.

How about comparing comparable code? How about comparing:

acquire_fast_lock();
i++;
j++;
release_fast_lock();

to

atomic_increment(&i);
atomic_increment(&j);

DS

Alexander Terekhov

unread,

Dec 4, 2003, 7:52:59 AM12/4/03

to

David Schwartz wrote:
[...]

> So you're saying that code that uses locks must use more kernel calls
> than lock-free code? Isn't it possible to implement any lock-free algorithm
> as kernel calls? And isn't it possible to implement locks totally in user
> space?

That's possible, of course. What you can't do without making kernel
calls [unless you have two level scheduler] is forcefully suspend a
thread and let some other thread try to make some progress. But what
if you simply have no other thread ready to make some progress (I
mean that each thread has a processor ro run)? Unlock "is likely"
to let waiter(s) "go ahead" immediately.

>
> Yes, I will agree that in general, code that uses fewer kernel calls is
> likely to outperform code that uses more kernel calls. For code that uses
> equal numbers of kernel calls on P4 x86 SMP platforms, code that uses fewer
> locked instructions will tend to outperform code that uses more locked
> instructions.
>
> How about comparing comparable code? How about comparing:
>
> acquire_fast_lock();
> i++;
> j++;
> release_fast_lock();
>
> to
>
> atomic_increment(&i);
> atomic_increment(&j);

That's not really comparable. Compare (assume no false-sharing)

atomic_increment(&i);
atomic_increment(&j);

to

acquire_fast_lock(l1);
i++;
release_fast_lock(l1);
acquire_fast_lock(l2);
j++;
release_fast_lock(l2);

regards,
alexander.

SenderX

unread,

Dec 4, 2003, 2:14:33 PM12/4/03

to

> > David S... You are wrong about lock-free. Dead wrong!
>
> So you're saying that code that uses locks must use more kernel calls
> than lock-free code?

Hell yeah I am!

The test proves that beyond doubt.

You have to use lock-free algos in the lock-based primitives in order to
scale up to a 100% lock-free design.

> Isn't it possible to implement any lock-free algorithm
> as kernel calls?

Why in the world would you want to even think about doing that!?!??!!??!

Lock-free algos in the kernel should NOT be exposed to users.

Lock-free algos in userland are fine.

> And isn't it possible to implement locks totally in user
> space?

That is called a spin-lock. ( yuck! )

To answer your question, you have to use a kernel object as a waitset.

> For code that uses
> equal numbers of kernel calls on P4 x86 SMP platforms, code that uses
fewer
> locked instructions will tend to outperform code that uses more locked
> instructions.

Most lock-free algos don't use the kernel AT ALL.

Look at my test code.

> How about comparing comparable code? How about comparing:
>
> acquire_fast_lock();
> i++;
> j++;
> release_fast_lock();

^^^^^^^^^^^^^^^^^^^^^^^^6

You are saying I should use a lock-free algo in the mutex?!?

Thank you very much, you have proved my point.

Lock-free algos will improve the performance of lock-based ones virtually
every time.

CRITICAL_SECTION uses a lock-free algo for the first point of contention,
see? Its a lock-free algo that makes this possible.

>
> to
>
> atomic_increment(&i);
> atomic_increment(&j);

Now David... I know your not stupid!

If you think that is comparable code, your insane not stupid.

P.S.

( It would only be comparable to DCAS, which would blow it away )

SenderX

unread,

Dec 4, 2003, 2:15:50 PM12/4/03

to

> That's not really comparable. Compare (assume no false-sharing)

^^^^^^^^^^^^^^^^^^^^^^^

Yup.

>
> atomic_increment(&i);
> atomic_increment(&j);
>
> to
>
> acquire_fast_lock(l1);
> i++;
> release_fast_lock(l1);
> acquire_fast_lock(l2);
> j++;
> release_fast_lock(l2);

Thats correct sir!

Thanks Alex for pointing that out to David. He seems like he doesn't believe
anything I say...

SenderX

unread,

Dec 4, 2003, 2:51:43 PM12/4/03

to

> Lock-free algos will improve the performance of lock-based ones virtually
> every time.

Sorry about that. I am finding it hard to break my hard-coded "lock-free
beats all" rule...

;(

I have agreed with David B. on this issue.

What I should have said was:

Lock-free algos will "help" lock-based algos scale better.

Lock-Free is NOT a cure all. It is a elegant tool to lock-based algos and
systems along the path of scalability.

David Schwartz

unread,

Dec 4, 2003, 4:51:37 PM12/4/03

to

"Alexander Terekhov" <tere...@web.de> wrote in message
news:3FCF2E2B...@web.de...

> > How about comparing comparable code? How about comparing:
> >
> > acquire_fast_lock();
> > i++;
> > j++;
> > release_fast_lock();
> >
> > to
> >
> > atomic_increment(&i);
> > atomic_increment(&j);
>
> That's not really comparable. Compare (assume no false-sharing)

Yes, it is comparable. Seem my reply on alt.winsock.programming.

> atomic_increment(&i);
> atomic_increment(&j);
>
> to
>
> acquire_fast_lock(l1);
> i++;
> release_fast_lock(l1);
> acquire_fast_lock(l2);
> j++;
> release_fast_lock(l2);

Except that lock-free programmers actually do write the first type of
code and programmers who use locks never write the second type. So such a
comparison would be of lock-free algorithms the way they're actually used
against lock-based algorithms the way no sane person would use them.

See my more detailed reply on alt.winsock.programming.

DS

SenderX

unread,

Dec 4, 2003, 5:04:40 PM12/4/03

to

> > That's not really comparable. Compare (assume no false-sharing)
>
> Yes, it is comparable. Seem my reply on alt.winsock.programming.

NOPE! Its NOT comparable AT ALL!

Get a clue David?

Lets keep this discussion on c.p.t please!

David Schwartz

unread,

Dec 4, 2003, 6:32:36 PM12/4/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:XbOzb.425317$Tr4.1199432@attbi_s03...

Okay, let's keep it simple. I have twenty different FIFOs. I want to add
twenty elements to each of the twenty FIFOs. I'm on an SMP P4. Please show
me how to do this using lock-free algorithms with just a single atomic
instruction. It's not difficult to do it using locks.

With lock-based algorithms, you can tune the granularity of the locking.
This is one of the major ways Solaris achieved the scalability that Linux
can only dream about. It is, AFAIK, impossible to similarly tune a lock-free
algorithm.

DS

SenderX

unread,

Dec 4, 2003, 7:22:58 PM12/4/03

to

> Okay, let's keep it simple. I have twenty different FIFOs. I want to
add
> twenty elements to each of the twenty FIFOs. I'm on an SMP P4. Please show
> me how to do this using lock-free algorithms with just a single atomic
> instruction. It's not difficult to do it using locks.

:/

You would have to protect the twenty queues with a SINGLE lock In order for
that to be comparable with a single lock-free operation. Agreed?

Remember you said SINLGE operation, which means a single critical section.

I may be trivial to due if all the queues were protected by the same lock,
but it is a performance nightmare. I can prove this with code as well,
David. Do you want me to?

My test has the first part of my proof regarding this interesting situation
you presented to me. Have you even looked at the code? Are you having
trouble understanding the advanced lock-free algos you seem to know so much
about?

Now to get back to the question:

I can not, I repeat can not! Add twenty items to twenty different queues
using a single atomic op. Now way no how.

*** HOWEVER ***

I CAN, I repeat CAN. Add twenty elements or a thousand elements, any number
of them, to a SINGLE queue, using 3 atomic operations.

"Method and Apparatus for Inserting Multiple Nodes at Once in a Lock-Free
Queue"

What do ya have to say to that?

> With lock-based algorithms, you can tune the granularity of the
locking.
> This is one of the major ways Solaris achieved the scalability that Linux
> can only dream about. It is, AFAIK, impossible to similarly tune a
lock-free
> algorithm.

My invention that can insert unlimited-multi-objects in a single queue,
using 3 atomic operations.

And my "working & stable" test of "MY" lock-free QUEUE & SUB-QUEUE API's:

http://appcore.home.comcast.net/LockFree/src/LockFree.c

Scroll down to the apis in question.

SEE! Direct control over the granularity of a lock-free algo!

Got it man?

P.S.

This idea of yours ( protecting multi-queues using single lock ) destroys
performance in many ways. My test will prove it.

Alls you have to do is protect the test SUB-QUEUE API with a single lock in
the LockBased.c file. Go ahead and see hhow your idea will ruin the
lock-based performance even further.

I can prove you dead wrong with actual code.

You have yet to modify my test in a way that will prove me wrong.

Alexander Terekhov

unread,

Dec 5, 2003, 6:08:04 AM12/5/03

to

David Schwartz wrote:
[...]

> > atomic_increment(&i);
> > atomic_increment(&j);
> >
> > to
> >
> > acquire_fast_lock(l1);
> > i++;
> > release_fast_lock(l1);
> > acquire_fast_lock(l2);
> > j++;
> > release_fast_lock(l2);
>
> Except that lock-free programmers actually do write the first type of
> code and programmers who use locks never write the second type. So such a
> comparison would be of lock-free algorithms the way they're actually used
> against lock-based algorithms the way no sane person would use them.

All releases of boost that are out there do use "fast locks" to
protect the counts for {tr::}shared_ptr<>.

shared_ptr<T> a(a_);
shared_ptr<T> b(b_);

Many portable COW std::string implementations also use "fast
locks" for counting.

std::string a(a_);
std::string b(b_);

>
> See my more detailed reply on alt.winsock.programming.

I mostly agree. But you push it way too hard. ;-)

regards,
alexander.

SenderX

unread,

Dec 5, 2003, 6:19:23 AM12/5/03

to

> All releases of boost that are out there do use "fast locks" to
> protect the counts for {tr::}shared_ptr<>.

They should be using the atomic increment function in your portable atomic
template!

I want that template bad!

I could use it for portable atomic ops and barriers to help the portability
my lock-free algos.

What would it take for your library to become a reality?

> > See my more detailed reply on alt.winsock.programming.
>
> I mostly agree. But you push it way too hard. ;-)

David S. was talking about how protecting 20 "different" shared queues with
a "single" lock, would outperform 20 queues protected by a fine-grain
distributed lock-free algo that I present in the test code I posted.

:O

That is insane!

David Schwartz

unread,

Dec 5, 2003, 7:17:53 AM12/5/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:%QZzb.308210$275.1052904@attbi_s53...

> David S. was talking about how protecting 20 "different" shared queues
with
> a "single" lock, would outperform 20 queues protected by a fine-grain
> distributed lock-free algo that I present in the test code I posted.

Try it. Use one CPU and one thread.

DS

David Butenhof

unread,

Dec 5, 2003, 8:49:19 AM12/5/03

to

SenderX wrote:

> I CAN, I repeat CAN. Add twenty elements or a thousand elements, any
> number of them, to a SINGLE queue, using 3 atomic operations.
>
> "Method and Apparatus for Inserting Multiple Nodes at Once in a Lock-Free
> Queue"
>
> What do ya have to say to that?

[...]

> My invention that can insert unlimited-multi-objects in a single queue,
> using 3 atomic operations.

"Invention"? Give me a break. Multiple queue insert has been done probably a
million times over the past 30 or 40 years. (Granted not all, though many,
atomically.)

>> With lock-based algorithms, you can tune the granularity of the
> locking.

> And my "working & stable" test of "MY" lock-free QUEUE & SUB-QUEUE API's:

>
> http://appcore.home.comcast.net/LockFree/src/LockFree.c
>
> Scroll down to the apis in question.
>
> SEE! Direct control over the granularity of a lock-free algo!
>
> Got it man?

Which is to say, once again -- you can't or WON'T "get it".

> This idea of yours ( protecting multi-queues using single lock ) destroys
> performance in many ways. My test will prove it.

That depends entirely on what you're trying to accomplish and WHY. You're
focused on one narrow path and completely ignoring the rest of the world.
Again, please just spend a fraction of the time and energy you're using to
argue on stretching your mind to construct your own counterexample. It'll
be good for you.

Your test will only prove what it's constructed to prove. Which is merely
that the particular artificial code stream you've created behaves as it
behaves. Your misunderstanding of the term "proof" seems to be fundamental
to your attitude.

> Alls you have to do is protect the test SUB-QUEUE API with a single lock

> in the LockBased.c file. Go ahead and see how your idea will ruin the

> lock-based performance even further.
>
> I can prove you dead wrong with actual code.
>
> You have yet to modify my test in a way that will prove me wrong.

Don't mistake unwillingness to bother for impossibility of the task or even
the inability of any person to accomplish it. Again, refer back to your
fundamental misunderstanding of the word "proof". ;-)

--
/--------------------[ David.B...@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/

David Butenhof

unread,

Dec 5, 2003, 8:51:37 AM12/5/03

to

SenderX wrote:

>> Lock-free algos will improve the performance of lock-based ones virtually
>> every time.
>
> Sorry about that. I am finding it hard to break my hard-coded "lock-free
> beats all" rule...
>
> ;(
>
> I have agreed with David B. on this issue.
>
> What I should have said was:
>
> Lock-free algos will "help" lock-based algos scale better.
>
> Lock-Free is NOT a cure all. It is a elegant tool to lock-based algos and
> systems along the path of scalability.

Wow. I may print this post and frame it. ;-)

Alexander Terekhov

unread,

Dec 5, 2003, 9:04:34 AM12/5/03

to

SenderX wrote:
[...]

> What would it take for your library to become a reality?

Monetarily? ;-) Seriously, one would need a memory model with all
those fancy membars AND compilers understanding that model. The
problem is that it's really more than "a library".

>
> > > See my more detailed reply on alt.winsock.programming.
> >
> > I mostly agree. But you push it way too hard. ;-)
>
> David S. was talking about how protecting 20 "different" shared queues with
> a "single" lock, would outperform 20 queues protected by a fine-grain
> distributed lock-free algo that I present in the test code I posted.

See <http://tinyurl.com/xuyc>.

regards,
alexander.

David Butenhof

unread,

Dec 5, 2003, 9:22:14 AM12/5/03

to

SenderX wrote:

>> > David S... You are wrong about lock-free. Dead wrong!
>>
>> So you're saying that code that uses locks must use more kernel calls
>> than lock-free code?
>
> Hell yeah I am!

That's completely untrue, since locks (even blocking locks with true
contention, much less uncontended spinlocks) can be implemented within a
process without any kernel calls.

I'll grant at least that you're probably thinking only Windows and Linux,
where "threads" are kernel objects and any thread scheduling requires a
kernel call. Even so, though, not all blocks (even "blocking" mutexes) need
actually block... if you can design carefully to avoid contention.

> The test proves that beyond doubt.

Oh, I think it's fairly safe to say that there's still plenty of room for
doubt. Again, your tests, however ambitious they may be, only prove that
they behave as they behave -- which is an intriguing tautology, but a
tautology all the same. They prove nothing about DIFFERENT workloads and
designs.

Y'know, we've more or less shown signs of getting past the "lock-free uber
alles"; can we do something about your use of the word "proof", please?

>> Isn't it possible to implement any lock-free algorithm
>> as kernel calls?
>
> Why in the world would you want to even think about doing that!?!??!!??!
>
> Lock-free algos in the kernel should NOT be exposed to users.
>
> Lock-free algos in userland are fine.

There are certainly likely to be specialized primitives in the kernel that
aren't applicable to user code and shouldn't (or at least needn't) be
exposed.

However, not to get into arguing on this particular point, but there are
some reasons why, sometimes, using a lock-free kernel API might be good.
There are many tradeoffs here, and I'm certainly not suggesting any
implication of "universal truth" here. However, often consolidating common
code into a routine rather than inlining it (against much common wisdom)
improves performance -- because modern processors support sufficiently
lightweight call linkages that the overhead of the call can be
substantially outweighed by the benefit (for commonly used code) of having
the code already in cache, valid branch predictions in place, and so forth.
Taking that a step further, given a sufficiently lightweight kernel call
(something of any oxymoron on Windows and many other systems... but not all
systems) that same code packaged as a kernel syscall, wired into physical
memory for all processes, could provide even more benefit.

>> And isn't it possible to implement locks totally in user
>> space?
>
> That is called a spin-lock. ( yuck! )

No, not at all. Given 2-level scheduling the entire sequence can take place
in user-mode, within the process. No problem at all. Not on Windows and
Linux, of course, or on Solaris 9; but on Tru64 UNIX, AIX, HP-UX...

> To answer your question, you have to use a kernel object as a waitset.

Only on some systems, or when sharing synchronization objects between
processes.

>> For code that uses
>> equal numbers of kernel calls on P4 x86 SMP platforms, code that uses
> fewer
>> locked instructions will tend to outperform code that uses more locked
>> instructions.
>
> Most lock-free algos don't use the kernel AT ALL.

That's not what he said.

> Look at my test code.
>
>> How about comparing comparable code? How about comparing:
>>
>> acquire_fast_lock();
>> i++;
>> j++;
>> release_fast_lock();
> ^^^^^^^^^^^^^^^^^^^^^^^^6
>
> You are saying I should use a lock-free algo in the mutex?!?
>
> Thank you very much, you have proved my point.
>
> Lock-free algos will improve the performance of lock-based ones virtually
> every time.
>
> CRITICAL_SECTION uses a lock-free algo for the first point of contention,
> see? Its a lock-free algo that makes this possible.

Another slip, I see, into the "all lock-free or all non-lock free". You're
the only one who's ever cared or tried to create that dichotomy!

(Besides, since spinlocking and fast-path locking predates any use of the
term "lock-free" by decades, I think you're cheating here more than a tiny
bit.)

>> to
>>
>> atomic_increment(&i);
>> atomic_increment(&j);
>
> Now David... I know your not stupid!
>
> If you think that is comparable code, your insane not stupid.

Wow. The castiron pot calling the teal kettle "black"? ;-)

He's addressing the "old SenderX", who was trying to position the entire
world as "all lock-free" or "all non-lock-free", and of course in an "all
lock-free" world you couldn't use the former, could you? The "new SenderX",
who realized this, wouldn't be calling him insane for pointing this out,
right? So please put "old SenderX" back in the box.

SenderX

unread,

Dec 5, 2003, 2:36:39 PM12/5/03

to

"Invention"? Give me a break. Multiple queue insert has been done probably
a
> million times over the past 30 or 40 years. (Granted not all, though many,
> atomically.)

^^^^^^^^^^^

A lock-free multi-insert algo? Can you point me to a published paper? I am
very interested if they are using the same method I am using...

> Your test will only prove what it's constructed to prove. Which is merely
> that the particular artificial code stream you've created behaves as it
> behaves. Your misunderstanding of the term "proof" seems to be fundamental
> to your attitude.

Yes, the workers execute basically nothing, they just cycle the messages
through the shared queues as fast as possible. The queue iteration executes
nothing, just loops through, touching the memory of each node, as fast as
possible. I wanted the contention and load of the system to focus on the
central shared queue implementations. If the worker threads did "actual
work", file/socket io whatever, it would lower the contention for the shared
queue. It would not focus the load and contention of the system on the code
in question?

Lock-free algos have to be blasted with high-contention situations to catch
bugs quicker... If you don't put them under high load bugs will not appear
nearly as fast as they would in an artificially created high-contention
situation, focused on the lock-free algo in question?

SenderX

unread,

Dec 5, 2003, 2:40:33 PM12/5/03

to

> > David S. was talking about how protecting 20 "different" shared queues
> with
> > a "single" lock, would outperform 20 queues protected by a fine-grain
> > distributed lock-free algo that I present in the test code I posted.
>
> Try it. Use one CPU and one thread.

huh? This is threaded programming?

How about 4-cpus and 64 threads?

SenderX

unread,

Dec 5, 2003, 3:17:19 PM12/5/03

to

> > David S. was talking about how protecting 20 "different" shared queues
with
> > a "single" lock, would outperform 20 queues protected by a fine-grain
> > distributed lock-free algo that I present in the test code I posted.
>
> See <http://tinyurl.com/xuyc>.

SMP contending for same lock is bad.

20 "different" queues sharing the "same lock", would focus load on the lock
and generate a bottleneck, not good?

At the very least, the locking should be fine-grain enough for each cpu to
have its own lock and queue?

SenderX

unread,

Dec 5, 2003, 5:01:12 PM12/5/03

to

> That's completely untrue, since locks (even blocking locks with true
> contention, much less uncontended spinlocks) can be implemented within a
> process without any kernel calls.

I have had spinlocks in a couple of older designs, and they started to give
me bottlenecks when the server applications users started to grow...

I switched to fine-grain lock-based collections for a big performance boost.

Then I started to replace some critical sections that were protected by
locks, with lock-free algos to get even better performance... In some key
areas. Like the lock-free iteration of a large-collection...

Spinlocks...

I suppose the pause opcode makes them eaiser ont he bus but...

I believe contending spinlocks will contend for the same shared memory?...
The lock state.

The contending spinning threads are checking a shared location with an
atomic op most likely, and they have to use the lock prefix.

That would be contention at the same level as a lock-free algo?

The only difference is:

Spinlock

When a spinlock fails, it means that another thread is in the critical
section that the spinlock is protecting, agreed?

If the same spinlock atomic op fails twice, three, many 50 times? It could
mean that the same thread that caused it to fail so many times before, could
have been in the exact same critical section? I mean the thread in the
critical section could hit a page fault, or context-switch to another
application... Then the waiting threads are crashing the bus making no
forward progress...

If you are going to make no forward progress on spinlock atomic op failure,
you would benefit from blocking big time. Right?

LockFree Cas

When a lock-free CAS fails, it means that the thread that caused it to fail
is guaranteed to have its write committed and be out of the critical
section, before it trys the lock-free algo again. This is a performance
increase over a spinlock...

See the difference here?

A lock-free algo can improve on the spinlock design, IF and only IF, the
entire critical section that the original spin-lock protected can be
atomically modified.

A lock-free queue or stack can easily header fit into a single CAS.

I think the lock-free design beats a spin-lock in this situation?

David Schwartz

unread,

Dec 5, 2003, 5:04:26 PM12/5/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:jJ5Ab.308864$9E1.1551626@attbi_s52...

> > > David S. was talking about how protecting 20 "different" shared queues
> > > with
> > > a "single" lock, would outperform 20 queues protected by a fine-grain
> > > distributed lock-free algo that I present in the test code I posted.
> >
> > See <http://tinyurl.com/xuyc>.

> SMP contending for same lock is bad.

Only in a lock-free algorithm or when there's nothing else to do. With a
locking algorithm, all the contending threads get descheduled and thread
that don't contend get scheduled. Later when the thread that would have
contended releases the lock, the other thread will run and not contend.

> 20 "different" queues sharing the "same lock", would focus load on the
lock
> and generate a bottleneck, not good?

Only if there's nothing else to do. If you assume that these 20 queues
sharing the same lock are just 20 out of 20,000 queues, then the
"bottleneck" is actually efficient.

> At the very least, the locking should be fine-grain enough for each cpu to
> have its own lock and queue?

No matter how many times I explain it, you don't even come remotely
close to comprehending it. We're talking about designing a queue. What I'm
trying to show you is what makes a good thread safe queue implementation has
a lot to do with how you intend to use it.

DS

David Schwartz

unread,

Dec 5, 2003, 5:01:46 PM12/5/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:Ra5Ab.33342$_M.142825@attbi_s54...

> > > David S. was talking about how protecting 20 "different" shared queues
> > > with
> > > a "single" lock, would outperform 20 queues protected by a fine-grain
> > > distributed lock-free algo that I present in the test code I posted.

> > Try it. Use one CPU and one thread.

> huh? This is threaded programming?

Yes, it is. Sometimes a threaded program runs on a machine with only one
CPU and only one thread happens to be accessing a particular collection at a
time. Some programs might find themselves in this situation so often that
they primarily optimize for this situation.

> How about 4-cpus and 64 threads?

Sure, go ahead and rig the test to get the results you want. But if
you're writing a program and the customer gets to choose how he uses it and
what hardware he runs it on, you may wind up in a situation with one CPU and
one thread, though your program has to be coded to handle 32 CPUs and 128
threads.

DS

SenderX

unread,

Dec 5, 2003, 6:22:24 PM12/5/03

to

> Only if there's nothing else to do. If you assume that these 20 queues
> sharing the same lock are just 20 out of 20,000 queues, then the
> "bottleneck" is actually efficient.

:/

20, 000 queues?

5 queues performs very good for 4-cpu 256+ threads, using a lock-free algo.

You have to use around 256+ queues to get even close to lock-free. The more
queues ( 10 lists for each cpu? ) the better the performance using locks.

5 lock-free queues is as a little better than 256 lock-based queues, imagine
what 256 lock-free queues would do?

Lock-free reads are another matter.

Think if a thread has to go thought a collection that has millions of items?

If reader threads decide to iterate the whole collection, all writer threads
would have to wait on these reads. Agreed?

You should allow for concurrent writes while your are reading big lists?

Reader threads can iterate while writers push and pop?

> No matter how many times I explain it, you don't even come remotely
> close to comprehending it. We're talking about designing a queue. What I'm
> trying to show you is what makes a good thread safe queue implementation
has
> a lot to do with how you intend to use it.

All normal systems should use pthread read/write locks. At least they can
get concurrent reads.

I talking about systems that experience heavy thread to thread comm, while
they are currently under load.

Like a chat server that has 35,000 clients on one box...

The chat server probably has to parse requests and send them to a queue that
could serve 64+ threads on 4+cpu boxs.

I want my server to respond to request as quick as possible, so I consider
the method in with a threads passes messages to other threads important.

I want those 64+ threads to pull requests as fast as possible, and quickly
get to work.

I want to avoid blocking at all costs. Blocking in worker threads is very
bad wrt. loaded async server.

If you can decrease the latency from chat server client request to server
response, just by improving thread to thread comm... Would that be good?

David Schwartz

unread,

Dec 5, 2003, 10:50:15 PM12/5/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:Qq8Ab.440196$Fm2.437099@attbi_s04...

> Like a chat server that has 35,000 clients on one box...

I happen to have a huge amount of experience with just such programs.

> The chat server probably has to parse requests and send them to a queue
that
> could serve 64+ threads on 4+cpu boxs.

> I want my server to respond to request as quick as possible, so I consider
> the method in with a threads passes messages to other threads important.

Except threads don't "pass messages to other threads" in that sense.
They add work items to a queue and threads serve that queue.

> I want those 64+ threads to pull requests as fast as possible, and quickly
> get to work.

Then you'd have 64+ threads all trying to access the same chat server
data structures, that is a horrible disaster. (One area where Windows beats
the pants off of most UNIXes is that it allows you to use IOCP as a
scheduling mechanism and avoid this problem.)

> I want to avoid blocking at all costs. Blocking in worker threads is very
> bad wrt. loaded async server.

Why do you think that's true? Another thread will just get scheduled and
it will do just as much work as the original thread was going to do.

Yes, context switches have some cost. But so does fighting through
contention. There are two possible cases:

1) The performance of the queue doesn't matter because you don't access
it often. In this case, it doesn't matter what you do.

2) The performance of the queue does matter because some threads access
it very often. In this case, lock-free hurts you because those threads have
to suffer through contention rather than being scheduled when they don't
contend with each other.

> If you can decrease the latency from chat server client request to server
> response, just by improving thread to thread comm... Would that be good?

No, because the latency between request and response is more than
drowned out by the network latency. You'd rather be able to get more CPU
work done. That usually means minimizing context switches, which you can do
by protecting the queue with spinlocks.

The limiting factor tends to be the need for write locks on complex data
structures. The time it actually takes to put a command on the queue or take
one off is such a tiny fraction of the time it takes to parse the command
(before putting it on the job queue) or execute the command (after taking it
off the job queue) that a spinlock for the queue is ideal because it allows
the executing thread to keep its code in the caches.

The spinlock can usually be implemented by a single atomic operation. So
there is no possible way to reduce the measurable cost any further.

DS

SenderX

unread,

Dec 5, 2003, 11:37:40 PM12/5/03

to

Spinlock

When a spinlock fails, it means that another thread is in the critical
section that the spinlock is protecting, agreed?

If the same spinlock atomic op fails twice, three, many 50 times? It could
mean that the same thread that caused it to fail so many times before, could
have been in the exact same critical section? I mean the thread in the
critical section could hit a page fault, or context-switch to another
application... Then the waiting threads are crashing the bus making no
forward progress...

If you are going to make no forward progress on spinlock atomic op failure,
you would benefit from blocking big time. Right?

LockFree Cas

When a lock-free CAS fails, it means that the thread that caused it to fail
is guaranteed to have its write committed and be out of the critical
section, before it trys the lock-free algo again. This is a performance
increase over a spinlock...

See the difference here?

>

> No, because the latency between request and response is more than
> drowned out by the network latency. You'd rather be able to get more CPU
> work done. That usually means minimizing context switches, which you can
do
> by protecting the queue with spinlocks.

That is one of the points that I have been saying about lock-free algos for
a while now!

They minimize context-switches! You have said that this is a bad thing.

They minimize kernel calls, when you learned this you switched your
arguments from locks are better, to spinlocks and fast-path locks are
better. A fast-path lock allows a single contending thread lock-free access
to the critical section, all threads that fail the fast-path block.

A simple lock-free tweak, mixed with a lock-based algo?

> The spinlock can usually be implemented by a single atomic operation.
So
> there is no possible way to reduce the measurable cost any further.

A contending thread might have to test and fail thought he spinlock many
times before it gets the critical section, that is more than one atomic
operation to modify the shared memory protected by the spin.

SenderX

unread,

Dec 6, 2003, 5:45:04 PM12/6/03

to

> Sure, go ahead and rig the test to get the results you want. But if
> you're writing a program and the customer gets to choose how he uses it
and
> what hardware he runs it on, you may wind up in a situation with one CPU
and
> one thread, though your program has to be coded to handle 32 CPUs and 128
> threads.

Dave your odd idea of 20 queues on 1 lock, on a single threaded app would be
fine. How could it experience contention?

The performance of 20 queues on 1 lock would degrade, as threads and
processors were added. 32 CPUS fighting over 20 queues protected by one
lock? How about 32 CPUS fighting over 20 queues protected by 20 lock-free
garbage headers, and 20 lock-free queue headers?

Now 20 CPUS can concurrently enter the 20 queues through different garbage
collector headers, and have lock-free access to all of the queues. The
distributed lock-free design will outperform your, single lock idea, as CPUs
and threads are added. You setup will run great on single cpu, but it will
not scale...

David Schwartz

unread,

Dec 6, 2003, 5:58:47 PM12/6/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:o2dAb.35724$_M.162718@attbi_s54...

> Spinlock
>
> When a spinlock fails, it means that another thread is in the critical
> section that the spinlock is protecting, agreed?
>
> If the same spinlock atomic op fails twice, three, many 50 times? It could
> mean that the same thread that caused it to fail so many times before,
could
> have been in the exact same critical section? I mean the thread in the
> critical section could hit a page fault, or context-switch to another
> application... Then the waiting threads are crashing the bus making no
> forward progress...

Crashing the bus? If the spinlock stays locked, it never changes, so it
will be shareable in all the caches.

> If you are going to make no forward progress on spinlock atomic op
failure,
> you would benefit from blocking big time. Right?

Yes. That's why you only use spinlocks in cases where a context-switch
while holding the spinlock is very unlikely. For example, a queue where the
CPU cost of generate items to put on the queue and processing items removed
from the queue grossly exceeds the CPU cost of accessing the queue itself.
You can also use spinlocks in cases where you have more control over context
switches, such as if you're using thread priorities or are writing the
scheduler too.

> When a lock-free CAS fails, it means that the thread that caused it to
fail
> is guaranteed to have its write committed and be out of the critical
> section, before it trys the lock-free algo again. This is a performance
> increase over a spinlock...

Yes, but at the cost of more atomic operations. I wouldn't choose a
spinlock unless the cost of the atomic operations (on every single access to
the structure) outweighed the slight performance benefit on the off chance a
thread happens to be descheduled for the five instruction window while it's
holding the lock.

> See the difference here?

Yes. With a lock-free collection, every access to that collection
requires an atomic/interlocked operation. With a lock-based collection, I
get to choose exactly how many such instructions provides the best tradeoff
between concurrency across processors and concurrency within a processor.

> > No, because the latency between request and response is more than
> > drowned out by the network latency. You'd rather be able to get more CPU
> > work done. That usually means minimizing context switches, which you can
> > do
> > by protecting the queue with spinlocks.
>
> That is one of the points that I have been saying about lock-free algos
for
> a while now!
>
> They minimize context-switches! You have said that this is a bad thing.

*sigh*

> They minimize kernel calls, when you learned this you switched your
> arguments from locks are better, to spinlocks and fast-path locks are
> better. A fast-path lock allows a single contending thread lock-free
access
> to the critical section, all threads that fail the fast-path block.

Huh?

> A simple lock-free tweak, mixed with a lock-based algo?

If you think that a spinlock hand-coded in assembly language used to
protect a queue is a "lock-free algorithm", then you're smoking something.

> > The spinlock can usually be implemented by a single atomic
operation.
> > So
> > there is no possible way to reduce the measurable cost any further.

> A contending thread might have to test and fail thought he spinlock many
> times before it gets the critical section, that is more than one atomic
> operation to modify the shared memory protected by the spin.

Doesn't matter. The odds of a contending thread are so low that this
performance benefit would be near zero. As for spinlocking "many times",
this assumes the other thread spends more than one or two instructions in
the critical sections, which it would not do.

DS

David Schwartz

unread,

Dec 6, 2003, 6:02:23 PM12/6/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:QZsAb.42922$_M.196063@attbi_s54...

> > Sure, go ahead and rig the test to get the results you want. But if
> > you're writing a program and the customer gets to choose how he uses it
> > and
> > what hardware he runs it on, you may wind up in a situation with one CPU
> > and
> > one thread, though your program has to be coded to handle 32 CPUs and
128
> > threads.

> Dave your odd idea of 20 queues on 1 lock, on a single threaded app would
be
> fine. How could it experience contention?

It couldn't. This accurately replicates cases where contention is very,
very rare. Sometimes you code in a case like this, so you optimize for the
zero contention case.

> The performance of 20 queues on 1 lock would degrade, as threads and
> processors were added.

Sure. So you wouldn't code like this if you expected there to be more
threads and mroe CPUs. However, sometimes you have a case where one thread
and one CPU is the most important case, so you optimize for that case.

> Now 20 CPUS can concurrently enter the 20 queues through different garbage
> collector headers, and have lock-free access to all of the queues. The
> distributed lock-free design will outperform your, single lock idea, as
CPUs
> and threads are added. You setup will run great on single cpu, but it will
> not scale...

Right. So my solution is much better than yours if the most important
case is one CPU and one thread. Sometimes that really is the most important
case.

FWIW, your solution won't scale either. When you get to 50 CPUs and 50
threads, the speed will be purely limited by the FSB. All your CPUs that
don't make forward progress will be stepping on each other. Meanwhile, I
will at least run one CPU at full speed and leave the other CPUs free to do
useful work.

DS

SenderX

unread,

Dec 6, 2003, 8:25:41 PM12/6/03

to

> Sure. So you wouldn't code like this if you expected there to be more
> threads and mroe CPUs. However, sometimes you have a case where one thread
> and one CPU is the most important case, so you optimize for that case.

This is threaded programming group. I know you would not code like this.

Your example will not scale.

> Right. So my solution is much better than yours if the most important
> case is one CPU and one thread. Sometimes that really is the most
important
> case.

A user would not use any sync if the app was single threaded...

If the user wants a thread-safe scaleable app they should stay away from the
multi-resource protected by a single lock idea you have...

> FWIW, your solution won't scale either.

A lock-free system can use proxy garbage collection to modify queue
granularity at runtime.

> When you get to 50 CPUs and 50
> threads, the speed will be purely limited by the FSB. All your CPUs that
> don't make forward progress will be stepping on each other. Meanwhile, I
> will at least run one CPU at full speed and leave the other CPUs free to
do
> useful work.

If you have to share the computer, put your servers worker threads through a
special condition variable and try to allow only a couple of threads to run.
Like emulating Windows IOCP.

If the computer is dedicated to your server, The more CPUS concurrently
working on different server requests without context-switching, the
better...

50 CPUS pulled through a single lock protecting 20 different queues? Yikes!

It doesn't matter how many CPUS you add to the lock-free picture. A shared
request - response based queue system can create the infrastructure, on the
fly, to be distributed across all cpus at runtime. To help ensure that the
CPUS will each do a CAS on different ( with no false sharing ) pieces of
garbage collected headers. So there will be no CAS failures on many accesses
to the lock-free queues. A lock-free garbage collector or queue header is
only 32 or 64-bits.

David Schwartz

unread,

Dec 6, 2003, 9:07:28 PM12/6/03

to

"SenderX" <x...@xxx.xxx> wrote in message

news:pkvAb.446452$Tr4.1248332@attbi_s03...

> > Right. So my solution is much better than yours if the most
important
> > case is one CPU and one thread. Sometimes that really is the most
> > important
> > case.

> A user would not use any sync if the app was single threaded...

Are you reading what I'm writing? This is a multi-thread app, but it
contains a collection that is mostly accessed by a single thread running on
a single CPU. For some applications, and for many collections inside
applications, that's the most important case, so that's the one you optimize
for. Yet you may still need thread safety for the occasional case where
another thread has to access it.

> If the user wants a thread-safe scaleable app they should stay away from
the
> multi-resource protected by a single lock idea you have...

Okay, I'm going to stop it here before I rip my hair out.

DS

David Butenhof

unread,

Dec 8, 2003, 11:11:47 AM12/8/03

to

SenderX wrote:

> "Invention"? Give me a break. Multiple queue insert has been done
> probably
> a
>> million times over the past 30 or 40 years. (Granted not all, though
>> many, atomically.)
> ^^^^^^^^^^^
>
> A lock-free multi-insert algo? Can you point me to a published paper? I am
> very interested if they are using the same method I am using...

I don't know of any published papers. Most of the people who do work like
this aren't interested (nor do they have the time) to write up and publish
every little new coding wrinkle they devise. Of course, consequently, much
of their work never gets proper credit in history. (Like, for example, the
Burroughs people who developed and shipped virtual memory about 10 years
before IBM claimed credit for the "invention". It was just something that
was "obviously" needed for the system they were trying to build, and they
did it and moved on.)

>> Your test will only prove what it's constructed to prove. Which is merely
>> that the particular artificial code stream you've created behaves as it
>> behaves. Your misunderstanding of the term "proof" seems to be
>> fundamental to your attitude.
>
> Yes, the workers execute basically nothing, they just cycle the messages
> through the shared queues as fast as possible. The queue iteration
> executes nothing, just loops through, touching the memory of each node, as
> fast as possible. I wanted the contention and load of the system to focus
> on the central shared queue implementations. If the worker threads did
> "actual work", file/socket io whatever, it would lower the contention for
> the shared queue. It would not focus the load and contention of the system
> on the code in question?
>
> Lock-free algos have to be blasted with high-contention situations to
> catch bugs quicker... If you don't put them under high load bugs will not
> appear nearly as fast as they would in an artificially created
> high-contention situation, focused on the lock-free algo in question?

That's ONE "worst case", but not the only one. Random variations of load
often produce more "interesting" timing conditions...

David Butenhof

unread,

Dec 8, 2003, 11:50:03 AM12/8/03

to

SenderX wrote:

>> That's completely untrue, since locks (even blocking locks with true
>> contention, much less uncontended spinlocks) can be implemented within a
>> process without any kernel calls.
>
> I have had spinlocks in a couple of older designs, and they started to
> give me bottlenecks when the server applications users started to grow...
>
> I switched to fine-grain lock-based collections for a big performance
> boost.
>
> Then I started to replace some critical sections that were protected by
> locks, with lock-free algos to get even better performance... In some key
> areas. Like the lock-free iteration of a large-collection...

And that's fine, when you're basing the changes on actual performance data
for the application. And that's exactly what I've been saying all along,
but you don't seem to be following.

> Spinlock
>
> When a spinlock fails, it means that another thread is in the critical
> section that the spinlock is protecting, agreed?
>
> If the same spinlock atomic op fails twice, three, many 50 times? It could
> mean that the same thread that caused it to fail so many times before,
> could have been in the exact same critical section? I mean the thread in
> the critical section could hit a page fault, or context-switch to another
> application... Then the waiting threads are crashing the bus making no
> forward progress...
>
> If you are going to make no forward progress on spinlock atomic op
> failure, you would benefit from blocking big time. Right?

Sometimes. It depends on a lot of factors, though; mostly the scheduling
overhead involved in blocking. Spinning 50, 100, even 1000 times
(particularly where you can spin in local cache as on Alpha or Itanium) may
still be many time cheaper than the cost of blocking and being unblocked;
and, depending on the nature of the synchronization protocols, that might
justify the spin for certain workloads.

> LockFree Cas
>
> When a lock-free CAS fails, it means that the thread that caused it to
> fail is guaranteed to have its write committed and be out of the critical
> section, before it trys the lock-free algo again. This is a performance
> increase over a spinlock...
>
> See the difference here?

Of course I do. The whole problem is that you seem to be having problems
with the larger engineering perspective. Increasing the performance of an
isolated operation is not necessarily in the best interests of the
APPLICATION. The parts of an application must cooperate, not compete, with
each other.

> A lock-free algo can improve on the spinlock design, IF and only IF, the
> entire critical section that the original spin-lock protected can be
> atomically modified.

First off, only some specific types of critical section fit that profile.
Secondly, improving the performance of those few that do may not
necessarily be good for the application.

Contention is part of any application with multiple instruction streams:
whether they're threads, processes, or MPI tasks. (They just contend for
different resources in different ways.) Contention is not something that
always needs to be MINIMIZED -- but rather something that needs to be
MANAGED across the application and balanced with its other requirements.

After all, a threaded application without any synchronization will almost
always be faster than one WITH sychronization. It just usually won't give
the right answers. (And there actually ARE times where "the right answer"
isn't an essential requirement; e.g., when polling randomly for "some
likely approximation of a recent value" might be just fine.)

> A lock-free queue or stack can easily header fit into a single CAS.
>
> I think the lock-free design beats a spin-lock in this situation?

One more time... this judgement depends upon how one defines "beats". You've
listed a number of advantages of lock-free, in straightline single stream
performance and in contention, and you continually miss the fact that I'm
not arguing against any of that. What I'm saying is that none of those
factors you quite correctly applaud in lock-free are necessarily an
advantage to all possible applications, and may sometimes be a
disadvantage. I'm talking about APPLICATION engineering, not design of an
isolated object method, and you seem unwilling to bridge that gap. The
SYSTEM needs to be balanced. If you can balance the system using mostly
lock-free, and care to take the time, or can objectively and relevantly
measure an improvement from that, then fine. In MOST cases, an application
designer's effort and time is best spent elsewhere.

SenderX

unread,

Dec 8, 2003, 5:14:12 PM12/8/03

to

> Sometimes. It depends on a lot of factors, though; mostly the scheduling
> overhead involved in blocking. Spinning 50, 100, even 1000 times
> (particularly where you can spin in local cache as on Alpha or Itanium)
may
> still be many time cheaper than the cost of blocking and being unblocked;

Agreed.

If a couple of spins avoids a kernel call to block, that is good.

> Of course I do. The whole problem is that you seem to be having problems
> with the larger engineering perspective. Increasing the performance of an
> isolated operation is not necessarily in the best interests of the
> APPLICATION. The parts of an application must cooperate, not compete, with
> each other.

Agreed.

Simple example:

If you use a lock-free algo to boost the performance of a shared queue that
serves requests that could tell worker threads to iterate a shared
collection, you could increase iteration load on the shared collection.
Iteration requests, writes, and other requests, will probably be served to
workers faster than before.

The bottleneck will shift from the shared queue's critical section, that you
replaced with a lock-free algo, to the lock-based critical-sections that
protect the shared collection. The shared collection might have to be re
worked to allow for lock-free iterations and concurrent writes, to consume
the new load put upon it by the new higher-peformance queue.

> > A lock-free algo can improve on the spinlock design, IF and only IF, the
> > entire critical section that the original spin-lock protected can be
> > atomically modified.
>
> First off, only some specific types of critical section fit that profile.
> Secondly, improving the performance of those few that do may not
> necessarily be good for the application.

An application could require other modifications to handle the increased
parallelism and throughput that is generally generated from a correct
lock-free algo replacing an existing lock-based critical-section...

> First off, only some specific types of critical section fit that profile.

2 separate 32-bit header's can be used with CAS or LL/SC to render most
generic shared collections, lock-free, 100% ABA safe, and garbage-collected
on virtually all 32 or 64 bit cpus.
^^^^^^^^^^^^^^^^^

> I'm talking about APPLICATION engineering, not design of an
> isolated object method, and you seem unwilling to bridge that gap. The
> SYSTEM needs to be balanced.
> If you can balance the system using mostly
> lock-free, and care to take the time, or can objectively and relevantly
> measure an improvement from that, then fine. In MOST cases, an application
> designer's effort and time is best spent elsewhere.

Which is why I am designing a library, that will "clearly" list all the
caveats. Would that be ok?

If there was a library that presented many garbage-collected, user-space,
lock-free primitives to 32 and/or 64 bit Linux, Windows, and Mac. With a
paper describing all of the caveats that come with lock-free design, would
that be ok?

A good lock-free, garbage-collected, shared linked-list C API would be good
for Linux and Mac?

Randy Howard

unread,

Dec 8, 2003, 9:07:13 PM12/8/03

to

In article <0Rwzb.20765$_M.79054@attbi_s54>, x...@xxx.xxx says...
> Here is the full-source code:
>
> http://appcore.home.comcast.net/LockFree/
>
> This test simply proves that lock-free beats lock-based, over and over
> again, under many HIGH-CONTENTION situations.

How portable is this to multiple platform architectures in contrast to
a pthread approach? Judging from the .asm source files, not at all.

I wonder how many real applications bear even a passing resemblance
to your test example?

--
Randy Howard _o
2reply remove FOOBAR \<,
______________________()/ ()______________________________________________
SCO Spam-magnet: postm...@sco.com

SenderX

unread,

Dec 9, 2003, 12:48:34 AM12/9/03

to

> How portable is this to multiple platform architectures in contrast to
> a pthread approach? Judging from the .asm source files, not at all.

Correct, the posted asm is x86 32-bit, that's it. However, I have different
LockFree.asm, LockFree.c and LockFree.h files for different processors...

I have asm for x86-32, x86-64, ppc, and Itanium, and can have asm for ANY
other processor that has a CAS or LL/SC that "works on memory sizes equal to
sizeof( void* )". All of the posted algos are "very portable".

I cannot post the non x86-32 or other cpu asm code. The amd-64-bit code uses
a 32-bit header on 32 "and" 64-bit modes, so it is actually compatible with
a Linux futex. I also think a proxy garbage collector header can use a
futex? All of the portable lock-free asm code relies on pthread api's for
non-windows systems.

I will provide a library that will expose x86-32 && 64, ppc, and itanium
compatible lock-free garbage collectors, lock-free primitives, and shared
collections. It will also include a new fast-path specialized condition
variable. The lock-free (fast-path) part of the condition variable can
handle 32 or 64 bits of user state.

I have a solution that provides user-space, pointer based, ABA free,
dyanamic ( malloc && free compatible ) lock-free algos to all users of
AMD64! In other words you can use full 64-bit pointers to do memory safe,
garbage collected lock-free algos...

I think that is good?

:)

> I wonder how many real applications bear even a passing resemblance
> to your test example?

If real applications have more than one thread iterating, pushing and
popping objects into a shared collection, then they could fit into this test
schema, shared central collection/queue iterated by many threads? Servers
that store connections and other active state in shared linked-lists that
get accessed by many threads, lock-free iterations improve throughput.

The current test code takes the work out of the worker threads, and directs
the contention of 256+ threads at the lock-free queues. The points of
contention are tested themselves. Lock-free design lets cpus go through
heavy contended critical sections ( points of contention ) faster.

You can also see that worker threads can "iterate the shared queues, while
other thread are pushing and popping". This is part of what makes a
lock-free design shine.

The current test code is to show how distributed lock-free algos can
withstand focused contention. The test artificially focuses the CPUS on the
lock-free or lock-based queue algo itself. The lock-based algos fall apart
under direct contention by many threads and cpus. You have to use lot of
locks to scale.

The good part is the test is lock-free and deadlock-free iteration of a
highly-contended shared queue. Pushes and pops are allowed during iteration.
That's good. The lock-based part can't really hold up to lock-free iteration
though...

I think optimized shared collection iteration exposed by a stable library
that relies on pthreads for non-windows platforms would be good for a lot of
server applications?

Joe's atomic_ptr and RCU are also good for lock-free iteration of shared
generic collections.

David Butenhof

unread,

Dec 9, 2003, 11:49:43 AM12/9/03

to

SenderX wrote:

>> I'm talking about APPLICATION engineering, not design of an
>> isolated object method, and you seem unwilling to bridge that gap. The
>> SYSTEM needs to be balanced.
>> If you can balance the system using mostly
>> lock-free, and care to take the time, or can objectively and relevantly
>> measure an improvement from that, then fine. In MOST cases, an
>> application designer's effort and time is best spent elsewhere.
>
> Which is why I am designing a library, that will "clearly" list all the
> caveats. Would that be ok?
>
> If there was a library that presented many garbage-collected, user-space,
> lock-free primitives to 32 and/or 64 bit Linux, Windows, and Mac. With a
> paper describing all of the caveats that come with lock-free design, would
> that be ok?
>
> A good lock-free, garbage-collected, shared linked-list C API would be
> good for Linux and Mac?

OK, let's take another step back in a different direction. We're
crosstalking between two separate discussions, and you're responding on the
one I wasn't trying to address!

I'm NOT objecting to your attempts to build the library, though I have
complained that your "sell job" is not always entirely straightforward.
I've said many times that I think developers should understand lock-free,
and how to use it, and should certainly consider it where it makes sense
for them.

MY principle direction in these discussions have been against the generic
message that "using lock-free will make any application faster". You've
tempered that extremism lately, though not with quite the level of
consistency I'd like. I also very much dislike your repeated claims that
the testing you happen to have done on a package "proves" that it's correct
and safe; you should know better, and sometimes I think you do. Perhaps
you're just not communicating the caveats as clearly as you should.

I have no objections at all to you developing your package, or even to
distributing it. While I'm not even trying to pretend that I can require
you to carefully explain the limitations and design considerations of which
application developers should be aware, I think it'd be awesome if you
could do a good job of clearly and concisely describing them. (And even
obscure hints are better than nothing. ;-) )

So go for it. That's great.

But anywhere you say "lock-free is better/faster", I'm going to make sure
everyone knows that it's not ALWAYS better or faster, nor even always
applicable... even if I'm pretty sure that you understand that already.
(And sometimes I'm pretty sure you do; but other times I'm still not
entirely convinced. ;-) )