multi-core ramifications w.r.t the monolithic linux kernel

rohit...@gmail.com

unread,

Oct 6, 2006, 2:05:35 PM10/6/06

to

Vista is a heavily threaded OS. Not only are the services of the OS
implemented as different threads, each service itself has the ability
to spawn mutiple threads for processing multiple user applications.

A good example is the networking stack. Vista has the ability to
process 2 networking packets on 2 instances of the TCP/IP stack
implementation as 2 threads to serve 2 applications that want to
put/get information from the network.

What is to become to linux? Linux has a monolithic kernel. All of the
kernels services are one big binary in memory. So when Application-1
has made a jump into the kernel, everybody waits? This seems like
design decision cannot scale.

Are there sufficient theads that run on a linux box only in the
application space to harness the many cores to come?

Bill Todd

unread,

Oct 6, 2006, 2:11:19 PM10/6/06

to

You are confused: having or not having a monolithic kernel (by the way,
Vista's is hardly much different than Linux's in this regard), and
employing or not employing multiple threads of execution in the kernel,
are completely orthogonal.

- bill

Scott Michel

unread,

Oct 6, 2006, 2:42:47 PM10/6/06

to

rohit...@gmail.com wrote:
> A good example is the networking stack. Vista has the ability to
> process 2 networking packets on 2 instances of the TCP/IP stack
> implementation as 2 threads to serve 2 applications that want to
> put/get information from the network.

Vista isn't the first to do this. Various BSDs have been doing this for
a while, but I'm not sure that whether the results indicated a net
benefit. And you don't need two instances of the IP stack to do this;
you just need enough threads to demultiplex the packets to their
intended recievers. On the send side, you realistically only need as
many threads as you have hardware interfaces.

> What is to become to linux? Linux has a monolithic kernel. All of the
> kernels services are one big binary in memory. So when Application-1
> has made a jump into the kernel, everybody waits? This seems like
> design decision cannot scale.

Linux has been internally multithreaded for a while, as have the BSDs.
A lot of the evolution has been getting rid of the BF kernel lock as
more parts of the kernel get threaded. Linux, from my understanding of
talking with people, has problems with thread and process scheduling,
so if anything, that's where the penalty can be found.

Stephen Sprunk

unread,

Oct 6, 2006, 2:51:17 PM10/6/06

to

<rohit...@gmail.com> wrote in message
news:1160157935.1...@m7g2000cwm.googlegroups.com...

> Vista is a heavily threaded OS. Not only are the services of the OS
> implemented as different threads, each service itself has the ability
> to spawn mutiple threads for processing multiple user applications.
>
> A good example is the networking stack. Vista has the ability to
> process 2 networking packets on 2 instances of the TCP/IP stack
> implementation as 2 threads to serve 2 applications that want to
> put/get information from the network.
>
> What is to become to linux? Linux has a monolithic kernel. All of the
> kernels services are one big binary in memory. So when Application-1
> has made a jump into the kernel, everybody waits? This seems like
> design decision cannot scale.

When Vista loads, all the hundreds of parts of the kernel are
dynamically linked into a single monolithic blob, just like Linux. UNIX
systems have separated "services" too, called daemons.

I'm not sure how the Vista (or other Windows) kernels work, but in Linux
the user thread that calls the kernel is what actually executes the
kernel code, not a separate "kernel thread". That means that, in
theory, if there are 100 user threads, all of them could be executing
concurrently in kernel space. You'll find some special daemons even
show 0s user time because they spend their entire lives living in the
kernel, merely providing a thread for kernel code to operate from; this
is as close to a "kernel thread" as Linux gets.

Now, all those threads running in the kernel do have to share data for
coordination, which requires locks. This used to be a huge problem for
Linux because there was originally a single lock (called the BKL, or Big
Kernel Lock) that covered all shared data. Considerable work has gone
into Linux in the last few years to reduce the number of things that
require grabbing the BKL (mainly by creating hundreds of little locks,
IIRC). That work was a fundamental requirement to get SMP and SMT
machines performing well, and it'll carry over to multicore processors
just fine.

Vista has no advantage here. If you have two threads wanting to access
the TCP/IP stack, there has to be some sort of locking so that they
don't stomp on each other. The mechanism may be different than Linux,
or it may even be the same. The same work still needs to be done, and
both sides have clever folks working on finding better and better
solutions.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Joe Seigh

unread,

Oct 6, 2006, 5:57:08 PM10/6/06

to

Stephen Sprunk wrote:
> <rohit...@gmail.com> wrote in message
> news:1160157935.1...@m7g2000cwm.googlegroups.com...

> Now, all those threads running in the kernel do have to share data for

> coordination, which requires locks. This used to be a huge problem for
> Linux because there was originally a single lock (called the BKL, or Big
> Kernel Lock) that covered all shared data. Considerable work has gone
> into Linux in the last few years to reduce the number of things that
> require grabbing the BKL (mainly by creating hundreds of little locks,
> IIRC). That work was a fundamental requirement to get SMP and SMT
> machines performing well, and it'll carry over to multicore processors
> just fine.

Linux is a monolithic kernel, meaning it's one giant data structure.
In order to safely access that structure, you need some form of syncrhonization.
One form is locking but locking doesn't scale well. Lock-free scales
better and it one thing Linux has been using more and more of lately,
namely in the form of RCU.

>
> Vista has no advantage here. If you have two threads wanting to access
> the TCP/IP stack, there has to be some sort of locking so that they
> don't stomp on each other. The mechanism may be different than Linux,
> or it may even be the same. The same work still needs to be done, and
> both sides have clever folks working on finding better and better
> solutions.
>

The stack is basically just code. You only need synchronization if the
threads are doing something with something in common, e.g. the same
TCP/IP connection.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Bill Todd

unread,

Oct 6, 2006, 8:19:16 PM10/6/06

to

Joe Seigh wrote:

...

> Linux is a monolithic kernel, meaning it's one giant data structure.

Er, no.

A monolithic kernel is a kernel which all executes in the same address
space (hence sacrificing some fault isolation for - one hopes - improved
performance by interacting components) and which usually is linked
together into one great (though of course often at least in part
pageable) mass, though some parts may be loaded and unloaded dynamically.

That monolithic code base, however, controls myriads of often
dynamically-allocated data structures, which can be locked or otherwise
synchronized independently at as fine a grain as makes sense.

- bill

Joe Seigh

unread,

Oct 7, 2006, 8:37:04 AM10/7/06

to

As a former kernel developer on one of those monolithic kernels, all I can
say is that going to finer grained locking is easier said than done.

Nick Maclaren

unread,

Oct 7, 2006, 8:36:36 AM10/7/06

to

In article <t5WdnZEDVewxA7rY...@comcast.com>,

Joe Seigh <jsei...@xemaps.com> writes:
|>
|> As a former kernel developer on one of those monolithic kernels, all I can
|> say is that going to finer grained locking is easier said than done.

Damn kernel programming - the same is true for almost every application!

Denying the difficulty of that issue is the great myth of SMP programming.

Regards,
Nick Maclaren.

Joe Seigh

unread,

Oct 7, 2006, 9:40:45 AM10/7/06

to

One of the things you can do is deny that your application needs all
that many cores.

http://www.theinquirer.net/default.aspx?article=34916

One of the problems I suspect is that they're taking programmers who've
never had any substantial experience doing multi-threaded programming
and expecting them to become instantly saavy in multi-threading.

Bill Todd

unread,

Oct 7, 2006, 10:15:55 AM10/7/06

to

Nick Maclaren wrote:
> In article <t5WdnZEDVewxA7rY...@comcast.com>,
> Joe Seigh <jsei...@xemaps.com> writes:
> |>
> |> As a former kernel developer on one of those monolithic kernels, all I can
> |> say is that going to finer grained locking is easier said than done.
>
> Damn kernel programming - the same is true for almost every application!

Hey, it's true for most after-the-fact changes in *anything*, which is
why doing the job right the first time is usually well worth the effort.

>
> Denying the difficulty of that issue is the great myth of SMP programming.

I must have missed that myth: I've never considered it easy myself,
even after three decades' worth of experience doing it.

On the other hand, people looking for easy tasks really shouldn't be
messing around with parallelism (or, for that matter, most other forms
of performance optimization) in the first place (save possibly via
intermediaries that can hide most of the necessary detail): what else
is new?

Locking works just fine in the absence of high local levels of
contention - as long as one manages the potential for deadlocks (either
by avoiding them - e.g., by imposing lock-acquisition ordering
constraints - or by resolving them through detection and selective
rejection). The main advantages of 'optimistic' alternatives in
low-contention situations are potentially simpler implementation and
slightly lower resource usage.

Locking often works *better* than more 'optimistic' lock-free
alternatives in the *presence* of high levels of contention (in
particular, it can guarantee forward progress without compromising data
currency).

That said, there are certainly circumstances in which lock-free
approaches can be superior, such as those where access to slightly stale
data is acceptable and those where execution efficiency is more
important than fairness (e.g., protection from live-lock).

- bill

Terje Mathisen

unread,

Oct 7, 2006, 10:55:30 AM10/7/06

to

Joe Seigh wrote:
> One of the things you can do is deny that your application needs all
> that many cores.
>
> http://www.theinquirer.net/default.aspx?article=34916
>
> One of the problems I suspect is that they're taking programmers who've
> never had any substantial experience doing multi-threaded programming
> and expecting them to become instantly saavy in multi-threading.

I would like to think that you're correct, I'm planning to look very
closely at high core count programming relatively soon.

OTOH, I have also spent quite a few hours with John Carmack, who by all
accounts is one of the all-time great programmers, and he didn't see any
way to really take advantage of multiple cores for gaming.

The obvious exception is in the servers for multiplayer games, but for
end users this is a hard problem.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

jacko

unread,

Oct 7, 2006, 11:09:03 AM10/7/06

to

seem to remember that amiga worked fast and smooth for MHz, this was
due to locking being done on the message passing, i.e. linked list
building routine, which was fast. every other process was sequentially
garanteed, due to processing of messages (list items) in sequence.
multiple threads could access the kernal by sending it a message. maybe
i am wrong on this, but making the list message passer have n-queues
equal to an n SMP configuration, and have one dispatch thread sending
messages to all other tasks then only get new message has to be locked.

cheers.

Joe Seigh

unread,

Oct 7, 2006, 11:32:38 AM10/7/06

to

Terje Mathisen wrote:
> Joe Seigh wrote:
>
>> One of the things you can do is deny that your application needs all
>> that many cores.
>>
>> http://www.theinquirer.net/default.aspx?article=34916
>>
>> One of the problems I suspect is that they're taking programmers who've
>> never had any substantial experience doing multi-threaded programming
>> and expecting them to become instantly saavy in multi-threading.
>
>
> I would like to think that you're correct, I'm planning to look very
> closely at high core count programming relatively soon.
>
> OTOH, I have also spent quite a few hours with John Carmack, who by all
> accounts is one of the all-time great programmers, and he didn't see any
> way to really take advantage of multiple cores for gaming.
>
> The obvious exception is in the servers for multiplayer games, but for
> end users this is a hard problem.
>

I agree that parallelizing arbitrary programs can be difficult. But
with the advent of high core count machines, I'm prepared to take a
radically different approach to multi-threading. Even to the point
of multi-threading things I considered only single threading because
of various existing overheads. That advent may be a way off since it
will be a while before I get my hands on a machine with enough cores
to make a difference.

Joe Seigh

unread,

Oct 7, 2006, 12:05:06 PM10/7/06

to

If an algorithm is lock-free according to Herlihy's definition, then
some thread can make forward progress no matter what. If there's
an issue about fairness, the ability of all threads to make forward
progress, with the interlocked instructions used in lock-free under
extreme contention, then lock based solutions will have a problem
also, since most lock implementations that I'm aware of use interlocked
instructions also.

I have a lock-free vs. lock testcase where I turn up the contention
scale to 11. Lock-free scales better and has better forward progress
guarantees. The lock-based solutions at best would run extremely slow,
or stall some or all of the threads.

jacko

unread,

Oct 7, 2006, 12:42:24 PM10/7/06

to

a lock free system then, but it needs locks to do the double cmpare
thing on single word machines. or atomic write jump of proc with write
access to result.??

md5 hash of whitespace removed { } type block and auto gen of stack IO
for block, makes for extream fine threading? as var overlap could be
used in sheduling of dependancies??

needs the source though, or else all loop segments based on opcodes,
but how is nesting discovered? by execution flow analysis?? and
removable graphs with one in and one out path??

cheers.

jacko

unread,

Oct 7, 2006, 12:48:58 PM10/7/06

to

Joe Seigh wrote:
> Terje Mathisen wrote:
> > Joe Seigh wrote:
> >
> >> One of the things you can do is deny that your application needs all
> >> that many cores.
> >>
> >> http://www.theinquirer.net/default.aspx?article=34916
> >>
> >> One of the problems I suspect is that they're taking programmers who've
> >> never had any substantial experience doing multi-threaded programming
> >> and expecting them to become instantly saavy in multi-threading.
> >
> >
> > I would like to think that you're correct, I'm planning to look very
> > closely at high core count programming relatively soon.
> >
> > OTOH, I have also spent quite a few hours with John Carmack, who by all
> > accounts is one of the all-time great programmers, and he didn't see any
> > way to really take advantage of multiple cores for gaming.

threaded character AIs??

> > The obvious exception is in the servers for multiplayer games, but for
> > end users this is a hard problem.
> >
>
> I agree that parallelizing arbitrary programs can be difficult. But
> with the advent of high core count machines, I'm prepared to take a
> radically different approach to multi-threading. Even to the point
> of multi-threading things I considered only single threading because
> of various existing overheads. That advent may be a way off since it
> will be a while before I get my hands on a machine with enough cores
> to make a difference.

http:indi.joox.net for my 222 logic element in MAX II CPLD processor
(16 bit) for the type of simple core that could become available.

would currently support co-operative multithreading, as needs mods to
allow access to P reg (prog counter) from interrupt mode space. it's
tiny.

cheers.

Nick Maclaren

unread,

Oct 7, 2006, 1:04:09 PM10/7/06

to

In article <YJ-dnXeh-rEBK7rY...@metrocastcablevision.com>,

Bill Todd <bill...@metrocast.net> writes:
|>
|> Hey, it's true for most after-the-fact changes in *anything*, which is
|> why doing the job right the first time is usually well worth the effort.

Yes and no. Changing a procedural design to an object-oriented one
isn't hard, and even changing an imperative one to a functional one
is quite feasible. Of course, the objective there is not performance
but clarity, extensibility and reliability, but even so all experience
is that merely converting old codes (without redesigning) rarely helps.

But converting a serial code to a fine-grain parallel one turns out
to be evil, even if all you are looking for is correctness and damn
any performance gain.

|> > Denying the difficulty of that issue is the great myth of SMP programming.
|>
|> I must have missed that myth: I've never considered it easy myself,
|> even after three decades' worth of experience doing it.

Oh, it's quite widespread - and is usually propagated by people who
have done it only in theory or on carefully selected demonstration
examples.

The first form in which I came across it was that ALL you had to do
was to write in a functional language, and it would parallelise
automatically. There is a theoretical sense in which that is true,
but it was rapidly discovered to be of no practical relevance.

The most recent form is that you can take an arbitrary serial code,
add some OpenMP (or UPC or ...) directives and tweak it a bit, and
it will run in parallel. That one is still being propagated, too.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Oct 7, 2006, 1:48:11 PM10/7/06

to

Joe Seigh wrote:
> Terje Mathisen wrote:

>> I would like to think that you're correct, I'm planning to look very
>> closely at high core count programming relatively soon.
>>
>> OTOH, I have also spent quite a few hours with John Carmack, who by
>> all accounts is one of the all-time great programmers, and he didn't
>> see any way to really take advantage of multiple cores for gaming.
>>
>> The obvious exception is in the servers for multiplayer games, but for
>> end users this is a hard problem.
>>
>
> I agree that parallelizing arbitrary programs can be difficult. But
> with the advent of high core count machines, I'm prepared to take a
> radically different approach to multi-threading. Even to the point

As I said I will also try to look at this 'out of the box', but I don't
expect it to be easy.

> of multi-threading things I considered only single threading because
> of various existing overheads. That advent may be a way off since it
> will be a while before I get my hands on a machine with enough cores
> to make a difference.

Even my laptops will have multiple cores from now on, but I don't know
exactly when I'll get a desktop machine with 8, 16 or even more.

Nick Maclaren

unread,

Oct 7, 2006, 2:03:06 PM10/7/06

to

In article <du5lv3-...@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> writes:

|> Joe Seigh wrote:
|> >
|> > I agree that parallelizing arbitrary programs can be difficult. But
|> > with the advent of high core count machines, I'm prepared to take a
|> > radically different approach to multi-threading. Even to the point

|> > of multi-threading things I considered only single threading because
|> > of various existing overheads. That advent may be a way off since it
|> > will be a while before I get my hands on a machine with enough cores
|> > to make a difference.
|>
|> Even my laptops will have multiple cores from now on, but I don't know
|> exactly when I'll get a desktop machine with 8, 16 or even more.

Quite.

Given the usual minimum overhead of a factor of 2, parallelising
most applications with below 4 cores is a waste of time (and Intel
Hyperthreading doesn't count), and not really worth the effort until
you have 8. And some of the more interesting approaches don't really
become worthwhile until you have a lot more than that.

Regards,
Nick Maclaren.

Eric P.

unread,

Oct 7, 2006, 3:49:57 PM10/7/06

to

Joe Seigh wrote:
>
> One of the things you can do is deny that your application needs all
> that many cores.
>
> http://www.theinquirer.net/default.aspx?article=34916

Their point is well taken, but this is a poor example.
The article states that because cores #2 and #3 would wait on
core #1 that there is no advantage to more than 2 cores. This is not
necessarily true as core #1 can still contribute to lowering the
overall cycle time. Whether it is worthwhile it is the question.

I think the biggest consideration is the market size (of multi-core
platforms) vs cost to parallelize, including the cost of finding
all the nasty race conditions, vs what your competitor is doing.
IOW the question cannot be considered in isolation.

> One of the problems I suspect is that they're taking programmers who've
> never had any substantial experience doing multi-threaded programming
> and expecting them to become instantly saavy in multi-threading.

There is really two classes of multi-threaded programs.
The first is the true concurrent execution of parallel algorithms.
I don't mean only things like weather prediction programs, but also
apps where the problem can be decomposed into multiple independent
items in a work queue being chewed on by multiple processors.

>From what I see in the public articles, in the vast majority of
multi-threaded programs, the threads are used, not for concurrency,
but to simply separate program state. The two most popular
examples are having thread-per-client to separate variables,
or threads whose sole purpose in life is to wait for something else
(an I/O or some other server) because app designer was unable or
unwilling to perform the waiting operation asynchronously.

The separate-state class of apps appears by far to be the most
prevalent, and is unlikely to be helped at all by multiple cores.
In fact it has been my experience that such apps usually run slower
in true multi-processor environments, and it also exposes all the
latent race conditions that remained hidden on uniprocessors.

Eric

Stephen Sprunk

unread,

Oct 7, 2006, 5:22:00 PM10/7/06

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:dqrkv3-...@osl016lin.hda.hydro.com...

> I would like to think that you're correct, I'm planning to look very
> closely at high core count programming relatively soon.

"High count", of course, is relative.

> OTOH, I have also spent quite a few hours with John Carmack, who by
> all accounts is one of the all-time great programmers, and he didn't
> see any way to really take advantage of multiple cores for gaming.

With all due respect to John (he _is_ a great game coder), I just read
about a new game engine from one of his competitors that uses five
threads: gameplay, graphics, physics, AI, and sound IIRC. It
keeps a quad-core system busy, is playable at reduced quality on
dual-core systems, and is dead in the water on a single-core SMT
system -- and the quality blows away current games.

There are people who can come up with innovative ways to use new
hardware, and they're rarely the folks who built their fortunes on the
last innovation. That's what makes being in tech so fun.

Stephen Sprunk

unread,

Oct 7, 2006, 5:31:33 PM10/7/06

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:ScqdnWqil-roTbvY...@comcast.com...

> Stephen Sprunk wrote:
>> <rohit...@gmail.com> wrote in message
>> news:1160157935.1...@m7g2000cwm.googlegroups.com...
>> Now, all those threads running in the kernel do have to share data
>> for coordination, which requires locks. This used to be a huge
>> problem for Linux because there was originally a single lock (called
>> the BKL, or Big Kernel Lock) that covered all shared data.
>> Considerable work has gone into Linux in the last few years to reduce
>> the number of things that require grabbing the BKL (mainly by
>> creating hundreds of little locks, IIRC). That work was a
>> fundamental requirement to get SMP and SMT machines performing well,
>> and it'll carry over to multicore processors just fine.
>
> Linux is a monolithic kernel, meaning it's one giant data structure.
> In order to safely access that structure, you need some form of
> syncrhonization.

It's not one giant data structure; it's one binary image in memory.
There are tens to hundreds of thousands of data structures for various
specialized purposes, linked to each other in convoluted ways. You only
need to synchronize access to the same individual structure (or
structures that point to it in some cases), not the entire kernel. The
BKL approach was downright flawed, Linus will readily admit, but it
worked fine for the first decade or so until SMP finally caught on (or,
alternately, until Linux became a contender for places SMP was already
in use).

> One form is locking but locking doesn't scale well. Lock-free scales
> better and it one thing Linux has been using more and more of lately,
> namely in the form of RCU.

There's still spinlocks and such all over the place. I know RCU is
getting more common, but it's an optimization, not a new approach. The
truly new approach was breaking things out from under the BKL into their
own synchronization mechanism; whether each structure uses a spinlock,
RCU, or something else will vary and is architecturally uninteresting
(though possibly important for performance reasons).

>> Vista has no advantage here. If you have two threads wanting to
>> access the TCP/IP stack, there has to be some sort of locking so that
>> they don't stomp on each other. The mechanism may be different than
>> Linux, or it may even be the same. The same work still needs to be
>> done, and both sides have clever folks working on finding better and
>> better solutions.
>
> The stack is basically just code. You only need synchronization if
> the
> threads are doing something with something in common, e.g. the same
> TCP/IP connection.

Different threads have to access common data structures to add and
remove connections, get a single queue (from the driver) of incoming
packets demultiplexed, etc. It's only _user_ threads that would be
sharing a single socket, and that's outside the scope of the problem
under discussion.

Bill Todd

unread,

Oct 7, 2006, 6:13:40 PM10/7/06

to

Joe Seigh wrote:
> Bill Todd wrote:

...

>> Locking works just fine in the absence of high local levels of
>> contention - as long as one manages the potential for deadlocks
>> (either by avoiding them - e.g., by imposing lock-acquisition ordering
>> constraints - or by resolving them through detection and selective
>> rejection). The main advantages of 'optimistic' alternatives in
>> low-contention situations are potentially simpler implementation and
>> slightly lower resource usage.
>>
>> Locking often works *better* than more 'optimistic' lock-free
>> alternatives in the *presence* of high levels of contention (in
>> particular, it can guarantee forward progress without compromising
>> data currency).
>
> If an algorithm is lock-free according to Herlihy's definition, then
> some thread can make forward progress no matter what.

That is insufficient in many circumstances, since it guarantees nothing
about avoiding perpetual livelock for some subset of the threads (unless
the thread exit rate exceeds the thread entry rate, in which case
eventually only one thread will be left).

If there's
> an issue about fairness,

There's usually *some* issue about fairness, but as long as unfairness
doesn't get too out-of-hand it can often be tolerated.

the ability of all threads to make forward
> progress, with the interlocked instructions used in lock-free under
> extreme contention, then lock based solutions will have a problem
> also, since most lock implementations that I'm aware of use interlocked
> instructions also.

It is possible that we're talking about two different things, then: I'm
talking about the difference between approaches that reserve (typically
shared or exclusive) access to some object for a period of time, as
distinguished from approaches which find ways to avoid reserving access
to individual objects (e.g., by creating private copies or by detecting
inadmissible sharing after the fact and acting at that time to eliminate
it retroactively).

While the underlying low-level hardware mechanisms may be common to both
traditional locking and to 'lock-free' approaches, forward-progress
guarantees (save for any need to address inequities introduced by NUMA
latency) are usually more critical to ensure at the higher
software-contention level (where the actual logical locking - or not -
persists) than at this hardware level (which is merely the mechanism by
which the logical lock - or 'lock-free' mechanism - is created).

In fine-grained situations where the duration of the required
synchronization starts to approach the time it takes to execute the
interlocking hardware instruction(s) themselves, 'lock-free' approaches
are great when they can be practically applied. For lengthier
synchronization in specific situations where use of stale copies of data
can be tolerated they are also effective (read-only transactions in a
multi-version database being a long-standing example). But while the
theory behind more general-purpose lock-free ('optimistic') database
concurrency control has been well-understood for at least the past
quarter-century, its near-complete absence in products testifies to the
superiority of lock-based approaches for such general-purpose use.

- bill

girish

unread,

Oct 7, 2006, 7:07:41 PM10/7/06

to

On 10/8/06 4:49 AM, in article
452806c8$0$1346$834e...@reader.greatnowhere.com, "Eric P."
<eric_p...@sympaticoREMOVE.ca> wrote:

> Joe Seigh wrote:
>>
>> One of the things you can do is deny that your application needs all
>> that many cores.
>>
>> http://www.theinquirer.net/default.aspx?article=34916
>
> Their point is well taken, but this is a poor example.
> The article states that because cores #2 and #3 would wait on
> core #1 that there is no advantage to more than 2 cores. This is not
> necessarily true as core #1 can still contribute to lowering the
> overall cycle time. Whether it is worthwhile it is the question.

Why is it (almost/always) necessary to run all the cores at equal speed?
Would it not be prudent to assign accelerated algorithms to threads (thread
= core) clocked at *higher*. Here the higher is relative term, by higher I
mean at least - order of 2. If I understand this, Thanks Prof.Knuth, we are
still trying to find an upper bound and lower bound in complexities of
multi-core. If it is functionally possible to re-model O(n) into O(1), the
converse is also true. As per the number of cores on a die, it is quadric
anyways. Let me say, the inputs provided here are functions of core
clocking, what would be the effect? Or am I trying to optimize away some
absurd random scenario?

Thanks.

--
First of all - I am an Engineer. I care less for Copyrights/Patents, at
least I have none of my own! I love software development & it pays me to run
my family. I try to dedicate some time thinking about Open Source movement &
sometime contributing to it actually. I often get paid by claiming knowledge
in software developed by Open Source community. Lots of things I know today
& still learning are due to Open Source community.

Eric P.

unread,

Oct 7, 2006, 9:22:34 PM10/7/06

to

girish wrote:
>
> Why is it (almost/always) necessary to run all the cores at equal speed?
> Would it not be prudent to assign accelerated algorithms to threads (thread
> = core) clocked at *higher*. Here the higher is relative term, by higher I
> mean at least - order of 2. If I understand this, Thanks Prof.Knuth, we are
> still trying to find an upper bound and lower bound in complexities of
> multi-core. If it is functionally possible to re-model O(n) into O(1), the
> converse is also true. As per the number of cores on a die, it is quadric
> anyways. Let me say, the inputs provided here are functions of core
> clocking, what would be the effect? Or am I trying to optimize away some
> absurd random scenario?

One reason is that most (all?) SMP OS's require all cpus to be
exactly equal, including rev level, so they don't have to try to
deal with any asymmetric behavior. You don't want to have to deal
with, say, a FDIV bug that only shows up on one cpu but not another.

So that would seem to preclude different speed cpus in an smp system.

Eric

Chris Thomasson

unread,

Oct 8, 2006, 2:08:41 AM10/8/06

to

"Bill Todd" <bill...@metrocast.net> wrote in message
news:f5WdnWBlH4QLu7XY...@metrocastcablevision.com...

> Joe Seigh wrote:
>> Bill Todd wrote:

[...]

>> If an algorithm is lock-free according to Herlihy's definition, then
>> some thread can make forward progress no matter what.
>
> That is insufficient in many circumstances, since it guarantees nothing
> about avoiding perpetual livelock for some subset of the threads (unless
> the thread exit rate exceeds the thread entry rate, in which case
> eventually only one thread will be left).

I do agree that some of the "loop-based" lock-free algorithms' (e.g., LL/SC
and certain "*CAS-Loop" logic) can theoretically suffer from "livelock-like
episodes":

http://groups.google.com/group/comp.arch/msg/5d1ff313519005f8

That's why I "try" make use of loopless lock-free algorithms whenever I can.
Although, and FWIW, lock-free algorithms are much better than
obstruction-free algorithms wrt livelock. My basic point is that loopless
lock-free algorithms simply cannot suffer from the type of livelock-like
scenarios that your are rightly describing here:

http://groups.google.com/group/comp.lang.c++.moderated/msg/6b2ccf76ba145c9c

Also, loopless lock-free algorithms are about the "only" type of lock-free
algorithms that can be "safely" used in any sort of hard real-time system:

http://groups.google.com/group/comp.sys.super/msg/642b6b608294e426
(I briefly mention lock-free wrt hard real-time near the end of the post)

> If there's
>> an issue about fairness,
>
> There's usually *some* issue about fairness, but as long as unfairness
> doesn't get too out-of-hand it can often be tolerated.

Priority inversions and lock-convoys should be factored into any locking
scheme... Its a good thing that clever use of lock-free and lock-based
synchronization techniques can provide a solution to priority inversions and
lock-convoys... Is that what you are getting at here?

> the ability of all threads to make forward
>> progress, with the interlocked instructions used in lock-free under
>> extreme contention, then lock based solutions will have a problem
>> also, since most lock implementations that I'm aware of use interlocked
>> instructions also.
>
> It is possible that we're talking about two different things, then: I'm
> talking about the difference between approaches that reserve (typically
> shared or exclusive) access to some object for a period of time, as
> distinguished from approaches which find ways to avoid reserving access to
> individual objects (e.g., by creating private copies or by detecting
> inadmissible sharing after the fact and acting at that time to eliminate
> it retroactively).
>
> While the underlying low-level hardware mechanisms may be common to both
> traditional locking and to 'lock-free' approaches,

Indeed.

> forward-progress guarantees (save for any need to address inequities
> introduced by NUMA latency) are usually more critical to ensure at the
> higher software-contention level (where the actual logical locking - or
> not - persists) than at this hardware level (which is merely the mechanism
> by which the logical lock - or 'lock-free' mechanism - is created).
>
> In fine-grained situations where the duration of the required
> synchronization starts to approach the time it takes to execute the
> interlocking hardware instruction(s) themselves, 'lock-free' approaches
> are great when they can be practically applied. For lengthier
> synchronization in specific situations where use of stale copies of data
> can be tolerated they are also effective (read-only transactions in a
> multi-version database being a long-standing example).

I wonder how many commercial databases actually make "extensive" use of
lock-free reader patterns?

https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068

Also, there are algorithms that can provide a solutions to the fact that
lock-free reader patterns can indeed iterate through some pointers that
point to stale/old data structures:

http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce
(I address stale data in the last 2 paragraphs of the post...)

Generally, **PDR can outperform fine-grain rw-locking by a fairly wide
margin. Locking has some overhead:

http://groups.google.com/group/comp.lang.c++.moderated/msg/d3a6db6ba9e7fd95

http://groups.google.com/group/comp.lang.c++.moderated/msg/7c15b3488642d1a6

http://groups.google.com/group/comp.lang.c++.moderated/msg/a8ef173bee4f0c03

> But while the theory behind more general-purpose lock-free ('optimistic')
> database concurrency control has been well-understood for at least the
> past quarter-century, its near-complete absence in products testifies to
> the superiority of lock-based approaches for such general-purpose use.

Most of the lock-free ('optimistic') concurrency controls schemes are based
around suboptimal lock-free techniques. They all seem to try to be
wait-free, and have the same livelock problems that obstruction-free
algorithms have. I actually would prefer to use a highly-tuned fine-grain
locking scheme over any obstruction-free algorithm:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4

Humm... I need to think some more about the practical issues you raised...

Thank You,

Chris Thomasson

* I remember hearing somewhere that CAS can be "fair" when it's implemented
in the hardware via. locking (e.g., bus). Something like this:

http://groups.google.com/group/comp.arch/msg/18f0d50aa3c9a148

I think... Humm...

** PDR == Paritial Copy-On-Write Deferred Reclamation:

http://groups.google.com/group/comp.arch/msg/2a0f4163f8e13f1e
(follow links for basic descriptions/discussions' on PDR)

John Dallman

unread,

Oct 8, 2006, 6:33:00 AM10/8/06

to

In article <C14E624D.AEFE%giri...@gmail.com>, giri...@gmail.com
(girish) wrote:

> Why is it (almost/always) necessary to run all the cores at equal
> speed? Would it not be prudent to assign accelerated algorithms to
> threads (thread = core) clocked at *higher*.

Because it's /simpler/. Take an example: say you want to make a four-core
processor, with one really fast core, two medium-speed, and one slowish
one for background tasks.

You can achieve this at a uniform clock speed by making the cores of
different designs: for a crude Intel example, a single Conroe core, two
Williamettes and a Pentium. But then you have to get them all working
together right, and reliably. And that's harder than having four uniform
cores work together reliably.

Alternatively, you can have cores of uniform design, and run them at
different clock speeds. That sounds attractive in a way, because it
means that you have a use for parts where only one core will run at the
maximum designed speed.

But then, is software to rely on the relative speed of these cores? Say
they're running at X, 0.75X, 0.75X and 0.3X. Now, the chip manufacturer
has a process shrink, and finds he can make it run at 1.5X, 1.2X, 1.2X
and 0.5X. Suddenly, the medium-speed cores are running a bit faster
relative to the top-speed one, and the slow one a bit slower. If your
software relied on the relative speeds, it's all messed up. So you don't
rely on the relative speeds. And then a processor with four identical
cores is just as good.

Since software gets run on all sorts of hardware, quite often hardware
which didn't exist when the software was finalised, keeping things
simple is just much /easier/.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Benny Amorsen

unread,

Oct 8, 2006, 7:01:48 AM10/8/06

to

>>>>> "EP" == Eric P <eric_p...@sympaticoREMOVE.ca> writes:

EP> So that would seem to preclude different speed cpus in an smp
EP> system.

Different speed CPU's are coming, in the form of multicore CPU's which
can do power saving on each core. That way a pure single-threaded
workload will keep every core except one at a very low clock
frequency.

Operating systems will need to be adjusted to cope, of course.
Timekeeping can be fun, for one thing. Also, if you have just a single
thread running at half speed, and another thread comes along, do you
schedule it on the same core or do you wake another core from deep
sleep?

/Benny

Terje Mathisen

unread,

Oct 8, 2006, 7:48:22 AM10/8/06

to

Stephen Sprunk wrote:
> "Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
> news:dqrkv3-...@osl016lin.hda.hydro.com...
>> I would like to think that you're correct, I'm planning to look very
>> closely at high core count programming relatively soon.
>
> "High count", of course, is relative.

Which is why I stated a bit later that I was waiting for at least 8 to
16 cores, i.e. this is the minimum of what I'd consider 'high count'.

>
>> OTOH, I have also spent quite a few hours with John Carmack, who by
>> all accounts is one of the all-time great programmers, and he didn't
>> see any way to really take advantage of multiple cores for gaming.
>
> With all due respect to John (he _is_ a great game coder), I just read
> about a new game engine from one of his competitors that uses five
> threads: gameplay, graphics, physics, AI, and sound IIRC. It
> keeps a quad-core system busy, is playable at reduced quality on
> dual-core systems, and is dead in the water on a single-core SMT
> system -- and the quality blows away current games.

This is a good example of multithreaded programming, also in the sense
that it is almost completely non-symmetrical: You have a big main
thread/program, and a bunch of helpers that all depend to a smaller or
larger degree on what the main thread is doing.

I.e. 3D sound has to use the same (or a simplified) model of the world
as the main game engine, so does AI and physics.

You can debate exactly how much belongs in 'gameplay' and how much in
'graphics', it is obvious that they have to overlap to at least some extent.

> There are people who can come up with innovative ways to use new
> hardware, and they're rarely the folks who built their fortunes on the
> last innovation. That's what makes being in tech so fun.

Sure, I wouldn't discount John C quite yet though! :-)

Bill Todd

unread,

Oct 8, 2006, 8:57:40 AM10/8/06

to

Chris Thomasson wrote:

...

>> If there's
>>> an issue about fairness,
>> There's usually *some* issue about fairness, but as long as unfairness
>> doesn't get too out-of-hand it can often be tolerated.
>
> Priority inversions and lock-convoys should be factored into any locking
> scheme

Indeed, though my specific comment referred to the ability to tolerate
moderate but statistically insignificant unfairness at the hardware
level that Joe mentioned as long as higher-level fairness is reasonably
assured (effectively, as long as lock-duration greatly exceeds average
lock-acquisition latency such that accessors who request locks at
approximately the same time will eventually acquire them in
approximately the order of their requests).

...

For lengthier
>> synchronization in specific situations where use of stale copies of data
>> can be tolerated they are also effective (read-only transactions in a
>> multi-version database being a long-standing example).
>
> I wonder how many commercial databases actually make "extensive" use of
> lock-free reader patterns?

Oracle has supported them for quite a while (rather than maintain
multiple versions of data, it uses its undo logs to back out later
changes to regress pages for non-locking read-only transactions to an
appropriate point in time. IIRC DEC's (now for a decade Oracle's) Rdb
does something like a snapshot to perform a similar service. Interbase
began life (also at DEC) by maintaining prior versions for both such
read-only accessors and IIRC for transaction backout as well. Postgres
(perhaps also its predecessor Ingres) employs some form of no-overwrite
update policy which my impression is serves this purpose.

Those are just the ones that spring immediately to mind, and I simply
don't know whether similar mechanisms exist in other current and former
major players such as SQL Server, Sybase, Informix, etc. (though without
knowing anything specific I'd expect that DB2 does: they seldom pass up
anything useful there).

- bill

Eric P.

unread,

Oct 8, 2006, 10:37:41 AM10/8/06

to

Those still have to be the same make and model cpu. And I'm
sure changing the cpu clock divider on the fly presents a whole
new set of of design problems and glitches.

But off the top of my head, I don't see how this would be of much
practical value. Currently if there is no work for a cpu the OS will
pause or power down anyway. And as soon as there is a computable
thread, a general purpose OS would run it as fast as possible because
it has no knowledge of thread speed or deadlines.

Now one might envision that threads could declare themselves as low
power candidates, or maybe define that priority 0 is low speed,
but only a few apps might ever do this so I don't see how this
would accomplish anything in the end.

A real time deadline based OS might use such a feature
but that is a whole different set of design issues and
they would probably not use SMP anyway.

Eric

Benny Amorsen

unread,

Oct 8, 2006, 11:44:32 AM10/8/06

to

>>>>> "EP" == Eric P <eric_p...@sympaticoREMOVE.ca> writes:

EP> But off the top of my head, I don't see how this would be of much
EP> practical value. Currently if there is no work for a cpu the OS
EP> will pause or power down anyway. And as soon as there is a
EP> computable thread, a general purpose OS would run it as fast as
EP> possible because it has no knowledge of thread speed or deadlines.

These days speed changes are usually initiated from user space -- a
user space daemon notices that the CPU is 40% idle and slows down
the CPU to e.g. 70%. This is handy for watching video and some games.
The downside is that it takes a moment for the daemon to notice a new
CPU intensive program, so latency can increase.

EP> Now one might envision that threads could declare themselves as
EP> low power candidates, or maybe define that priority 0 is low
EP> speed, but only a few apps might ever do this so I don't see how
EP> this would accomplish anything in the end.

It is possible for the daemon to decide that anything with a nice
value larger than 0 should not count in the busy percentage.

/Benny

Tim Bradshaw

unread,

Oct 8, 2006, 5:43:54 PM10/8/06

to

On 2006-10-08 11:33:00 +0100, j...@cix.co.uk (John Dallman) said:

> Because it's /simpler/. Take an example: say you want to make a
> four-core processor, with one really fast core, two medium-speed, and
> one slowish one for background tasks.
> You can achieve this at a uniform clock speed by making the cores of
> different designs: for a crude Intel example, a single Conroe core, two
> Williamettes and a Pentium. But then you have to get them all working
> together right, and reliably. And that's harder than having four
> uniform cores work together reliably.

I hear rumours that people also ship 8-core processors where some of
the cores have failed as 6 or 4 core units. But I never know if this
actually happens.

jacko

unread,

Oct 8, 2006, 6:35:49 PM10/8/06

to

Hi

the one call operating system was something i've thought about.

in_from_que = tx(out_to_que)

self is possible out??

tx works in the following manor:

1. fill the data field of the message element
2. fill the task field of the message element
3. pick up new message element if available or yeild until message
demon runs it again.

a message demon task (only one) goes through all message elements and
adapts the message in pointer of a task to be one of the messages for
it. by changing the task field element of a message to the task number
of an outgoing task the packet is marked as read, modified and sent to
another task..

so you pick up a message when you pass one out.

txing a message may be a registration of a service, which waits for rx
packets.

process ids would have an instance count, to support multi threading of
the same message handler, unless registered as a single instance
service.

this structure can be built using lists, with only the message demon
handling list pointers. multiple list/message demons could run if
messages held a demon id.

processor load syncronization is via sync() which sends a message to
the scheduler indicating it's tick requirement. a counter is increased
for the task and scheduling happens on the lowest (count-currentbase).
a task could be double ticked to reduce priority, preemted on the do
something else interval.

auto currentbase over estimation detection could be used to double tick
non criticals.

chee

Joe Seigh

unread,

Oct 8, 2006, 8:10:44 PM10/8/06

to

Bill Todd wrote:

> Chris Thomasson wrote:
>
>>
>>
>> I wonder how many commercial databases actually make "extensive" use
>> of lock-free reader patterns?
>
>
> Oracle has supported them for quite a while (rather than maintain
> multiple versions of data, it uses its undo logs to back out later
> changes to regress pages for non-locking read-only transactions to an
> appropriate point in time. IIRC DEC's (now for a decade Oracle's) Rdb
> does something like a snapshot to perform a similar service. Interbase
> began life (also at DEC) by maintaining prior versions for both such
> read-only accessors and IIRC for transaction backout as well. Postgres
> (perhaps also its predecessor Ingres) employs some form of no-overwrite
> update policy which my impression is serves this purpose.
>
> Those are just the ones that spring immediately to mind, and I simply
> don't know whether similar mechanisms exist in other current and former
> major players such as SQL Server, Sybase, Informix, etc. (though without
> knowing anything specific I'd expect that DB2 does: they seldom pass up
> anything useful there).
>

I think Chris is referring to whether the implementation is largely
lock-free, as opposed to the external api being lock-free. The only database I
know that claims this is this one

http://www.ants.com/index.php?option=com_content&task=view&id=525&Itemid=79#lockfree

I looked at their patents at one point. One was for a lock-free LIFO queue using
double wide compare and swap, and the other was I think some kind of lock-free
FIFO queue using producer/consumer logic. I'm not sure how they're doing the
lock-free index.

Bill Todd

unread,

Oct 8, 2006, 8:43:54 PM10/8/06

to

Joe Seigh wrote:
> Bill Todd wrote:
>> Chris Thomasson wrote:
>>
>>>
>>>
>>> I wonder how many commercial databases actually make "extensive" use
>>> of lock-free reader patterns?
>>
>>
>> Oracle has supported them for quite a while (rather than maintain
>> multiple versions of data, it uses its undo logs to back out later
>> changes to regress pages for non-locking read-only transactions to an
>> appropriate point in time. IIRC DEC's (now for a decade Oracle's) Rdb
>> does something like a snapshot to perform a similar service.
>> Interbase began life (also at DEC) by maintaining prior versions for
>> both such read-only accessors and IIRC for transaction backout as
>> well. Postgres (perhaps also its predecessor Ingres) employs some
>> form of no-overwrite update policy which my impression is serves this
>> purpose.
>>
>> Those are just the ones that spring immediately to mind, and I simply
>> don't know whether similar mechanisms exist in other current and
>> former major players such as SQL Server, Sybase, Informix, etc.
>> (though without knowing anything specific I'd expect that DB2 does:
>> they seldom pass up anything useful there).
>>
>
> I think Chris is referring to whether the implementation is largely
> lock-free, as opposed to the external api being lock-free.

1. Since he was responding specifically to the example of lock-free
read-only database access that I had provided, this seems unlikely.

2. That example has nothing to do with 'the external api', since
typical database use does not address locking in its api at all (i.e.,
any locking is a strictly internal mechanism supporting the implicit
isolation semantics of the higher-level api).

3. The (usually row- or page-granularity) isolation level under
discussion in this instance is the most important one in terms of
database performance: by comparison, performance variations achievable
in lower-level synchronization mechanisms tend to be well down in the
noise, and hence of relatively limited interest.

- bill

Bernd Paysan

unread,

Oct 7, 2006, 4:57:13 PM10/7/06

to

Terje Mathisen wrote:
> OTOH, I have also spent quite a few hours with John Carmack, who by all
> accounts is one of the all-time great programmers, and he didn't see any
> way to really take advantage of multiple cores for gaming.

AI, physics, even the harder tasks of rendering (e.g. light maps) can be
calculated in parallel. The rendering itself already is parallelized even
if John Carmack doesn't realize it is (SLI) - all he does on his side is
calling OpenGL functions.

I think the main problem is when you just have two cores - the overhead will
eat up most of the speedup.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Bernd Paysan

unread,

Oct 7, 2006, 4:46:15 PM10/7/06

to

Joe Seigh wrote:
> As a former kernel developer on one of those monolithic kernels, all I can
> say is that going to finer grained locking is easier said than done.

Changing a program that was build with a different paradigm in mind is
always esier said than done. That's why it is often easier to ditch the
legacy program when significant changes come around, and rewrite from
scratch. This is often harder said than done, since most people calculate
the efford already invested into the project vs. the effort invested into
the rewrite instead of calculating the effort invested in the rewrite vs.
the effort of changing the not-adopted solution of the past.

Joe Seigh

unread,

Oct 8, 2006, 9:56:10 PM10/8/06

to

Bill Todd wrote:
> Joe Seigh wrote:
>
>> Bill Todd wrote:
>>
>>> Chris Thomasson wrote:
>>>
>>>>
>>>>
>>>> I wonder how many commercial databases actually make "extensive" use
>>>> of lock-free reader patterns?
>>>
>>>

[...]

>>
>> I think Chris is referring to whether the implementation is largely
>> lock-free, as opposed to the external api being lock-free.
>
>
> 1. Since he was responding specifically to the example of lock-free
> read-only database access that I had provided, this seems unlikely.

This question has come up before in other newgroups. That's what
he's talking about.

>
> 2. That example has nothing to do with 'the external api', since
> typical database use does not address locking in its api at all (i.e.,
> any locking is a strictly internal mechanism supporting the implicit
> isolation semantics of the higher-level api).
>
> 3. The (usually row- or page-granularity) isolation level under
> discussion in this instance is the most important one in terms of
> database performance: by comparison, performance variations achievable
> in lower-level synchronization mechanisms tend to be well down in the
> noise, and hence of relatively limited interest.
>

Maybe for OLTP. I wouldn't say that's necessarily true for OLAP.

Bill Todd

unread,

Oct 9, 2006, 2:53:39 AM10/9/06

to

Joe Seigh wrote:
> Bill Todd wrote:

...

>> 3. The (usually row- or page-granularity) isolation level under
>> discussion in this instance is the most important one in terms of
>> database performance: by comparison, performance variations
>> achievable in lower-level synchronization mechanisms tend to be well
>> down in the noise, and hence of relatively limited interest.
>>
>
> Maybe for OLTP. I wouldn't say that's necessarily true for OLAP.

It should be when the OLAP activity is occurring on a database
concurrently shared with OLTP processing. And when the database isn't
being so shared, since most of the OLAP access is read-only there won't
be any noticeable locking overhead (I suppose you could suggest that
this then exposes underlying synchronization overheads more, but it's
not at all clear that they'd constitute a significant portion of the
overall processing load in such cases).

- bill

Ketil Malde

unread,

Oct 9, 2006, 3:12:29 AM10/9/06

to

girish <giri...@gmail.com> writes:

> Why is it (almost/always) necessary to run all the cores at equal speed?
> Would it not be prudent to assign accelerated algorithms to threads (thread
> = core) clocked at *higher*.

Why not clock all of them at maximum? Perhaps you could save a bit of
power if you have one thread that consumes a lot more cpu than all the
others together, but I can't see how it's worth adding a lot of
complexity to an SMP system in order to support heavily skewed
parallelism.

-k
--
If I haven't seen further, it is by standing in the footprints of giants

Chris Thomasson

unread,

Oct 9, 2006, 5:23:36 AM10/9/06

to

I have to admit, I am nitpicking a bit here! Sorry Bill!

;)

"Bill Todd" <bill...@metrocast.net> wrote in message

news:nJCdnbkrGYrNBrTY...@metrocastcablevision.com...

> Joe Seigh wrote:
>> Bill Todd wrote:
>>> Chris Thomasson wrote:
>>>
>>>>
>>>>
>>>> I wonder how many commercial databases actually make "extensive" use of
>>>> lock-free reader patterns?
>>>
>>>
>>> Oracle has supported them for quite a while (rather than maintain
>>> multiple versions of data, it uses its undo logs to back out later
>>> changes to regress pages for non-locking read-only transactions to an
>>> appropriate point in time.

I wonder exactly how much of an impact maintaining correct undo logs
actually has on the "read-mostly/only" part of a given syncronizatation
scheme? I know that the "flagrant" use of atomic operations and/or memory
barriers can have a marked negative effect on the throughput, scalability
and general performance of a synchronization algorithms shared reader
operations. You don't really want to be forced to execute an atomic
operation and/or memory barrier, and basically blow the pipeline and any
cache that may have accumulated, every time you need to acquire and
subsequently read from a shared data-structure...

> 1. Since he was responding specifically to the example of lock-free
> read-only database access that I had provided, this seems unlikely.
>
> 2. That example has nothing to do with 'the external api', since typical
> database use does not address locking in its api at all (i.e., any locking
> is a strictly internal mechanism supporting the implicit isolation
> semantics of the higher-level api).
>
> 3. The (usually row- or page-granularity) isolation level under
> discussion in this instance is the most important one in terms of database
> performance: by comparison, performance variations achievable in
> lower-level synchronization mechanisms tend to be well down in the noise,
> and hence of relatively limited interest.

Well, IMHO and FWIW, I think that a fairly well thought-out, and somewhat
carefully designed lock-based synchronization strategy is usually way better
than basically any sort of "crappy" lock-free implementation, or, I should
more correctly say, emulation(s)... Humm... IMHO, it does "kind of seem"
that a number of lock-free algorithms that exist, or are being currently
designed to address the synchronization requirement "for" some different
types of robust, enterprise-like databases bind their various forward
progress aspects to, or even explicitly design then for, and/or totally
depend on some sort of "multi-non-contiguous-word load-linked
store-conditional" (e.g., MCAS, *KCSS, **STM, perhaps ***HTM) and/or
contention manager functionality. Apparently, these type of lock-free
algorithms "cannot seem to survive in the wild" without the aid of atomic
multi-word LL/SC operations, and/or of course, some "explicit, or implicit?"
assistance from a nifty "contention manager protocol"; which usually comes
with its fair share of overhead, all by itself.

For those who happen to be not so familiar with this kind of stuff, a
contention manager "usually" has to be able to identify and subsequently
address scenarios in which multiple application threads could be
experiencing periods of repeated, and somewhat "persistent" failures. This
could be during periods of moderate to severe, possibly prolonged periods of
load. They could be trying to make their way through a synchronization
primitives implementation (e.g., obstruction-free queue, skiplist, ect...)
only to find themselves folding under and totally depending on the
contention manager to give it emergency first-aid. The current "popular"
trend wrt lock-free seems to be focused on obstruction-free-like
synchronization techniques. That basically means that some sort of
contention manager/arbiter logic has to "keep track" of certain things. This
could, and usually does, involve the need to "query some state from some
threads that happen to be currently accessing the shared data-structure
during times of fairly intense, and perhaps prolonged periods of load"...

IMHO, all of the intricate work that "they" put into developing "a working
multi-word ll/sc emulation" could of been going into developing practical
usage paradigms' which are aimed at reducing the "total number" of atomic
operations/membars throughout a number of different synchronization
infrastructures that their "user" applications make heavy/extensive use of.
Unfortunately, it seems like these LL/SC emulation things are being
advertised as a good solution for the "concurrency problem". These hefty
obstruction-free algorithms, along with their contention managers, can be
abstracted away into a library that is transparent to the end-user database
API. I mean, does the -end-user really care if the database he or she is
making use of is implemented in a somewhat suboptimal manner? Well, they
might start to care... Maybe... I could picture a company that has a perfect
application that runs great and very fast on uni-processor, and
un-deterministically dies from some "mystery race-condition" after a couple
of days/weeks of running its "perfectly and obviously flawed logic" through
a multi-processor system. How to fix a race-condition when your application
is massive, and you never even had to think about this bug simply because it
never shows up after years of running a track record that shows successful
reliability and good performance characteristics'.

IMHO, all of this basically boils down to the contention manager performing
introspection duties which should be really be used for profiling,
experimenting and/or debugging application architecture in order to begin to
leverage some of the various results against any experimental code
optimizations you may have created. I guess the basic point I am getting at
here is that algorithms that need to "basically rely" on some fairly
intricate introspection techniques in order to guide it through periods of
high-contention, IMHO, is already been suffering from an inherent and
somewhat severe disability, from birth. Once you design any algorithm to
depend on a contention manager, you are basically folding the performance of
some key areas of your algorithm under an unknown set of rules. The
contention algorithms behavior can't really be predicted by a "user"
algorithm. So, you are kind of "blindly" binding some of the performance
critical aspects of your algorithm to an unknown third-party in order to
guide your threads through your applications points of contention, when the
lights are out! Yikes!

;)

I have not looked ANT Software's prior art. The patent(s) that hold seem to
classify their "type" of lock-free algorithms into the "writer" category. I
find it to be very useful when your synchronization logic can differentiate
a "writer" thread, from a "reader" thread. You can actually implement this
in 100% user-space using about a pages worth of SPARC 32/64 - IA-32 assembly
language. Here is a suboptimal full-blown working example:

http://groups.google.com/group/comp.arch/browse_frm/thread/b2f751d5dad6c07b

I have to admit that even though this has the inherent ability to outperform
basically any rw-mutex implementations', barring some hardware assisted
super ****rw-mutex of the future, it still has to use atomic operations and
memory barriers. However, the presented code has scalability characteristics
that exceed anything provided by most rw-mutex implementations'. Its has the
ability to create a scenario in which writers and readers or no longer
mutually excludable, and keeps any piece of memory that a thread has
acquired in a PDR collected-region alive, as long as any entity (e.g.,
thread in a process) owns an acquired a reference to it.

------
*
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/b21b212dd175baab

**
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f6399b3b837b0a40
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/9c572b709248ae64

http://groups.google.com/group/comp.arch/msg/bbbc035cf1a8502c

http://groups.google.com/group/comp.arch/msg/11b14c4bda2d5d82

http://groups.google.com/group/comp.arch/msg/9b00fda2752966f9

http://groups.google.com/group/comp.arch/msg/335aeb22fd6fe526

http://groups.google.com/group/comp.arch/msg/1ace9400b1b16cd4

*** http://groups.google.com/group/comp.arch/msg/995379a16beb3b69
(simply excellent Wheeler post)

****
http://groups.google.com/group/comp.programming.threads/msg/6236a9029d80527a

Chris Thomasson

unread,

Oct 9, 2006, 5:36:26 AM10/9/06

to

I munged my post up a bit when I posted it. I included a section of a
paragraph in a context that it had no business being in...

[...]

> I mean, does the -end-user really care if the database he or she is making
> use of is implemented in a somewhat suboptimal manner? Well, they might
> start to care... Maybe...

The following comments were suppose to be sarcastically addressing a
possible scenario in which a nasty race-condition renders an application
dead in the water, with sharks circling...

> I could picture a company that has a perfect application that runs great
> and very fast on uni-processor, and un-deterministically dies from some
> "mystery race-condition" after a couple of days/weeks of running its
> "perfectly and obviously flawed logic" through a multi-processor system.
> How to fix a race-condition when your application is massive, and you
> never even had to think about this bug simply because it never shows up
> after years of running a track record that shows successful reliability
> and good performance characteristics'.

I decided to remove the paragraph, and accidentally left this portion in.
Sorry for any confusion!

:O

> I mean, does the -end-user really care if the database he or she is making
> use of is implemented in a somewhat suboptimal manner? Well, they might
> start to care... Maybe...

They could be asking questions once they start to purchase high-end
multi-processor systems and happen to notice that one of their applications
processes its "critical tasks a little bit slower, or perhaps only barely
faster than usual". They might start to expect their applications to have
the inherent ability to scale-up to whatever type of crazy multi-processor
system the industry can invent and throw at them... I am living in a
pipedream, no? lol

:)

Joe Seigh

unread,

Oct 9, 2006, 6:10:47 AM10/9/06

to

There's a number of databases aimed purely at the OLAP market. They
usually run with the entire database in memory for speed. Some of the
don't bother with making the database persistent by backing it to disk.
At least one of them is running into scalability issues. They need to
expand their database from handling only up to 100 users to 1000's of
users.

A particularly hot market is the financial analysis sector. While most
of the access is read only, there is a significant amount of update
activity. This is a market where speed is of the essence. I believe
they measure costs in dollars per microsecond response time.

Terje Mathisen

unread,

Oct 9, 2006, 8:18:52 AM10/9/06

to

Joe Seigh wrote:
> I looked at their patents at one point. One was for a lock-free LIFO
> queue using double wide compare and swap, and the other was I think
> some kind of lock-free FIFO queue using producer/consumer logic. I'm
> not sure how they're doing the lock-free index.

We discussed this a few years ago, in the context of using XADD
(eXchange & Add):

At the time, my conclusion was that if you can live with (temporarily)
fixed limits on the size of your queue, then you can do a totally
lock-free implementation using no more than 4 XADD-updated variables:

front, back, room & avail

front points at the front of the queue, back at the end, room contains
the number of free slots and avail the number of available tasks.

This is loop-free as well, as long as there is the queue is neither full
nor empty.

It might be possible to find a more efficient approach, but this was the
best I could figure out at the time, while allowing multiple producers
and consumers.

With a single producer and/or a single consumer, you can of course get
away with a much simpler setup!

With XADD as the primitive, I would force the queue size to be a power
of two, allocate it as an array, and use indices into this array instead
of direct pointers.

The front/back indices would be allowed to grow until the register word
size, while being interpreted modulo the queue size:

for () {
if (avail > 0) {
int myavail = XADD(&avail,-1); // Any available tasks?
if (myavail > 0) break;
XADD(&avail,1);
}
yield();
}

int myfront = XADD(&front,1); // Get current task index
myfront &= QUEUE_SIZE_MASK;
taskp tp = queue_array[myfront];
queue_array[myfront] = NULL;
XADD(room,1); // Increase free queue space

return tp;

The real problem is to write this in such a way that it can survive if a
thread is stuck somewhere between getting & incrementing the front index
and retrieving & setting to NULL the queue element.

I really don't know how to write code that can survive even if an
arbitrary thread/core can stop in an arbitrary location!

Bill Todd

unread,

Oct 9, 2006, 9:44:44 AM10/9/06

to

Joe Seigh wrote:

...

> There's a number of databases aimed purely at the OLAP market. They
> usually run with the entire database in memory for speed. Some of the
> don't bother with making the database persistent by backing it to disk.

RAM-resident databases are indeed a different kettle of fish. OTOH,
especially with the more static ones they also experience a lot less
low-level activity (e.g., allocation/deallocation and queuing for
resources) than conventional databases, so the opportunities for
benefits from low-level lock-free operation are fewer.

> At least one of them is running into scalability issues. They need to
> expand their database from handling only up to 100 users to 1000's of
> users.

That is entirely believable, but it remains to be demonstrated that lock
overheads are a significant contributor to their scalability problems
(I'd certainly be interested in any evidence you may have to that
effect, but in its absence will observe that a scattershot approach of
"we'll just reduce any overheads we can find, regardless of their
importance" is usually not the best when dealing with immediate
real-world problems, especially when it may entail significant effort in
restructuring code).

- bill

Bill Todd

unread,

Oct 9, 2006, 10:40:21 AM10/9/06

to

Chris Thomasson wrote:
> I have to admit, I am nitpicking a bit here! Sorry Bill!
>
> ;)
>
>
>
>
> "Bill Todd" <bill...@metrocast.net> wrote in message
> news:nJCdnbkrGYrNBrTY...@metrocastcablevision.com...
>> Joe Seigh wrote:
>>> Bill Todd wrote:
>>>> Chris Thomasson wrote:
>>>>
>>>>>
>>>>> I wonder how many commercial databases actually make "extensive" use of
>>>>> lock-free reader patterns?
>>>>
>>>> Oracle has supported them for quite a while (rather than maintain
>>>> multiple versions of data, it uses its undo logs to back out later
>>>> changes to regress pages for non-locking read-only transactions to an
>>>> appropriate point in time.
>
> I wonder exactly how much of an impact maintaining correct undo logs
> actually has on the "read-mostly/only" part of a given syncronizatation
> scheme?

I suspect none whatsoever: my understanding is that Oracle maintains
the undo log for other reasons (such as backout of data dirtied on disk
within a transaction which aborts, or complete database rollback to an
earlier state) and merely capitalizes upon its existence to support long
read-only transactions.

I know that the "flagrant" use of atomic operations and/or memory
> barriers can have a marked negative effect on the throughput, scalability
> and general performance of a synchronization algorithms shared reader
> operations.

Only, of course, when the overhead associated with the atomic operations
and/or memory barriers starts to become a significant fraction of the
total overhead associated with processing the data structures so
acquired - a situation for which in this particular instance I'd have to
see actual evidence before giving it much credence: otherwise,
throughput, scalability, and general performance will not be
significantly affected.

...

I guess Joe may have been right after all, and that you completely
missed the point both in your earlier response and now here.

The locking or 'optimistic' mechanisms used to isolate transactions from
each other, which I submit are probably the primary factor in
synchronization-associated throughput and scalability limits in even
relatively fine-grained OLTP, execute a couple of levels higher up than
the primitives that you're talking about here and make the
implementation choice of those primitives fairly inconsequential (as
long as whatever choice is made is competently implemented).

- bill

Joe Seigh

unread,

Oct 9, 2006, 10:58:41 AM10/9/06

to

Bill Todd wrote:
> Joe Seigh wrote:
>

> ...

>
>> At least one of them is running into scalability issues. They need to
>> expand their database from handling only up to 100 users to 1000's of
>> users.
>
>
> That is entirely believable, but it remains to be demonstrated that lock
> overheads are a significant contributor to their scalability problems
> (I'd certainly be interested in any evidence you may have to that
> effect, but in its absence will observe that a scattershot approach of
> "we'll just reduce any overheads we can find, regardless of their
> importance" is usually not the best when dealing with immediate
> real-world problems, especially when it may entail significant effort in
> restructuring code).
>

It's based on the fact that even if you are only doing a read access of
the data, you still need some form of synchronization, be it conventional
locking, PDR (PCOW Deferred Reclamation), or some other synchronization
technique like chained locking. If you have something where like OLAP where
memory access is the predominate performance factor vs. an OLTP database
where network latency affects performance more than any internal synchronization,
lock-free out performs lock based by orders of magnitude.

Restructuring could be a problem especially when the data structures weren't
designed with that form of synchronization in mind. Java had to redo the
collection libraries when they did the java.util.concurrent stuff. The old
collections were unrefactorable into lock-free collections.

At a couple of points, I had looked at making TDB (Trivial DataBase), which is
part of Samba, into a lock-free database. It would have been a lot of work.
It turned out that Samba couldn't even say whether they really had a performance
problem, even though they had gone along with the initial proposal which wasn't
from me BTW. It fell through, though IBM did get a lock-free patent application
out of it.

20060130061 "Use of rollback RCU with read-side modifications to RCU-protected data structures"

Bill Todd

unread,

Oct 9, 2006, 1:45:08 PM10/9/06

to

Joe Seigh wrote:
> Bill Todd wrote:
>> Joe Seigh wrote:
>>
>> ...
>>
>>> At least one of them is running into scalability issues. They need to
>>> expand their database from handling only up to 100 users to 1000's of
>>> users.
>>
>>
>> That is entirely believable, but it remains to be demonstrated that
>> lock overheads are a significant contributor to their scalability
>> problems (I'd certainly be interested in any evidence you may have to
>> that effect, but in its absence will observe that a scattershot
>> approach of "we'll just reduce any overheads we can find, regardless
>> of their importance" is usually not the best when dealing with
>> immediate real-world problems, especially when it may entail
>> significant effort in restructuring code).
>>
>
> It's based on the fact that even if you are only doing a read access of
> the data, you still need some form of synchronization, be it conventional
> locking, PDR (PCOW Deferred Reclamation), or some other synchronization
> technique like chained locking.

No, you do not: you only need it when at least potentially sharing
access with updating activity - save for low-level operations such as
allocating/deallocating operation structures and managing activity
queues (not queues resulting from data-access interlocks, however:
there aren't any).

That's one reason why (most?) OLAP approaches explicitly segregate
update activity into batches performed while the normal analytical
processing has been temporarily suspended: intense OLAP typically
doesn't require up-to-the-minute data (day-old usually being more than
sufficiently current), and typically does benefit significantly from the
optimizations obtainable by not sharing analytical activity with updates.

If you have something where like OLAP
> where
> memory access is the predominate performance factor vs. an OLTP database
> where network latency affects performance more than any internal
> synchronization,
> lock-free out performs lock based by orders of magnitude.

You appear quite confused: 'network latency' plays very little part in
OLTP performance (well, it can increase response latency to remote
clients, but only by a constant that has nothing to do with aggregate
throughput).

And (at least in synthetic OLTP benchmarks like TPC-C) even disk latency
plays little part in aggregate throughput: it's all about processor
performance, processor caches, and memory latency (and occasionally
memory throughput, though apparently in most cases that's not the
limiting factor).

>
> Restructuring could be a problem especially when the data structures
> weren't
> designed with that form of synchronization in mind. Java had to redo the
> collection libraries when they did the java.util.concurrent stuff. The old
> collections were unrefactorable into lock-free collections.
>
> At a couple of points, I had looked at making TDB (Trivial DataBase),
> which is
> part of Samba, into a lock-free database. It would have been a lot of
> work.
> It turned out that Samba couldn't even say whether they really had a
> performance
> problem, even though they had gone along with the initial proposal which
> wasn't
> from me BTW. It fell through, though IBM did get a lock-free patent
> application
> out of it.

That's the point that a convivial venture capitalist with time on his
hands would stress to any bright-eyed young enthusiasts who came to
him/her with a proposal centered on having a 'better' solution to some
problem (even to some problem in which s/he was intensely interested):

*How much better is it, and how much will it cost?* (i.e., does it make
a real difference, sufficient to justify the effort = money required?)

Crusty old engineers have a similar attitude, which may help explain why
more people here haven't joined your lock-free crusade: there are
plenty of areas in which any given product could be 'improved', but most
of them just don't matter.

If (and while this may be suggested for at least some specific
applications it's really not yet in actual evidence) lock-free
approaches are really a significantly better way to do certain things,
then over time they will likely migrate into more common use for those
applications. But no one (save for religious zealots) is likely to
bother to retrofit any existing code (even if it's being otherwise
revised) absent some actual evidence that doing so will be worthwhile in
a much higher-level sense.

And in all the cases you and Chris have brought up, the only points even
remotely resembling evidence have been associated with very low-level
primitives rather than with before-and-after performance measurements
(or even just credible simulations) of the real-world applications you
keep suggesting would materially benefit from their use.

- bill

John Dallman

unread,

Oct 9, 2006, 1:53:00 PM10/9/06

to

In article <egbreq$hhu$1$8302...@news.demon.co.uk>, t...@tfeb.org (Tim
Bradshaw) wrote:

> I hear rumours that people also ship 8-core processors where some of
> the cores have failed as 6 or 4 core units. But I never know if this
> actually happens.

Well, when processors are available with variable amounts of cache, ones
with less than the max often have failed areas of cache that have been
disconnected. So disconnected cores would seem possible; what processors
have you heard this of?

Nick Maclaren

unread,

Oct 9, 2006, 2:08:25 PM10/9/06

to

In article <memo.2006100...@jgd.compulink.co.uk>,

j...@cix.co.uk (John Dallman) writes:
|> In article <egbreq$hhu$1$8302...@news.demon.co.uk>, t...@tfeb.org (Tim
|> Bradshaw) wrote:
|>
|> > I hear rumours that people also ship 8-core processors where some of
|> > the cores have failed as 6 or 4 core units. But I never know if this
|> > actually happens.
|>
|> Well, when processors are available with variable amounts of cache, ones
|> with less than the max often have failed areas of cache that have been
|> disconnected. So disconnected cores would seem possible; what processors
|> have you heard this of?

I have HEARD this of all of the recent multi-core systems, but the only
one for which the vendor said that this was done was the IBM POWER4,
and the disabling was done to increase bandwidth per cpu. The other
rumours were all so indirect as to be little more reliable than
"well-sourced intelligence".

Regards,
Nick Maclaren.

Joe Seigh

unread,

Oct 9, 2006, 4:22:16 PM10/9/06

to

Bill Todd wrote:
> Joe Seigh wrote:
>
...
>>

>> It's based on the fact that even if you are only doing a read access of
>> the data, you still need some form of synchronization, be it conventional
>> locking, PDR (PCOW Deferred Reclamation), or some other synchronization
>> technique like chained locking.
>
>
> No, you do not: you only need it when at least potentially sharing
> access with updating activity - save for low-level operations such as
> allocating/deallocating operation structures and managing activity
> queues (not queues resulting from data-access interlocks, however: there
> aren't any).
>
> That's one reason why (most?) OLAP approaches explicitly segregate
> update activity into batches performed while the normal analytical
> processing has been temporarily suspended: intense OLAP typically
> doesn't require up-to-the-minute data (day-old usually being more than
> sufficiently current), and typically does benefit significantly from the
> optimizations obtainable by not sharing analytical activity with updates.

Not for financial trading applications. Market updates are frequent.
You don't have the luxury of batching things.

...

You're ignoring the fact that RCU is being used in the Linux kernel.
They had to convince Linus that there would be an actual and substantial
performance improvement before he let it be put in.

>
> And in all the cases you and Chris have brought up, the only points even
> remotely resembling evidence have been associated with very low-level
> primitives rather than with before-and-after performance measurements
> (or even just credible simulations) of the real-world applications you
> keep suggesting would materially benefit from their use.
>

It's basically a scalability issue. I don't have the resources to write
an enterprise scale application. Also lock-free algorithms are being
patented as fast as they can be invented. Dealing with patents is also
a resource issue.

And I guess you could claim the ANTs database isn't a real world application.
Also, there's the Noble library here http://www.pss-ab.com/ and Intel
has a library here
http://www.intel.com/cd/software/products/asmo-na/eng/threading/threadbuildblocks/294797.htm

Both of those use the usual write lock-free algorithms and read lock-free
using hazard pointers I believe.

Bill Todd

unread,

Oct 9, 2006, 7:25:38 PM10/9/06

to

Joe Seigh wrote:
> Bill Todd wrote:
>> Joe Seigh wrote:
>>
> ...
>>>
>>> It's based on the fact that even if you are only doing a read access of
>>> the data, you still need some form of synchronization, be it
>>> conventional
>>> locking, PDR (PCOW Deferred Reclamation), or some other synchronization
>>> technique like chained locking.
>>
>>
>> No, you do not: you only need it when at least potentially sharing
>> access with updating activity - save for low-level operations such as
>> allocating/deallocating operation structures and managing activity
>> queues (not queues resulting from data-access interlocks, however:
>> there aren't any).
>>
>> That's one reason why (most?) OLAP approaches explicitly segregate
>> update activity into batches performed while the normal analytical
>> processing has been temporarily suspended: intense OLAP typically
>> doesn't require up-to-the-minute data (day-old usually being more than
>> sufficiently current), and typically does benefit significantly from
>> the optimizations obtainable by not sharing analytical activity with
>> updates.
>
> Not for financial trading applications. Market updates are frequent.
> You don't have the luxury of batching things.

Then you're mixing OLAP with OLTP updates in the manner I described.
Even then, approaches which analyze the incoming update stream in
conjunction with a read-only snapshot of the pre-existing data rather
than attempt to coordinate running OLAP on the OLTP database itself may
make a lot more sense.

What part of the fact that you have yet to provide a scintilla of
evidence to back up the claims you've made for the benefit to the
various applications you've cited still manages to escape you? Yes, RCU
is used in the Linux kernel, and had you pointed to that rather than
waved your hands vaguely (but vigorously) in other directions you would
have had a point (for the Linux kernel, though not something
automatically more generally applicable).

>
>>
>> And in all the cases you and Chris have brought up, the only points
>> even remotely resembling evidence have been associated with very
>> low-level primitives rather than with before-and-after performance
>> measurements (or even just credible simulations) of the real-world
>> applications you keep suggesting would materially benefit from their use.
>>
>
> It's basically a scalability issue. I don't have the resources to write
> an enterprise scale application. Also lock-free algorithms are being
> patented as fast as they can be invented. Dealing with patents is also
> a resource issue.
>
> And I guess you could claim the ANTs database isn't a real world
> application.

No, I'm simply observing (do you really have *that* much of a problem
wrapping your mind around this?) that it doesn't mean shit without some
kind of controlled comparative (lock-based vs. lock-free being the
*only* differences) *application-level performance data* (or, as I said,
at a minimum credible quantitative simulations).

- bill

Paul Aaron Clayton

unread,

Oct 9, 2006, 9:13:50 PM10/9/06

to

Nick Maclaren wrote:
[snip]

> I have HEARD this of all of the recent multi-core systems, but the only
> one for which the vendor said that this was done was the IBM POWER4,
> and the disabling was done to increase bandwidth per cpu. The other
> rumours were all so indirect as to be little more reliable than
> "well-sourced intelligence".

So you have never heard of Sun's Sun Fire T2000 Server
available with 4, 6, or 8 cores at 1.0GHz (or 8 cores at 1.2GHz)?

Paul A. Clayton
just a technophile

Brian Inglis

unread,

Oct 10, 2006, 6:24:32 AM10/10/06

to

fOn 7 Oct 2006 17:04:09 GMT in comp.arch, nm...@cus.cam.ac.uk (Nick
Maclaren) wrote:

>
>In article <YJ-dnXeh-rEBK7rY...@metrocastcablevision.com>,
>Bill Todd <bill...@metrocast.net> writes:
>|>
>|> Hey, it's true for most after-the-fact changes in *anything*, which is
>|> why doing the job right the first time is usually well worth the effort.
>
>Yes and no. Changing a procedural design to an object-oriented one
>isn't hard, and even changing an imperative one to a functional one
>is quite feasible. Of course, the objective there is not performance
>but clarity, extensibility and reliability, but even so all experience
>is that merely converting old codes (without redesigning) rarely helps.
>
>But converting a serial code to a fine-grain parallel one turns out
>to be evil, even if all you are looking for is correctness and damn
>any performance gain.
>
>|> > Denying the difficulty of that issue is the great myth of SMP programming.
>|>
>|> I must have missed that myth: I've never considered it easy myself,
>|> even after three decades' worth of experience doing it.
>
>Oh, it's quite widespread - and is usually propagated by people who
>have done it only in theory or on carefully selected demonstration
>examples.
>
>The first form in which I came across it was that ALL you had to do
>was to write in a functional language, and it would parallelise
>automatically. There is a theoretical sense in which that is true,
>but it was rapidly discovered to be of no practical relevance.
>
>The most recent form is that you can take an arbitrary serial code,
>add some OpenMP (or UPC or ...) directives and tweak it a bit, and
>it will run in parallel. That one is still being propagated, too.

Are there any other useful papers available along the lines of those
giving the details of OS conversion to SMP for TOPS-10 (James M.
Flemming/Anthony (Tony) Wachs)
http://www.inwap.com/pdp10/paper-smp.txt
or CP/VM (Charles A. Salisbury/Lynn Wheeler)
http://www.garlic.com/~lynn/93.html#0
http://www.garlic.com/~lynn/94.html#27
http://www.garlic.com/~lynn/99.html#88
http://www.garlic.com/~lynn/2001k.html#66 etc...

--
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada

Brian....@CSi.com (Brian[dot]Inglis{at}SystematicSW[dot]ab[dot]ca)
fake address use address above to reply

Jan Vorbrüggen

unread,

Oct 10, 2006, 9:31:51 AM10/10/06

to

> Are there any other useful papers available along the lines of those
> giving the details of OS conversion to SMP for TOPS-10 (James M.
> Flemming/Anthony (Tony) Wachs)

In the distant past, there was a DEC technical paper of making VMS an
SMP system. Very detailed on the rationale, different possible approaches,
why the one was selected, and debugging aids.

Jan

Nick Maclaren

unread,

Oct 10, 2006, 9:53:15 AM10/10/06

to

In article <4p1lkoF...@individual.net>,

However, from the point of view of using multiple cores for performance,
operating systems are much easier tasks than most applications, for two
reasons:

They have a certain amount of natural parallelism, with existing,
clean interfaces between independent and interacting components.

Quite a lot of the problems have already been resolved, because
they have to be for interrupt handling, security and preemption.

Regards,
Nick Maclaren.

John Dallman

unread,

Oct 11, 2006, 5:38:00 PM10/11/06

to

In article <1160442830....@c28g2000cwb.googlegroups.com>,

Dysthy...@aol.com (Paul Aaron Clayton) wrote:

> So you have never heard of Sun's Sun Fire T2000 Server
> available with 4, 6, or 8 cores at 1.0GHz (or 8 cores at 1.2GHz)?

I, at least, didn't know that it was available with less than 8 cores.