A good example is the networking stack. Vista has the ability to
process 2 networking packets on 2 instances of the TCP/IP stack
implementation as 2 threads to serve 2 applications that want to
put/get information from the network.
What is to become to linux? Linux has a monolithic kernel. All of the
kernels services are one big binary in memory. So when Application-1
has made a jump into the kernel, everybody waits? This seems like
design decision cannot scale.
Are there sufficient theads that run on a linux box only in the
application space to harness the many cores to come?
You are confused:  having or not having a monolithic kernel (by the way, 
Vista's is hardly much different than Linux's in this regard), and 
employing or not employing multiple threads of execution in the kernel, 
are completely orthogonal.
- bill
Vista isn't the first to do this. Various BSDs have been doing this for
a while, but I'm not sure that whether the results indicated a net
benefit. And you don't need two instances of the IP stack to do this;
you just need enough threads to demultiplex the packets to their
intended recievers. On the send side, you realistically only need as
many threads as you have hardware interfaces.
> What is to become to linux? Linux has a monolithic kernel. All of the
> kernels services are one big binary in memory. So when Application-1
> has made a jump into the kernel, everybody waits? This seems like
> design decision cannot scale.
Linux has been internally multithreaded for a while, as have the BSDs.
A lot of the evolution has been getting rid of the BF kernel lock as
more parts of the kernel get threaded. Linux, from my understanding of
talking with people, has problems with thread and process scheduling,
so if anything, that's where the penalty can be found.
When Vista loads, all the hundreds of parts of the kernel are 
dynamically linked into a single monolithic blob, just like Linux.  UNIX 
systems have separated "services" too, called daemons.
I'm not sure how the Vista (or other Windows) kernels work, but in Linux 
the user thread that calls the kernel is what actually executes the 
kernel code, not a separate "kernel thread".  That means that, in 
theory, if there are 100 user threads, all of them could be executing 
concurrently in kernel space.  You'll find some special daemons even 
show 0s user time because they spend their entire lives living in the 
kernel, merely providing a thread for kernel code to operate from; this 
is as close to a "kernel thread" as Linux gets.
Now, all those threads running in the kernel do have to share data for 
coordination, which requires locks.  This used to be a huge problem for 
Linux because there was originally a single lock (called the BKL, or Big 
Kernel Lock) that covered all shared data.  Considerable work has gone 
into Linux in the last few years to reduce the number of things that 
require grabbing the BKL (mainly by creating hundreds of little locks, 
IIRC).  That work was a fundamental requirement to get SMP and SMT 
machines performing well, and it'll carry over to multicore processors 
just fine.
Vista has no advantage here.  If you have two threads wanting to access 
the TCP/IP stack, there has to be some sort of locking so that they 
don't stomp on each other.  The mechanism may be different than Linux, 
or it may even be the same.  The same work still needs to be done, and 
both sides have clever folks working on finding better and better 
solutions.
S
-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking 
-- 
Posted via a free Usenet account from http://www.teranews.com
> Now, all those threads running in the kernel do have to share data for 
> coordination, which requires locks.  This used to be a huge problem for 
> Linux because there was originally a single lock (called the BKL, or Big 
> Kernel Lock) that covered all shared data.  Considerable work has gone 
> into Linux in the last few years to reduce the number of things that 
> require grabbing the BKL (mainly by creating hundreds of little locks, 
> IIRC).  That work was a fundamental requirement to get SMP and SMT 
> machines performing well, and it'll carry over to multicore processors 
> just fine.
Linux is a monolithic kernel, meaning it's one giant data structure.
In order to safely access that structure, you need some form of syncrhonization.
One form is locking but locking doesn't scale well.  Lock-free scales
better and it one thing Linux has been using more and more of lately,
namely in the form of RCU.
> 
> Vista has no advantage here.  If you have two threads wanting to access 
> the TCP/IP stack, there has to be some sort of locking so that they 
> don't stomp on each other.  The mechanism may be different than Linux, 
> or it may even be the same.  The same work still needs to be done, and 
> both sides have clever folks working on finding better and better 
> solutions.
> 
The stack is basically just code.  You only need synchronization if the
threads are doing something with something in common, e.g. the same
TCP/IP connection.
-- 
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software. 
...
> Linux is a monolithic kernel, meaning it's one giant data structure.
Er, no.
A monolithic kernel is a kernel which all executes in the same address 
space (hence sacrificing some fault isolation for - one hopes - improved 
performance by interacting components) and which usually is linked 
together into one great (though of course often at least in part 
pageable) mass, though some parts may be loaded and unloaded dynamically.
That monolithic code base, however, controls myriads of often 
dynamically-allocated data structures, which can be locked or otherwise 
synchronized independently at as fine a grain as makes sense.
- bill
As a former kernel developer on one of those monolithic kernels, all I can
say is that going to finer grained locking is easier said than done.
Damn kernel programming - the same is true for almost every application!
Denying the difficulty of that issue is the great myth of SMP programming.
Regards,
Nick Maclaren.
http://www.theinquirer.net/default.aspx?article=34916
One of the problems I suspect is that they're taking programmers who've
never had any substantial experience doing multi-threaded programming
and expecting them to become instantly saavy in multi-threading.
Hey, it's true for most after-the-fact changes in *anything*, which is 
why doing the job right the first time is usually well worth the effort.
> 
> Denying the difficulty of that issue is the great myth of SMP programming.
I must have missed that myth:  I've never considered it easy myself, 
even after three decades' worth of experience doing it.
On the other hand, people looking for easy tasks really shouldn't be 
messing around with parallelism (or, for that matter, most other forms 
of performance optimization) in the first place (save possibly via 
intermediaries that can hide most of the necessary detail):  what else 
is new?
Locking works just fine in the absence of high local levels of 
contention - as long as one manages the potential for deadlocks (either 
by avoiding them - e.g., by imposing lock-acquisition ordering 
constraints - or by resolving them through detection and selective 
rejection).  The main advantages of 'optimistic' alternatives in 
low-contention situations are potentially simpler implementation and 
slightly lower resource usage.
Locking often works *better* than more 'optimistic' lock-free 
alternatives in the *presence* of high levels of contention (in 
particular, it can guarantee forward progress without compromising data 
currency).
That said, there are certainly circumstances in which lock-free 
approaches can be superior, such as those where access to slightly stale 
data is acceptable and those where execution efficiency is more 
important than fairness (e.g., protection from live-lock).
- bill
I would like to think that you're correct, I'm planning to look very 
closely at high core count programming relatively soon.
OTOH, I have also spent quite a few hours with John Carmack, who by all 
accounts is one of the all-time great programmers, and he didn't see any 
way to really take advantage of multiple cores for gaming.
The obvious exception is in the servers for multiplayer games, but for 
end users this is a hard problem.
Terje
-- 
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
seem to remember that amiga worked fast and smooth for MHz, this was
due to locking being done on the message passing, i.e. linked list
building routine, which was fast. every other process was sequentially
garanteed, due to processing of messages (list items) in sequence.
multiple threads could access the kernal by sending it a message. maybe
i am wrong on this, but making the list message passer have n-queues
equal to an n SMP configuration, and have one dispatch thread sending
messages to all other tasks then only get new message has to be locked.
cheers.
I agree that parallelizing arbitrary programs can be difficult.  But
with the advent of high core count machines, I'm prepared to take a
radically different approach to multi-threading.  Even to the point
of multi-threading things I considered only single threading because
of various existing overheads.  That advent may be a way off since it
will be a while before I get my hands on a machine with enough cores
to make a difference.  
If an algorithm is lock-free according to Herlihy's definition, then
some thread can make forward progress no matter what.  If there's
an issue about fairness, the ability of all threads to make forward
progress, with the interlocked instructions used in lock-free under
extreme contention, then lock based solutions will have a problem
also, since most lock implementations that I'm aware of use interlocked
instructions also.
I have a lock-free vs. lock testcase where I turn up the contention
scale to 11.  Lock-free scales better and has better forward progress
guarantees.  The lock-based solutions at best would run extremely slow,
or stall some or all of the threads.
a lock free system then, but it needs locks to do the double cmpare
thing on single word machines. or atomic write jump of proc with write
access to result.??
md5 hash of whitespace removed { } type block and auto gen of stack IO
for block, makes for extream fine threading? as var overlap could be
used in sheduling of dependancies??
needs the source though, or else all loop segments based on opcodes,
but how is nesting discovered? by execution flow analysis?? and
removable graphs with one in and one out path??
cheers.
threaded character AIs??
> > The obvious exception is in the servers for multiplayer games, but for
> > end users this is a hard problem.
> >
>
> I agree that parallelizing arbitrary programs can be difficult.  But
> with the advent of high core count machines, I'm prepared to take a
> radically different approach to multi-threading.  Even to the point
> of multi-threading things I considered only single threading because
> of various existing overheads.  That advent may be a way off since it
> will be a while before I get my hands on a machine with enough cores
> to make a difference.
http:indi.joox.net for my 222 logic element in MAX II CPLD processor
(16 bit) for the type of simple core that could become available.
would currently support co-operative multithreading, as needs mods to
allow access to P reg (prog counter) from interrupt mode space. it's
tiny.
cheers.
Yes and no.  Changing a procedural design to an object-oriented one
isn't hard, and even changing an imperative one to a functional one
is quite feasible.  Of course, the objective there is not performance
but clarity, extensibility and reliability, but even so all experience
is that merely converting old codes (without redesigning) rarely helps.
But converting a serial code to a fine-grain parallel one turns out
to be evil, even if all you are looking for is correctness and damn
any performance gain.
|> > Denying the difficulty of that issue is the great myth of SMP programming.
|> 
|> I must have missed that myth:  I've never considered it easy myself, 
|> even after three decades' worth of experience doing it.
Oh, it's quite widespread - and is usually propagated by people who
have done it only in theory or on carefully selected demonstration
examples.
The first form in which I came across it was that ALL you had to do
was to write in a functional language, and it would parallelise
automatically.  There is a theoretical sense in which that is true,
but it was rapidly discovered to be of no practical relevance.
The most recent form is that you can take an arbitrary serial code,
add some OpenMP (or UPC or ...) directives and tweak it a bit, and
it will run in parallel.  That one is still being propagated, too.
Regards,
Nick Maclaren.
As I said I will also try to look at this 'out of the box', but I don't 
expect it to be easy.
> of multi-threading things I considered only single threading because
> of various existing overheads.  That advent may be a way off since it
> will be a while before I get my hands on a machine with enough cores
> to make a difference. 
Even my laptops will have multiple cores from now on, but I don't know 
exactly when I'll get a desktop machine with 8, 16 or even more.
Quite.
Given the usual minimum overhead of a factor of 2, parallelising
most applications with below 4 cores is a waste of time (and Intel
Hyperthreading doesn't count), and not really worth the effort until
you have 8.  And some of the more interesting approaches don't really
become worthwhile until you have a lot more than that.
Regards,
Nick Maclaren.
Their point is well taken, but this is a poor example.
The article states that because cores #2 and #3 would wait on
core #1 that there is no advantage to more than 2 cores. This is not
necessarily true as core #1 can still contribute to lowering the
overall cycle time. Whether it is worthwhile it is the question.
I think the biggest consideration is the market size (of multi-core
platforms) vs cost to parallelize, including the cost of finding
all the nasty race conditions, vs what your competitor is doing.
IOW the question cannot be considered in isolation.
> One of the problems I suspect is that they're taking programmers who've
> never had any substantial experience doing multi-threaded programming
> and expecting them to become instantly saavy in multi-threading.
There is really two classes of multi-threaded programs.
The first is the true concurrent execution of parallel algorithms.
I don't mean only things like weather prediction programs, but also
apps where the problem can be decomposed into multiple independent
items in a work queue being chewed on by multiple processors.
>From what I see in the public articles, in the vast majority of
multi-threaded programs, the threads are used, not for concurrency,
but to simply separate program state. The two most popular
examples are having thread-per-client to separate variables,
or threads whose sole purpose in life is to wait for something else
(an I/O or some other server) because app designer was unable or
unwilling to perform the waiting operation asynchronously.
The separate-state class of apps appears by far to be the most
prevalent, and is unlikely to be helped at all by multiple cores.
In fact it has been my experience that such apps usually run slower
in true multi-processor environments, and it also exposes all the
latent race conditions that remained hidden on uniprocessors.
Eric
"High count", of course, is relative.
> OTOH, I have also spent quite a few hours with John Carmack, who by
> all accounts is one of the all-time great programmers, and he didn't
> see any way to really take advantage of multiple cores for gaming.
With all due respect to John (he _is_ a great game coder), I just read
about a new game engine from one of his competitors that uses five
threads: gameplay, graphics, physics, AI, and sound IIRC.  It
keeps a quad-core system busy, is playable at reduced quality on
dual-core systems, and is dead in the water on a single-core SMT
system -- and the quality blows away current games.
There are people who can come up with innovative ways to use new
hardware, and they're rarely the folks who built their fortunes on the
last innovation.  That's what makes being in tech so fun.
It's not one giant data structure; it's one binary image in memory. 
There are tens to hundreds of thousands of data structures for various 
specialized purposes, linked to each other in convoluted ways.  You only 
need to synchronize access to the same individual structure (or 
structures that point to it in some cases), not the entire kernel.  The 
BKL approach was downright flawed, Linus will readily admit, but it 
worked fine for the first decade or so until SMP finally caught on (or, 
alternately, until Linux became a contender for places SMP was already 
in use).
> One form is locking but locking doesn't scale well.  Lock-free scales
> better and it one thing Linux has been using more and more of lately,
> namely in the form of RCU.
There's still spinlocks and such all over the place.  I know RCU is 
getting more common, but it's an optimization, not a new approach.  The 
truly new approach was breaking things out from under the BKL into their 
own synchronization mechanism; whether each structure uses a spinlock, 
RCU, or something else will vary and is architecturally uninteresting 
(though possibly important for performance reasons).
>> Vista has no advantage here.  If you have two threads wanting to 
>> access the TCP/IP stack, there has to be some sort of locking so that 
>> they don't stomp on each other.  The mechanism may be different than 
>> Linux, or it may even be the same.  The same work still needs to be 
>> done, and both sides have clever folks working on finding better and 
>> better solutions.
>
> The stack is basically just code.  You only need synchronization if 
> the
> threads are doing something with something in common, e.g. the same
> TCP/IP connection.
Different threads have to access common data structures to add and 
remove connections, get a single queue (from the driver) of incoming 
packets demultiplexed, etc.  It's only _user_ threads that would be 
sharing a single socket, and that's outside the scope of the problem 
under discussion.
...
>> Locking works just fine in the absence of high local levels of 
>> contention - as long as one manages the potential for deadlocks 
>> (either by avoiding them - e.g., by imposing lock-acquisition ordering 
>> constraints - or by resolving them through detection and selective 
>> rejection).  The main advantages of 'optimistic' alternatives in 
>> low-contention situations are potentially simpler implementation and 
>> slightly lower resource usage.
>>
>> Locking often works *better* than more 'optimistic' lock-free 
>> alternatives in the *presence* of high levels of contention (in 
>> particular, it can guarantee forward progress without compromising 
>> data currency).
> 
> If an algorithm is lock-free according to Herlihy's definition, then
> some thread can make forward progress no matter what.
That is insufficient in many circumstances, since it guarantees nothing 
about avoiding perpetual livelock for some subset of the threads (unless 
the thread exit rate exceeds the thread entry rate, in which case 
eventually only one thread will be left).
   If there's
> an issue about fairness,
There's usually *some* issue about fairness, but as long as unfairness 
doesn't get too out-of-hand it can often be tolerated.
  the ability of all threads to make forward
> progress, with the interlocked instructions used in lock-free under
> extreme contention, then lock based solutions will have a problem
> also, since most lock implementations that I'm aware of use interlocked
> instructions also.
It is possible that we're talking about two different things, then:  I'm 
talking about the difference between approaches that reserve (typically 
shared or exclusive) access to some object for a period of time, as 
distinguished from approaches which find ways to avoid reserving access 
to individual objects (e.g., by creating private copies or by detecting 
inadmissible sharing after the fact and acting at that time to eliminate 
it retroactively).
While the underlying low-level hardware mechanisms may be common to both 
traditional locking and to 'lock-free' approaches, forward-progress 
guarantees (save for any need to address inequities introduced by NUMA 
latency) are usually more critical to ensure at the higher 
software-contention level (where the actual logical locking - or not - 
persists) than at this hardware level (which is merely the mechanism by 
which the logical lock - or 'lock-free' mechanism - is created).
In fine-grained situations where the duration of the required 
synchronization starts to approach the time it takes to execute the 
interlocking hardware instruction(s) themselves, 'lock-free' approaches 
are great when they can be practically applied.  For lengthier 
synchronization in specific situations where use of stale copies of data 
can be tolerated they are also effective (read-only transactions in a 
multi-version database being a long-standing example).  But while the 
theory behind more general-purpose lock-free ('optimistic') database 
concurrency control has been well-understood for at least the past 
quarter-century, its near-complete absence in products testifies to the 
superiority of lock-based approaches for such general-purpose use.
- bill
On 10/8/06 4:49 AM, in article
452806c8$0$1346$834e...@reader.greatnowhere.com, "Eric P."
<eric_p...@sympaticoREMOVE.ca> wrote:
> Joe Seigh wrote:
>> 
>> One of the things you can do is deny that your application needs all
>> that many cores.
>> 
>> http://www.theinquirer.net/default.aspx?article=34916
> 
> Their point is well taken, but this is a poor example.
> The article states that because cores #2 and #3 would wait on
> core #1 that there is no advantage to more than 2 cores. This is not
> necessarily true as core #1 can still contribute to lowering the
> overall cycle time. Whether it is worthwhile it is the question.
Why is it (almost/always) necessary to run all the cores at equal speed?
Would it not be prudent to assign accelerated algorithms to threads (thread
= core) clocked at *higher*. Here the higher is relative term, by higher I
mean at least - order of 2. If I understand this, Thanks Prof.Knuth, we are
still trying to find an upper bound and lower bound in complexities of
multi-core. If it is functionally possible to re-model O(n) into O(1), the
converse is also true. As per the number of cores on a die, it is quadric
anyways. Let me say, the inputs provided here are functions of core
clocking, what would be the effect? Or am I trying to optimize away some
absurd random scenario?
Thanks.
-- 
First of all - I am an Engineer. I care less for Copyrights/Patents, at
least I have none of my own! I love software development & it pays me to run
my family. I try to dedicate some time thinking about Open Source movement &
sometime contributing to it actually. I often get paid by claiming knowledge
in software developed by Open Source community. Lots of things I know today
& still learning are due to Open Source community.
One reason is that most (all?) SMP OS's require all cpus to be
exactly equal, including rev level, so they don't have to try to
deal with any asymmetric behavior. You don't want to have to deal
with, say, a FDIV bug that only shows up on one cpu but not another.
So that would seem to preclude different speed cpus in an smp system.
Eric
[...]
>> If an algorithm is lock-free according to Herlihy's definition, then
>> some thread can make forward progress no matter what.
>
> That is insufficient in many circumstances, since it guarantees nothing 
> about avoiding perpetual livelock for some subset of the threads (unless 
> the thread exit rate exceeds the thread entry rate, in which case 
> eventually only one thread will be left).
I do agree that some of the "loop-based" lock-free algorithms' (e.g., LL/SC 
and certain "*CAS-Loop" logic) can theoretically suffer from "livelock-like 
episodes":
http://groups.google.com/group/comp.arch/msg/5d1ff313519005f8
That's why I "try" make use of loopless lock-free algorithms whenever I can. 
Although, and FWIW, lock-free algorithms are much better than 
obstruction-free algorithms wrt livelock. My basic point is that loopless 
lock-free algorithms simply cannot suffer from the type of livelock-like 
scenarios that your are rightly describing here:
http://groups.google.com/group/comp.lang.c++.moderated/msg/6b2ccf76ba145c9c
Also, loopless lock-free algorithms are about the "only" type of lock-free 
algorithms that can be "safely" used in any sort of hard real-time system:
http://groups.google.com/group/comp.sys.super/msg/642b6b608294e426
(I briefly mention lock-free wrt hard real-time near the end of the post)
>   If there's
>> an issue about fairness,
>
> There's usually *some* issue about fairness, but as long as unfairness 
> doesn't get too out-of-hand it can often be tolerated.
Priority inversions and lock-convoys should be factored into any locking 
scheme... Its a good thing that clever use of lock-free and lock-based 
synchronization techniques can provide a solution to priority inversions and 
lock-convoys... Is that what you are getting at here?
>  the ability of all threads to make forward
>> progress, with the interlocked instructions used in lock-free under
>> extreme contention, then lock based solutions will have a problem
>> also, since most lock implementations that I'm aware of use interlocked
>> instructions also.
>
> It is possible that we're talking about two different things, then:  I'm 
> talking about the difference between approaches that reserve (typically 
> shared or exclusive) access to some object for a period of time, as 
> distinguished from approaches which find ways to avoid reserving access to 
> individual objects (e.g., by creating private copies or by detecting 
> inadmissible sharing after the fact and acting at that time to eliminate 
> it retroactively).
>
> While the underlying low-level hardware mechanisms may be common to both 
> traditional locking and to 'lock-free' approaches,
Indeed.
> forward-progress guarantees (save for any need to address inequities 
> introduced by NUMA latency) are usually more critical to ensure at the 
> higher software-contention level (where the actual logical locking - or 
> not - persists) than at this hardware level (which is merely the mechanism 
> by which the logical lock - or 'lock-free' mechanism - is created).
>
> In fine-grained situations where the duration of the required 
> synchronization starts to approach the time it takes to execute the 
> interlocking hardware instruction(s) themselves, 'lock-free' approaches 
> are great when they can be practically applied.  For lengthier 
> synchronization in specific situations where use of stale copies of data 
> can be tolerated they are also effective (read-only transactions in a 
> multi-version database being a long-standing example).
I wonder how many commercial databases actually make "extensive" use of 
lock-free reader patterns?
https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068
Also, there are algorithms that can provide a solutions to the fact that 
lock-free reader patterns can indeed iterate through some pointers that 
point to stale/old data structures:
http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce
(I address stale data in the last 2 paragraphs of the post...)
Generally, **PDR can outperform fine-grain rw-locking by a fairly wide 
margin. Locking has some overhead:
http://groups.google.com/group/comp.lang.c++.moderated/msg/d3a6db6ba9e7fd95
http://groups.google.com/group/comp.lang.c++.moderated/msg/7c15b3488642d1a6
http://groups.google.com/group/comp.lang.c++.moderated/msg/a8ef173bee4f0c03
> But while the theory behind more general-purpose lock-free ('optimistic') 
> database concurrency control has been well-understood for at least the 
> past quarter-century, its near-complete absence in products testifies to 
> the superiority of lock-based approaches for such general-purpose use.
Most of the lock-free ('optimistic') concurrency controls schemes are based 
around suboptimal lock-free techniques. They all seem to try to be 
wait-free, and have the same livelock problems that obstruction-free 
algorithms have. I actually would prefer to use a highly-tuned fine-grain 
locking scheme over any obstruction-free algorithm:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4
Humm... I need to think some more about the practical issues you raised...
Thank You,
Chris Thomasson
* I remember hearing somewhere that CAS can be "fair" when it's implemented 
in the hardware via. locking (e.g., bus). Something like this:
http://groups.google.com/group/comp.arch/msg/18f0d50aa3c9a148
I think... Humm...
** PDR == Paritial Copy-On-Write Deferred Reclamation:
http://groups.google.com/group/comp.arch/msg/2a0f4163f8e13f1e
(follow links for basic descriptions/discussions' on PDR) 
> Why is it (almost/always) necessary to run all the cores at equal 
> speed? Would it not be prudent to assign accelerated algorithms to
> threads (thread = core) clocked at *higher*. 
Because it's /simpler/. Take an example: say you want to make a four-core 
processor, with one really fast core, two medium-speed, and one slowish 
one for background tasks. 
You can achieve this at a uniform clock speed by making the cores of 
different designs: for a crude Intel example, a single Conroe core, two 
Williamettes and a Pentium. But then you have to get them all working 
together right, and reliably. And that's harder than having four uniform 
cores work together reliably. 
Alternatively, you can have cores of uniform design, and run them at 
different clock speeds. That sounds attractive in a way, because it 
means that you have a use for parts where only one core will run at the 
maximum designed speed. 
But then, is software to rely on the relative speed of these cores? Say 
they're running at X, 0.75X, 0.75X and 0.3X. Now, the chip manufacturer 
has a process shrink, and finds he can make it run at 1.5X, 1.2X, 1.2X 
and 0.5X. Suddenly, the medium-speed cores are running a bit faster 
relative to the top-speed one, and the slow one a bit slower. If your 
software relied on the relative speeds, it's all messed up. So you don't 
rely on the relative speeds. And then a processor with four identical 
cores is just as good. 
Since software gets run on all sorts of hardware, quite often hardware 
which didn't exist when the software was finalised, keeping things 
simple is just much /easier/. 
--- 
John Dallman                                        j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a 
well-rigged demo"
EP> So that would seem to preclude different speed cpus in an smp
EP> system.
Different speed CPU's are coming, in the form of multicore CPU's which
can do power saving on each core. That way a pure single-threaded
workload will keep every core except one at a very low clock
frequency.
Operating systems will need to be adjusted to cope, of course.
Timekeeping can be fun, for one thing. Also, if you have just a single
thread running at half speed, and another thread comes along, do you
schedule it on the same core or do you wake another core from deep
sleep?
/Benny
Which is why I stated a bit later that I was waiting for at least 8 to 
16 cores, i.e. this is the minimum of what I'd consider 'high count'.
> 
>> OTOH, I have also spent quite a few hours with John Carmack, who by
>> all accounts is one of the all-time great programmers, and he didn't
>> see any way to really take advantage of multiple cores for gaming.
> 
> With all due respect to John (he _is_ a great game coder), I just read
> about a new game engine from one of his competitors that uses five
> threads: gameplay, graphics, physics, AI, and sound IIRC.  It
> keeps a quad-core system busy, is playable at reduced quality on
> dual-core systems, and is dead in the water on a single-core SMT
> system -- and the quality blows away current games.
This is a good example of multithreaded programming, also in the sense 
that it is almost completely non-symmetrical: You have a big main 
thread/program, and a bunch of helpers that all depend to a smaller or 
larger degree on what the main thread is doing.
I.e. 3D sound has to use the same (or a simplified) model of the world 
as the main game engine, so does AI and physics.
You can debate exactly how much belongs in 'gameplay' and how much in 
'graphics', it is obvious that they have to overlap to at least some extent.
> There are people who can come up with innovative ways to use new
> hardware, and they're rarely the folks who built their fortunes on the
> last innovation.  That's what makes being in tech so fun.
Sure, I wouldn't discount John C quite yet though! :-)
...
>>   If there's
>>> an issue about fairness,
>> There's usually *some* issue about fairness, but as long as unfairness 
>> doesn't get too out-of-hand it can often be tolerated.
> 
> Priority inversions and lock-convoys should be factored into any locking 
> scheme
Indeed, though my specific comment referred to the ability to tolerate 
moderate but statistically insignificant unfairness at the hardware 
level that Joe mentioned as long as higher-level fairness is reasonably 
assured (effectively, as long as lock-duration greatly exceeds average 
lock-acquisition latency such that accessors who request locks at 
approximately the same time will eventually acquire them in 
approximately the order of their requests).
...
   For lengthier
>> synchronization in specific situations where use of stale copies of data 
>> can be tolerated they are also effective (read-only transactions in a 
>> multi-version database being a long-standing example).
> 
> I wonder how many commercial databases actually make "extensive" use of 
> lock-free reader patterns?
Oracle has supported them for quite a while (rather than maintain 
multiple versions of data, it uses its undo logs to back out later 
changes to regress pages for non-locking read-only transactions to an 
appropriate point in time.  IIRC DEC's (now for a decade Oracle's) Rdb 
does something like a snapshot to perform a similar service.  Interbase 
began life (also at DEC) by maintaining prior versions for both such 
read-only accessors and IIRC for transaction backout as well.  Postgres 
(perhaps also its predecessor Ingres) employs some form of no-overwrite 
update policy which my impression is serves this purpose.
Those are just the ones that spring immediately to mind, and I simply 
don't know whether similar mechanisms exist in other current and former 
major players such as SQL Server, Sybase, Informix, etc. (though without 
knowing anything specific I'd expect that DB2 does:  they seldom pass up 
anything useful there).
- bill
Those still have to be the same make and model cpu. And I'm
sure changing the cpu clock divider on the fly presents a whole
new set of of design problems and glitches.
But off the top of my head, I don't see how this would be of much
practical value. Currently if there is no work for a cpu the OS will
pause or power down anyway. And as soon as there is a computable
thread, a general purpose OS would run it as fast as possible because
it has no knowledge of thread speed or deadlines.
Now one might envision that threads could declare themselves as low
power candidates, or maybe define that priority 0 is low speed,
but only a few apps might ever do this so I don't see how this
would accomplish anything in the end.
A real time deadline based OS might use such a feature
but that is a whole different set of design issues and
they would probably not use SMP anyway.
Eric
EP> But off the top of my head, I don't see how this would be of much
EP> practical value. Currently if there is no work for a cpu the OS
EP> will pause or power down anyway. And as soon as there is a
EP> computable thread, a general purpose OS would run it as fast as
EP> possible because it has no knowledge of thread speed or deadlines.
These days speed changes are usually initiated from user space -- a
user space daemon notices that the CPU is 40% idle and slows down
the CPU to e.g. 70%. This is handy for watching video and some games.
The downside is that it takes a moment for the daemon to notice a new
CPU intensive program, so latency can increase.
EP> Now one might envision that threads could declare themselves as
EP> low power candidates, or maybe define that priority 0 is low
EP> speed, but only a few apps might ever do this so I don't see how
EP> this would accomplish anything in the end.
It is possible for the daemon to decide that anything with a nice
value larger than 0 should not count in the busy percentage.
/Benny
> Because it's /simpler/. Take an example: say you want to make a 
> four-core processor, with one really fast core, two medium-speed, and 
> one slowish one for background tasks.
> You can achieve this at a uniform clock speed by making the cores of 
> different designs: for a crude Intel example, a single Conroe core, two 
> Williamettes and a Pentium. But then you have to get them all working 
> together right, and reliably. And that's harder than having four 
> uniform cores work together reliably.
I hear rumours that people also ship 8-core processors where some of 
the cores have failed as 6 or 4 core units.  But I never know if this 
actually happens. 
the one call operating system was something i've thought about.
in_from_que = tx(out_to_que)
self is possible out??
tx works in the following manor:
1. fill the data field of the message element
2. fill the task field of the message element
3. pick up new message element if available or yeild until message
demon runs it again.
a message demon task (only one) goes through all message elements and
adapts the message in pointer of a task to be one of the messages for
it.  by changing the task field element of a message to the task number
of an outgoing task the packet is marked as read, modified and sent to
another task..
so you pick up a message when you pass one out.
txing a message may be a registration of a service, which waits for rx
packets.
process ids would have an instance count, to support multi threading of
the same message handler, unless registered as a single instance
service.
this structure can be built using lists, with only the message demon
handling list pointers. multiple list/message demons could run if
messages held a demon id.
processor load syncronization is via sync() which sends a message to
the scheduler indicating it's tick requirement. a counter is increased
for the task and scheduling happens on the lowest (count-currentbase).
a task could be double ticked to reduce priority, preemted on the do
something else interval.
auto currentbase over estimation detection could be used to double tick
non criticals.
chee
I think Chris is referring to whether the implementation is largely
lock-free, as opposed to the external api being lock-free.  The only database I
know that claims this is this one
http://www.ants.com/index.php?option=com_content&task=view&id=525&Itemid=79#lockfree
I looked at their patents at one point.  One was for a lock-free LIFO queue using
double wide compare and swap, and the other was I think some kind of lock-free
FIFO queue using producer/consumer logic.  I'm not sure how they're doing the
lock-free index.
1.  Since he was responding specifically to the example of lock-free 
read-only database access that I had provided, this seems unlikely.
2.  That example has nothing to do with 'the external api', since 
typical database use does not address locking in its api at all (i.e., 
any locking is a strictly internal mechanism supporting the implicit 
isolation semantics of the higher-level api).
3.  The (usually row- or page-granularity) isolation level under 
discussion in this instance is the most important one in terms of 
database performance:  by comparison, performance variations achievable 
in lower-level synchronization mechanisms tend to be well down in the 
noise, and hence of relatively limited interest.
- bill
AI, physics, even the harder tasks of rendering (e.g. light maps) can be
calculated in parallel. The rendering itself already is parallelized even
if John Carmack doesn't realize it is (SLI) - all he does on his side is
calling OpenGL functions.
I think the main problem is when you just have two cores - the overhead will
eat up most of the speedup.
-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
Changing a program that was build with a different paradigm in mind is
always esier said than done. That's why it is often easier to ditch the
legacy program when significant changes come around, and rewrite from
scratch. This is often harder said than done, since most people calculate
the efford already invested into the project vs. the effort invested into
the rewrite instead of calculating the effort invested in the rewrite vs.
the effort of changing the not-adopted solution of the past.
>>
>> I think Chris is referring to whether the implementation is largely
>> lock-free, as opposed to the external api being lock-free.
> 
> 
> 1.  Since he was responding specifically to the example of lock-free 
> read-only database access that I had provided, this seems unlikely.
This question has come up before in other newgroups.  That's what
he's talking about.
> 
> 2.  That example has nothing to do with 'the external api', since 
> typical database use does not address locking in its api at all (i.e., 
> any locking is a strictly internal mechanism supporting the implicit 
> isolation semantics of the higher-level api).
> 
> 3.  The (usually row- or page-granularity) isolation level under 
> discussion in this instance is the most important one in terms of 
> database performance:  by comparison, performance variations achievable 
> in lower-level synchronization mechanisms tend to be well down in the 
> noise, and hence of relatively limited interest.
> 
Maybe for OLTP. I wouldn't say that's necessarily true for OLAP.
...
>> 3.  The (usually row- or page-granularity) isolation level under 
>> discussion in this instance is the most important one in terms of 
>> database performance:  by comparison, performance variations 
>> achievable in lower-level synchronization mechanisms tend to be well 
>> down in the noise, and hence of relatively limited interest.
>>
> 
> Maybe for OLTP.  I wouldn't say that's necessarily true for OLAP.
It should be when the OLAP activity is occurring on a database 
concurrently shared with OLTP processing.  And when the database isn't 
being so shared, since most of the OLAP access is read-only there won't 
be any noticeable locking overhead (I suppose you could suggest that 
this then exposes underlying synchronization overheads more, but it's 
not at all clear that they'd constitute a significant portion of the 
overall processing load in such cases).
- bill
> Why is it (almost/always) necessary to run all the cores at equal speed?
> Would it not be prudent to assign accelerated algorithms to threads (thread
> = core) clocked at *higher*. 
Why not clock all of them at maximum?  Perhaps you could save a bit of
power if you have one thread that consumes a lot more cpu than all the
others together, but I can't see how it's worth adding a lot of
complexity to an SMP system in order to support heavily skewed
parallelism.
-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
;)
"Bill Todd" <bill...@metrocast.net> wrote in message 
news:nJCdnbkrGYrNBrTY...@metrocastcablevision.com...
> Joe Seigh wrote:
>> Bill Todd wrote:
>>> Chris Thomasson wrote:
>>>
>>>>
>>>>
>>>> I wonder how many commercial databases actually make "extensive" use of 
>>>> lock-free reader patterns?
>>>
>>>
>>> Oracle has supported them for quite a while (rather than maintain 
>>> multiple versions of data, it uses its undo logs to back out later 
>>> changes to regress pages for non-locking read-only transactions to an 
>>> appropriate point in time.
I wonder exactly how much of an impact maintaining correct undo logs 
actually has on the "read-mostly/only" part of a given syncronizatation 
scheme? I know that the "flagrant" use of atomic operations and/or memory 
barriers can have a marked negative effect on the throughput, scalability 
and general performance of a synchronization algorithms shared reader 
operations. You don't really want to be forced to execute an atomic 
operation and/or memory barrier, and basically blow the pipeline and any 
cache that may have accumulated, every time you need to acquire and 
subsequently read from a shared data-structure...
> 1.  Since he was responding specifically to the example of lock-free 
> read-only database access that I had provided, this seems unlikely.
>
> 2.  That example has nothing to do with 'the external api', since typical 
> database use does not address locking in its api at all (i.e., any locking 
> is a strictly internal mechanism supporting the implicit isolation 
> semantics of the higher-level api).
>
> 3.  The (usually row- or page-granularity) isolation level under 
> discussion in this instance is the most important one in terms of database 
> performance:  by comparison, performance variations achievable in 
> lower-level synchronization mechanisms tend to be well down in the noise, 
> and hence of relatively limited interest.
Well, IMHO and FWIW,  I think that a fairly well thought-out, and somewhat 
carefully designed lock-based synchronization strategy is usually way better 
than basically any sort of "crappy" lock-free implementation, or, I should 
more correctly say, emulation(s)... Humm... IMHO, it does "kind of seem" 
that a number of lock-free algorithms that exist, or are being currently 
designed to address the synchronization requirement "for" some different 
types of robust, enterprise-like databases bind their various forward 
progress aspects to, or even explicitly design then for, and/or totally 
depend on some sort of "multi-non-contiguous-word load-linked 
store-conditional" (e.g., MCAS, *KCSS, **STM, perhaps ***HTM) and/or 
contention manager functionality. Apparently, these type of lock-free 
algorithms "cannot seem to survive in the wild" without the aid of atomic 
multi-word LL/SC operations, and/or of course, some "explicit, or implicit?" 
assistance from a nifty "contention manager protocol"; which usually comes 
with its fair share of overhead, all by itself.
For those who happen to be not so familiar with this kind of stuff, a 
contention manager "usually" has to be able to identify and subsequently 
address scenarios in which multiple application threads could be 
experiencing periods of repeated, and somewhat "persistent" failures. This 
could be during periods of moderate to severe, possibly prolonged periods of 
load. They could be trying to make their way through a synchronization 
primitives implementation (e.g., obstruction-free queue, skiplist, ect...) 
only to find themselves folding under and totally depending on the 
contention manager to give it emergency first-aid. The current "popular" 
trend wrt lock-free seems to be focused on obstruction-free-like 
synchronization techniques. That basically means that some sort of 
contention manager/arbiter logic has to "keep track" of certain things. This 
could, and usually does, involve the need to "query some state from some 
threads that happen to be currently accessing the shared data-structure 
during times of fairly intense, and perhaps prolonged periods of load"...
IMHO, all of the intricate work that "they" put into developing "a working 
multi-word ll/sc emulation" could of been going into developing practical 
usage paradigms' which are aimed at reducing the "total number" of atomic 
operations/membars throughout a number of different synchronization 
infrastructures that their "user" applications make heavy/extensive use of. 
Unfortunately, it seems like these LL/SC emulation things are being 
advertised as a good solution for the "concurrency problem". These hefty 
obstruction-free algorithms, along with their contention managers, can be 
abstracted away into a library that is transparent to the end-user database 
API. I mean, does the -end-user really care if the database he or she is 
making use of is implemented in a somewhat suboptimal manner? Well, they 
might start to care... Maybe... I could picture a company that has a perfect 
application that runs great and very fast on uni-processor, and 
un-deterministically dies from some "mystery race-condition" after a couple 
of days/weeks of running its "perfectly and obviously flawed logic" through 
a multi-processor system. How to fix a race-condition when your application 
is massive, and you never even had to think about this bug simply because it 
never shows up after years of running a track record that shows successful 
reliability and good performance characteristics'.
IMHO, all of this basically boils down to the contention manager performing 
introspection duties which should be really be used for profiling, 
experimenting and/or debugging application architecture in order to begin to 
leverage some of the various results against any experimental code 
optimizations you may have created. I guess the basic point I am getting at 
here is that algorithms that need to "basically rely" on some fairly 
intricate introspection techniques in order to guide it through periods of 
high-contention, IMHO, is already been suffering from an inherent and 
somewhat severe disability, from birth. Once you design any algorithm to 
depend on a contention manager, you are basically folding the performance of 
some key areas of your algorithm under an unknown set of rules. The 
contention algorithms behavior can't really be predicted by a "user" 
algorithm. So, you are kind of "blindly" binding some of the performance 
critical aspects of your algorithm to an unknown third-party in order to 
guide your threads through your applications points of contention, when the 
lights are out! Yikes!
;)
I have not looked ANT Software's prior art. The patent(s) that hold seem to 
classify their "type" of lock-free algorithms into the "writer" category. I 
find it to be very useful when your synchronization logic can differentiate 
a "writer" thread, from a "reader" thread. You can actually implement this 
in 100% user-space using about a pages worth of SPARC 32/64 - IA-32 assembly 
language. Here is a suboptimal full-blown working example:
http://groups.google.com/group/comp.arch/browse_frm/thread/b2f751d5dad6c07b
I have to admit that even though this has the inherent ability to outperform 
basically any rw-mutex implementations', barring some hardware assisted 
super ****rw-mutex of the future, it still has to use atomic operations and 
memory barriers. However, the presented code has scalability characteristics 
that exceed anything provided by most rw-mutex implementations'. Its has the 
ability to create a scenario in which writers and readers or no longer 
mutually excludable, and keeps any piece of memory that a thread has 
acquired in a PDR collected-region alive, as long as any entity (e.g., 
thread in a process) owns an acquired a reference to it.
------
* 
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/b21b212dd175baab
** 
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f6399b3b837b0a40
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/9c572b709248ae64
http://groups.google.com/group/comp.arch/msg/bbbc035cf1a8502c
http://groups.google.com/group/comp.arch/msg/11b14c4bda2d5d82
http://groups.google.com/group/comp.arch/msg/9b00fda2752966f9
http://groups.google.com/group/comp.arch/msg/335aeb22fd6fe526
http://groups.google.com/group/comp.arch/msg/1ace9400b1b16cd4
*** http://groups.google.com/group/comp.arch/msg/995379a16beb3b69
(simply excellent Wheeler post)
**** 
http://groups.google.com/group/comp.programming.threads/msg/6236a9029d80527a 
[...]
> I mean, does the -end-user really care if the database he or she is making 
> use of is implemented in a somewhat suboptimal manner? Well, they might 
> start to care... Maybe...
The following comments were suppose to be sarcastically addressing a 
possible scenario in which a nasty race-condition renders an application 
dead in the water, with sharks circling...
> I could picture a company that has a perfect application that runs great 
> and very fast on uni-processor, and un-deterministically dies from some 
> "mystery race-condition" after a couple of days/weeks of running its 
> "perfectly and obviously flawed logic" through a multi-processor system. 
> How to fix a race-condition when your application is massive, and you 
> never even had to think about this bug simply because it never shows up 
> after years of running a track record that shows successful reliability 
> and good performance characteristics'.
I decided to remove the paragraph, and accidentally left this portion in. 
Sorry for any confusion!
:O
> I mean, does the -end-user really care if the database he or she is making 
> use of is implemented in a somewhat suboptimal manner? Well, they might 
> start to care... Maybe...
They could be asking questions once they start to purchase high-end 
multi-processor systems and happen to notice that one of their applications 
processes its "critical tasks a little bit slower, or perhaps only barely 
faster than usual". They might start to expect their applications to have 
the inherent ability to scale-up to whatever type of crazy multi-processor 
system the industry can invent and throw at them... I am living in a 
pipedream, no? lol
:)
There's a number of databases aimed purely at the OLAP market.  They
usually run with the entire database in memory for speed.  Some of the
don't bother with making the database persistent by backing it to disk.
At least one of them is running into scalability issues.  They need to
expand their database from handling only up to 100 users to 1000's of
users.
A particularly hot market is the financial analysis sector.  While most
of the access is read only, there is a significant amount of update
activity.  This is a market where speed is of the essence.  I believe
they measure costs in dollars per microsecond response time.
We discussed this a few years ago, in the context of using XADD 
(eXchange & Add):
At the time, my conclusion was that if you can live with (temporarily) 
fixed limits on the size of your queue, then you can do a totally 
lock-free implementation using no more than 4 XADD-updated variables:
front, back, room & avail
front points at the front of the queue, back at the end, room contains 
the number of free slots and avail the number of available tasks.
This is loop-free as well, as long as there is the queue is neither full 
nor empty.
It might be possible to find a more efficient approach, but this was the 
best I could figure out at the time, while allowing multiple producers 
and consumers.
With a single producer and/or a single consumer, you can of course get 
away with a much simpler setup!
With XADD as the primitive, I would force the queue size to be a power 
of two, allocate it as an array, and use indices into this array instead 
of direct pointers.
The front/back indices would be allowed to grow until the register word 
size, while being interpreted modulo the queue size:
   for () {
     if (avail > 0) {
       int myavail = XADD(&avail,-1); // Any available tasks?
       if (myavail > 0) break;
       XADD(&avail,1);
     }
     yield();
   }
   int myfront = XADD(&front,1); // Get current task index
   myfront &= QUEUE_SIZE_MASK;
   taskp tp = queue_array[myfront];
   queue_array[myfront] = NULL;
   XADD(room,1);			// Increase free queue space
return tp;
The real problem is to write this in such a way that it can survive if a 
thread is stuck somewhere between getting & incrementing the front index 
and retrieving & setting to NULL the queue element.
I really don't know how to write code that can survive even if an 
arbitrary thread/core can stop in an arbitrary location!
...
> There's a number of databases aimed purely at the OLAP market.  They
> usually run with the entire database in memory for speed.  Some of the
> don't bother with making the database persistent by backing it to disk.
RAM-resident databases are indeed a different kettle of fish.  OTOH, 
especially with the more static ones they also experience a lot less 
low-level activity (e.g., allocation/deallocation and queuing for 
resources) than conventional databases, so the opportunities for 
benefits from low-level lock-free operation are fewer.
> At least one of them is running into scalability issues.  They need to
> expand their database from handling only up to 100 users to 1000's of
> users.
That is entirely believable, but it remains to be demonstrated that lock 
overheads are a significant contributor to their scalability problems 
(I'd certainly be interested in any evidence you may have to that 
effect, but in its absence will observe that a scattershot approach of 
"we'll just reduce any overheads we can find, regardless of their 
importance" is usually not the best when dealing with immediate 
real-world problems, especially when it may entail significant effort in 
restructuring code).
- bill
I suspect none whatsoever:  my understanding is that Oracle maintains 
the undo log for other reasons (such as backout of data dirtied on disk 
within a transaction which aborts, or complete database rollback to an 
earlier state) and merely capitalizes upon its existence to support long 
read-only transactions.
  I know that the "flagrant" use of atomic operations and/or memory
> barriers can have a marked negative effect on the throughput, scalability 
> and general performance of a synchronization algorithms shared reader 
> operations.
Only, of course, when the overhead associated with the atomic operations 
and/or memory barriers starts to become a significant fraction of the 
total overhead associated with processing the data structures so 
acquired - a situation for which in this particular instance I'd have to 
see actual evidence before giving it much credence:  otherwise, 
throughput, scalability, and general performance will not be 
significantly affected.
...
I guess Joe may have been right after all, and that you completely 
missed the point both in your earlier response and now here.
The locking or 'optimistic' mechanisms used to isolate transactions from 
each other, which I submit are probably the primary factor in 
synchronization-associated throughput and scalability limits in even 
relatively fine-grained OLTP, execute a couple of levels higher up than 
the primitives that you're talking about here and make the 
implementation choice of those primitives fairly inconsequential (as 
long as whatever choice is made is competently implemented).
- bill
It's based on the fact that even if you are only doing a read access of
the data, you still need some form of synchronization, be it conventional
locking, PDR (PCOW Deferred Reclamation), or some other synchronization
technique like chained locking.  If you have something where like OLAP where
memory access is the predominate performance factor vs. an OLTP database
where network latency affects performance more than any internal synchronization,
lock-free out performs lock based by orders of magnitude.
Restructuring could be a problem especially when the data structures weren't
designed with that form of synchronization in mind.  Java had to redo the
collection libraries when they did the java.util.concurrent stuff.  The old
collections were unrefactorable into lock-free collections.
At a couple of points, I had looked at making TDB (Trivial DataBase), which is
part of Samba, into a lock-free database.  It would have been a lot of work.
It turned out that Samba couldn't even say whether they really had a performance
problem, even though they had gone along with the initial proposal which wasn't
from me BTW.  It fell through, though IBM did get a lock-free patent application
out of it.
20060130061 "Use of rollback RCU with read-side modifications to RCU-protected data structures"
No, you do not:  you only need it when at least potentially sharing 
access with updating activity - save for low-level operations such as 
allocating/deallocating operation structures and managing activity 
queues (not queues resulting from data-access interlocks, however: 
there aren't any).
That's one reason why (most?) OLAP approaches explicitly segregate 
update activity into batches performed while the normal analytical 
processing has been temporarily suspended:  intense OLAP typically 
doesn't require up-to-the-minute data (day-old usually being more than 
sufficiently current), and typically does benefit significantly from the 
optimizations obtainable by not sharing analytical activity with updates.
   If you have something where like OLAP
> where
> memory access is the predominate performance factor vs. an OLTP database
> where network latency affects performance more than any internal 
> synchronization,
> lock-free out performs lock based by orders of magnitude.
You appear quite confused:  'network latency' plays very little part in 
OLTP performance (well, it can increase response latency to remote 
clients, but only by a constant that has nothing to do with aggregate 
throughput).
And (at least in synthetic OLTP benchmarks like TPC-C) even disk latency 
plays little part in aggregate throughput:  it's all about processor 
performance, processor caches, and memory latency (and occasionally 
memory throughput, though apparently in most cases that's not the 
limiting factor).
> 
> Restructuring could be a problem especially when the data structures 
> weren't
> designed with that form of synchronization in mind.  Java had to redo the
> collection libraries when they did the java.util.concurrent stuff.  The old
> collections were unrefactorable into lock-free collections.
> 
> At a couple of points, I had looked at making TDB (Trivial DataBase), 
> which is
> part of Samba, into a lock-free database.  It would have been a lot of 
> work.
> It turned out that Samba couldn't even say whether they really had a 
> performance
> problem, even though they had gone along with the initial proposal which 
> wasn't
> from me BTW.  It fell through, though IBM did get a lock-free patent 
> application
> out of it.
That's the point that a convivial venture capitalist with time on his 
hands would stress to any bright-eyed young enthusiasts who came to 
him/her with a proposal centered on having a 'better' solution to some 
problem (even to some problem in which s/he was intensely interested):
*How much better is it, and how much will it cost?*  (i.e., does it make 
a real difference, sufficient to justify the effort = money required?)
Crusty old engineers have a similar attitude, which may help explain why 
more people here haven't joined your lock-free crusade:  there are 
plenty of areas in which any given product could be 'improved', but most 
of them just don't matter.
If (and while this may be suggested for at least some specific 
applications it's really not yet in actual evidence) lock-free 
approaches are really a significantly better way to do certain things, 
then over time they will likely migrate into more common use for those 
applications.  But no one (save for religious zealots) is likely to 
bother to retrofit any existing code (even if it's being otherwise 
revised) absent some actual evidence that doing so will be worthwhile in 
a much higher-level sense.
And in all the cases you and Chris have brought up, the only points even 
remotely resembling evidence have been associated with very low-level 
primitives rather than with before-and-after performance measurements 
(or even just credible simulations) of the real-world applications you 
keep suggesting would materially benefit from their use.
- bill
> I hear rumours that people also ship 8-core processors where some of 
> the cores have failed as 6 or 4 core units.  But I never know if this 
> actually happens. 
Well, when processors are available with variable amounts of cache, ones 
with less than the max often have failed areas of cache that have been 
disconnected. So disconnected cores would seem possible; what processors 
have you heard this of? 
I have HEARD this of all of the recent multi-core systems, but the only
one for which the vendor said that this was done was the IBM POWER4,
and the disabling was done to increase bandwidth per cpu.  The other
rumours were all so indirect as to be little more reliable than
"well-sourced intelligence".
Regards,
Nick Maclaren.
Not for financial trading applications.  Market updates are frequent.
You don't have the luxury of batching things.
...
You're ignoring the fact that RCU is being used in the Linux kernel.
They had to convince Linus that there would be an actual and substantial
performance improvement before he let it be put in.
> 
> And in all the cases you and Chris have brought up, the only points even 
> remotely resembling evidence have been associated with very low-level 
> primitives rather than with before-and-after performance measurements 
> (or even just credible simulations) of the real-world applications you 
> keep suggesting would materially benefit from their use.
> 
It's basically a scalability issue.  I don't have the resources to write
an enterprise scale application.  Also lock-free algorithms are being
patented as fast as they can be invented.  Dealing with patents is also
a resource issue.
And I guess you could claim the ANTs database isn't a real world application.
Also, there's the Noble library here http://www.pss-ab.com/ and Intel
has a library here
http://www.intel.com/cd/software/products/asmo-na/eng/threading/threadbuildblocks/294797.htm
Both of those use the usual write lock-free algorithms and read lock-free
using hazard pointers I believe.
Then you're mixing OLAP with OLTP updates in the manner I described. 
Even then, approaches which analyze the incoming update stream in 
conjunction with a read-only snapshot of the pre-existing data rather 
than attempt to coordinate running OLAP on the OLTP database itself may 
make a lot more sense.
What part of the fact that you have yet to provide a scintilla of 
evidence to back up the claims you've made for the benefit to the 
various applications you've cited still manages to escape you?  Yes, RCU 
is used in the Linux kernel, and had you pointed to that rather than 
waved your hands vaguely (but vigorously) in other directions you would 
have had a point (for the Linux kernel, though not something 
automatically more generally applicable).
> 
>>
>> And in all the cases you and Chris have brought up, the only points 
>> even remotely resembling evidence have been associated with very 
>> low-level primitives rather than with before-and-after performance 
>> measurements (or even just credible simulations) of the real-world 
>> applications you keep suggesting would materially benefit from their use.
>>
> 
> It's basically a scalability issue.  I don't have the resources to write
> an enterprise scale application.  Also lock-free algorithms are being
> patented as fast as they can be invented.  Dealing with patents is also
> a resource issue.
> 
> And I guess you could claim the ANTs database isn't a real world 
> application.
No, I'm simply observing (do you really have *that* much of a problem 
wrapping your mind around this?) that it doesn't mean shit without some 
kind of controlled comparative (lock-based vs. lock-free being the 
*only* differences) *application-level performance data* (or, as I said, 
at a minimum credible quantitative simulations).
- bill
So you have never heard of Sun's Sun Fire T2000 Server
available with 4, 6, or 8 cores at 1.0GHz (or 8 cores at 1.2GHz)?
Paul A. Clayton
just a technophile
>
>In article <YJ-dnXeh-rEBK7rY...@metrocastcablevision.com>,
>Bill Todd <bill...@metrocast.net> writes:
>|> 
>|> Hey, it's true for most after-the-fact changes in *anything*, which is 
>|> why doing the job right the first time is usually well worth the effort.
>
>Yes and no.  Changing a procedural design to an object-oriented one
>isn't hard, and even changing an imperative one to a functional one
>is quite feasible.  Of course, the objective there is not performance
>but clarity, extensibility and reliability, but even so all experience
>is that merely converting old codes (without redesigning) rarely helps.
>
>But converting a serial code to a fine-grain parallel one turns out
>to be evil, even if all you are looking for is correctness and damn
>any performance gain.
>
>|> > Denying the difficulty of that issue is the great myth of SMP programming.
>|> 
>|> I must have missed that myth:  I've never considered it easy myself, 
>|> even after three decades' worth of experience doing it.
>
>Oh, it's quite widespread - and is usually propagated by people who
>have done it only in theory or on carefully selected demonstration
>examples.
>
>The first form in which I came across it was that ALL you had to do
>was to write in a functional language, and it would parallelise
>automatically.  There is a theoretical sense in which that is true,
>but it was rapidly discovered to be of no practical relevance.
>
>The most recent form is that you can take an arbitrary serial code,
>add some OpenMP (or UPC or ...) directives and tweak it a bit, and
>it will run in parallel.  That one is still being propagated, too.
Are there any other useful papers available along the lines of those
giving the details of OS conversion to SMP for TOPS-10 (James M.
Flemming/Anthony (Tony) Wachs)
http://www.inwap.com/pdp10/paper-smp.txt
or CP/VM (Charles A. Salisbury/Lynn Wheeler)
http://www.garlic.com/~lynn/93.html#0
http://www.garlic.com/~lynn/94.html#27
http://www.garlic.com/~lynn/99.html#88
http://www.garlic.com/~lynn/2001k.html#66 etc...  
-- 
Thanks. Take care, Brian Inglis 	Calgary, Alberta, Canada
Brian....@CSi.com 	(Brian[dot]Inglis{at}SystematicSW[dot]ab[dot]ca)
    fake address		use address above to reply
In the distant past, there was a DEC technical paper of making VMS an
SMP system. Very detailed on the rationale, different possible approaches,
why the one was selected, and debugging aids.
	Jan
However, from the point of view of using multiple cores for performance,
operating systems are much easier tasks than most applications, for two
reasons:
    They have a certain amount of natural parallelism, with existing,
clean interfaces between independent and interacting components.
    Quite a lot of the problems have already been resolved, because
they have to be for interrupt handling, security and preemption.
Regards,
Nick Maclaren.
> So you have never heard of Sun's Sun Fire T2000 Server
> available with 4, 6, or 8 cores at 1.0GHz (or 8 cores at 1.2GHz)?
I, at least, didn't know that it was available with less than 8 cores.