From the Blog:
"I’ll get right to the point. If your multicore CPU or concurrent
programming language or operating system does not support fine-grain,
instruction-level parallelism in a MIMD (multiple instruction,
multiple data) environment, it is crap. Ok, I know that you are
offended and you don’t think that your company’s CPU, language, or OS
is crap but it is. And I mean Crap with a capital C. The computer
market is looking for fine-grain parallel systems that are fast,
secure, easy to program, auto-scalable and bug free. The crap that I
see out there does not even come close to delivering what the market
wants. Let me twist the knife in the wound a little deeper because I
don’t think you people (I’m tempted to say, “you idiots”, but I’ll let
it slide) are getting the message. Here is a list of highly desirable
things that the market wants right now, things that your crappy
products are not supporting but should be."
"No job is too difficult for the man
who doesn't have to do it"
Maybe he can team up with Arthur T. Murray of Mentifex MindForth
[dis]fame? They seem like peas in a pod.
I like erlang and message passing simply because I can do it efficentlyu in
software:>
http://groups.google.com/group/comp.arch/msg/ec70a077c0057e97
http://groups.google.com/group/comp.arch/msg/280dff4b930590f0
Any thoughts?
When you have calmed down from your ranting you should take a look at the
National Instruments offerings.Their systems offer quite a bit of what you
are looking for. As for the rest of us, we are more involved in getting
maximum functionality into ever smaller enclosures with ever deceasing
power budgets and ever higher security and integrity requirements for the
least cost.
Of course, if you are not satisfied that even NI have what ou are loking for
then you will just have to jolly well sit down and create it yourself.
--
********************************************************************
Paul E. Bennett...............<email://Paul_E....@topmail.co.uk>
Forth based HIDECS Consultancy
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-811095
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
Now, THAT's unfair. Persuasiveness and inventiveness very rarely
occur in the same person.
However, let's stick to the technical aspects, as we are better at
those - well, I am :-) To describe the author of that as an idiot
is unfair on the vast majority of idiots. This is actually SO bad
that I find it amusing. Let's select some of the points he has
highlighted.
I'll get right to the point. If your multicore CPU or concurrent
programming language or operating system does not support
fine-grain, instruction-level parallelism in a MIMD (multiple
instruction, multiple data) environment, it is crap.
OK, I'll buy that. I have been saying something similar for years,
after all, except that I probably don't mean what he does. I will
post a SEPARATE message on the points I agree with him.
Fast, fine-grain, instruction-level parallelism using MIMD.
What is the point of parallelism otherwise? ...
Exposure of ignorance, plus refusal to learn. 'Nuff said?
Easy software composition. This means a graphical interface for
non-algorithmic programming, among other things. It also means
plug-compatible components. Just drag'm and drop'm. No more
(text-based) computer languages, por favor!
Ditto. That has been a hot research topic for 30+ years and, like
the semantic analysis of natural languages, the more that you learn,
the less of the problem that you even understand. It's actually a
deep psychological problem as well as an IT one.
Deterministic timing of events at the elementary operation level.
Deterministic temporal order is a must for reliability. This is
possible only by adopting a convention whereby all instructions
(elementary operations) are parallel reactive processes and have
equal durations based on a virtual system-wide clock.
Ditto. Mathematicians can prove both that it is inherently not
scalable and that it it isn't enough, and engineers have confirmed
that the mathematics matches reality.
Automatic resolution of data dependencies. This eliminates blind
code and otherwise hidden side effects of code modification. It
is an essential ingredient for reliability. It can only be done
using a reactive synchronous software model.
Ditto. Both mathematicians and engineers can witness that it helps
with only one cause of such problems, there are other known approaches
to reliability, and his pet technique doesn't provide it anyway.
Impregnable security. This is possible only in a system that
enforces deterministic timing.
Ditto. Both mathematicians and engineers can witness it is twaddle.
Regards,
Nick Maclaren.
If your multicore CPU or concurrent programming language or
operating system does not support fine-grain, instruction-level
parallelism in a MIMD (multiple instruction, multiple data)
environment, it is crap.
He's right. The current hardware, operating systems and languages
make it artificially difficult to do this, though there are a few
signs of slight improvement. It should be possible for an ordinary,
unprivileged application to manage, schedule, synchronise, suspend,
restart, communicate with and otherwise control all of the cores/
microthreads that it has been provided with. WITHOUT needing to
ask the operating system to do it.
Basically, unless it becomes possible for ordinary application
developers to experiment with new, very low-level paradigms, we are
unlikely to see much progress. Better support would not solve the
problem, but would enable research and development.
However, DESIGNING architectures and programming languages to make
that possible is damn hard - you can make everything deterministic
and defined, and run like a drain - or you can leave everything
unspecified, and make the design unusable. Designing a worthwhile
compromise needs a great team, and the lack of salesdroids riding
on their backs.
Implicit parallelism at the design level, ...
He's right there, too. You can't bolt parallelism onto a serial
design and expect either good RAS or good performance. But we've
known that for at least three decades.
Regards,
Nick Maclaren.
If by fine-grained, you mean every couple-of-dozen instructions one
wants to be able to communicate with another computing device, then,
SORRY, you are totally out of luck. You see, it is currently taking
200-ish clocks to communicate with DRAM (cache fill/spill), and if the
other interested party wants to read the data you may have just
written, it will take another 200-ish clocks. So, to a first order,
you have to have a communications "grain" of 400-500 clocks. There is
a greater chance of this number going up (slightly) than down
(silghtly or significantly)over time.
As a good friend of mine once said: Good, Fast, Cheap: choose any 2.
Mitch
I don't know what that, er, intellectually challenged person wants,
but I would be happy with communicating only between threads of the
same process running on cores of the same chip. It's still not a
1-cycle activity, but it's a HELL of a sight better than going via
any external object (whether DRAM, InfiniBand or other).
|> As a good friend of mine once said: Good, Fast, Cheap: choose any 2.
And that's only the theoretical limit :-(
Regards,
Nick Maclaren.
Labview is pretty cool but it does not support fine-grained
parallelism, AFAIK, nor is it a true reactive (100% change-based)
environment. Besides, Labview could not possibly support fine-grain
parallelism since fine-grain, MIMD, multicore CPUs don't exist. In the
kind of parallel environment that I envision, cores are both invisible
and transparent and load-balancing is automatic. The developer should
be focusing on the application, not the hardware.
>Of course, if you are not satisfied that even NI have what ou are loking for
>then you will just have to jolly well sit down and create it yourself.
Nope. Somebody else has got to pay for it. That's what venture
capitalists are for, to pay for things.
>If by fine-grained, you mean every couple-of-dozen instructions one
>wants to be able to communicate with another computing device, then,
>SORRY, you are totally out of luck.
No. I mean fine-grain in the same sense that it is used in the
parallel computing industry. An example will illustrate. Let's say,
you need to initialize four variables for a parallel quicksort module
using a many-core CPU. The variables should be initialized by four
separate cores simultaneously. The initialization should not be done
sequentially within a thread a la Erlang. Thread-based, coarse-grain
parallelism is a joke. Again, unless it is fine-grain and MIMD, it's
crap. An don't tell me that it cannot be done. I know otherwise.
I personally like using using a distributed thread-to-thread communication
scheme based entirely on per-thread unbounded wait-free
single-producer/consumer (SPSC) queuing for the fast-paths, and
per-wait-bucket lock-free eventcounts to handle all the slow-paths. I can do
a SPSC queue in x86 using:
_____________________
The data-structure (e.g., anchor of the queue) is two contigous pointers;
head, tail. The push function is made up of five (5 MOV) instructions.The
pop function is (1 PUSH; 5 MOV; 1 CMP; 1 XOR; 1 POP; 1 RET) for 10
instructions.... The push and pop combined is only 15 instructions. IMHO,
that is virtually zero-overhead when compared to an implementation that uses
the LOCK prefix or an mfence instruction.
_____________________
That's fairly
The slow-paths (e.g., queue empty condition) is as follows:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380
(my lock-free eventcount invention...)
As you can see, the eventcount can be used as a heavily fast-pathed (e.g.,
fast signaling) condition-variable for non-blocking synchronization
primitives. Here is use basic usage-pattern for a waiting a non-blocking
queue condition, queue-empty in this case:
_____________________
// try an pop a node
node* msg = spscq_pop(...);
while (! msg) {
// INITIAL SLOW-PATH:
// acquire key
eventcount_key const key = eventcount_get(...);
// try one more time
msg = spscq_pop(...);
if (msg) { break; }
// FINAL SLOW-PATH:
// wait on the key
eventcount_wait(..., key);
// try an pop a node-
msg = spscq_pop(...);
}
// do some work...
node_do_work(msg);
_____________________
The eventcount is completely separated from the entity that has conditions.
This can be used with lock-based algorithms as well. In fact, it actually
can outperform many POSIX condition-variable implementations out there.
As I said before, this would be highly-distributed multiplexed queuing...
Here is a very brief description of the method I personally use:
http://groups.google.com/group/comp.programming.threads/msg/4230a9d3cbc20ea9
(the first paragraph gives simple outline...)
http://groups.google.com/group/comp.programming.threads/msg/cab0505f87084c2b
IMHO, this is a highly scaleable design... In practice, it seems to scale
exceptionally well indeed. What do you think about it? Do you think its
complete crap!?
:^0
> |> As a good friend of mine once said: Good, Fast, Cheap: choose any 2.
>
> And that's only the theoretical limit :-(
Agreed.
To continue that sentence:
IMO, that's a fairly low-overhead implementation.
[...]
Perhaps my ignorance is showing, but I would expect that caching
effects would make it better for a single CPU to fill in all four
variables and then flush the cache line out rather than have four
seperate ones trying to touch the same one. Cache coherency has got to
be a real bear with multiple separate processors jumping onto memory
in such close proximity.
It seems to me that we need to break our problems down into chunks
that are separable enough not to need to be so closely coupled.
- Tim.
Oops! I forgot the RET instruction. A push is (5 MOV; 1 RET).
> The pop function is (1 PUSH; 5 MOV; 1 CMP; 1 XOR; 1 POP; 1 RET) for 10
> instructions.... The push and pop combined is only 15 instructions.
That is 16 instructions total.
> IMHO, that is virtually zero-overhead when compared to an implementation
> that uses the LOCK prefix or an mfence instruction.
> _____________________
[...]
Sorry for that.
Excellent point. The way I see it, it's more than just a cache
problem. Whenever you have multiple cores accessing shared memory
through a bottleneck, you are going to have major bus contention
issues to deal with, both in main and cache memory. IMO, that's the
biggest problem facing parallel computing, the second being the search
for a good parallel programming model. How can we design a cache
system for a multicore CPU that is randomly accessible by all cores
without running into gridlocks?
We could do as you suggest: have each core maintain its own cache and
keep the cores from running into each other's territories as much as
possible. The biggest performance hit will come from inter-core
communication. This is the approach taken by all the multicore
vendors. The problem with this is that it kills the performance
increase one would gain from fine-grain parallelism. Some tasks (e.g.,
parallel QuickSort, matrix operations, etc...) are tailor-made for
fine-grain execution.
My approach to multicore architecture is radically different. You can
read about it at the link below. I'll explain how caching can be
handled smoothly in my next post later today or Monday. I got to catch
some sleep right now. Come to think of it, I should insert an addendum
to my blog to do just that.
How to Design a Fine-Grain, Self-Balancing, MIMD Multicore CPU:
http://rebelscience.blogspot.com/2007/09/how-to-design-self-balancing-multicore_08.html
I don't. I can't speak for the perpetrator of the blog.
That is why I said that asking what the point is in any other form of
parallelism than fast, fine-grain, instruction-level using MIMD merely
exposed his ignorance. There are a zillion requirements for it, at
the basic problem formulation level, and many almost demand different
models. Many problems are naturals for distributed message passing.
Some are naturals for SIMD. Some are naturals for BSP. Many are
naturals for the 'sea of threads'. Etc.
Where I agreed with him was in context. GIVEN the current approaches
to multi-core chip design, AND the wish to improve the performance of
existing (largely serial) applications and designs, THEN providing
that form of parallelism at the application level is essential.
Unfortunately, it is very hard to do and I know that I am not good
enough to design it (I have tried, many times, and failed).
The alternative is to change the industry's objectives. Well, I have
tried that one, too, and failed even more dismally :-(
Regards,
Nick Maclaren.
It is one of the great mysteries of all time, how the industry manages to
remain so impervious to suggestions from users.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
A satisfied user is dangerous: he won't spend money any more.
Stefan
Stefan Monnier <mon...@iro.umontreal.ca> spake the secret code
<jwvsl4dpk8d.fsf-...@gnu.org> thusly:
>> It is one of the great mysteries of all time, how the industry manages to
>> remain so impervious to suggestions from users.
>
>A satisfied user is dangerous: he won't spend money any more.
The satisfied user is a myth. User's expectations increase over time.
--
"The Direct3D Graphics Pipeline" -- DirectX 9 draft available for download
<http://www.xmission.com/~legalize/book/download/index.html>
Legalize Adulthood! <http://blogs.xmission.com/legalize/>
But, that's impossible. Since the idiots who invented MIMD,
where idiots with atomic clocks, not people with computers.
Or, better said as:
"The Children Of Jehovah, rather than people with
computers".
> - Show quoted text -
Some users indeed are never satisfied, but it doesn't make any sense to try
to satisfy them either since they're never satisfied anyway!
Stefan
No person "invented" MIMD. Mike Flynn at Stanford described the idea of
multiple independent streams before any existed during the time of early
time sharing. Mike has far from an idiot and perhaps one of the
brightest guys in the field when he retired. All he did was partition
the language (English).
Erlang: boy, that's a name I've not heard in a while.
--
No mystery.
Systems are complex. A lot of years going into designs and making them
(systems, not mere chips or boards or boxes).
You have to be able (not merely willing) to convince janitors.
Otherwise roll your own. Users had better learn as a minimum to sling code.
Stefan Monnier <mon...@iro.umontreal.ca> spake the secret code
><jwvsl4dpk8d.fsf-...@gnu.org> thusly:
>>A satisfied user is dangerous: he won't spend money any more.
Oh the Obsorne, and now Jobs iPhone, problem.
In article <feurhq$9d1$1...@news.xmission.com>, Richard <> wrote:
>The satisfied user is a myth. User's expectations increase over time.
True.
--
No, users get briefly satisfied, but that period of time is too damn
short to derive any pleasure. So don't count on it.
--
MIMD processing has been used for large scientific calculations for
some centuries. All the computers to is to implement the old manual
techniques, just a little faster and on a larger scale :-)
Regards,
Nick Maclaren.
ahahaha... Eugene, you are responding to a stupid usenet robot.
zzbunker has been messing around on the newsgroups for years. Read
zzbunker's autistic-sounding response again. It makes no sense, does
it? Almost all of its replies have the word "since" in it, no
originality. Text-based or language-based AI, also known as symbolic
AI, is the the worst possible way to research AI but zzbunker's
creators never seem to get the point. Every once in a while, it can be
very funny. ahahaha... AHAHAHA... ahahaha...
Ah Mr. Maclaren..... Going to be in Reno?
The arguable adjective in your response is "large."
Your key second word is scale. As Mike published his basic ideas
in 1966 thru the early 70s I would say that "large prior to the mid-20th
century" were really pretty small scale for that. The argument for
things like trig tables as big scale was that their algorithms are
comparatively simple. They blow up in fairly easily understood known ways.
When I took a job in the building which had the ILLIAC wing I worked
with ex-ILLIAC IV guys (still here) who used MIMD and "concurrent"
specifically rather than "parallel". I think they were splitting hairs.
Half the time. The other half now I agree with them. Those guys did by
in large still only really did SIMD computing. Their problem with
singularities was and is still quite small (otherwise vector processors
would not have been as successful either). I think the ICL DAP guys
would go along with that.
Really embarrassingly, easily decomposed MIMD can luckly happen on the
web say in places where power is cheap and those special places where
people are willing to pay for it is still fairly small scale.
Scale here might be say 8-9 orders of magnitude (not my figure).
I think Mike quit at the right time. I recall after I gave a seminar
for him on performance measurement that he smiled when I cited
Muybridge's work which took place on another part of the Stanford campus.
He told that class of students to go by and see the monument to that work.
Stanford was really lucky to have him.
People are only confusing a description with a prescription and
natural languages are mostly just conventions. The users define a
language not the language lawyers.
--
You must have half or more of comp.arch rolling on the floor laughing,
because you can could say substitute my name for his, but neither of you
openly posts like most of the rest of the c.a. respondents with our
real names.
>Almost all of its replies have the word "since" in it, no
>originality. Text-based or language-based AI, also known as symbolic
>AI, is the the worst possible way to research AI but zzbunker's
>creators never seem to get the point. Every once in a while, it can be
>very funny. ahahaha... AHAHAHA... ahahaha...
We'll remember that the next time c.a. gets together for dinner.
--
I doubt it, I am afraid.
|> The arguable adjective in your response is "large."
|> Your key second word is scale. As Mike published his basic ideas
|> in 1966 thru the early 70s I would say that "large prior to the mid-20th
|> century" were really pretty small scale for that. The argument for
|> things like trig tables as big scale was that their algorithms are
|> comparatively simple. They blow up in fairly easily understood known ways.
Yes and no. A fairly reliable source once told me that some of those
calculations were done by hundreds of inter-relating groups/people,
which scale wasn't reached in practical computing until the late
1970s, perhaps even early 1980s. And it wasn't just tables we were
discussing, which I agree are embarassingly parallel. But I failed
to track down an authoritative reference.
One big difference is that the manual MIMD could rely on a skilled
human to do the overall control and error recovery, whereas designing
for computers needs to produce a precise algorithm.
Regards,
Nick Maclaren.
> No person "invented" MIMD.
zzbunker has been posting for years on sci.military.naval. The general
consensus there is that it is some kind of bot.
Ken Young
> Yes and no. A fairly reliable source once told me that some of those
> calculations were done by hundreds of inter-relating groups/people,
> which scale wasn't reached in practical computing
Log trig and astronomical tables were produced by manual methods using
differences. One reason differences were used was that it made using
parallel computing easy. In some cases there could be scores of people
doing the scut work. These people were known as computers.
Babbage's first design was for a "Difference Engine" to remove human
error in this process. So massive parallel computing was alive and well
in the 19th century.
I have not got a handy book but when I mentioned this on another book
googling on the Difference Engine and Differences did apparently produce
relevant hits judging by the response I got.
Ken Young
In article <ff60g8$p2n$1...@gemini.csx.cam.ac.uk>,
Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>I doubt it, I am afraid.
Pity. I'm only there 2 separate days. I have to come back down to the
Bay Area and convince the founding chair to come up for 1 day.
>|> The arguable adjective in your response is "large."
>|> Your key second word is scale. As Mike published his basic ideas
>
>Yes and no. A fairly reliable source once told me that some of those
>calculations were done by hundreds of inter-relating groups/people,
>which scale wasn't reached in practical computing until the late
>1970s, perhaps even early 1980s. And it wasn't just tables we were
>discussing, which I agree are embarassingly parallel. But I failed
>to track down an authoritative reference.
Oh yes, that kind of thing started in the small scale with guys like
Kepler (merely an era place holder) and on to the 100s (of
correspondents) by the 1800s. You need a reference?
>One big difference is that the manual MIMD could rely on a skilled
>human to do the overall control and error recovery, whereas designing
>for computers needs to produce a precise algorithm.
There's (as you realize) quite a number of click work experiments in
addition to the SETI@home/proteinfolding@home such as a colleagues
Martian crater counting work. The width and independence of the
instruction stream. You will get IBM people like Lynn who are happy
with large scale job shopping but that doens't help the high end fine
grain data interested types (Dennis/Arvind). None of that helps some of
automated decompossition problem (just had to do a Wikipedia session).
--
Humm... I can offer an abstraction over some very low-level distributed
multi-threading paradigms, however, its obviously not that popular because I
have only managed to license it to a mere handful of interested clients...
Luckily for me, those that actually used it, ended up liking it rather
quickly.
Oh well...
At a guess, I am talking about a lower level than you are. I really
am talking about enabling application-controlled parallelism on the
scale of 5-10 machine instructions. And I don't mean overhead - I
mean that you can parallelise 20 instructions and get them to run
in the time of 15.
My analysis, and that of some other people, is that you need to get
down there to make use of the accessible parallelism in a lot of
existing, serial code. Now, there is another viewpoint that doing
that is a stupid idea, and the only sane approach is to redesign,
which may well be correct.
I think that it could be done - but not starting from here. But I
do believe that, even starting from here, we could get down to a
fairly small scale. What we CAN'T do is to do that in UNPRIVILEGED
code using existing hardware architectures and operating system
facilities. And it is that which I feel needs attention.
Regards,
Nick Maclaren.
Okay.
> I really
> am talking about enabling application-controlled parallelism on the
> scale of 5-10 machine instructions. And I don't mean overhead - I
> mean that you can parallelise 20 instructions and get them to run
> in the time of 15.
In the case of traversing a dynamic linked data-structures, I would like to
batch-up and parallelize a plurality of loads to the list anchor(s) and/or
the "next" pointer(s), and perhaps pre-fetch a number of nodes ahead to
"attempt" a possible reduction of memory latency. I guess you could bind the
group affinity to a group of processors, and only operate on its local
resources, think NUMA here. Something like Tilera allowing to segregate
groups of processors from running certain code. I would like to be able to
pragmatically compse the cache-coherency semantics of a multi-core chip at
OS boot (e.g., 64+ cores) into small groups of 2-4 processors, which are
adjacent to each other, into ccNUMA semantics wrt the groups
per-core-memory, and connect the groups together using a relaxed NUMA model
in the sense that a group of processors is remote entity.
I guess you could focus FPGA/chip-programming to create groups of cores and
dramatically weaken, or strengthen the cache-coherency on a per-group basis.
> My analysis, and that of some other people, is that you need to get
> down there to make use of the accessible parallelism in a lot of
> existing, serial code.
> Now, there is another viewpoint that doing
> that is a stupid idea, and the only sane approach is to redesign,
> which may well be correct.
>
> I think that it could be done - but not starting from here. But I
> do believe that, even starting from here, we could get down to a
> fairly small scale. What we CAN'T do is to do that in UNPRIVILEGED
> code using existing hardware architectures and operating system
> facilities. And it is that which I feel needs attention.
It seems extremely complicated to get a practical design.
When are the chip vendors going to be making their own memory? Anybody
trying to integrate a fairly large amount of physical memory and multi-core
chip arrays into a single entity? Programming model is distributed NUMA,
simple?
:^0
Would the parallel/pipelined computer (all female, it appears) system at Los
Alamos that Feynman describes in his book apply?
Jan
Yup :-) Did he describe the actual calculations? If so, that would
be an excellent reference.
Regards,
Nick Maclaren.
It seemed pretty clear to me from the description, including the way they
handled computational errors on the fly, that this was iterating difference
equations to solve differential equations numerically - that is, the usual
stuff you would expect.
Jan