Mainstream parallel processing: who will write the software and how?

Robert Myers

unread,

Mar 1, 2004, 7:32:55 PM3/1/04

to

Hello all,

Arstechnica has a recent "inside look" at the Pentium M:

http://www.arstechnica.com/cpu/004/pentium-m/pentium-m-1.html

The article doesn't present much that is completely new, but it closes
with a bit of speculation about the future:

'Gelsinger, giving the final speech of an Intel technology forum,
showed the audience a slide of the impossibly high power needs of
computer processors as a way of arguing that chip designers must
radically change chip architectures, and that Intel would be the
company to do just that. "We need a fresh approach," Gelsinger said.
"We need an architectural paradigm shift." That "fresh approach" might
look something like a future version of the Pentium M, in a system
that gangs together multiple low-power processors and relies heavily
on multi-threaded software. Not coincidentally, this looks something
like what I've elsewhere described as IBM's plan for the PowerPC 970
and its offspring.'

He goes on to say that, as others have speculated, the Pentium M core
might form the basis of a Plan B if Intel finally has to admit that
maybe Netburst wasn't such a great idea, after all.

If garden variety desktop (or server) processors turn into a
collection of low power processors, I can imagine two extreme
scenarios. One extreme scenario is that people will bumble along with
no major changes in languages, tools, or compilers, doing the best
they can with languages and tools now mostly used in HPC.

The other extreme scenario, which I actually think more likely, is
that if Intel, in particular, goes that way, it won't go there without
providing tools to make coding for such a processor more routine.
That could range from providing high-performance primitives such as
they do now for SIMD to a much more aggressive auto-parallelizing
compiler.

The latter "extreme" scenario might not be so extreme. People would
incorporate performance primitives into their code and change some
compiler flags. On the other hand, my guess is that a major, er,
paradigm shift like that would occasion more significant competition
among tool-builders for parallel computing than HPC has been able to
create.

Or perhaps it will just hasten the movement of labor-intensive coding
offshore. Any guesses?

RM

del cecchi

unread,

Mar 1, 2004, 8:28:17 PM3/1/04

to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:8dj740lqeg76i72jc...@4ax.com...

> Hello all,
>
> Arstechnica has a recent "inside look" at the Pentium M:
>
> http://www.arstechnica.com/cpu/004/pentium-m/pentium-m-1.html
>
> The article doesn't present much that is completely new, but it closes
> with a bit of speculation about the future:
>
> 'Gelsinger, giving the final speech of an Intel technology forum,
> showed the audience a slide of the impossibly high power needs of
> computer processors as a way of arguing that chip designers must
> radically change chip architectures, and that Intel would be the
> company to do just that. "We need a fresh approach," Gelsinger said.
> "We need an architectural paradigm shift." That "fresh approach" might
> look something like a future version of the Pentium M, in a system
> that gangs together multiple low-power processors and relies heavily
> on multi-threaded software.

snip

>
> The latter "extreme" scenario might not be so extreme. People would
> incorporate performance primitives into their code and change some
> compiler flags. On the other hand, my guess is that a major, er,
> paradigm shift like that would occasion more significant competition
> among tool-builders for parallel computing than HPC has been able to
> create.
>
> Or perhaps it will just hasten the movement of labor-intensive coding
> offshore. Any guesses?
>
> RM

I think that in the long run that programming will be done in what might
be described as "really really High Level Languages". Or maybe the
computers will write their own programs. Certainly the current
languages like C or scheme or any of those get the programmer too bogged
down in minutuae. Some baby steps have been taken in this direction in
some areas with languages like SAS, Matlab, Spice etc. Even SQL is a
step in the right direction. Google is a bigger step.

del cecchi

Terje Mathisen

unread,

Mar 2, 2004, 3:44:37 AM3/2/04

to

Robert Myers wrote:
> If garden variety desktop (or server) processors turn into a
> collection of low power processors, I can imagine two extreme
> scenarios. One extreme scenario is that people will bumble along with
> no major changes in languages, tools, or compilers, doing the best
> they can with languages and tools now mostly used in HPC.
>
> The other extreme scenario, which I actually think more likely, is
> that if Intel, in particular, goes that way, it won't go there without
> providing tools to make coding for such a processor more routine.
> That could range from providing high-performance primitives such as
> they do now for SIMD to a much more aggressive auto-parallelizing
> compiler.

Wow!

At even odds, I'd be willing to bet against that prediction.

In fact, I'd even entertain worse odds than that. :-(

Robert, let me put it this way:

While what you're hoping for would be very worthwhile, I don't think it
has much chance at all of turning into reality.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Robert Myers

unread,

Mar 2, 2004, 4:34:19 PM3/2/04

to

On Tue, 02 Mar 2004 09:44:37 +0100, Terje Mathisen
<terje.m...@hda.hydro.com> wrote:

>Robert Myers wrote:

<snip>

>> The other extreme scenario, which I actually think more likely, is
>> that if Intel, in particular, goes that way, it won't go there without
>> providing tools to make coding for such a processor more routine.
>> That could range from providing high-performance primitives such as
>> they do now for SIMD to a much more aggressive auto-parallelizing
>> compiler.
>
>Wow!
>
>At even odds, I'd be willing to bet against that prediction.
>
>In fact, I'd even entertain worse odds than that. :-(
>
>Robert, let me put it this way:
>
>While what you're hoping for would be very worthwhile, I don't think it
>has much chance at all of turning into reality.
>

I will admit to an occasional propensity for overstating the case, for
better or worse, but I can't imagine mainstream processors that depend
on a large number of executable threads being available to achieve
anywhere near their potential without some kind of change taking
place.

Most naively-written code doesn't do well on the P4 as compared to AMD
processors, which is why people who do alot of on-the-fly coding often
prefer AMD processors, but both Intel and ISV's seem to get what they
need out of P4 when it matters most.

The learning curve for Itanium is even steeper, and I _know_ I'm
voting against the majority when I say I think it will be climbed
eventually.

A processor that needs lots of threads to do well just seems like one
more turn of the screw. P4 is a challenge, Itanium is even more of a
challenge, and a CPU with lots of potential threads will be more of
the same.

Whether P4 or Itanium were the right moves or not is another
discussion. The fact is that the existence of a challenging mass
market processor has pushed the state of the art when it comes to
software.

Right now, consumer software that can take advantage of even two
concurrent threads is a rarity. Matlab, which is not consumer
software, would most definitely benefit from being multi-threaded, but
it isn't. I'm sure there are plenty of other examples.

Right now, an ISV can just ride the curve of ever-more-powerful single
cores. If Gelsinger's "prediction" (it read more like a warning to
me) is correct, the days of single-threaded software are probably
numbered. If no one else does, Intel and Microsoft will go at it
hammer and tongs. If the speculation about the Power PC 970 is
anywhere near correct, IBM can't be planning on standing by and
watching idly, either.

RM

Andrew Reilly

unread,

Mar 2, 2004, 4:43:11 PM3/2/04

to

Hmm, Robert's post hasn't made it to my computer, yet. I too think it
unlikely that Intel (or any other chip vendor) will create yet another new
language to support desktop parallel processors. That's because the tools
are already here. Just look where multi-processors are already having a
significant impact (not HPC, because HPC doesn't look much like
interactive desktop activity): web and database serving. Those don't look
much like desktop workloads either, but they are similar (to the extent
that they use much grunt) in that the important characteristics are
embarrassing parallelism of small pieces of work (with a few tricky
bottlenecks, perhaps), and a requirement for individual task completion
times in the tens of miliseconds, rather than hours or days. The way
people generally deal with that today is to code in, say, Java, with a
model based on a sea of threads (a thread per task). Sure, there's some
inefficiency there, due perhaps to unnecessary task switching overhead,
but if that's the only bottleneck, you can bet that that's where the
optimization work will be done.

Robert: what problems did you have in mind where the "sea of Java threads"
programming model, perhaps augmented with the "canned large-vector
arithmetic routines model", a-la Intel's IPP libraries, or VSIPL, or NAG,
or even netlib, isn't up to the task? I know that you don't like the
apparent lack of magic involved, but I think that the space of abstract
computational problem solving just involves work sometimes.

I have my own problem domain that doesn't map to that model very well at
all. Well, it could, but on today's hardware, with today's target markets
and today's problem sizes, the overhead is too high. Luckily for me, it
does map quite simply to fairly arbitrarily arranged collections of
processors. As such, C and assembler will do me just fine. I expect
their successor to be VHDL or verilog, rather than another
von-Neumann-based language/implementation model.

--
Andrew

Andrew Reilly

unread,

Mar 2, 2004, 6:04:49 PM3/2/04

to

On Tue, 02 Mar 2004 17:34:19 -0500, Robert Myers wrote:
> The learning curve for Itanium is even steeper, and I _know_ I'm
> voting against the majority when I say I think it will be climbed
> eventually.

The Itanium seems particularly ill-suited to the "zillions of threads"
model of software. That could be interesting for it, other than where it
is used in boxes with zillions of processors, so that there aren't many
contending threads on each processor. Which seems to be mostly where it
is, now.

> Right now, consumer software that can take advantage of even two
> concurrent threads is a rarity.

Is it? Do you use a single-threaded web browser? A single-pipe graphics
card? Ever let an MP3 player run in the background while you're reading
usenet? Run Photoshop? Plenty of embarrassing parallelism there. If
it's not parallel now, is that because it's hard, or because there hasn't
been any point, because desktops only have one processor at the moment?
Your consumer desktop is certainly connected to a bunch of multi-threaded,
probably multi-processor boxes at the other end of the network, busy
serving web pages, looking up databases and what-not.

> Matlab, which is not consumer software,
> would most definitely benefit from being multi-threaded, but it isn't.
> I'm sure there are plenty of other examples.

Matlab (the company) seem to be fairly sensitive to the size of their
target markets. They dropped the Macintosh version a while back due to
demand not matching the effort required. I would be surprised if they
didn't support multi-processor systems if their market research showed
them that they could get enough extra sales to make the effort worthwhile.
In fact, I'd bet that they're planning it.

> Right now, an ISV can just ride the curve of ever-more-powerful single
> cores.

Why is that surprising? That's what's been happening, what the consumers
have been buying.

> If Gelsinger's "prediction" (it read more like a warning to me)
> is correct, the days of single-threaded software are probably numbered.

That's probably over-stating it a bit. Individual processors aren't
likely to regress to be less powerful than (say) an XScale or Pentium-M.
There are still a lot of things that fit onto the "consumer desktop" that
will run just fine in a single thread.

> If no one else does, Intel and Microsoft will go at it hammer and
> tongs.

Go at what?

Microsoft's already going at the multi-thread thing: C# is Java, and it
knows about threads.

> If the speculation about the Power PC 970 is anywhere near correct, IBM
> can't be planning on standing by and watching idly, either.

IBM is heavily into Java, which does threads.

--
Andrew

Robert Myers

unread,

Mar 2, 2004, 6:51:40 PM3/2/04

to

On Tue, 02 Mar 2004 21:43:11 GMT, Andrew Reilly
<and...@gurney.reilly.home> wrote:

<snip>

>
>Robert: what problems did you have in mind where the "sea of Java threads"
>programming model, perhaps augmented with the "canned large-vector
>arithmetic routines model", a-la Intel's IPP libraries, or VSIPL, or NAG,
>or even netlib, isn't up to the task?

In the "sea of Java threads," I think it's the embarrassingly parallel
part that's important, not whether the code is written in Java or not.
Even embarrassingly parallel tasks have to have some shared resources.
Maybe all of them could fit into the client-server model of my Folding
at Home client.

A multi-threaded grep would be nice to have, and it would probably be
fairly easy to write one on the client-server model such that access
to any shared resources would be only through the server. That puts
all the burden on writing the server. I could imagine a skeleton
server that could be deformed to fit all kinds of problems. Who
knows, maybe such a thing exists.

Let's assume that's the right way to do business and that such an
approach will handle a large class of problems. At a very minimum, it
changes the way people write code. How many people naturally
formulate problems on a client-server paradigm?

That's why I asked the question. It's not that I believe that a
miracle has to occur. I have neither the experience nor the judgment
to say one way or another.

If Gelsinger says we're soon going to have chips made up of lots of
little cores and almost all code now being written is single-threaded,
something has to change between here and there, and I don't imagine
that I know enough to say just how much has to change or how.

As to numerically-intensive problems, the fact that Matlab isn't now
parallel and that MathWorks had, last time I checked, no intention of
making it parallel should say something. You can use parallelized
backends, but there are plenty of people who just wait for their
single-threaded Matlab problem to grind through a model that might
represent man-years of work.

>I know that you don't like the
>apparent lack of magic involved, but I think that the space of abstract
>computational problem solving just involves work sometimes.
>

That's an interesting glimpse into how I appear to other people. ;-).

My apparent preference for abstraction has two sources:

1. My observation from experience that the most general (and usually
the most abstract) approach almost always turns out to be the most
powerful.

2. The fact that I myself personally make a very poor abstract symbol
processor, especially as compared to a computer. If a problem can be
reduced to the formal manipulation of abstract symbols according to an
algorithm, then I would very much prefer to do it just once and let
the computer do the rest.

>I have my own problem domain that doesn't map to that model very well at
>all. Well, it could, but on today's hardware, with today's target markets
>and today's problem sizes, the overhead is too high. Luckily for me, it
>does map quite simply to fairly arbitrarily arranged collections of
>processors. As such, C and assembler will do me just fine. I expect
>their successor to be VHDL or verilog, rather than another
>von-Neumann-based language/implementation model.

I don't know whether VHDL and verilog are particularly attractive to
you because of the nature of the problem you are working on, but
hardware specification and design languages have always struck me as
particularly promising models because

1. They include a notion of real time, and if you're going to code to
hide latency, you have to have a notion of real time.

2. There are people who actually use them to do non-trivially
concurrent problems where the price of failure is higher than a patch
downloaded from the Microsoft website.

RM

Rupert Pigott

unread,

Mar 2, 2004, 7:36:28 PM3/2/04

to

Robert Myers wrote:

[SNIP]

> A multi-threaded grep would be nice to have, and it would probably be

I doubt you need that, application of your average UNIX-head's
scripting skills would do the trick. I figure the prob would be
you'd end up I/O bound anyway. :/

Cheers,
Rupert

Andrew Reilly

unread,

Mar 2, 2004, 8:17:58 PM3/2/04

to

On Tue, 02 Mar 2004 19:51:40 -0500, Robert Myers wrote:
> In the "sea of Java threads," I think it's the embarrassingly parallel
> part that's important, not whether the code is written in Java or not.

True, but Java and its ilk have some advantages for moving in
this direction, I think: being object oriented and multi-threaded from the
beginning, the existing libraries are based on the standard container
types, and these are inherently monitor-like things that serialize
multi-threaded access. So while you will probably need skill to avoid
bottlenecks and deadlocks, you shouldn't get badly surprised when you
start running zillions of these pieces of code in parallel. Well, I don't
know that for sure: I don't do Java yet.

> Even embarrassingly parallel tasks have to have some shared resources.
> Maybe all of them could fit into the client-server model of my Folding
> at Home client.

That sort of thing, or (more likely, I think), an extension of the already
event-driven GUI models. Most of these multi-thread things are based on
doing something simple and single-threaded for each of a blizzard of
incoming events. Up to now, people have been largely content to
process their events one at a time, because there was only one
processor to do it anyway. Maybe that's what the Folding at Home client
does?

> A multi-threaded grep would be nice to have, and it would probably be
> fairly easy to write one on the client-server model such that access to
> any shared resources would be only through the server. That puts all
> the burden on writing the server. I could imagine a skeleton server
> that could be deformed to fit all kinds of problems. Who knows, maybe
> such a thing exists.

Make is fairly close, if you look at it from far enough away... Works as
an auto-parallelizer of compiles and other such dependency-list tasks.

> Let's assume that's the right way to do business and that such an
> approach will handle a large class of problems. At a very minimum, it
> changes the way people write code. How many people naturally formulate
> problems on a client-server paradigm?

I think that a big change will come when it becomes a common language
construct to say "foreach i in collection", rather than explicitly say
that you need to process each element in the collection in some particular
order. At the moment, parallelizing compilers have to (a) infer that your
counting for loop is really a foreach block, and (b) perform the
next-order optimization of blocking some arbitrary number of passes back
into a sequential loop because it knows that it is not efficient to spawn
too many threads (or indeed more threads than available processors).

> If Gelsinger says we're soon going to have chips made up of lots of
> little cores and almost all code now being written is single-threaded,

[Who is this Gelsinger, and where did he say this? Not that I think he's
wrong, mind.]

> something has to change between here and there, and I don't imagine that
> I know enough to say just how much has to change or how.
>
> As to numerically-intensive problems, the fact that Matlab isn't now
> parallel and that MathWorks had, last time I checked, no intention of
> making it parallel should say something. You can use parallelized
> backends, but there are plenty of people who just wait for their
> single-threaded Matlab problem to grind through a model that might
> represent man-years of work.

Matlab did start out on Crays and vector supers. Maybe Cleve just likes
that style better.

>>I know that you don't like the
>>apparent lack of magic involved, but I think that the space of abstract
>>computational problem solving just involves work sometimes.
>>
>>
> That's an interesting glimpse into how I appear to other people. ;-).
>
> My apparent preference for abstraction has two sources:
>
> 1. My observation from experience that the most general (and usually the
> most abstract) approach almost always turns out to be the most powerful.

True, but perhaps only in retrospect? "The good is the enemy of the best"
means that abstraction usually incurs a penalty, compared to direct
solutions. So the abstract, general answer is only right when the problem
is sufficiently large/complex to amortize that cost?

> 2. The fact that I myself personally make a very poor abstract symbol
> processor, especially as compared to a computer. If a problem can be
> reduced to the formal manipulation of abstract symbols according to an
> algorithm, then I would very much prefer to do it just once and let the
> computer do the rest.

I agree. I like one of Bertrand Meyer's comments regarding guiding
principles for his Eiffel language: make the compiler do as much work as
possible. Never make the programmer do book-keeping or busywork that the
compiler could do instead. I suspect that the type-inference languages
take this stance even further.

The hitch comes when there isn't already a suitable automaton to iterate
the manipulation. Is it more work to build the automaton or to do the
manual work to solve the problem at hand?

>>I have my own problem domain that doesn't map to that model very well at
>>all. Well, it could, but on today's hardware, with today's target
>>markets and today's problem sizes, the overhead is too high. Luckily for
>>me, it does map quite simply to fairly arbitrarily arranged collections
>>of processors. As such, C and assembler will do me just fine. I expect
>>their successor to be VHDL or verilog, rather than another
>>von-Neumann-based language/implementation model.
>
> I don't know whether VHDL and verilog are particularly attractive to you
> because of the nature of the problem you are working on

Partly, but mostly because the market pressure to reduce unit costs and
power consumption seems to be stronger than that to increase
functionality, at the moment. Power consumption is strongly correlated
with clock rate, so the answer is sometimes custom ASICs (or FPGAs) with
many parallel functional units or custom processing circuits, running at a
low(ish) clock rate. The market in question being (portable) consumer
electronics, of course.

>, but hardware
> specification and design languages have always struck me as particularly
> promising models because
>
> 1. They include a notion of real time, and if you're going to code to
> hide latency, you have to have a notion of real time.

I like real time. In some sense, it reinforces the understanding of
parallelism, I think. Its easy to imagine task synchronization points as
real-time deadlines. But lo: I burble.

> 2. There are people who actually use them to do non-trivially concurrent
> problems where the price of failure is higher than a patch downloaded
> from the Microsoft website.

There was an article recently about evidence that consumers of consumer
electronics equipment were starting to react badly to the notion of
installing patches or flash upgrades, particularly if these were to
implement promised features that weren't actually part of the original
shipping product. I can't say that that surprises me. Trying to find a
stable/reliable flash image for my wireless access point at home is
driving me nuts. Devices should just work. If you don't want people to
expect it to work as designed, don't make it look like a device...

Cheers,

--
Andrew

Stephen Sprunk

unread,

Mar 2, 2004, 9:20:15 PM3/2/04

to

"Andrew Reilly" <and...@gurney.reilly.home> wrote in message
news:pan.2004.03.03....@gurney.reilly.home...

> I think that a big change will come when it becomes a common language
> construct to say "foreach i in collection", rather than explicitly say
> that you need to process each element in the collection in some particular
> order. At the moment, parallelizing compilers have to (a) infer that your
> counting for loop is really a foreach block,

I've taken a (brief) look at the OpenMP support in the Portland Group's
compilers, and it automatically parallelizes loops that do not have
dependencies between iterations; this seems to cover what you're discussing.
It also has directives to mark critical sections, barriers, etc.

> and (b) perform the next-order optimization of blocking some arbitrary
> number of passes back into a sequential loop because it knows that it is
> not efficient to spawn too many threads (or indeed more threads than
> available processors).

It does that too...

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

del cecchi

unread,

Mar 2, 2004, 11:25:18 PM3/2/04

to

"Robert Myers" <rmy...@rustuck.com> wrote in message

news:b74a409ogfa7ddbdd...@4ax.com...
>
snip

> If Gelsinger says we're soon going to have chips made up of lots of
> little cores and almost all code now being written is single-threaded,
> something has to change between here and there, and I don't imagine
> that I know enough to say just how much has to change or how.
>

snip

Define lots and little. My personal opinion is that the cores will be
more powerful that today's processors, if only because it is "easy" to
swizzle to the new technology and today's processors are already done.
And contention and arbitration and coherence make it hard to have too
many processors.

del

Robert Myers

unread,

Mar 2, 2004, 11:41:25 PM3/2/04

to

xosview says I get virtually 100% cpu utilization out of grep on my
home directory. That's not even on the system with striped RAID.

Another way to measure it is that it takes *much* longer to grep the
directory than it does to copy it.

RM

Robert Myers

unread,

Mar 3, 2004, 12:25:52 AM3/3/04

to

On Wed, 03 Mar 2004 01:17:58 GMT, Andrew Reilly
<and...@gurney.reilly.home> wrote:

>On Tue, 02 Mar 2004 19:51:40 -0500, Robert Myers wrote:
>> In the "sea of Java threads," I think it's the embarrassingly parallel
>> part that's important, not whether the code is written in Java or not.
>
>True, but Java and its ilk have some advantages for moving in
>this direction, I think: being object oriented and multi-threaded from the
>beginning, the existing libraries are based on the standard container
>types, and these are inherently monitor-like things that serialize
>multi-threaded access. So while you will probably need skill to avoid
>bottlenecks and deadlocks, you shouldn't get badly surprised when you
>start running zillions of these pieces of code in parallel. Well, I don't
>know that for sure: I don't do Java yet.
>

Me, neither. I've heard mixed things about Java threads, the war
between IBM, Sun, and Microsoft seems likely to maim Java if not to
kill it, and it is, in any case, a long way from true platform
independence at the moment.

>> Even embarrassingly parallel tasks have to have some shared resources.
>> Maybe all of them could fit into the client-server model of my Folding
>> at Home client.
>
>That sort of thing, or (more likely, I think), an extension of the already
>event-driven GUI models. Most of these multi-thread things are based on
>doing something simple and single-threaded for each of a blizzard of
>incoming events. Up to now, people have been largely content to
>process their events one at a time, because there was only one
>processor to do it anyway. Maybe that's what the Folding at Home client
>does?
>

I only choseFolding at Home (and the corresponding client-server
model) because the access to shared resources is fully exposed: my
client requests a work unit from a far away server, gets it, and
returns the work unit when it is done. There are no critical sections
in my client (at least not that have to do with the essentials of
solving the problem. There is a GUI that surely has event-driven
concurrency). The critical sections are all in the server. I assume
there are many other ways of accomplishing the same thing (most of the
threads are completely independent; only a few have to worry about
concurrency).

<snip>

>
>> If Gelsinger says we're soon going to have chips made up of lots of
>> little cores and almost all code now being written is single-threaded,
>
>[Who is this Gelsinger, and where did he say this? Not that I think he's
>wrong, mind.]
>

That's the OP that you apparently never got:

OP>http://www.arstechnica.com/cpu/004/pentium-m/pentium-m-1.html
OP>
OP>The article doesn't present much that is completely new, but it
closes
OP>with a bit of speculation about the future:
OP>
OP>'Gelsinger, giving the final speech of an Intel technology forum,
OP>showed the audience a slide of the impossibly high power needs of
OP>computer processors as a way of arguing that chip designers must
OP>radically change chip architectures, and that Intel would be the
OP>company to do just that. "We need a fresh approach," Gelsinger
said.
OP>"We need an architectural paradigm shift." That "fresh approach"
might
OP>look something like a future version of the Pentium M, in a system
OP>that gangs together multiple low-power processors and relies
heavily
OP>on multi-threaded software. Not coincidentally, this looks
something
OP>like what I've elsewhere described as IBM's plan for the PowerPC
970
OP>and its offspring.'

Gelsinger is listed as Intel's CTO.

<snip>

>>
>> As to numerically-intensive problems, the fact that Matlab isn't now
>> parallel and that MathWorks had, last time I checked, no intention of
>> making it parallel should say something. You can use parallelized
>> backends, but there are plenty of people who just wait for their
>> single-threaded Matlab problem to grind through a model that might
>> represent man-years of work.
>
>Matlab did start out on Crays and vector supers. Maybe Cleve just likes
>that style better.
>

Matlab made a start at making the code parallel and then decided that
it wasn't worth the investment. Semi-independent efforts to provide
parallel backends continue, one of them practically in the lap of the
MathWorks main office near MIT. The episode tells me to be to be wary
if anybody gives a big wave of the arm and says, "Don't worry.
Nothing to it."

<snip>

>>
>> 1. My observation from experience that the most general (and usually the
>> most abstract) approach almost always turns out to be the most powerful.
>
>True, but perhaps only in retrospect? "The good is the enemy of the best"
>means that abstraction usually incurs a penalty, compared to direct
>solutions. So the abstract, general answer is only right when the problem
>is sufficiently large/complex to amortize that cost?
>

I'd be willing to bet against such a generalization in the long run.
I don't understand concurrent programming well enough to make emphatic
statements about how it is taught, but I am willing to risk the wrath
of the usual suspects by insisting that things will go alot more
smoothly when people have no choice but to stop treating concurrent
processes as an awkward generalization of programs for von Neumann
architectures.

<snip>

>
>The hitch comes when there isn't already a suitable automaton to iterate
>the manipulation. Is it more work to build the automaton or to do the
>manual work to solve the problem at hand?
>

I'm back to my original bet that, if garden variety processors almost
demand significant concurrency, the tools to support significant
concurrency will almost inevitably get better.

<snip>

>
>>, but hardware
>> specification and design languages have always struck me as particularly
>> promising models because
>>
>> 1. They include a notion of real time, and if you're going to code to
>> hide latency, you have to have a notion of real time.
>
>I like real time. In some sense, it reinforces the understanding of
>parallelism, I think. Its easy to imagine task synchronization points as
>real-time deadlines. But lo: I burble.
>

Is that kind of burbling to be avoided? Seems like a good point to
me.

<snip>

>Trying to find a
>stable/reliable flash image for my wireless access point at home is
>driving me nuts. Devices should just work. If you don't want people to
>expect it to work as designed, don't make it look like a device...
>

I bet I could guess the manufacturer. That's only for home networking
gear. Cisco has had some really big problems, I gather. That's not
home networking gear. The minute something starts to look like
software, all hell breaks loose, it seems.

RM

Robert Myers

unread,

Mar 3, 2004, 2:19:32 AM3/3/04

to

On Tue, 2 Mar 2004 22:25:18 -0600, "del cecchi" <dce...@msn.com>
wrote:

>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:b74a409ogfa7ddbdd...@4ax.com...
>>
>snip
>> If Gelsinger says we're soon going to have chips made up of lots of
>> little cores and almost all code now being written is single-threaded,
>> something has to change between here and there, and I don't imagine
>> that I know enough to say just how much has to change or how.
>>
>snip
>
>Define lots

Eight (e.g. Niagara)? Sixteen (e.g. Tukwila/Tanglewood)?
Tukwila/Tanglewood may have (probably will have?) a pared down issue
width. Processors with four or more cores as yet to be named when
Intel announces it is throwing in the towel on NetBurst (the two core
version of Pentium-M has already been announced).

>and little.

Not so little by todays standards. Some (five) years old Patterson
presentation suggesting that we had already passed the point of common
sense in terms of the number of transistors per core. I could dig it
up if you like. Maybe lower voltage, much lower power, slower clocks.
Wherever the sweet spot is in terms of joules per useful unit of work,
which is increasingly going to be the figure of merit that counts.

>My personal opinion is that the cores will be
>more powerful that today's processors, if only because it is "easy" to
>swizzle to the new technology and today's processors are already done.

But the importance of keeping power consumption low argues against
continuing current trends. It may even argue for reversing them as
in, dare I mention it, Blue Gene.

>And contention and arbitration and coherence make it hard to have too
>many processors.
>

Coherence comes from shared cache, which turns coherence into a
contention and arbitration issue, I think. Apparently Intel thinks it
can talk sixteen cores with a straight face. I noticed you snipped
the reference to PPC 970 in responding to my first post. How many is
IBM thinking about? ;-).

RM

Terje Mathisen

unread,

Mar 3, 2004, 7:52:28 AM3/3/04

to

Robert Myers wrote:

That's simply broken then!

I.e. any reasonably optimized grep will do the core operations on any
given fixed string which is part of the search term, using something
like Boyer-Moore.

If you can make that run less slowly than (say) 50 MB/s which is at the
high end for sequential reads these days, then you need to fix your code.

Peter "Firefly" Lund

unread,

Mar 3, 2004, 9:06:35 AM3/3/04

to

On Wed, 3 Mar 2004, Terje Mathisen wrote:

> That's simply broken then!

Another broken program is wc.

> I.e. any reasonably optimized grep will do the core operations on any
> given fixed string which is part of the search term, using something
> like Boyer-Moore.

grep currently has some problems when run in UTF8 locales - it will
basically convert between UTF8 and wchar_t for each character...

> If you can make that run less slowly than (say) 50 MB/s which is at the
> high end for sequential reads these days, then you need to fix your code.

Agreed!

-Peter

Sander Vesik

unread,

Mar 3, 2004, 10:58:23 AM3/3/04

to

I don't think grep is presently I/O bound though.

>
> Cheers,
> Rupert

--
Sander

+++ Out of cheese error +++

Rupert Pigott

unread,

Mar 3, 2004, 8:21:58 PM3/3/04

to

Sander Vesik wrote:
> Rupert Pigott <r...@dark-try-removing-this-boong.demon.co.uk> wrote:
>
>>Robert Myers wrote:
>>
>>[SNIP]
>>
>>
>>>A multi-threaded grep would be nice to have, and it would probably be
>>
>>I doubt you need that, application of your average UNIX-head's
>>scripting skills would do the trick. I figure the prob would be
>>you'd end up I/O bound anyway. :/
>
>
> I don't think grep is presently I/O bound though.

It is I/O bound on my 3 year old HDDs. :)

Seriously though I have seen grep flip-flop from I/O bound
to CPU bound over the years, and of course there are a huge
number of variables involved... Like the content of the
files, the layout & size of the files etc.

The fact of the matter is : If Robert is *already* CPU bound
with a single threaded grep (as his reply suggests) what
advantage would multi-threading it give ? If he runs it on
a SMP type setup he may well gain something and then make
it an I/O bound problem again.

For 99.9% of the uses of Grep that I have I figure that the
problem really *should* be I/O bound.

Cheers,
Rupert

Robert Myers

unread,

Mar 3, 2004, 9:10:47 PM3/3/04

to

On Thu, 04 Mar 2004 01:21:58 +0000, Rupert Pigott
<r...@dark-try-removing-this-boong.demon.co.uk> wrote:

<snip>

>
>For 99.9% of the uses of Grep that I have I figure that the
>problem really *should* be I/O bound.
>

So I didn't pick the best example. I chose it as a task that's easy
to understand that can easily be made embarrassingly parallel.
Suppose I had to take the current power of my (relatively) beefy CPU
and accomplish the same thing with eight little CPU's that have
equivalent computing power for embarrassingly parallel tasks.

The hidden objection is that many here don't believe that desktop
applications will ever *need* eight-way CPU's. That would be the
Forrest Curve true believers, a group to which I do not belong.

If grep isn't a compelling example, and it isn't, because you can only
push it so far _without_ it becoming I/O bound, there are other
possible examples, even without trying to answer the general Forrest
curve objection.

The example that got me thinking about this post was a chess playing
program that made good use of SMP and of HT (35% boost going from two
threads to four on a two-way Xeon system. Nice.). So there's a
problem that can sop up endless power, that you can imagine running on
a desktop, and that can be partitioned nicely so that it can make good
use of a large number of threads.

The chess-playing program isn't a perfect example because people who
write chess-playing programs probably feel that the realities of
concurrent programming only add to the exhiliration of rising to the
challenge. If Gelsinger's prediction is correct, though, lots of
people who won't feel particularly exhilirated will wind up trying to
figure out how to get eight threads to work together, and I was
wondering if people could visualize how that would happen.

RM

Rabbit

unread,

Mar 3, 2004, 9:50:05 PM3/3/04

to

Robert,

> The chess-playing program isn't a perfect example because people who
> write chess-playing programs probably feel that the realities of
> concurrent programming only add to the exhiliration of rising to the
> challenge. If Gelsinger's prediction is correct, though, lots of
> people who won't feel particularly exhilirated will wind up trying to
> figure out how to get eight threads to work together, and I was
> wondering if people could visualize how that would happen.

A PC running Windows, to use the example of the "average user
experience", is a busy event-driven system. Not only waiting for
response from user inputs (mouse clicks and key presses) but also timer
interrupts, device i/o and so on. Looking at the performance monitor on
my Windows 2000 box shows about 200 interrupts per second and all I am
actively doing is typing into a Putty session.

In this sort of environment multiple processors would be a natural fit
as there would be a pool of available processors to service event processing.

Stuff also gets implicitly multi-threaded without the application always
being aware. Consider the playing of a video file in DirectX. The
video playing application may construct a filter graph which could have
stream splitting, decoding, processing and rendering as the logical
chain of graph objects for playing the video. Each of these graph
objects could spawn it's own thread if that was an appropriate strategy
for keeping the graph busy and not stalled in any particular section
(think of each object consuming upstream events and generating events
for the downstream object). And then you may have an audio stream with
it's own filter graph and bunch of objects processing.

Again, multiple processors would be useful as they could keep the graph
objects busy processing.

As for getting your application to harness a pile of threads to do work
that's not event processing, well, that's another issue others are
covering in other, umm, threads.

-r.

Robert Myers

unread,

Mar 3, 2004, 10:17:28 PM3/3/04

to

On Thu, 4 Mar 2004 02:50:05 +0000 (UTC), Rabbit <nos...@nospammm.com>
wrote:

<snip>

>
>As for getting your application to harness a pile of threads to do work
>that's not event processing, well, that's another issue others are
>covering in other, umm, threads.
>

Sure. It's not really a comp.arch question. People could just build
processors and hope that somebody will figure out how to make use ot
them. Just like garden-variety architects plan houses that they hope
people will be able to enjoy living in, I would imagine that computer
architects should regard it as a part of their job to design computers
that people can really use.

That there are people who make a specialty of handling concurrency is
not news to me; some of them do hang around comp.arch. The fact is
that most applications that make it to desktop computers can't use
even two concurrent threads effectively, never mind eight. I was
hoping for broad perspective--the kind an architect would get from
examining a potential building site in the context of its
surroundings.

RM

David C. DiNucci

unread,

Mar 4, 2004, 12:13:23 AM3/4/04

to

Robert Myers wrote:
> My apparent preference for abstraction has two sources:
>
> 1. My observation from experience that the most general (and usually
> the most abstract) approach almost always turns out to be the most
> powerful.

Not to mention portable. After all, abstraction implies that you are
leaving certain details unconsidered, and those are often architectural.

> I don't know whether VHDL and verilog are particularly attractive to
> you because of the nature of the problem you are working on, but
> hardware specification and design languages have always struck me as
> particularly promising models because
>
> 1. They include a notion of real time, and if you're going to code to
> hide latency, you have to have a notion of real time.

If you're not willing to at least abstract time, how can you abstract
much of anything?

Even if you are suggesting that the programmer is going to try hide
latency explicitly, within the code, rather than rely on runtime system
and/or programming model to help, real time notions still won't help
without knowing the real time constraints of everything in the system,
including communication, operation timings, external events, etc. And
even if *that* could all be known, the complexity required to make use
of it would be overwhelming for all but the very simplest problems.

So *you* do not have to have a notion of real time to hide latency.
Ultimately, the thing which schedules operations (at whatever
granularity you wish to define "operation") can benefit from being able
to predict the future based on the present, so then a notion of real
time is important, as it applies to some model that it has of the
hardware and/or software. Note that, unlike runtime systems, code-time
(or even compile- or link-time) latency hiding approaches only have one
"present" from which to make future predictions--and that's one before
the program runs. It's like the challenge of giving a car a fixed
algorithm to get to a destination rather than just sitting in it and
driving and adapting to curves, traffic, weather, etc.: The only way to
get the same benefit as a dynamic scheduler from raw program code is to
effectively build the scheduler into the program.

-- Dave
-----------------------------------------------------------------
David C. DiNucci Elepar Tools for portable grid,
da...@elepar.com http://www.elepar.com parallel, distributed, &
503-439-9431 Beaverton, OR 97006 peer-to-peer computing

john jakson

unread,

Mar 4, 2004, 12:34:01 AM3/4/04

to

Robert Myers <rmy...@rustuck.com> wrote in message news:<tk0b40luq45dk05ir...@4ax.com>...

If Intel does start going N way with much slower cpus, wouldn't that
open the door again to more new cpu architectures again, away from
x86. x86 isn't going to bloom into a Transputer in a million years for
a variety of reason I'll leave for later, (starting with not having
workspaces in cache v teeny reg files and heavy context swap).

Also I believe MS is the other partner in crime here of perpetuating
these rediculous room heaters by bloating Windows to hell and back. I
notice that PCs actually get slower over time, inspite of faster cpus,
bandwidth goes up, but latency seems same or worse. If anyone here
ever saw/used BeOS, you would know what MS has been doing wrong all
these years, it is pervasively threaded and light to the touch. More
responive on old HW than Windows on new HW.

I am really surprised by this thread.

Many posters admit little knowledge of Java and C#. Clearly also some
lack of knowledge of history too.

1st Java & C# are not thread safe, see Kent U work on JavaCSP, and
JavaPP (IIRC). Sun has even admitted as much concerning Swing, I doubt
it has been fixed. Java env isn't even written in Java, which says
much about it. In spite of these stabs, I like it as a language, hate
it as a religion and poor system programming perf.

For the kind of problems I could think up that could sit on 100s of
cpus each able to run 1000s of processes (I won't use the thread word
here), I couldn't imagine using Java/C# languages or most conventional
so called par SW languages that have heavy context costs associated
with them. I can most definitely imagine doing such coding in an Occam
or CSP descendant (maybe HandelC) or better still a HDL hybrid that
has been given a SW usability makeover. Possibly SystemVerilog but it
needs more work by far. True parallel programming won't be pervasive
till context swap and message passing, scheduling overheads are in
same ballpark as mul ops ie 1 to 20cycles. Um, IIRC Inmos been there,
done that.

When I design HW cpu as hundreds of pipelines, I can also imagine them
as par processes communicating with each other through wires or Occam
style channels. The HW event simulator is just a specialised par event
messaging system with many levels of priority (ie gate delay events) v
2 of Transputer. Many of the examples in the Occam books are for HW
like entities, fifos queues etc. I always thought that HW like
entities were better described with HDL languages where the system
took care of the underlying magic. Occam always seemed to me to be
fixing the not quite a HDL problem.

By the way. Occam includes something like par (;;) {} as well as for
(;;) {}. Does same thing but all {} statements are concurrent. If
multiple cpus available, then each process could be on different cpus
if so placed. Its pervasive to the language, all par processes are
near equal class to seq processes, just that over use of par
primitives could slow a program down some.

If decent enough cpu speed can be got from FPGA cpus (say 300MHz base
line) the inherant higher cost of FPGA and generally 5x slower
performance can cover the 10,000x higher NREs of full custom cpus that
must be spread over the higher volume of users.

Its seems amusing to me that an FPGA cpu built by 1 person can now
easily match a P2 5yrs dated that probably needed 1000s of engineers,
yet most of the worlds PCs are probably in that same class. The FPGA
cpu if its a Transputer type can also be scaled with the same message
passing simplicity of bygone days.

So this begs the question, how many cpus is sweetest, well as many as
can be instanced into the largest FPGA with best bang per buck, which
would be about 4-16 also. The larger FPGAs get richer by the boatload
in LUT cells, but BlockRams follows closer to the edge length. The
smaller FPGAs may be closer to the sweet spot. I am thinking of the
sp3-50 to 400, a few $ each if Xilinx ever gets them to little guys. I
am not even convinced that 32b FPUs is going to be difficult either,
most people never expected much from FPGA computing anyway.

I admit to not actually ever having built a par program (except HW
models ofcourse) but then nobody ever wanted to build one that was
affordeable to me and the SW industry never settled on a pervasive par
programming std. But when I do get one, I will be interested to work
on parallelizing EDA tools more than the typical difficult problems
usually mentioned, so I only need 100s of cpus.

enough ranting

johnjakson_usa_com

David C. DiNucci

unread,

Mar 4, 2004, 1:10:38 AM3/4/04

to

Robert Myers wrote:
> The hidden objection is that many here don't believe that desktop
> applications will ever *need* eight-way CPU's. That would be the
> Forrest Curve true believers, a group to which I do not belong.

The problem with the Forrest Curve claim is that it's a self-fulfilling
prophecy, which can be seen in some of the posts in this thread. "Well,
I want to do this and that, and I can't really do it on my desktop very
well, but that's OK, because it's not really a desktop application." So
why should we be surprised that desktop owners are more and more
satisfied that their machines can handle the workload they're being
assigned? It is completely wrong to read that as an indication that
people have all the cycles that they need, just that they are (for the
time being) willing to go elsewhere to get them.

This all reminds me of a (canceled) presentation that I fairly recently
prepared for Intel Capital. This notion that cycles are currently free
is hogwash, and Intel (and every other chip producer) would do well to
recognize it and support very portable and "mainstream" parallel
programming methodologies. Why? To do away with the invisible "oh, it
needs to run on a single processor" wall that software people
(designers, architects, engineers, etc.) bump up against when sizing up
what consitutes an "application" (and especially a "desktop
application").

If cycles were really free (and I, apparently like Gelsinger, believe
that they will be much closer to free before long), then you would be
able to decouple your program design from the CPU power available to run
it. Instead of making it do only what will fit in one, two, or four
threads on a single chip, you could make it do what you think it should
do, then provide the cycles that it needs to do that. Yes, there are
obvious limits to possible speedup, especially when factoring in
inter-chip latency, but those aren't the limits that most desktop app
designers and implementers are dealing with, and even if they are, it's
only in the context of archaic (or non-existent) tools and programming
methodologies.

Perhaps MS is already pretending that cycles are free, but using their
current programming methodologies, they're not, which is why some report
that their software gets slower as the chips get faster.

> The example that got me thinking about this post was a chess playing
> program that made good use of SMP and of HT (35% boost going from two
> threads to four on a two-way Xeon system. Nice.). So there's a
> problem that can sop up endless power, that you can imagine running on
> a desktop, and that can be partitioned nicely so that it can make good
> use of a large number of threads.
>
> The chess-playing program isn't a perfect example because people who
> write chess-playing programs probably feel that the realities of
> concurrent programming only add to the exhiliration of rising to the
> challenge. If Gelsinger's prediction is correct, though, lots of
> people who won't feel particularly exhilirated will wind up trying to
> figure out how to get eight threads to work together, and I was
> wondering if people could visualize how that would happen.

This relates back to some of our first encounters here on comp.arch,
where I claimed that virtually any problem that one might want to solve
could be considered a software problem, and you apparently wished to
restrict the domain. Very few people spend much of their time playing
chess. Virtually everyone spends large amounts of time interacting with
their machines, interacting with other people, extracting and recording
and organizing and exploiting information from numerous sources, etc.
(This is really what grids/resource collectives are all about.) The
death of the WIMP interface has been predicted for years, but hasn't
come to pass, not because it's the best way to interact with a machine,
but primarily because it's pretty efficient computationally. The
challenge of many other computationally challenging but useful
interactive technologies is often waived away with a dismissive "oh,
that's AI, let's not get into that". Well, at some point, I say, yes,
let's get into that.

It won't be long before people look back to now and say "and they
thought cycles were free?!" They're going to be grabbing for every cycle
they can reach.

Rupert Pigott

unread,

Mar 4, 2004, 1:51:31 AM3/4/04

to

Rabbit wrote:

[SNIP]

> A PC running Windows, to use the example of the "average user
> experience", is a busy event-driven system. Not only waiting for
> response from user inputs (mouse clicks and key presses) but also timer
> interrupts, device i/o and so on. Looking at the performance monitor on
> my Windows 2000 box shows about 200 interrupts per second and all I am
> actively doing is typing into a Putty session.

From 'systat vm' on my OpenBSD box :

Interrupts
229 total
100 clock
bha3
pciide1
128 rtc
ne1
1 vr0
vr1

FWIW the CPU was "99.2% idle" too. :)

That's a 600MHz AMD K7, the Pentium Pro 200 that used to
run that load would also idle with those figures too. You
will note that they are mostly timer interrupts, some
servicing the system clock, some servicing the NICs. It
ain't what I'd call "very busy" though !

Cheers,
Rupert

Rupert Pigott

unread,

Mar 4, 2004, 2:11:11 AM3/4/04

to

john jakson wrote:

[SNIP]

> For the kind of problems I could think up that could sit on 100s of
> cpus each able to run 1000s of processes (I won't use the thread word
> here), I couldn't imagine using Java/C# languages or most conventional
> so called par SW languages that have heavy context costs associated
> with them. I can most definitely imagine doing such coding in an Occam
> or CSP descendant (maybe HandelC) or better still a HDL hybrid that

It's frustrating trying to explain this to people isn't it ?
Every time I try I end up feeling like some wild-eyed crack
pot who has returned with a serious case of sun stroke from
40 days in the desert.

Few people have actually worked with tools that allow you
effectively manage this scale of parallelism (eg: CSP/OCCAM),
so I guess it's only natural that they don't "get it". HW
folks are actually ahead of the game on this one and I've
often noticed that they smile smugly and keep quiet in the
MPP flamewars. :)

> has been given a SW usability makeover. Possibly SystemVerilog but it
> needs more work by far. True parallel programming won't be pervasive
> till context swap and message passing, scheduling overheads are in
> same ballpark as mul ops ie 1 to 20cycles. Um, IIRC Inmos been there,
> done that.

> By the way. Occam includes something like par (;;) {} as well as for

> (;;) {}. Does same thing but all {} statements are concurrent. If
> multiple cpus available, then each process could be on different cpus
> if so placed. Its pervasive to the language, all par processes are
> near equal class to seq processes, just that over use of par
> primitives could slow a program down some.

It's a safe bet that OCCAM would actually lend itself far
more readily to modern optimisation techinques than C for
example. The semantics are far more clearly defined for a
start, side-effects are well contained, therefore it should
be far easier to auto-parallelize those large SEQ constructs
too. Flattening PARs down into SEQs shouldn't be too hard
either.

The question is : When do you do these optimisations ? I
don't think a fully-dynamic approach is entirely sane (and
I would *hate* to debug like that). If you went for a ANDF/
TenDRA style approach and do this kind of stuff a install
or load time I think you could be onto a winner. Load time
would probably be a good thing to do for large multi-user
systems, install time would probably suit desktops etc.
Been thinking about it (again) having tripped over TenDRA
(again). :)

[SNIP]

> usually mentioned, so I only need 100s of cpus.

I was lucky enough to dig up some B035s and have a muck
about with them. Lack of per-node RAM was the killer
there, they either needed more bandwidth or more RAM,
ideally both. BG/L nodes are pretty much what I used
to wish for in terms of comms/local memory/compute. :)

Cheers,
Rupert

Terje Mathisen

unread,

Mar 4, 2004, 3:14:03 AM3/4/04

to

Peter "Firefly" Lund wrote:

> On Wed, 3 Mar 2004, Terje Mathisen wrote:
>
>
>>That's simply broken then!
>
> Another broken program is wc.

Wow!

I'll have to do something about that, my wc has always been more or less
the fastest in the world. I clocked 10+ MB/s on a 486-33, with perfect
scaling to twice that on a Pentium.

Since then I've rewritten it for P6/P4 class machines, including pure C
versions.

>
>
>>I.e. any reasonably optimized grep will do the core operations on any
>>given fixed string which is part of the search term, using something
>>like Boyer-Moore.
>
> grep currently has some problems when run in UTF8 locales - it will
> basically convert between UTF8 and wchar_t for each character...

Ouch. If the same thing happens for wc, then that's _really_ stupid.

The only real problem with utf8 and wc is to calculate the number of
actual characters, this does require some code/states to detect the
beginning/end of a utf8 sequence.

Let's see...

First classify characters as one of the following:

Whitespace, including all word separators (can be user-specified)
CR
LF
regular in-word character
utf8 lead-in
utf8 middle
utf8 tail

This gives 7 different cases, round it up to 8.

Make a 64 K table returning the combined state for two input characters
as a 0..63 number.

pair = inbuf16[i++]; // Grab the next pair of bytes
pair_class = pair_table[pair];// Lookup the corresponding state

Combine this value with the last char from the previous pair:
state = ((state & 7) << 6) | pair_class; // Three last bytes

Here's a neat trick: I use a combined lookup table with increments for
the character count (11 bits), the line count (11 bits) and the word
count (10 bits):

accumulator += combined_increments[state]; // 512-entry 32-bit table

The code above can handle up to 2046 input characters with no
possibility of overflow between the individual counters, whereupon I'll
have to extract the various components (I'd do this after 1 K so that I
can keep the file buffer accesses page aligned):

char_count += accumulator & 2047;
line_count += (accumulator >> 11) & 2047;
word_count += accumulator >> 22;
accumulator = 0;

The inner loop has just two operations as part of the latency-limiting
critical loop:

state = prev_state | pair_table;
increments = combined_increments[state];

All the other operations can be run in parallel with these two basic
operations, so we've reduced the problem to (more or less) a linked list
traversal, with the speed determined by the latency of table lookups
that always hit L1 (the combined_increment[] table is just 2 KB).

I.e. the sequential access of the input buffer will be the real limiter,
this can run at ~GB/s on current cpus, right?

Anyway, this leaves an order of magnitude between the speed of a striped
local disk pair, and the wc counting speed.
:-)

Terje
PS. One caveat: The code above will not work as-is if any utf8-encoded
characters are to be treated as word separators!

Terje Mathisen

unread,

Mar 4, 2004, 3:21:31 AM3/4/04

to

Robert Myers wrote:

> On Thu, 04 Mar 2004 01:21:58 +0000, Rupert Pigott
> <r...@dark-try-removing-this-boong.demon.co.uk> wrote:
>
> <snip>
>
>>For 99.9% of the uses of Grep that I have I figure that the
>>problem really *should* be I/O bound.
>>
>
> So I didn't pick the best example. I chose it as a task that's easy
> to understand that can easily be made embarrassingly parallel.

Oh, but it cannot!

Rather the opposite: You should take great pain that you never try to
run more than a single grep scan on a given disk spindle at any given
point in time!

If you _know_ that you're grabbing data from multiple disks, then you
could consider forking another process to handle each new drive
encountered, maybe using mount points as the key.

Nick Maclaren

unread,

Mar 4, 2004, 3:25:02 AM3/4/04

to

In article <lf2d40dvk3q12og02...@4ax.com>,

Robert Myers <rmy...@rustuck.com> writes:
|>
|> The example that got me thinking about this post was a chess playing
|> program that made good use of SMP and of HT (35% boost going from two
|> threads to four on a two-way Xeon system. Nice.). So there's a
|> problem that can sop up endless power, that you can imagine running on
|> a desktop, and that can be partitioned nicely so that it can make good
|> use of a large number of threads.

Er, really? At best, that is very poor evidence that the problem
will scale. It isn't clear whether the (low) figure of 35% is due to
the HyperThreading limits or the problem limits or both. But let's
ignore the former, and look at SMP experience.

Getting a 35% improvement for a 100% increase in the number of 'CPUs'
is generally a sign that the problem has reached its limits of
scalability. In such cases, it is common for a further 100% increase
in the number of 'CPUs' to deliver only a small improvement (under,
say, 10%) or even a performance loss of the same order.

Regards,
Nick Maclaren.

Jouni Osmala

unread,

Mar 4, 2004, 3:26:25 AM3/4/04

to

> >>>A multi-threaded grep would be nice to have, and it would probably be
> >>
> >>I doubt you need that, application of your average UNIX-head's
> >>scripting skills would do the trick. I figure the prob would be
> >>you'd end up I/O bound anyway. :/
> >
> >
> > I don't think grep is presently I/O bound though.
>
> It is I/O bound on my 3 year old HDDs. :)
>
> Seriously though I have seen grep flip-flop from I/O bound
> to CPU bound over the years, and of course there are a huge
> number of variables involved... Like the content of the
> files, the layout & size of the files etc.
>
> The fact of the matter is : If Robert is *already* CPU bound
> with a single threaded grep (as his reply suggests) what
> advantage would multi-threading it give ? If he runs it on
> a SMP type setup he may well gain something and then make
> it an I/O bound problem again.
>
> For 99.9% of the uses of Grep that I have I figure that the
> problem really *should* be I/O bound.

On thing comes in mind that linux uses what ever memory you don't use
othervice as a disc cache. You only need to access the file for write
or read on reasonable time frame to make grep CPU bound. For instance
a machine that has la 2GB of ram, for some specific purpose like
running some large simulations sometimes, or do some GFX editing that
requires that much of RAM.
Next thing is when you configure things, do you need to access disk
for reads after you have once accessed all the files you are using
while trying things out? I think not. That would make grep 100% CPU
bound.

Jouni Osmala
Helsinki University of Technology.

Nick Maclaren

unread,

Mar 4, 2004, 3:40:29 AM3/4/04

to

In article <c26ou8$a73$1...@osl016lin.hda.hydro.com>,

Grrk. YOU might, but 99.9% of programmers shouldn't. Far too many
people try that approach to optimisation, and it doesn't work,
because it is so far beyond most programmers' abilities that they
don't even understand the issues.

If you are doing that, you need to consider where the bottlenecks
are, what properties they have, and tune accordingly. That is a
diabolically complex activity on modern hardware, made much worse
by the appalling software. I know where to start, and it is to
fill in an application for early retirement :-)

FAR better would be to use a system with decent I/O support and to
use a single stream, which would be parallelised by the system.
Come back mainframe and mini-computer operating systems, all is
forgiven! You might use multiple threads/processes/etc. for CPU
performance, but they would take their data from a single logical
stream.

For SERIOUS parallelism (> 1,000 way), you hit bottlenecks in the
single interface and therefore need a model that exports the I/O
parallelism to the application. But that is a task that should be
left to real experts - even if you have an interface (like MPI-2
I/O that provides it semi-cleanly).

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Mar 4, 2004, 3:55:35 AM3/4/04

to

Jouni Osmala wrote:
> On thing comes in mind that linux uses what ever memory you don't use
> othervice as a disc cache. You only need to access the file for write
> or read on reasonable time frame to make grep CPU bound. For instance
> a machine that has la 2GB of ram, for some specific purpose like
> running some large simulations sometimes, or do some GFX editing that
> requires that much of RAM.
> Next thing is when you configure things, do you need to access disk
> for reads after you have once accessed all the files you are using
> while trying things out? I think not. That would make grep 100% CPU
> bound.

As far back as in my 486 days, my wc ran _faster_ than Microsofts disk
cache (actually RAM disk) could transfer it into my file buffers.
:-)

Jouni Osmala

unread,

Mar 4, 2004, 4:01:12 AM3/4/04

to

> >>>A multi-threaded grep would be nice to have, and it would probably be
> >>
> >>I doubt you need that, application of your average UNIX-head's
> >>scripting skills would do the trick. I figure the prob would be
> >>you'd end up I/O bound anyway. :/
> >
> >
> > I don't think grep is presently I/O bound though.
>
> It is I/O bound on my 3 year old HDDs. :)
>
> Seriously though I have seen grep flip-flop from I/O bound
> to CPU bound over the years, and of course there are a huge
> number of variables involved... Like the content of the
> files, the layout & size of the files etc.
>
> The fact of the matter is : If Robert is *already* CPU bound
> with a single threaded grep (as his reply suggests) what
> advantage would multi-threading it give ? If he runs it on
> a SMP type setup he may well gain something and then make
> it an I/O bound problem again.
>
> For 99.9% of the uses of Grep that I have I figure that the
> problem really *should* be I/O bound.

On thing comes in mind that linux uses what ever memory you don't use

othervice as a disc cache. You only need to access the file for write
or read on reasonable time frame to make grep CPU bound. For instance
a machine that has la 2GB of ram, for some specific purpose like
running some large simulations sometimes, or do some GFX editing that
requires that much of RAM.
Next thing is when you configure things, do you need to access disk
for reads after you have once accessed all the files you are using
while trying things out? I think not. That would make grep 100% CPU
bound.

Jouni Osmala
Helsinki University of Technology.

Terje Mathisen

unread,

Mar 4, 2004, 5:11:31 AM3/4/04

to

Jouni Osmala wrote:
> On thing comes in mind that linux uses what ever memory you don't use
> othervice as a disc cache. You only need to access the file for write
> or read on reasonable time frame to make grep CPU bound. For instance
> a machine that has la 2GB of ram, for some specific purpose like
> running some large simulations sometimes, or do some GFX editing that
> requires that much of RAM.
> Next thing is when you configure things, do you need to access disk
> for reads after you have once accessed all the files you are using
> while trying things out? I think not. That would make grep 100% CPU
> bound.

Yes, but in that case performance is probably good enough anyway. :-)

With ~ 1 GB/s throughput for wc, grep should not be much slower, and for
most searches (i.e. those with a least a few constant characters in the
search term) it should be faster.

If you have 2 GB RAM, and 1 GB is your current disk cache, then you
should get answers to most cache-based queries in a second or two,
except that the OS directory scan and file open/close operations could
start to dominate for everything except really big files.

Ken Hagan

unread,

Mar 4, 2004, 5:18:28 AM3/4/04

to

Robert Myers wrote:
>
> So I didn't pick the best example. I chose it as a task that's easy
> to understand that can easily be made embarrassingly parallel.

Embarrassing parallelism isn't terribly useful. Spawning zillions
of threads will kill any (?) hardware stone dead. So of course we
don't do that; we ask a single thread to perform the operations
sequentially.

We could break the problem up into a small (2,4,8?) number of sub-
problems and create worker threads to grind through them all and
simply wait for the workers to finish in our main thread. If "the
problem" really is the heart and soul of our program then the
effort required is probably worth it.

If the embarrasing parallelism is found at some lower level and in
dozens of different ways, not only does the effort of programming
it become too great, the run-time overheads of creating all those
threads might start to exceed the speed up.

And of course *everything* eventually becomes I/O bound (or more
likely memory bound) so the expected return on investment probably
won't happen.

Having said that, I can see fairly easy ways of making my own
company's software use "several" threads. Now that our customers
are likely to be running SMT P4s, it is worth our time exploiting
that limited degree of parallelism. Other programmers may feel
the same way. In past postings I've also noted that Microsoft's
component plumbing partly automates the process of spawning a few
threads to grind through many parallel tasks. It may soon be worth
Intel's time to stuff a few more cores or threads onto a chip. On
an optimistic day, I can see a slow migration to more and more
parallel software and hardware.

No revolutions, though. In 2010, most software will be written in
C/C++, just like the operating systems they sit upon, and compiled
to a machine language that the 8086 designers would recognise. If
you want "radical", the best you can hope for is that the actual
PCs will all be tablets -- probably A3 size for "desktops" and
only a centimetre thick, but no wires or separate keyboard. I'll
leave open the question of whether they are smart terminals or
self-contained PCs.

Sander Vesik

unread,

Mar 4, 2004, 9:56:43 AM3/4/04

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> Peter "Firefly" Lund wrote:
>
> > On Wed, 3 Mar 2004, Terje Mathisen wrote:
> >
> >
> >>That's simply broken then!
> >
> > Another broken program is wc.
>
> Wow!
>
> I'll have to do something about that, my wc has always been more or less
> the fastest in the world. I clocked 10+ MB/s on a 486-33, with perfect
> scaling to twice that on a Pentium.

If it is reasonably portable and has a BSD style licence, making people
use it and distribute it shouldn't be hard ;-)

john jakson

unread,

Mar 4, 2004, 11:02:20 AM3/4/04

to

Rupert Pigott wrote in message

snipping

>
> It's frustrating trying to explain this to people isn't it ?
> Every time I try I end up feeling like some wild-eyed crack
> pot who has returned with a serious case of sun stroke from
> 40 days in the desert.
>

Precisely

> Few people have actually worked with tools that allow you
> effectively manage this scale of parallelism (eg: CSP/OCCAM),
> so I guess it's only natural that they don't "get it". HW
> folks are actually ahead of the game on this one and I've
> often noticed that they smile smugly and keep quiet in the
> MPP flamewars. :)
>

One thing about massively parallel HW design is that it is all message
based using descrete wires to carry private channels between all
processes (er HW modules). Some channels are multiplexed or tristated
but mostly private. There is almost no notion of shared data that is
in fact mostly unshared. When resources are shared, that often
introduces more headaches than just duplicating the HW, all depends on
lesser evil. In VLSI 101, 20yrs ago, the message was loud & clead,
wires are expensive, computing is cheap. Its only gotten truer with
time with high resistance wiring.

In parallel computing with shared memory, the message seems totally
lost, reminds me of the E.Dijksta paper concerning gotos, or today
that would be unecesary common address space or use of globals. Oh
well, after Intel has pushed 1 rock up the hill, industry may as well
push shared data up as well.

>
> It's a safe bet that OCCAM would actually lend itself far
> more readily to modern optimisation techinques than C for
> example. The semantics are far more clearly defined for a
> start, side-effects are well contained, therefore it should
> be far easier to auto-parallelize those large SEQ constructs
> too. Flattening PARs down into SEQs shouldn't be too hard
> either.
>

Ofcourse. I have the Occam book by Alan xxx who also wrote an ADA
book. Sorry for forgetting name, but it made it crystal clear to me, I
never realised how simple it was till well after it was gone. Also
explains why the Oxford folks went on to do HandelC. Now the
conundrum, to be successful a language must look like C even if
par/seq/alt are tacked on. The C language is vast in its underlying
type complexity (see Hanson Fraser lcc book). Combining with Verilog
only makes things much worse except that Verilog is very C ish
already.

But it should be possible to dump out an Occam style internal tree,
where other processes can now give it a going over. From Alan's book,
it seems as if par/seq compares well with simple DeMorgan rules used
implicitly by all HW guys without even thinking. There are plenty of
formal tools for checking manipulating bool logic expressed in any
HDL, mostly its called synthesis today. Again that leads back to
HandelC & Celoxica, using C syntax instead of Verilog.

> The question is : When do you do these optimisations ? I
> don't think a fully-dynamic approach is entirely sane (and
> I would *hate* to debug like that). If you went for a ANDF/
> TenDRA style approach and do this kind of stuff a install
> or load time I think you could be onto a winner. Load time
> would probably be a good thing to do for large multi-user
> systems, install time would probably suit desktops etc.
> Been thinking about it (again) having tripped over TenDRA
> (again). :)
>

I will have to look up TenDRA etc

>>
> I was lucky enough to dig up some B035s and have a muck
> about with them. Lack of per-node RAM was the killer
> there, they either needed more bandwidth or more RAM,
> ideally both. BG/L nodes are pretty much what I used
> to wish for in terms of comms/local memory/compute. :)
>

Local memory is no problem anymore. Each cpu in FPGA can have 4kbyte
(sounds familiar doesn't it) local cache ie 2 Blockrams, or more as
paired upto max of 200Kbytes or so if all BlockRam given to 1 cpu.
FPGAs have enormous bandwidth available to drive DDR Ram and even
RLDRAM1 at RAS cycle rates down to 20ns. This means cpu cycles are in
same ballpark as DRAM cycle so cache perf is less of an issue. I am
looking at 1 256Mb RLDRAM per 8 cpus or so, ie 2 chips per TRAM style
module giving 4Mbyte per cpu. It begs the question of should I waste
effort on MMU, I think not. If a cpu is trying to host more than a few
meg of code or data, its probably not been partitioned well.

regards

johnjakson_usa_com

Robert Myers

unread,

Mar 4, 2004, 11:51:30 AM3/4/04

to

If it's to be a subject of discussion, then everyone should be able to
look at the same evidence:

http://www.aceshardware.com/read.jsp?id=60000285

You get a 76% boost by doubling the number of physical processors in
use on a dual xeon machine (boost due to SMP), and a further 35% boost
by using hyperthreading on those two processors (boost due to SMT).

I didn't take that particular implementation as evidence that the
process would scale, only my expectation that a sufficiently large
tree search can be made to exploit a large number of threads. In the
instance where I needed a scalable tree search, I would have my pick
of predesigned and in some cases precoded algorithms to choose from.

As to the 35% boost in performance from HT for a relatively tiny
number of extra transistors, I'll take it, and I don't think it
provides evidence that the problem is reaching its limits of
scalability.

RM

Robert Myers

unread,

Mar 4, 2004, 12:14:32 PM3/4/04

to

On Wed, 03 Mar 2004 21:13:23 -0800, "David C. DiNucci"
<da...@elepar.com> wrote:

>Robert Myers wrote:

<snip>

>>
>> 1. They include a notion of real time, and if you're going to code to
>> hide latency, you have to have a notion of real time.
>
>If you're not willing to at least abstract time, how can you abstract
>much of anything?
>

<snip>

> Note that, unlike runtime systems, code-time
>(or even compile- or link-time) latency hiding approaches only have one
>"present" from which to make future predictions--and that's one before
>the program runs. It's like the challenge of giving a car a fixed
>algorithm to get to a destination rather than just sitting in it and
>driving and adapting to curves, traffic, weather, etc.: The only way to
>get the same benefit as a dynamic scheduler from raw program code is to
>effectively build the scheduler into the program.
>

This could easily present the opportunity to recycle all of the
discussions I've been involved in about static vs. dynamic scheduling,
with or without feedback, and with or without a priori information
from the programmer. I hope it suffices just to refer to those and
other related threads to say that there is plenty to discuss.

If one took a vote, either of opinion or of results from experience, I
think that, at the moment, run-time scheduling would win by a
landslide. Even if we took that weight of evidence and opinion as
decisive as to the value of run-time scheduling, it leaves open the
issue of whether a priori analysis has value (the evidence is that it
does), whether feedback has value (the evidence is that it does), and
whether informed programmer input has value (the evidence is that it
does).

So the question is not just simply static vs. dynamic scheduling.
Given the choice, you will use some kind of run time scheduling. The
question then becomes, how do you acquire other information and how do
you use it.

I didn't mean to propose any particular mechanism for incorporating
information about latency either into the code or into the feedback
data, or even that it would necessarily be visible to the day-to-day
programmer. I only wanted to note that hardware design languages
already include a notion of real time and that I couldn't imagine much
you could do proactively about latency without having a notion of real
time at your disposal, wherever it was included in the process.

RM

David C. DiNucci

unread,

Mar 4, 2004, 1:18:34 PM3/4/04

to

Robert Myers wrote:
> If one took a vote, either of opinion or of results from experience, I
> think that, at the moment, run-time scheduling would win by a
> landslide. Even if we took that weight of evidence and opinion as
> decisive as to the value of run-time scheduling, it leaves open the
> issue of whether a priori analysis has value (the evidence is that it
> does), whether feedback has value (the evidence is that it does), and
> whether informed programmer input has value (the evidence is that it
> does).

Right. Use all of the reliable info you can. Just don't use info for
situation X that only applies to situation Y.

> So the question is not just simply static vs. dynamic scheduling.
> Given the choice, you will use some kind of run time scheduling. The
> question then becomes, how do you acquire other information and how do
> you use it.
>
> I didn't mean to propose any particular mechanism for incorporating
> information about latency either into the code or into the feedback
> data, or even that it would necessarily be visible to the day-to-day
> programmer. I only wanted to note that hardware design languages
> already include a notion of real time and that I couldn't imagine much
> you could do proactively about latency without having a notion of real
> time at your disposal, wherever it was included in the process.

And here's where we differ. Hardware design languages include a notion
of real time because, simply put, it applies to that situation: You're
not going to change real time constraints without changing the hardware
you're designing (at least until for now). It's not true with
software. Real time factors for any one piece of software will be
changing out from under it constantly, not only as it moves from one
machine to the next, but as events occur under it, and I (for one) would
prefer that my software not change each time. Of course, even that
depends on the definition of software: Compilers or other tools will
often optimize an executable for my program based on some model of the
target platform (including real time constraints), so be it, and you
might call that executable "the software". But my source program is no
place for such a hardware model, or for data to be fed into it.

Rupert Pigott

unread,

Mar 4, 2004, 1:20:01 PM3/4/04

to

Terje Mathisen wrote:

[SNIP]

> If you _know_ that you're grabbing data from multiple disks, then you
> could consider forking another process to handle each new drive
> encountered, maybe using mount points as the key.

The thing is : Why the hell bother ? Grep has two basic modes
of usage, Interactive and Scripted. Interactive is largely one
off searches, we're not *too* bothered about the speed here in
most cases - we maybe worried about thrashing a multi-user box
to death though. Naive implementations of grep are *unlikely*
to defeat the OSes attempts to share cycles and bandwidth
between users/apps.

Scripted would be repeated searches, perhaps part of some kind
of a batch. In this case it's *fairly* likely that the script
in question will be pretty much "single purpose", so the guy
who writes the script and those who use it probably have a good
idea about the data and nature of the searches involved. The
right place to do the parallelisation would probably be in the
script itself.

IMO. :)

Cheers,
Rupert

Sander Vesik

unread,

Mar 4, 2004, 1:19:40 PM3/4/04

to

Ken Hagan <K.H...@thermoteknix.co.uk> wrote:
> Robert Myers wrote:
> >
> > So I didn't pick the best example. I chose it as a task that's easy
> > to understand that can easily be made embarrassingly parallel.
>
> Embarrassing parallelism isn't terribly useful. Spawning zillions
> of threads will kill any (?) hardware stone dead. So of course we
> don't do that; we ask a single thread to perform the operations
> sequentially.

It won't kill every kind of hardware - hardware that has zillions
of processors or zillions of thread contexts (asuming a tera style
processor in SMP conf) will be very happy.

>
> We could break the problem up into a small (2,4,8?) number of sub-
> problems and create worker threads to grind through them all and
> simply wait for the workers to finish in our main thread. If "the
> problem" really is the heart and soul of our program then the
> effort required is probably worth it.

Right. Now what if such threads came for free and were widely avbailable?

>
> If the embarrasing parallelism is found at some lower level and in
> dozens of different ways, not only does the effort of programming
> it become too great, the run-time overheads of creating all those
> threads might start to exceed the speed up.
>
> And of course *everything* eventually becomes I/O bound (or more
> likely memory bound) so the expected return on investment probably
> won't happen.

As things stand, most things are far from I/O bound. Alos, making them
less so is not that hard if you have more than one PCI-X bus.

>
> Having said that, I can see fairly easy ways of making my own
> company's software use "several" threads. Now that our customers
> are likely to be running SMT P4s, it is worth our time exploiting
> that limited degree of parallelism. Other programmers may feel
> the same way. In past postings I've also noted that Microsoft's
> component plumbing partly automates the process of spawning a few
> threads to grind through many parallel tasks. It may soon be worth
> Intel's time to stuff a few more cores or threads onto a chip. On
> an optimistic day, I can see a slow migration to more and more
> parallel software and hardware.

Right, and now imagine if P4 has say 32 threads instead of two. How
wuould you change the software? You wouldn't really ignore the chance
to have extra worker threads, would you?

>
> No revolutions, though. In 2010, most software will be written in
> C/C++, just like the operating systems they sit upon, and compiled
> to a machine language that the 8086 designers would recognise. If

So? This in no way means threads - or even a large amount of threads
is unusable solution.

> you want "radical", the best you can hope for is that the actual
> PCs will all be tablets -- probably A3 size for "desktops" and
> only a centimetre thick, but no wires or separate keyboard. I'll
> leave open the question of whether they are smart terminals or
> self-contained PCs.
>

Tablets are a horrible, special use idea.

Greg Lindahl

unread,

Mar 4, 2004, 1:35:55 PM3/4/04

to

In article <c26vkl$2am$1$8300...@news.demon.co.uk>,
Ken Hagan <K.H...@thermoteknix.co.uk> wrote:

>Embarrassing parallelism isn't terribly useful. Spawning zillions
>of threads will kill any (?) hardware stone dead. So of course we
>don't do that; we ask a single thread to perform the operations
>sequentially.

You seem to have confused the concepts of "embarrassingly parallel"
and a really bad implementation of solving an embarrassingly parallel
problem.

-- greg

Rupert Pigott

unread,

Mar 4, 2004, 1:42:43 PM3/4/04

to

Ken Hagan wrote:

> Robert Myers wrote:
>
>>So I didn't pick the best example. I chose it as a task that's easy
>>to understand that can easily be made embarrassingly parallel.
>
>
> Embarrassing parallelism isn't terribly useful. Spawning zillions
> of threads will kill any (?) hardware stone dead. So of course we

See Jon Jakson's post. :)

Well, zillions perhaps, but hundreds or thousands, perhaps even
millions shouldn't be a problem *if* you use a sane scheduling
mechanism, keep process context to a minimum (note: I say
processes, not threads) combined with message passing as the
Transputer did all those years ago.

I didn't have a problem with firing up 100 processes on a single
T800, most of them would sit idle waiting for data. On the rare
occasion they were all active you'd lose responsiveness but there
were very very few wasted cycles. Local IPC was near as dammit
"for free" on the Transputer.

> don't do that; we ask a single thread to perform the operations
> sequentially.
>
> We could break the problem up into a small (2,4,8?) number of sub-
> problems and create worker threads to grind through them all and
> simply wait for the workers to finish in our main thread. If "the
> problem" really is the heart and soul of our program then the
> effort required is probably worth it.

As I've explained to others down the years I commonly applied the
"pipeline" model to apps, it was interesting because raised some
questions as to *where* you should slice up the problem. Having
"pipelined" an app I might then consider whether to replicate
the pipeline - of course that depended on whether I could easily
partition the datasets. That's just one way to slice the cake,
there are plenty of others, that's what makes the parallelism
field so interesting. :)

I had a few rules of thumb that I've long since forgotten now,
however I figure that with the profiling, simulation and
optimisation technology we have today it *should* be possible to
make better decisions.

Cheers,
Rupert

Nick Maclaren

unread,

Mar 4, 2004, 1:42:45 PM3/4/04

to

In article <c26vkl$2am$1$8300...@news.demon.co.uk>,
Ken Hagan <K.H...@thermoteknix.co.uk> wrote:
>

>Embarrassing parallelism isn't terribly useful. Spawning zillions
>of threads will kill any (?) hardware stone dead. So of course we
>don't do that; we ask a single thread to perform the operations
>sequentially.

Have you hugged a tree today?

Regards,
Nick Maclaren.

Robert Myers

unread,

Mar 4, 2004, 2:00:07 PM3/4/04

to

On Thu, 04 Mar 2004 10:18:34 -0800, "David C. DiNucci"
<da...@elepar.com> wrote:

<snip>

>...Compilers or other tools will

>often optimize an executable for my program based on some model of the
>target platform (including real time constraints), so be it, and you
>might call that executable "the software". But my source program is no
>place for such a hardware model, or for data to be fed into it.
>

I think we do disagree on this point, er, up to a point.

Your goal of having the programmer specify exactly what he wants, no
more and no less, in a hardware-independent and completely portable
fashion, is commendable. I just don't think it's realistic at the
moment, and I'm more ready to compromise than you are.

Maybe that allies with the good that is the enemy of the best, but I
hope I've said enough positive things about your ideas that you will
forgive me this particular apostasy.

On the one extreme, we have coding styles that are very closely tied
to the hardware, and, at the moment, single-threaded hardware. At the
other extreme, we have coding styles that pretend that nothing ever
actually has to be executed on physical hardware. I don't find either
extreme particularly attractive at this particular stage of hardware
and software development.

For me, real-world software is a gnu/linux tgz file. Gnu/linux has
managed to achieve a very high degree of portability, but at what a
cost: a huge fraction of the characters in the tgz file are committed
to coping with all the various hardware and software possibilities the
software might encounter (which generally include running under cygwin
on a Windows platform, and sometimes include the possibility of
running in a completely non-gnu environment).

Even for that kind of ugly compromise to work, the programmer had to
anticipate a range of possibilities, and thus, the code is not
completely portable. I'd settle for that degree of portability with a
significant improvement in readability, optimizability, and
verifiability. If that means making some assumptions about the
hardware that show up in the code (the processor has cache, the cache
has finite size, the processor may be singly or multiply-threaded,
etc.), I can live with that.

RM

Rupert Pigott

unread,

Mar 4, 2004, 2:11:54 PM3/4/04

to

john jakson wrote:

> Rupert Pigott wrote in message
>
> snipping
>
>>It's frustrating trying to explain this to people isn't it ?
>>Every time I try I end up feeling like some wild-eyed crack
>>pot who has returned with a serious case of sun stroke from
>>40 days in the desert.
>>
>
>
> Precisely
>
>
>>Few people have actually worked with tools that allow you
>>effectively manage this scale of parallelism (eg: CSP/OCCAM),
>>so I guess it's only natural that they don't "get it". HW
>>folks are actually ahead of the game on this one and I've
>>often noticed that they smile smugly and keep quiet in the
>>MPP flamewars. :)
>>
>
>
> One thing about massively parallel HW design is that it is all message
> based using descrete wires to carry private channels between all
> processes (er HW modules). Some channels are multiplexed or tristated

There is something far more subtle at work here... Shared
memory systems are inevitably implemented using message
passing, albeit at the hardware level. Consider a ccNUMA
machine running an MPI app...

Hardware : Message Passing.
ISA : Shared Memory.
API : Message Passing.
App : Message Passing.

What a waste of hardware & design effort. All that extra
stuff that can break too, all that extra complication.

It does not matter what the hardware looks like to your
code, eventually you will have to read or write to storage
a long way away and at that point you are having to think
"Message Passing" whether you like it or not. Hell even
the memory subsystems we work with routinely are forcing
the hardware designers, the compiler writers AND the
application developers to think "Message Passing" even
though they call it a "Caching Problem" or "Shared Memory".

As an aside Terje's caching sig should really be at the
start of any Parallelism textbook.

The question of MP or SHM is orthogonal to the fundamental
latency problem. MP/SHM is purely about how you manage the
complexity and latency in large parallel applications. My
preference is to cut out the middle man (ie: cache coherency
& arbitration hardware), and code as God always intended
us to code : Mesasge Passing. It simplifies the behaviour
of the hardware (good), it is the path of least surprise
(why is hitting this array element taking 1us ?) etc.

Sigh. Rant over I guess.

[SNIP]

> Local memory is no problem anymore. Each cpu in FPGA can have 4kbyte
> (sounds familiar doesn't it) local cache ie 2 Blockrams, or more as
> paired upto max of 200Kbytes or so if all BlockRam given to 1 cpu.

Have you looked at IBM's BlueGene/L ? Although I'm not the
PowerPC's biggest fan, I like the approach they have taken
with their nodes. I suspect that for a fair number of apps
you might well get a substantial cost and power saving by
eliminating the external memory on the compute nodes and
scattering a few "fast storage" nodes with large chunks of
RAM. Option of automatic paging from internal eDRAM to the
nearest fast-storage node perhaps. That is a shot in the
dark though, I am sure that the guys behind BG/L have been
very careful to balance the FLOPS/Kbyte ratio for their
target apps.

Cheers,
Rupert

Kees van Reeuwijk

unread,

Mar 4, 2004, 3:20:00 PM3/4/04

to

john jakson <johnj...@yahoo.com> wrote:

> So this begs the question, how many cpus is sweetest, well as many as
> can be instanced into the largest FPGA with best bang per buck, which
> would be about 4-16 also. The larger FPGAs get richer by the boatload
> in LUT cells, but BlockRams follows closer to the edge length. The
> smaller FPGAs may be closer to the sweet spot. I am thinking of the
> sp3-50 to 400, a few $ each if Xilinx ever gets them to little guys. I
> am not even convinced that 32b FPUs is going to be difficult either,
> most people never expected much from FPGA computing anyway.

Why talk in terms of CPUs in the first place? That's just a legacy
artefact: FPGAs are a valid computer architecture in their own right; it
is just not the traditional von Neumann architecture.

Sure, imperative programming languages (including Occam) do not
translate well to the FPGA architecture, but that just means that a
different category of programming language must be used. What I have in
mind is something like the synchronous language Lustre, but a language
that is specifically designed for the purpose would be even better.

David C. DiNucci

unread,

Mar 4, 2004, 3:28:56 PM3/4/04

to

Robert Myers wrote:
>
> On Thu, 04 Mar 2004 10:18:34 -0800, "David C. DiNucci"
> <da...@elepar.com> wrote:
>
> <snip>
>
> >...Compilers or other tools will
> >often optimize an executable for my program based on some model of the
> >target platform (including real time constraints), so be it, and you
> >might call that executable "the software". But my source program is no
> >place for such a hardware model, or for data to be fed into it.

<snip>

> On the one extreme, we have coding styles that are very closely tied
> to the hardware, and, at the moment, single-threaded hardware. At the
> other extreme, we have coding styles that pretend that nothing ever
> actually has to be executed on physical hardware. I don't find either
> extreme particularly attractive at this particular stage of hardware
> and software development.

I wouldn't call those extremes different programming styles, but
descriptions of the same program at different points in a journey. At
the beginning, it is a specification of how to solve a real-world
problem, understandable and specifiable by a human, as generally or
specifically as is necessary for the task at hand. At the end, it is
instructions to one or more machines, hopefully to perform that same
task as efficiently as possible on the available hardware. Information
can be added along the way, by people, by tools, by runtime systems, by
experience/feedback, etc.

The software will meet several forks in the road as it heads off to
different target platforms. It makes sense to allow a human to provide
information that can be useful at each fork, thereby refining the
original algorithm there in ways that help it to run efficiently on the
target without changing it's high-level behavior. It's called
optimization, and the formal model behind SC provides guidance into the
sorts of optimization that can be applied and the effects it will (not)
have. The info at the fork pertains to the software, not parameters for
constructing a hardware model, which is separate. If it's necessary
later to take a different fork, it's best to still have the un-optimized
code to which the optimization for the new target can be applied.
That's very different from describing the algorithm in the first place
with all sorts of options of how it should behave differently on many
different targets. In that latter case, the architecture-independent
algorithm may never be apparent even from the initial code.

> For me, real-world software is a gnu/linux tgz file. Gnu/linux has
> managed to achieve a very high degree of portability, but at what a
> cost: a huge fraction of the characters in the tgz file are committed
> to coping with all the various hardware and software possibilities the
> software might encounter (which generally include running under cygwin
> on a Windows platform, and sometimes include the possibility of
> running in a completely non-gnu environment).

I'm not sure I get your example. Assuming that a tgz file is just any
file processed by tar and gzip, are you suggesting that the file formats
used by tar and/or gzip include much baggage to make them work on a wide
variety of platforms? Or that any random software you might find that
has been packaged into a .tgz file will contain such cruft? Or maybe
that the tar and/or gzip software packages themselves have such cruft in
the source code? The first I doubt, and the second depends on your
sampling. The last I might well believe, but I don't view it as a best
practice.

Benjamin Ylvisaker

unread,

Mar 4, 2004, 12:57:18 PM3/4/04

to

On Mon, 01 Mar 2004 19:32:55 -0500
Robert Myers <rmy...@rustuck.com> wrote:

> Hello all,
>
> Arstechnica has a recent "inside look" at the Pentium M:
>
> http://www.arstechnica.com/cpu/004/pentium-m/pentium-m-1.html
>
> The article doesn't present much that is completely new, but it
> closes with a bit of speculation about the future:
>
> 'Gelsinger, giving the final speech of an Intel technology forum,
> showed the audience a slide of the impossibly high power needs of
> computer processors as a way of arguing that chip designers must
> radically change chip architectures, and that Intel would be the
> company to do just that. "We need a fresh approach," Gelsinger said.
> "We need an architectural paradigm shift." That "fresh approach"
> might look something like a future version of the Pentium M, in a
> system that gangs together multiple low-power processors and relies
> heavily on multi-threaded software. Not coincidentally, this looks
> something like what I've elsewhere described as IBM's plan for the
> PowerPC 970 and its offspring.'
>
> He goes on to say that, as others have speculated, the Pentium M
> core might form the basis of a Plan B if Intel finally has to admit
> that maybe Netburst wasn't such a great idea, after all.
>
> If garden variety desktop (or server) processors turn into a
> collection of low power processors, I can imagine two extreme
> scenarios. One extreme scenario is that people will bumble along
> with no major changes in languages, tools, or compilers, doing the
> best they can with languages and tools now mostly used in HPC.
>
> The other extreme scenario, which I actually think more likely, is
> that if Intel, in particular, goes that way, it won't go there
> without providing tools to make coding for such a processor more
> routine. That could range from providing high-performance
> primitives such as they do now for SIMD to a much more aggressive
> auto-parallelizing compiler.
>
> The latter "extreme" scenario might not be so extreme. People would
> incorporate performance primitives into their code and change some
> compiler flags. On the other hand, my guess is that a major, er,
> paradigm shift like that would occasion more significant competition
> among tool-builders for parallel computing than HPC has been able to
> create.
>
> Or perhaps it will just hasten the movement of labor-intensive
> coding offshore. Any guesses?
>
> RM

I personally have little doubt that garden variety computers will
become single chip MP systems sooner or later. Allan Hartstein and
Thomas Puzak from IBM research gave some good evidence for my belief
in a paper last year on optimum power/performance pipeline depth:
http://www.microarch.org/micro36/html/pdf/hartstein-OptimumPower.pdf
The primary conclusion from the paper is that if you want to
simultaneously minimize the amount of time and the amount of energy it
takes to do some computational task, the ideal processor pipeline is
probably a whole lot shallower than you might have guessed.

I'll also take this opportunity to advertise for my favorite
concurrent language, CML:
http://people.cs.uchicago.edu/~jhr/papers/1992/phd-thesis.ps.gz
Writing concurrent programs is, on "average", a good deal harder to
think about than writing sequential programs. I really hope we don't
shoot ourselves in the foot by making "sequential language +
ill-defined concurrent junk clobbered onto the side" the standard way
to program the widespread MP systems of the future.

Cheers,
Benjamin

Robert Myers

unread,

Mar 4, 2004, 6:46:11 PM3/4/04

to

On Thu, 04 Mar 2004 12:28:56 -0800, "David C. DiNucci"
<da...@elepar.com> wrote:

>Robert Myers wrote:
>>
>> On Thu, 04 Mar 2004 10:18:34 -0800, "David C. DiNucci"
>> <da...@elepar.com> wrote:
>>
>> <snip>
>>
>> >...Compilers or other tools will
>> >often optimize an executable for my program based on some model of the
>> >target platform (including real time constraints), so be it, and you
>> >might call that executable "the software". But my source program is no
>> >place for such a hardware model, or for data to be fed into it.
>
><snip>
>
>> On the one extreme, we have coding styles that are very closely tied
>> to the hardware, and, at the moment, single-threaded hardware. At the
>> other extreme, we have coding styles that pretend that nothing ever
>> actually has to be executed on physical hardware. I don't find either
>> extreme particularly attractive at this particular stage of hardware
>> and software development.
>
>I wouldn't call those extremes different programming styles, but
>descriptions of the same program at different points in a journey.

In an ideal world that would be so. As the world is at the moment,
both extremes describe coding styles.

>At
>the beginning, it is a specification of how to solve a real-world
>problem, understandable and specifiable by a human, as generally or
>specifically as is necessary for the task at hand. At the end, it is
>instructions to one or more machines, hopefully to perform that same
>task as efficiently as possible on the available hardware. Information
>can be added along the way, by people, by tools, by runtime systems, by
>experience/feedback, etc.
>

No disagreement with any of that.

>The software will meet several forks in the road as it heads off to
>different target platforms. It makes sense to allow a human to provide
>information that can be useful at each fork, thereby refining the
>original algorithm there in ways that help it to run efficiently on the
>target without changing it's high-level behavior. It's called
>optimization, and the formal model behind SC provides guidance into the
>sorts of optimization that can be applied and the effects it will (not)
>have. The info at the fork pertains to the software, not parameters for
>constructing a hardware model, which is separate. If it's necessary
>later to take a different fork, it's best to still have the un-optimized
>code to which the optimization for the new target can be applied.
>That's very different from describing the algorithm in the first place
>with all sorts of options of how it should behave differently on many
>different targets. In that latter case, the architecture-independent
>algorithm may never be apparent even from the initial code.
>

I find _that_ hard to disagree with. Editors exist that will let you
intervene manually in a graph and then tell you whether what you have
done still makes syntactic sense. There may even be such editors that
will tell you whether the manual transformation you just forced on the
graph leaves the graph semantically equivalent to what you started
with. If so, I haven't encountered such a tool.

As it is, in a perfect world, one would need a tool that I can almost
guarantee does not exist; viz, one that would allow you to add
information that leaves the description of the algorithm unchanged but
changes the semantic content of the graph in the sense that it gives
specific guidance as to how to cope with specific hardware. I can see
the desirability of enforcing the distinction, but I'm not optimistic
that we'll see such tools any time soon.

The thought that one might manually or otherwise intervene in an SC
program to add semantic information that is beyond the scope of the
semantics of the original algorithm doesn't seem to bother you. The
one thing you are adamant about is that a formal description of the
algorithm should exist in a hardware-independent form, and that seems
like a reasonable goal. As I've seen things go, though, unless the
relationship between the original hardware-independent description and
a version with hardware-specific information in it can be enforced or
checked automatically, the two will rapidly drift apart as the code
evolves under the pressure of actual use.

>> For me, real-world software is a gnu/linux tgz file. Gnu/linux has
>> managed to achieve a very high degree of portability, but at what a
>> cost: a huge fraction of the characters in the tgz file are committed
>> to coping with all the various hardware and software possibilities the
>> software might encounter (which generally include running under cygwin
>> on a Windows platform, and sometimes include the possibility of
>> running in a completely non-gnu environment).
>
>I'm not sure I get your example. Assuming that a tgz file is just any
>file processed by tar and gzip, are you suggesting that the file formats
>used by tar and/or gzip include much baggage to make them work on a wide
>variety of platforms?

Of course not.

>Or that any random software you might find that
>has been packaged into a .tgz file will contain such cruft? Or maybe
>that the tar and/or gzip software packages themselves have such cruft in
>the source code? The first I doubt, and the second depends on your
>sampling. The last I might well believe, but I don't view it as a best
>practice.
>

I mean the second, and if I have ever encountered a software
distribution that didn't harware-specific information in the autoconf
file and/or the source code, I can't remember it.

RM

Stephen Sprunk

unread,

Mar 4, 2004, 7:18:07 PM3/4/04

to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:2mme40lphgggmfgcs...@4ax.com...

> If it's to be a subject of discussion, then everyone should be able to
> look at the same evidence:
>
> http://www.aceshardware.com/read.jsp?id=60000285
>
> You get a 76% boost by doubling the number of physical processors in
> use on a dual xeon machine (boost due to SMP), and a further 35% boost
> by using hyperthreading on those two processors (boost due to SMT).

It all depends on your benchmark.

http://www.anandtech.com/IT/showdoc.html?i=1982&p=8

AnandTech's tests show a 39% performance increase in going from two Xeon
3.0GHz 4MB processors to four -- hardly stellar scaling from Intel's
flagship x86 server processor. While exact figures aren't given, they say
SMT gave only a 3-5% increase so was enabled in all tests (presumably to
reduce redundant datapoints).

Opterons do noticeably better in SMP because they're not limited by a shared
FSB, but that still does not support your point.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

Robert Myers

unread,

Mar 4, 2004, 7:58:06 PM3/4/04

to

On Fri, 05 Mar 2004 00:18:07 GMT, "Stephen Sprunk"
<ste...@sprunk.org> wrote:

>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:2mme40lphgggmfgcs...@4ax.com...
>> If it's to be a subject of discussion, then everyone should be able to
>> look at the same evidence:
>>
>> http://www.aceshardware.com/read.jsp?id=60000285
>>
>> You get a 76% boost by doubling the number of physical processors in
>> use on a dual xeon machine (boost due to SMP), and a further 35% boost
>> by using hyperthreading on those two processors (boost due to SMT).
>
>It all depends on your benchmark.
>
>http://www.anandtech.com/IT/showdoc.html?i=1982&p=8
>
>AnandTech's tests show a 39% performance increase in going from two Xeon
>3.0GHz 4MB processors to four -- hardly stellar scaling from Intel's
>flagship x86 server processor. While exact figures aren't given, they say
>SMT gave only a 3-5% increase so was enabled in all tests (presumably to
>reduce redundant datapoints).
>
>Opterons do noticeably better in SMP because they're not limited by a shared
>FSB, but that still does not support your point.
>

I'm not sure I know what point you want me to take away from the
example. That reasonable scaling with number of processors is not
guaranteed for all problems and all implementations? That HT doesn't
always pay off by all that much and may even cause performance
degradation? I'll stipulate all that.

The chess benchmark was striking to me for a number of reasons. One
is that multi-threaded benchmarks for non-OLTP non-HPC applications
aren't all that common. Another is that Xeon, with its off-die memory
controller and deep pipeline even managed to stay in the contest with
Opteron on a branchy problem--but only with the help of SMT. The 76%
lift going from one processor to two isn't wonderful, but the
benchmark wasn't intended as a test of scalability to massive
parallelism (even if "massive" is, say, eight threads), and, as I said
to Nick, I wasn't relying on the benchmark for evidence that the
problem at hand would scale if you wanted to work at it. The
benchmark does seem like an example of HT doing exactly what an
optimist might hope it would do: mask the deficiencies of the NetBurst
architecture on a problem where it really shouldn't be expected to do
well.

RM

john jakson

unread,

Mar 4, 2004, 8:21:03 PM3/4/04

to

Rupert Pigott <r...@dark-try-removing-this-boong.demon.co.uk> wrote in message news:<10784257...@saucer.planet.gong>...

> Ken Hagan wrote:
>
> > Robert Myers wrote:
> >
> >>So I didn't pick the best example. I chose it as a task that's easy
> >>to understand that can easily be made embarrassingly parallel.
> >
> >
> > Embarrassing parallelism isn't terribly useful. Spawning zillions
> > of threads will kill any (?) hardware stone dead. So of course we
>
> See Jon Jakson's post. :)
>
> Well, zillions perhaps, but hundreds or thousands, perhaps even
> millions shouldn't be a problem *if* you use a sane scheduling
> mechanism, keep process context to a minimum (note: I say
> processes, not threads) combined with message passing as the
> Transputer did all those years ago.
>

From a HW viewpoint there are a few trivial HW problems that could be
run on a massively par cpu, but probably shouldn't. The classic
example would be our old friend FFT. If the FFT and its buddies can be
run in ok time on 1 cpu, thats probably going to be more efficient as
long as the kernal is tight. But where Transputer style can really
help is to break up big chunks of process code into somewhat smaller
pieces for smoother flow of data through a system and to allow
partitioning across cpus with allocated bandwidth over the IOs. Just
like a chip/pcb designer does across modules/chips/boards.

Important thing is to get useful work done per the cost of supporting
context switch and message overheads. The current design of many par
languages using the OS for thread support breaks that as the ratio is
way too high to be useable.

Another example is HDL simulation, every decent HDL simulalator has an
internal event wheel capable of supporting a completely arb no of par
processes that are very short lived but static. Such processes might
be as simple as a logic gate model or as complex as a cpu model. The
event wheel essentially supports a very light form of context switch
since all the processes generally have minimal state. The scheduler
would have potentially many levels of priority or event times in
future, each priority could have thousands of processes on that list.
Never heard of zillions of processes breaking such a simulator or a HW
accelerated version of the same. But then in the HW model processes
are statically built at runtime from a compile unit so the memory
management is not too difficult. Similar to the way Inmos constrained
Occam to static compile so all memory could be layed out in advance.

I am not so sure I would want such a restricted model, I would like to
be able to new [] or delete hierarchies of module/process instances on
the fly (reconfigurable simulated computing) and reclaim space
afterwards, this turns the problem into a malloc problem for zillions
of tiny workspaces & process contexts, and maybe brings along garbage
collection and some link list ops too. I can see processes creation
calling new_workspace and getting another workspace in a few cycles,
very similar to stack frame creation using something like arenas. But
reclaiming them out of order means memory management in HW or a mix of
HW & SW for GC side.

The big question I have for all cpu architects out there that are
doing hyperthreading, why stop there. If you can get a few process
contexts to swap state in 0 cycles, why not take advantage of that to
add the message support, and scheduling as well. This can be done in
parallel with the process that is running. Probably the reason most HT
cpus can't get there is because they are locked into the register myth
of more is better esp in light of AMD64 story.

If we can go back to memory based workspaces that are cached and
support 3way RRW access, then we can move forward with light context
switches etc. Turns the context switch problem into a HW cache
swapping problem done in the background. As long as the cache can hold
a few hundred workspaces, it will outperform any and all fixed
register models and we can forget all about stacks too. It also has
the nice property that all workspaces are unbounded at least in terms
of the cpu core adding 3 workspace offsets to wp to reach into cache
for the 3 operands in a typ ld/st ISA. The instruction encoding might
impose some limit (say 8bits) but essentialy all register allocation
policies go out the window too. Kinda nice to get rid of stacks, reg
limits, let all local operands be in reg (I mean workspace), get
scheduling, communication, memory allocation and maybe GC in one fell
swoop.

Regards

johnjakson_usa_com

Occam-HDL really all the same thing

del cecchi

unread,

Mar 4, 2004, 8:30:18 PM3/4/04

to

"Robert Myers" <rmy...@rustuck.com> wrote in message

news:tk0b40luq45dk05ir...@4ax.com...
> On Tue, 2 Mar 2004 22:25:18 -0600, "del cecchi" <dce...@msn.com>

> wrote:
>
> >
> >"Robert Myers" <rmy...@rustuck.com> wrote in message

> >news:b74a409ogfa7ddbdd...@4ax.com...
> >>
> >snip
> >> If Gelsinger says we're soon going to have chips made up of lots of
snip
> >snip
> >
> >Define lots
>
> Eight (e.g. Niagara)? Sixteen (e.g. Tukwila/Tanglewood)?
> Tukwila/Tanglewood may have (probably will have?) a pared down issue
> width. Processors with four or more cores as yet to be named when
> Intel announces it is throwing in the towel on NetBurst (the two core
> version of Pentium-M has already been announced).
>
Interesting thought...

> >and little.
>
> Not so little by todays standards. Some (five) years old Patterson
> presentation suggesting that we had already passed the point of common
> sense in terms of the number of transistors per core. I could dig it
> up if you like. Maybe lower voltage, much lower power, slower clocks.
> Wherever the sweet spot is in terms of joules per useful unit of work,
> which is increasingly going to be the figure of merit that counts.

Common sense seems to have little to do with it. Speed costs
transistors, how fast do you want to go? (to paraphrase the hotrodders
of my youth). If spending another 20 million dollars (100 py) and
adding a bunch of transistors makes your chip 20 percent faster than the
competition, is that common sense or not?

>
> >My personal opinion is that the cores will be
> >more powerful that today's processors, if only because it is "easy"
to
> >swizzle to the new technology and today's processors are already
done.
>
> But the importance of keeping power consumption low argues against
> continuing current trends. It may even argue for reversing them as
> in, dare I mention it, Blue Gene.
>
> >And contention and arbitration and coherence make it hard to have too
> >many processors.
> >
> Coherence comes from shared cache, which turns coherence into a
> contention and arbitration issue, I think. Apparently Intel thinks it
> can talk sixteen cores with a straight face. I noticed you snipped
> the reference to PPC 970 in responding to my first post. How many is
> IBM thinking about? ;-).
>
> RM

I don't really know what IBM is planning, and of course if I did I
couldn't say. I think the problem with putting large numbers of
processors in a general purpose box is that of software. Intel may
believe that breakthroughs in software can be had, I am a little more
skeptical. And in the case of Intel vrs skeptics it seems to be
Skeptics 1 Intel 0. Or maybe it is tied if you count all the skeptics
who never thought an X86 could go that fast. :-)

I guess we'll see how their 16 cores work.

del

Brian Hurt

unread,

Mar 4, 2004, 8:43:14 PM3/4/04

to

"Ken Hagan" <K.H...@thermoteknix.co.uk> wrote in message news:<c26vkl$2am$1$8300...@news.demon.co.uk>...

> Robert Myers wrote:
> >
> > So I didn't pick the best example. I chose it as a task that's easy
> > to understand that can easily be made embarrassingly parallel.
>
> Embarrassing parallelism isn't terribly useful. Spawning zillions
> of threads will kill any (?) hardware stone dead. So of course we
> don't do that; we ask a single thread to perform the operations
> sequentially.
>

The question is, which is easier to do automatically: parallelize
sequential code, or sequentialize parallel code? I'd say the latter,
and then tell you to take a good look at Cilk:
http://supertech.lcs.mit.edu/cilk/
as an example of sequentializing parallel code.

Greg Lindahl

unread,

Mar 4, 2004, 9:55:50 PM3/4/04

to

In article <81f0f84e.04030...@posting.google.com>,
Brian Hurt <bh...@spnz.org> wrote:

>The question is, which is easier to do automatically: parallelize
>sequential code, or sequentialize parallel code? I'd say the latter,
>and then tell you to take a good look at Cilk:
>http://supertech.lcs.mit.edu/cilk/
>as an example of sequentializing parallel code.

That isn't the question. The question is: what will people actually
use?

-- greg

Robert Myers

unread,

Mar 4, 2004, 9:58:43 PM3/4/04

to

On Thu, 4 Mar 2004 19:30:18 -0600, "del cecchi" <dce...@msn.com>
wrote:

>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:tk0b40luq45dk05ir...@4ax.com...
>> On Tue, 2 Mar 2004 22:25:18 -0600, "del cecchi" <dce...@msn.com>
>> wrote:

<snip>

>
>> >and little.
>>
>> Not so little by todays standards. Some (five) years old Patterson
>> presentation suggesting that we had already passed the point of common
>> sense in terms of the number of transistors per core. I could dig it
>> up if you like. Maybe lower voltage, much lower power, slower clocks.
>> Wherever the sweet spot is in terms of joules per useful unit of work,
>> which is increasingly going to be the figure of merit that counts.
>
>Common sense seems to have little to do with it. Speed costs
>transistors, how fast do you want to go? (to paraphrase the hotrodders
>of my youth). If spending another 20 million dollars (100 py) and
>adding a bunch of transistors makes your chip 20 percent faster than the
>competition, is that common sense or not?
>

The IBM paper that is cited elsewhere in this thread considers the
tradeoff of power consumption vs. performance. In a fairly short
period of time, we have gone from speed at any cost, to speed only if
we can manage to cool it without further heroic measures, to some as
yet undefined trade between speed and power.

Computational density (useful unit of work per unit volume) probably
works as a good figure of merit if you rule out liquid cooling. It
works as a figure of merit for commercial applications (cooling and
real estate requirements) and for HPC (speed of light considerations).
Those are all (cooling, real estate, and time to traverse the system
at the speed of light) fairly hard requirements and would seem to
subordinate all other considerations, including development costs and
the difficulty and expense of writing software. Allowing liquid
cooling just produces a different competition class with the same
figure of merit (computational density).

<snip>
>
>... I think the problem with putting large numbers of

>processors in a general purpose box is that of software.

That's why I asked the question. :-).

>Intel may
>believe that breakthroughs in software can be had, I am a little more
>skeptical. And in the case of Intel vrs skeptics it seems to be
>Skeptics 1 Intel 0. Or maybe it is tied if you count all the skeptics
>who never thought an X86 could go that fast. :-)
>
>I guess we'll see how their 16 cores work.
>

While you're gloating, I notice that IBM slipped back into Fortune's
list of ten most admired American companies, after an absence of
seventeen long years. Congratulations.

RM

glen herrmannsfeldt

unread,

Mar 4, 2004, 10:54:22 PM3/4/04

to

Terje Mathisen wrote:

(snip regarding disk buffers)

> Yes, but in that case performance is probably good enough anyway. :-)

> With ~ 1 GB/s throughput for wc, grep should not be much slower, and for
> most searches (i.e. those with a least a few constant characters in the
> search term) it should be faster.

Last I knew GNU grep did a Boyer-Moore search using the longest fixed
string on the query. Also, fixed strings are a large fraction of grep
searches, anyway. With a reasonable length string it can be very fast.

> If you have 2 GB RAM, and 1 GB is your current disk cache, then you
> should get answers to most cache-based queries in a second or two,
> except that the OS directory scan and file open/close operations could
> start to dominate for everything except really big files.

I remember doing some timing tests on a machine running a 500MB file
through a program, and noticing that it was faster than the network
it was running on (from an NFS disk). Then I remembered that a 4GB
machine can have a very large disk cache.

-- glen

David C. DiNucci

unread,

Mar 5, 2004, 2:08:25 AM3/5/04

to

Robert Myers wrote:
>
> On Thu, 04 Mar 2004 12:28:56 -0800, "David C. DiNucci"
> <da...@elepar.com> wrote:

<snip>

> >The software will meet several forks in the road as it heads off to
> >different target platforms. It makes sense to allow a human to provide
> >information that can be useful at each fork, thereby refining the
> >original algorithm there in ways that help it to run efficiently on the
> >target without changing it's high-level behavior. It's called
> >optimization, and the formal model behind SC provides guidance into the
> >sorts of optimization that can be applied and the effects it will (not)
> >have. The info at the fork pertains to the software, not parameters for
> >constructing a hardware model, which is separate. If it's necessary
> >later to take a different fork, it's best to still have the un-optimized
> >code to which the optimization for the new target can be applied.
> >That's very different from describing the algorithm in the first place
> >with all sorts of options of how it should behave differently on many
> >different targets. In that latter case, the architecture-independent
> >algorithm may never be apparent even from the initial code.
> >
> I find _that_ hard to disagree with. Editors exist that will let you
> intervene manually in a graph and then tell you whether what you have
> done still makes syntactic sense. There may even be such editors that
> will tell you whether the manual transformation you just forced on the
> graph leaves the graph semantically equivalent to what you started
> with. If so, I haven't encountered such a tool.

The tools you seek are possible (or feasible) only when there is a
sufficient underlying model/theory. My key phrase from the above is

"the formal model behind SC provides guidance into the sorts of
optimization that can be applied and the effects it will (not) have".

> As it is, in a perfect world, one would need a tool that I can almost

> guarantee does not exist; viz, one that would allow you to add
> information that leaves the description of the algorithm unchanged but
> changes the semantic content of the graph in the sense that it gives
> specific guidance as to how to cope with specific hardware.

Exactly, with the possible exception of your use of the word
"semantic". That is, if the algorithm is unchanged relative to the
formal description of SC, then the additional information is necessarily
semantically neutral wrt SC. However, I really only stated that the
model provides guidance into the (semantic) effects they will or won't
have. If some optimizations might change the semantics in some
restricted ways relative to SC's model, then it may still be necessary
to independently show that important properties of the program are still
preserved relative to the algorithm. This might still be easy, it's
just not necessarily free.

Consider your own earlier post:
http://groups.google.com/groups?selm=qenenv8hbi2rtbuvlvqflcvv6rp17onajb%404ax.com
Wouldn't it be nice to have a basic dot product algorithm, which you can
convince yourself (perhaps through proof) is correct, then a set of
optimizations which in some sense "refines" that algorithm for different
situations/platforms in the ways you describe in your post without
needing to reprove the entire (or perhaps any) of the correctness all
over again each time.

> I can see
> the desirability of enforcing the distinction, but I'm not optimistic
> that we'll see such tools any time soon.

As you might guess from the above, I have reason to be much more
optimistic, but Elepar IP considerations prevent me from going much
deeper into this just now.

> The thought that one might manually or otherwise intervene in an SC
> program to add semantic information that is beyond the scope of the
> semantics of the original algorithm doesn't seem to bother you. The
> one thing you are adamant about is that a formal description of the
> algorithm should exist in a hardware-independent form, and that seems
> like a reasonable goal. As I've seen things go, though, unless the
> relationship between the original hardware-independent description and
> a version with hardware-specific information in it can be enforced or
> checked automatically, the two will rapidly drift apart as the code
> evolves under the pressure of actual use.

To a great extent, this is a tools and data management issue -- i.e.
keeping the optimizations which apply to this platform separate from the
ones which apply to that other platform separate from the original
hardware-independent form, and applying the appropriate optimizations at
the right times and places.

<snip>

> I mean the second, and if I have ever encountered a software
> distribution that didn't harware-specific information in the autoconf
> file and/or the source code, I can't remember it.

This is again a tools issue, which is again heavily influenced by
language model issues, and I'm not about to criticize or restrict an
approach which people find workable. However, one factor with SC is that
you (almost*) need to have tools more advanced than a text editor to
help you in the programming process anyway, so you might as well have
them help you in the other ways as well, especially with a powerful
programming model to give them leverage.

*With ESCORT, it is theoretically possible to program in SC with just a
text editor.

-- Dave
-----------------------------------------------------------------
David C. DiNucci Elepar Consulting for portable

da...@elepar.com http://www.elepar.com parallel, distributed,

503-439-9431 Beaverton, OR 97006 grid, and P2P computing

Russell Wallace

unread,

Mar 5, 2004, 2:47:44 AM3/5/04

to

On Thu, 04 Mar 2004 18:46:11 -0500, Robert Myers <rmy...@rustuck.com>
wrote:

>As it is, in a perfect world, one would need a tool that I can almost
>guarantee does not exist; viz, one that would allow you to add
>information that leaves the description of the algorithm unchanged but
>changes the semantic content of the graph in the sense that it gives
>specific guidance as to how to cope with specific hardware. I can see
>the desirability of enforcing the distinction, but I'm not optimistic
>that we'll see such tools any time soon.

Not sure exactly what you're looking for, but would DECLARE forms in
Lisp be an example of it? (Or the (now obsolete) 'register' qualifier
in C for that matter?)

--
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
http://www.esatclear.ie/~rwallace

Jouni Osmala

unread,

Mar 5, 2004, 3:23:55 AM3/5/04

to

> > So I didn't pick the best example. I chose it as a task that's easy
> > to understand that can easily be made embarrassingly parallel.
>
> Embarrassing parallelism isn't terribly useful. Spawning zillions
> of threads will kill any (?) hardware stone dead. So of course we
> don't do that; we ask a single thread to perform the operations
> sequentially.

Thats because the hardware you run had been build that way. Consider a
situation where you have atleast SMT if not CMP style processors
available for few years, so software developers would be adding more
of them. Next thing to be considered, is that number of processors
doubling every year. From HARDWARE point of view trying to get faster
in other ways than adding more processors to the problem has become a
quite a uphill battle. I could envision a "PC" which would have lets
say a 256 simple cores running in paraller with memories in same
package, with multiple VERY high speed busses running between memory
CHIP and CPU chip. I'm not saying we get there tomorrow, or next year,
but lets say 10 years. Power saving would be just taking power of from
unused cores/FPU blocks.

> We could break the problem up into a small (2,4,8?) number of sub-
> problems and create worker threads to grind through them all and
> simply wait for the workers to finish in our main thread. If "the
> problem" really is the heart and soul of our program then the
> effort required is probably worth it.
>

> If the embarrasing parallelism is found at some lower level and in
> dozens of different ways, not only does the effort of programming
> it become too great, the run-time overheads of creating all those
> threads might start to exceed the speed up.

Well then you don't need the power. That it. If you need processing
power you add more threads if you are just fine with one core you
don't add threads. Not all problems are parallel but luckily enough of
the problems which need the performance are parallel. For CPU
manufacturers go that way.

> And of course *everything* eventually becomes I/O bound (or more
> likely memory bound) so the expected return on investment probably
> won't happen.

Uhh. When you have lots of Gigabytes of ram on you desktop with
multiple TB/s of bandwith and reasonable sized ondie cache you don't
say its I/O bound or memory bound for single or two processors. And
I/O is speeding up rapidly, there migh be even leap where we replace
the magnetic discs with something with ordersOfMagnitude better
latency like making memory chips that could be used as permanent
storage. Consider that in 10 years a singe memory chip could handle
128times as much as today, and if thats permanent storage and we could
put lets say 8 of them on a same package with CPU. Now we are looking
of around TB worth of extremele high bandwith low latency memory that
is the permanent storage in which we don't need any diskdrives
anymore. Of course we are internet bound anyway ;)

Jan C. Vorbrüggen

unread,

Mar 5, 2004, 3:33:23 AM3/5/04

to

> I mean the second, and if I have ever encountered a software
> distribution that didn't harware-specific information in the autoconf
> file and/or the source code, I can't remember it.

Ugh. What counts as a "software distribution"? For instance, would SPEC
CPU2000 count? AFAIK, there's preciously little in the SPEC tools them-
selves (apart from having two versions for Unix and WNT. for obvious OS-
and not hardware-specific reasons) and some parts - e.g., the one I con-
tributed - definitely have none.

Jan

Terje Mathisen

unread,

Mar 5, 2004, 3:43:37 AM3/5/04

to

glen herrmannsfeldt wrote:

> Terje Mathisen wrote:
>
>> With ~ 1 GB/s throughput for wc, grep should not be much slower, and
>> for most searches (i.e. those with a least a few constant characters
>> in the search term) it should be faster.
>
> Last I knew GNU grep did a Boyer-Moore search using the longest fixed

Yeah, that's the obvious approach.

> string on the query. Also, fixed strings are a large fraction of grep
> searches, anyway. With a reasonable length string it can be very fast.

My point was that it should be very hard to make grep slower than wc,
and wc should run at main memory streaming rates.

>> If you have 2 GB RAM, and 1 GB is your current disk cache, then you
>> should get answers to most cache-based queries in a second or two,
>> except that the OS directory scan and file open/close operations could
>> start to dominate for everything except really big files.
>
> I remember doing some timing tests on a machine running a 500MB file
> through a program, and noticing that it was faster than the network
> it was running on (from an NFS disk). Then I remembered that a 4GB
> machine can have a very large disk cache.

BTDT. :-(

When I benchmarked multi-stream laptop disk file writing, I used the
normal C library and generated 6 100 MB files.

When I then went back and tried to read the same data, it took less than
a second, since it was all still in cache.

The proper way to do it would have been to make it OS-specific, using
lowlevel calls to do page-aligned unbuffered transfers to/from the disk.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Ken Hagan

unread,

Mar 5, 2004, 6:23:13 AM3/5/04

to

I *think* I've been mis-understood, so I'll pick the friendliest
reply (!) and try to clarify. I was responding to Robert's remark
that...

> So I didn't pick the best example. I chose it as a task that's easy
> to understand that can easily be made embarrassingly parallel.

...which I took to pre-suppose that embarrasingly parallel problems
should be easy to speed up. I was trying to say why an unexceptional
programmer like myself might find it harder than that.

Sander Vesik wrote:
>
> It won't kill every kind of hardware - hardware that has zillions
> of processors or zillions of thread contexts (asuming a tera style
> processor in SMP conf) will be very happy.

True. I was (as might have been more obvious later) assuming that
the platform was Linux-cum-Wintel shaped. (Probably not a fair
assumption in this group!) So my arguments are based on a small
number of fairly heavyweight threads (well, processes sharing an
adderss space, really).

John Jakson's reply explicitly challenged that assumption and I'd
happily withdraw my conclusions if someone could deliver a much
lighter threading primitive. Say, something lightweight enough
that a compiler could unroll loops with it. :)

> Right. Now what if such threads came for free and were widely

> available?

This was roughly where my closing remarks were going. I do expect
to see future hardware being N-way parallel for small values of N
even if there is only one chip in the box. (ie, for free)

But I'm getting ahead of myself. First, I made the point that
actually producing a modestly parallel implementation (and I'm
assuming existing tools, aka "worst case choice of language")
is sufficiently hard that you'd need to be fairly confident of
reaping the benefits. Problem: you can have no such confidence
as long as your N-way CPU is connected to a single main memory
because everything ends up waiting for the dat to turn up.

> As things stand, most things are far from I/O bound. Also,

> making them less so is not that hard if you have more than
> one PCI-X bus.

I shouldn't have said "I/O bound (or more likely memory bound)".
I should have said "memory bound" and left it at that. I Was
distracted by the "grep" example which had already come up.

> Right, and now imagine if P4 has say 32 threads instead of two. How
> wuould you change the software? You wouldn't really ignore the chance
> to have extra worker threads, would you?

This depends. If I have to split my work into a dozen pieces or
more in order to approach the peak performance of the hardware,
then...

I will have a go. Right now that would be something of a
re-write, but in a few years time it will be less so, as
long as the intervening years saw me targetting 4-way or
8-way systems.

I doubt whether customers will buy the chip.

If the chip performance is pretty good with only a few threads
active, I won't bother, the market won't exist for the chip and
the likes of Intel won't make it.

Which brings me nicely to my "no revolutions" remark at the end.

>> No revolutions, though. In 2010, most software will be written in
>> C/C++, just like the operating systems they sit upon, and compiled
>> to a machine language that the 8086 designers would recognise. If
>
> So? This in no way means threads - or even a large amount of threads
> is unusable solution.

That depends on what you call "large". You mentioned 32 and Rupert
Pigott mentioned millions. I doubt the same hardware can do both,
since otherwise we'd have heard. That being the case, the hardware
can't change dramatically without losing its market.

Another way of looking at the software problem is that almost every
program has parts which are naturally N-way parallel. The problem is
that the value of N varies from 1 to millions and back again in the
space of a few instructions. That's why embarrasing parallelism
isn't always terribly useful -- it is simply too hard to isolate a
given stretch of N-way code and map it onto M-way hardware, except
for the trivial case of M=1.

Better languages can help if they automate the mapping process.
However, the unwashed hordes have shown that any language that
doesn't look like C (ignore the semantics, I'm talking syntax and
tokens here) won't get shelf-loads of books and marketing. I find
it depressing that language designers haven't learned this lesson
and still produce new languages which differ gratuitously from
the "norm". (Once again, I'm aware that most of this group can
probably remember a time before C even existed, let alone became
the syntax norm, but its 2004 now.)

If we insist on shipping precompiled applications, then parts of
the mapping have to be done before the value of M is known. This
appears to me to be harder still.

As with "C-like syntax", we know the solution to this problem as
well, since as far as I can tell it has been the standard way of
distributing software since the dawn of time in every part of the
industry except PCs. There is presumably some economic or political
reason why PCs chose to do it differently, but VMs seem to be coming
into vogue. (They probably aren't useful, since most of the high-
level program knowledge has been discarded by the time you get to
byte-code or MSIL, but they may get users used to the idea of
compiling their own apps.)

Sander Vesik

unread,

Mar 5, 2004, 8:03:10 AM3/5/04

to

Fortran, C, C++ and Java? well, ok, and a smattering of other languages.

>
> -- greg

john jakson

unread,

Mar 5, 2004, 9:44:13 AM3/5/04

to

reeu...@few.vu.nl (Kees van Reeuwijk) wrote in message news:<1ga565d.10u4e031u0yzquN%reeu...@few.vu.nl>...

Actually the language I have in mind and in compiler development
thanks to lcc is a hybrid of Verilog, C & Occam. I only just switched
back to cpu development to push that side of the project along.

Its easy to see how to put Occam primitives into C, thats been done a
few times. But combining Verilog allows for HDL or CSP style which I
believe can be complementary. Further if programs are written from say
math descriptions and are embarassingly parrallel, models can be
written in both styles and evaluated. The C,Occam stlye will be
written more quickly and perform well. On a Transputer style, such
programs are also scaleable.

Now also imagine the HDL form which can also use either non synth
Verilog events style some what similar to other event languages, or
better still writing in RTL style. If the Transputer model supports
event wheel scheduling as well as par,alt?! scheduling then guess
what. It runs maybe even slower than Occam style (there is more
overhead to support), BUT it can be synthed into FPGA HW and sit side
by side with cpu.

Ultimately Transputing and FPGAs are mutually complementary, no reason
to force one over the other. Different parts of a problem go to the
side which best supports it. If it looks like an engine, code it RTL
style and place on FPGA as many times as needed. If it doesn't, keep
it in code. But IO is always likely to be the true bottleneck.

Robert Myers

unread,

Mar 5, 2004, 11:53:33 AM3/5/04

to

On Fri, 05 Mar 2004 07:47:44 GMT, wallacet...@eircom.net (Russell
Wallace) wrote:

>On Thu, 04 Mar 2004 18:46:11 -0500, Robert Myers <rmy...@rustuck.com>
>wrote:
>
>>As it is, in a perfect world, one would need a tool that I can almost
>>guarantee does not exist; viz, one that would allow you to add
>>information that leaves the description of the algorithm unchanged but
>>changes the semantic content of the graph in the sense that it gives
>>specific guidance as to how to cope with specific hardware. I can see
>>the desirability of enforcing the distinction, but I'm not optimistic
>>that we'll see such tools any time soon.
>
>Not sure exactly what you're looking for, but would DECLARE forms in
>Lisp be an example of it? (Or the (now obsolete) 'register' qualifier
>in C for that matter?)

They would be, but (within the limits of my understanding of DECLARE
in lisp) they aren't terribly powerful.

I'm not sure why I worry about all this stuff so much, because I could
have gotten through a fair part of my career with a good
multi-dimensional FFT, and you can almost always count on someone
having beaten multi-dimensional FFT's to death on practically any
hardware platform you can imagine.

To do it properly, though, you have to know alot about the machine:
how big is the cache, how many processors are available, what kinds of
data grouping and movement are the most economical. That information
in hand, you have to do things that are more than merely advisory: you
choose different strategies depending on the actual hardware.

Such things are straightforwardly accomplshed in c and probably now in
Fortran, but the code is messy.

If you have a normal form for descriptions of algorithms in whatever
language you are working in, and an algorithm for reducing
descriptions of algorithms to normal form (throwing away extraneous
information in the process, if necessary), the messiness need not
bother you in at least one respect: you should always be able to
determine whether two descriptions of the algorithm are equivalent in
whatever formal way you have determined is important.

Given that more exact description of what is required, it is easier to
imagine that such tools might either exist or be possible to create in
situations that matter.

RM

Robert Myers

unread,

Mar 5, 2004, 12:08:47 PM3/5/04

to

It is perhaps telling that you had to go to an situation where
hardware-neutrality is a requirment to find an example?

As it is, I don't even know exactly what constitutes a CPU2000
distribution, but I think it a safe bet that hardware-specific
adjustments to the make file would be an inevitable part of
benchmarking. For your average Unix administrator doing an unzip and
install, that sort of thing would ordinarily be taken care of by the
autoconf step.

That doesn't begin to touch what kind of skullduggery might be
possible with header and library files, but I obviously don't know the
ground rules in any detail.

RM

Jan C. Vorbrüggen

unread,

Mar 5, 2004, 12:19:53 PM3/5/04

to

> As it is, I don't even know exactly what constitutes a CPU2000
> distribution, but I think it a safe bet that hardware-specific
> adjustments to the make file would be an inevitable part of
> benchmarking.

No, the only mods are made automatically to support various compilers
and their options. There are some pre-ordained - on a per-application
basis - defines you are allowed to use, and (especially in the integer
part) some reflect hardware properties; for instance, whether the system
is little- or big-endian. I consider these so-called portability flags
a kludge: they shouldn't exist, but then SPEC sometimes needs to take
what it can get. But these flags are few and far between.

Of course, the FP part is mostly in Fortran, which has no standard
notion of conditional compilation...

Jan

David Gay

unread,

Mar 5, 2004, 1:19:24 PM3/5/04

to

Robert Myers <rmy...@rustuck.com> writes:
> If you have a normal form for descriptions of algorithms in whatever
> language you are working in, and an algorithm for reducing
> descriptions of algorithms to normal form (throwing away extraneous
> information in the process, if necessary), the messiness need not
> bother you in at least one respect: you should always be able to
> determine whether two descriptions of the algorithm are equivalent in
> whatever formal way you have determined is important.

What you're asking for in the paragraph above will, I believe, never exist:
I believe that any useful, general-purpose algorithm description language
will be subject to all the usual undecidability problems (termination,
equivalence, etc, etc).

For sufficiently constrained domains, you might solve this.

--
David Gay
dg...@acm.org

Eugene Miya

unread,

Mar 5, 2004, 1:31:33 PM3/5/04

to

Just passing through.

It's been argued that PP will never be mainstream.
The big thing to remember is that right now parallelism is largely an
O(n) solution to problems with computational complexity in some cases
vastly exceeding in some cases O(N^3). It's all going to depend on your n.
The vast majority of people reading this (or not) fail to understand this.

Who: in the near future it will be written by people who write it now.
Excepting high visibility benchmarks and people out for attention,
these smart people will write the apps for their areas and remain
relatively anonymous except in their communities.

How: One line at a time. Right now it's mostly some version of MPI.
And they debug using print statements. I see no drop in parallelism.

A few in the masses may get access to libraries, but that's as close to
mainstream as you are going to get.

--

Nick Maclaren

unread,

Mar 5, 2004, 1:31:59 PM3/5/04

to

In article <s711xo7...@beryl.CS.Berkeley.EDU>,
David Gay <dg...@beryl.CS.Berkeley.EDU> writes:

|> Robert Myers <rmy...@rustuck.com> writes:
|> >
|> > If you have a normal form for descriptions of algorithms in whatever
|> > language you are working in, and an algorithm for reducing
|> > descriptions of algorithms to normal form (throwing away extraneous
|> > information in the process, if necessary), the messiness need not
|> > bother you in at least one respect: you should always be able to
|> > determine whether two descriptions of the algorithm are equivalent in
|> > whatever formal way you have determined is important.
|>

|> What you're asking for in the paragraph above will, I believe, never exist:
|> I believe that any useful, general-purpose algorithm description language
|> will be subject to all the usual undecidability problems (termination,
|> equivalence, etc, etc).

I am pretty certain that you are right. It isn't obviously derivable
from one of the standard insoluble problems, but it assuredly looks
at least as general as some of them!

If nothing else, the existence of such an algorithm would immediately
prove that there are are no one-way functions :-)

|> For sufficiently constrained domains, you might solve this.

Yes. And it has been done for some VERY simple ones.

Regards,
Nick Maclaren.

Kai Harrekilde-Petersen

unread,

Mar 5, 2004, 3:17:19 PM3/5/04

to

Terje Mathisen <terje.m...@hda.hydro.com> writes:

That, or run an app that malloc all memory on the machine, and then
writes one byte to each page sequentially. I did one for Linux
(although the code is very generic) called 'swapout' for obvious
reasons. Dirty, but very very efficient at avoiding caching effects
:-)

/Kai

David C. DiNucci

unread,

Mar 5, 2004, 5:39:34 PM3/5/04

to

Eugene Miya wrote:
>
> Just passing through.
>
> It's been argued that PP will never be mainstream.
> The big thing to remember is that right now parallelism is largely an
> O(n) solution to problems with computational complexity in some cases
> vastly exceeding in some cases O(N^3). It's all going to depend on your n.
> The vast majority of people reading this (or not) fail to understand this.

I guess I'm one of them. For example, I don't understand where these
complexity assertions came from, or the possible significance of "n" vs.
"N". Maybe "O(n) solution" comes from some payback related to
functional decomposition? There are certainly ways to get more bang for
your buck, through data parallelism, parallel recursion, reuse, etc.
And I can only guess that the O(N^3) relates somehow to the analysis
required to decompose sequential code. But much of that is discovering
the complexity which arose from haphazardly adding inter-dependences
when *composing* the sequential code. By constructing the code in
parallel in the first place, with tools which help to organize each new
dependence, the construction process is hardly more difficult (and
certainly much more manageable) than sequential.

> Who: in the near future it will be written by people who write it now.
> Excepting high visibility benchmarks and people out for attention,
> these smart people will write the apps for their areas and remain
> relatively anonymous except in their communities.

Again, I disagree. If parallel tools provide a better environment for
writing even "sequential" code than existing tools do in important ways,
existing "sequential" programmers will migrate to those tools, with
potential parallelism in their applications being frosting rather than
the cake.

> How: One line at a time. Right now it's mostly some version of MPI.
> And they debug using print statements. I see no drop in parallelism.

Strike three. While I agree with how they do it now, I believe most
parallelism in the future will be block-by-block, module-by-module,
and/or app-by-app rather than line-by-line. Individual blocks, modules,
and in some cases apps, will look much like they do now, written in the
same languages with the same tools.

> A few in the masses may get access to libraries, but that's as close to
> mainstream as you are going to get.

Only if people are scared to death of implementing parallel code because
of assertions like you're making here! :-)

-- Dave
-----------------------------------------------------------------
David C. DiNucci Elepar Tools for portable grid,
da...@elepar.com http://www.elepar.com parallel, distributed, &
503-439-9431 Beaverton, OR 97006 peer-to-peer computing

David C. DiNucci

unread,

Mar 5, 2004, 5:16:39 PM3/5/04

to

Robert Myers wrote:
<snip>

> Such things are straightforwardly accomplshed in c and probably now in
> Fortran, but the code is messy.
>
> If you have a normal form for descriptions of algorithms in whatever
> language you are working in, and an algorithm for reducing
> descriptions of algorithms to normal form (throwing away extraneous
> information in the process, if necessary), the messiness need not
> bother you in at least one respect: you should always be able to
> determine whether two descriptions of the algorithm are equivalent in
> whatever formal way you have determined is important.

Take an extreme example. I choose my normal form to be just the
input-output mapping/table represented by the description of the
algorithm. The algorithm for reducing the description of the algorithm
to the normal form is to run the description, according to the stated
semantics of its representation, on all possible inputs to produce all
possible outputs, from which I construct that mapping/table. Can I now
just compare the normal forms of two descriptions to determine whether
they're equal? No, because it is (in general) undecideable whether I can
create even this simplistic normal form in finite time, even if I can
look into the inner workings of the algorithmic descriptions, if those
descriptions are partial recursive.

However, I want to make clear that your claim here is not what I was
claiming regarding SC optimization. For example, in my case, I don't
care whether a particular algorithm halts or not, or whether it is
equivalent to some other arbitrary algorithm. All I care is that I know
whether a certain optimization/transformation I perform to the algorithm
might alter its halting properties.

Nick Maclaren

unread,

Mar 5, 2004, 5:12:31 PM3/5/04

to

In article <404901A6...@elepar.com>,

David C. DiNucci <da...@elepar.com> wrote:

>Eugene Miya wrote:
>>
>> Just passing through.
>>
>> It's been argued that PP will never be mainstream.
>> The big thing to remember is that right now parallelism is largely an
>> O(n) solution to problems with computational complexity in some cases
>> vastly exceeding in some cases O(N^3). It's all going to depend on your n.
>> The vast majority of people reading this (or not) fail to understand this.
>
>I guess I'm one of them. For example, I don't understand where these
>complexity assertions came from, or the possible significance of "n" vs.
>"N". Maybe "O(n) solution" comes from some payback related to
>functional decomposition? There are certainly ways to get more bang for
>your buck, through data parallelism, parallel recursion, reuse, etc.
>And I can only guess that the O(N^3) relates somehow to the analysis
>required to decompose sequential code. But much of that is discovering
>the complexity which arose from haphazardly adding inter-dependences
>when *composing* the sequential code. By constructing the code in
>parallel in the first place, with tools which help to organize each new
>dependence, the construction process is hardly more difficult (and
>certainly much more manageable) than sequential.

At one level, I agree. The vast majority of "grand challenge" problems
are inherently parallel at the level of the basic physics, and it is
the mapping to a mathematically tractable form that introduces the
serialisation. Whether this means that a suitable parallel programming
paradigm would make it easy and natural to program them efficiently is
less than clear.

But it is assuredly a topic that needs some serious research.

Regards,
Nick Maclaren.

Sander Vesik

unread,

Mar 5, 2004, 5:14:51 PM3/5/04

to

Ken Hagan <K.H...@thermoteknix.co.uk> wrote:
> I *think* I've been mis-understood, so I'll pick the friendliest
> reply (!) and try to clarify. I was responding to Robert's remark
> that...
>
> > So I didn't pick the best example. I chose it as a task that's easy
> > to understand that can easily be made embarrassingly parallel.
>
> ...which I took to pre-suppose that embarrasingly parallel problems
> should be easy to speed up. I was trying to say why an unexceptional
> programmer like myself might find it harder than that.

embarrasingly parallel problems are easy to speed up on even very begining
programmers - while its not commonly done AFAIK, but many of these are at
a level that you could give it has homework towards the end of programming 101
and expect to get back working results. Sure, these might not scale well
beyond a handfull of cpus, but its not rocket science either if you look
for small scalability.

>
> Sander Vesik wrote:
> >
> > It won't kill every kind of hardware - hardware that has zillions
> > of processors or zillions of thread contexts (asuming a tera style
> > processor in SMP conf) will be very happy.
>
> True. I was (as might have been more obvious later) assuming that
> the platform was Linux-cum-Wintel shaped. (Probably not a fair
> assumption in this group!) So my arguments are based on a small
> number of fairly heavyweight threads (well, processes sharing an
> adderss space, really).

While its true that using light-weight paralellism will allow you to
easily support many more threads its not true that supporting lots
of posix-style heavy threads running more or less concurrently is
impossible or exceptionaly hard to do. Its just that so far, most
emphasis has been on single thread perfomance, but when that starts
to have increasingly diminishing returns, you can have lots of
paralellism easily from CMP, SMT and barrel processors (and as they are
not exclusive, you can have *LOTS* of it).

>
> John Jakson's reply explicitly challenged that assumption and I'd
> happily withdraw my conclusions if someone could deliver a much
> lighter threading primitive. Say, something lightweight enough
> that a compiler could unroll loops with it. :)

Compilers already can do - and do do - this with posix threads, and
there are codes that benefit well from this.

>
> > Right. Now what if such threads came for free and were widely
> > available?
>
> This was roughly where my closing remarks were going. I do expect
> to see future hardware being N-way parallel for small values of N
> even if there is only one chip in the box. (ie, for free)

as long as you are prepared to view 'dozens' as small it sounds ok ;-)

>
> But I'm getting ahead of myself. First, I made the point that
> actually producing a modestly parallel implementation (and I'm
> assuming existing tools, aka "worst case choice of language")
> is sufficiently hard that you'd need to be fairly confident of
> reaping the benefits. Problem: you can have no such confidence
> as long as your N-way CPU is connected to a single main memory
> because everything ends up waiting for the dat to turn up.

It depends. For codes that are hard limited on memory latency, this
is true for any single thread. But it may not be true for a bundle
of such threads if hardware was pro-actively designed to counter
such. Consider an architecture that involuntarily swicthed threads
on cache miss and supported a number of inactive, waiting threads.
As fetching data from main memory wil ltake the same time as hundreds
of instructions, you may be able to support many such threads.

IBM northstar was such a CPU with the number of active and inactive
threads being both 1.

>
> > As things stand, most things are far from I/O bound. Also,
> > making them less so is not that hard if you have more than
> > one PCI-X bus.
>
> I shouldn't have said "I/O bound (or more likely memory bound)".
> I should have said "memory bound" and left it at that. I Was
> distracted by the "grep" example which had already come up.

But is 'memory bound' in this case bandwidth or latency? A processor
with multiple threads of execution (no matter how implemented) and
a memorysystem that supports multiple outstanding memory requests
will be able to convert a portion of memory bandwidth into latency
tolerance beyond what a single thread only processor can.

>
> > Right, and now imagine if P4 has say 32 threads instead of two. How
> > wuould you change the software? You wouldn't really ignore the chance
> > to have extra worker threads, would you?
>
> This depends. If I have to split my work into a dozen pieces or
> more in order to approach the peak performance of the hardware,
> then...
>
> I will have a go. Right now that would be something of a
> re-write, but in a few years time it will be less so, as
> long as the intervening years saw me targetting 4-way or
> 8-way systems.
>
> I doubt whether customers will buy the chip.
>
> If the chip performance is pretty good with only a few threads
> active, I won't bother, the market won't exist for the chip and
> the likes of Intel won't make it.

in throughput oriented chips, a program using M/N threads out of M
will get at most M/N of the peak performance. Lets say M=8, something
that might easily be appearing in 3-4 years. If you can only use 2
threads, you won't get more than 1/4th of available perfomance.

>
> Which brings me nicely to my "no revolutions" remark at the end.
>
> >> No revolutions, though. In 2010, most software will be written in
> >> C/C++, just like the operating systems they sit upon, and compiled
> >> to a machine language that the 8086 designers would recognise. If
> >
> > So? This in no way means threads - or even a large amount of threads
> > is unusable solution.
>
> That depends on what you call "large". You mentioned 32 and Rupert
> Pigott mentioned millions. I doubt the same hardware can do both,
> since otherwise we'd have heard. That being the case, the hardware
> can't change dramatically without losing its market.

32 is something that you will probably get vendors to admit is a resonable
ballbark figure for a chip shipping in 10 years. Note that a dual-core
Xeon would already have 4 active threads and reduce it to a question
of "will there be 8x more threads".

I don't see millions of heavy threads anytime soon - but I can see tens of
thousands per server in not more than 15 years. With 16 or 32 threads
per cpu, even quite modest 8-16 way SMP-s will have a lot of parallel
threads in action.

>
> Another way of looking at the software problem is that almost every
> program has parts which are naturally N-way parallel. The problem is
> that the value of N varies from 1 to millions and back again in the
> space of a few instructions. That's why embarrasing parallelism
> isn't always terribly useful -- it is simply too hard to isolate a
> given stretch of N-way code and map it onto M-way hardware, except
> for the trivial case of M=1.

Umm.. thats not embarrasingly parallel, at least not in the way its
commonly understood.

[snip]

>
> If we insist on shipping precompiled applications, then parts of
> the mapping have to be done before the value of M is known. This
> appears to me to be harder still.

This is not quite true - ifthe problem is partitionables or has
a pipeline in it (etcetera) then you can more or less use any
number of threads from 2 to the partitioning limit without language
being a problem.

>
> As with "C-like syntax", we know the solution to this problem as
> well, since as far as I can tell it has been the standard way of
> distributing software since the dawn of time in every part of the
> industry except PCs. There is presumably some economic or political
> reason why PCs chose to do it differently, but VMs seem to be coming
> into vogue. (They probably aren't useful, since most of the high-
> level program knowledge has been discarded by the time you get to
> byte-code or MSIL, but they may get users used to the idea of
> compiling their own apps.)
>

A .jar file a ... whatever MSIL is distributed as ... is from the
user point of view just the same as if they had been handed a binary.

TTK Ciar

unread,

Mar 5, 2004, 5:07:34 PM3/5/04

to

I believe that the obstacles to innovation in the underlying
architectures will also define the path that innovation will take
(ie, innovation is kept down, but not out, and it will eventually
trace the path of least resistance towards market-driven rewards).

More specifically, I refer to the vicious cycle familiar to many
of the comp.arch regulars: new processor paradigms fail to catch
on unless they can be programmed using established programming
practices, and programming practices are a function of how commodity
processors can be easily programmed.

One of the ways we have been able to utilize commodity processors
for high-end applications is by making them into clusters, and using
well-known client/server approaches to writing the software. For
the past few years, I have seen and participated in the in-house
development of various tools (most of them modest, but effective)
for making the development of well-behaved distributed applications
faster and easier. The most successful efforts I am aware of did
not try to strive for elegance or efficiency, but rather rapid
development without producing prohibitively bad solutions.

The place I'm working for now, the Internet Archive, uses a cute
little application called "p2" which the user evokes at the shell
prompt, passing it as arguments a shell command and the name of a
class of cluster nodes. It ssh's out to these machines in parallel,
causes them to execute the given command, and optionally performs
operations on the stdout such as a staged distributed sort. The
processed result is then written to stdout on the user's local node.

This "p2" could be a lot better -- complaining about it is a
favorite passtime among the administrators and engineers who use it.
But it gets the job done. Applications have been written which use
exec() (or qx() as the case may be) of p2 as their sole means of job
distribution, and though these applications do not perform very well,
they scale well enough (the Archive cluster is about 770 nodes right
now) and were developed quickly and easily.

I could name other applications which I have encountered, but the
important thing is what they have in common:

* They do not hide the distributed nature of the system, but let
a little understanding on the user's part go a long way.

* While they do not produce great solutions, they do not produce
horribly suboptimal ones.

* People can use them to solve various real-life problems.

I believe that by not trying to fully solve the problem of
abstracting away the distributed nature of the system, and by not
trying to achieve perfect efficiency, the developers of these tools
are able to focus instead on making them flexible and robust. This
might not be a bad formula to follow when trying to develop new
tools for producing distributed applications.

Changing focus to the hardware side of things, it seems to me
that some of the more effective approaches to making scalable
distributed systems make the computer look more like a cluster.
AMD, for instance, uses effectively a network of parallel memory
busses on the motherboard to improve the performance of the memory
system, and places the processors in this network much in the way
nodes are placed in a conventional cluster. There are differences
between NUMA systems and cluster systems, but I think that from
the perspective of software engineering this is mostly a difference
in degree rather than kind (NUMA's apparently-unified address space
is a cute bit of syntactic sugar, but the same effect can be
accomplished on a cluster by making the programmers access data
through an interface which maps the virtual space to physical node
spaces).

It seems reasonable to me, then, that software tools which are
being developed now to make life easier for the cluster software
engineer might be evolved to also fit the NUMA "cluster in a box"
problem-space.

For instance, let's say we write a meta-language which describes
various jobs and data structures which the programmer thinks can
be dissociated and run in parallel. The programmer would choose
the code and data boundries by which the application is divided,
and the points at which synchronization needs to occur. It is left
to the processor to do the grunt-work of dividing the jobs into
discrete processing entities, and selecting the means of sharing
data and synchronization.

Invoked in different ways, the language processor might generate
from this presented code an application which uses processes, pipes,
and SysV mutexes to implement the code on a single machine. Or it
might use POSIX threads and mutexes. Or it might use ssh, sockets,
and a distributed locking implementation so it can run on a cluster.

The transformations the processor would have to perform on the
presented code might not need be too sophisticated, eg:

job gather [N]
{
my $operation = main:$operations[$N];
Jobs::block ( collector:$ready );
my $data = pop ( collector[@_[0]]:@list );
return ( operate ( $operation, $data ) );
}

job collector [N]
{
my $ready = 0;
my @list = keys ( %ENV );
$ready = 1;
return ( 0 );
}

job main
{
my $n = $_[0];
my @operations = ( 1, 1, 2, 2 );
my @ready = ( 0, 0, 0, 0 );
my @data = ( "foo", "bar", "yin", "yang" );
my $j_col = Jobs::start ( "collector", $n );
my $j_gat = Jobs::start ( "gather(int(rand($n)))", $n );
$j_col->join();
return ( $j_gat->join() );
}

my $j = Jobs::new("pthreads");
my @results = main(4);
$j->end(); # clean up
store_results ( @results );

This relatively simple code might get transformed into something
that the perl interpretar can then run, by matching jobname:label
paterns and other special regions (like the top of the "job" block,
where argument-passing needs to be handled) and encapsulating them
with appropriate code, like:

sub gather
{
my ( $N, @args ) = @_;
my $operation = Jobs::var_rd ( "main", "\$operations[$N]" );
Jobs::block ( Jobs::var_rd ( "collector", "\$ready" ) );
my $data = pop ( Jobs::var_rd ( "collector", "@list", @args[0] ) ); # note: potentially very inefficient
return ( operate ( $operation, $data ) );
}

sub collector
{
my ( $N, @args ) = @_;

my $ready_mutex = Jobs::mutex_new ( "\$ready" );
$ready_mutex->lock();
my $ready = 0;
Jobs::var_register ( "\$ready", \$ready );
$ready_mutex->unlock();

my $list_mutex = Jobs::mutex_new ( "\@list" );
$list_mutex->lock();
my @list = keys ( %ENV );
Jobs::var_register ( "\@list", \@list );
$list_mutex->unlock();

$ready = 1;
return ( 0 );
}

sub main
{
my ( @args ) = @_;

my $n = $args[0];

my $operations_mutex = Jobs::mutex_new ( "\@operations" );
$operations_mutex->lock();
my @operations = ( 1, 1, 2, 2 );
Jobs::var_register ( "\@operations", \@operations );
$operations_mutex->unlock();

my $ready_mutex = Jobs::mutex_new ( "\@ready" );
$ready_mutex->lock();
my @ready = ( 0, 0, 0, 0 );
Jobs::var_register ( "\@ready", \@ready );
$ready_mutex->unlock();

my $data_mutex = Jobs::mutex_new ( "\@data" );
$data_mutex->lock();
my @data = ( "foo", "bar", "yin", "yang" );
Jobs::var_register ( "\@data", \@data );
$data_mutex->unlock();

$j_col->join();
return ( $j_gat->join() );
}

my $j = Jobs::new("pthreads");
my @results = main(4);
$j->end(); # clean up
store_results ( @results );

I hope the semantics of the (as of yet unfinished) Jobs module's
methods are obvious .. if they are not, I'll explain them in a
later post if anyone's interested. Jobs::start() might do little
more than start up $n POSIX threads which initialize their own
local Jobs objects and call the named functions (which are known
to the Jobs module via the metalanguage processor).

Note the line that I commented in the post-processed gather()
subroutine: here, the "dumb" annotation of the code leads to a
potential gross inefficiency which will show up if deployed as
a cluster application, because it will involve reading an entire
list of values over the network when only one value in that list
is needed. But it should at least work, and the programmer can
go back later and optimize the code if the effort is warranted,
once a working application is deployed and bringing in company
profits.

This example is in perl, but with the notable exception of
handling variables so that a pointer passed into var_register()
doesn't become a dangling pointer when a function returns, the same
approach could be applied to C or C++. Perl's automatic garbage
collection just happens to make that part of the problem go away
(ie, the data held in the local variables are not deallocated
until the references in the Jobs module's internal data structure
are gone).

Obviously the meta-language processor isn't doing much, just
adding the routine keystrokes necessary for sharing data so that
the programmer doesn't have to, but even this modest contribution
is significant (especially for large projects). It allows the
original (meta-language) code to be simpler, more readable, and
easier to understand, saves the programmer time and effort, and
reduces the likelihood of bugs in the finished work (which is
exponentially proportional to the number of lines of code that
the programmer must produces -- looking at it another way, if
the programmer doesn't need to add all those mutexes manually,
s/he does not risk forgetting to write one and introducing a bug).

There are companies and institutions all over the globe coming
up with their own solutions to this class of problems. Whichever
one(s) catch on and become popular might then be adapted to more
fine-grained parallelism as seen in NUMA machines and multithread
microprocessors.

I'm inclined to believe that the technologies that will catch
on will probably come from companies which are always developing
new user services on rapid schedules which run in a distributed
environment such as Google, Yahoo, and Inktomi. Engineers in such
positions are pressured to produce useful code fast, and necessity
is the mother of invention.

-- TTK

Robert Myers

unread,

Mar 5, 2004, 5:50:36 PM3/5/04

to

On Fri, 05 Mar 2004 14:16:39 -0800, "David C. DiNucci"
<da...@elepar.com> wrote:

>Robert Myers wrote:
> <snip>
>
>> Such things are straightforwardly accomplshed in c and probably now in
>> Fortran, but the code is messy.
>>
>> If you have a normal form for descriptions of algorithms in whatever
>> language you are working in, and an algorithm for reducing
>> descriptions of algorithms to normal form (throwing away extraneous
>> information in the process, if necessary), the messiness need not
>> bother you in at least one respect: you should always be able to
>> determine whether two descriptions of the algorithm are equivalent in
>> whatever formal way you have determined is important.
>
>Take an extreme example. I choose my normal form to be just the
>input-output mapping/table represented by the description of the
>algorithm. The algorithm for reducing the description of the algorithm
>to the normal form is to run the description, according to the stated
>semantics of its representation, on all possible inputs to produce all
>possible outputs, from which I construct that mapping/table. Can I now
>just compare the normal forms of two descriptions to determine whether
>they're equal? No, because it is (in general) undecideable whether I can
>create even this simplistic normal form in finite time, even if I can
>look into the inner workings of the algorithmic descriptions, if those
>descriptions are partial recursive.
>

We've actually been partway down this road before, in the thread in
which we first made our mutual acquaintance or one of its near
descendents. There are actually two questions: what constitutes
equivalence, and how do you establish it for a given pair of
representations of algorithms (or prove that the algorithms are not
equivalent or decide that the question is undecidable).

The first time I asked the question, I proposed the input-output test
as a test of equivalence. This time, not even remembering that first
encounter, I was at least bright enough to be more vague as to what I
meant by equivalence. I may not know how to cook, but I at least
don't grab the same hot pan handle twice.

The google search "Equivalence of Algorithms" turns up lots of neat
stuff, including

http://www.eecs.umich.edu/gasm/papers/equiv.html

Yuri Gurevich and James K. Huggins, "Equivalence Is In The Eye Of The
Beholder," Theoretical Computer Science (179) 1-2 (1997), 353--380.

in which the authors take on no less a person than Leslie Lamport.
Given the choice, I'd prefer to take on Einstein on the subect of
causality.

Being a practical person, though, I might well be happy with a tool
that told me: what you just did didn't change the meaning of the
algorithm in whatever sense of equivalence has been chosen, what you
did changed the meaning of the algorithm, I could work on this forever
and never tell you if it changed the meaning of the algorithm, or,
I've been working on this for an awfully long time and don't seem to
be getting anywhere, do you want me to continue?

In other words, I might settle for a less-than-perfect tool.

RM

David C. DiNucci

unread,

Mar 5, 2004, 6:27:47 PM3/5/04

to

I also saw this heading in a familiar direction, but I think the I->O
test is more than just a random example. You can't make a simpler
normal form, unless you want to provide equivalence classes for the
output instead of actual values, with the broadest equivalence classes
being "no result" or "result", and even by doing that, you still haven't
made a dent on the problem. Any other (more specific) normal form is
going to have at least the same difficulties.

> The google search "Equivalence of Algorithms" turns up lots of neat
> stuff, including
>
> http://www.eecs.umich.edu/gasm/papers/equiv.html
>
> Yuri Gurevich and James K. Huggins, "Equivalence Is In The Eye Of The
> Beholder," Theoretical Computer Science (179) 1-2 (1997), 353--380.
>
> in which the authors take on no less a person than Leslie Lamport.
> Given the choice, I'd prefer to take on Einstein on the subect of
> causality.

So maybe they'll take issue with what I just said, above. I'll look.

> Being a practical person, though, I might well be happy with a tool
> that told me: what you just did didn't change the meaning of the
> algorithm in whatever sense of equivalence has been chosen, what you
> did changed the meaning of the algorithm, I could work on this forever
> and never tell you if it changed the meaning of the algorithm, or,
> I've been working on this for an awfully long time and don't seem to
> be getting anywhere, do you want me to continue?
>
> In other words, I might settle for a less-than-perfect tool.

And this sounds very much like the sort of tool I was describing.

Andrew Reilly

unread,

Mar 5, 2004, 7:22:37 PM3/5/04

to

Of those, only Java has any language-level specification of what to do
with multiple threads. So the question then expands to: what
library-level APIs/abstractions/tools will people use? Is MPI a good
enough answer? Raw POSIX threads? Single-threaded POSIX processes over
TCP/IP? Something different?

Cheers,

--
Andrew

Sander Vesik

unread,

Mar 6, 2004, 10:34:52 PM3/6/04

to

David C. DiNucci <da...@elepar.com> wrote:

> Eugene Miya wrote:
> >
> > Just passing through.
> >
> > It's been argued that PP will never be mainstream.
> > The big thing to remember is that right now parallelism is largely an
> > O(n) solution to problems with computational complexity in some cases
> > vastly exceeding in some cases O(N^3). It's all going to depend on your n.
> > The vast majority of people reading this (or not) fail to understand this.
>
> I guess I'm one of them. For example, I don't understand where these
> complexity assertions came from, or the possible significance of "n" vs.
> "N". Maybe "O(n) solution" comes from some payback related to

n is the number of processors and N is the number of data elements - it means
that adding more processors provides only a constant speedup compared to
O(N^3) increase in need of processing power when the dataset increses.

> functional decomposition? There are certainly ways to get more bang for
> your buck, through data parallelism, parallel recursion, reuse, etc.

As long as the size of your dataset remains static, yes.

[snip - sorry, but it didn't make too much sense]

>
> -- Dave
> -----------------------------------------------------------------
> David C. DiNucci Elepar Tools for portable grid,
> da...@elepar.com http://www.elepar.com parallel, distributed, &
> 503-439-9431 Beaverton, OR 97006 peer-to-peer computing

--

David C. DiNucci

unread,

Mar 6, 2004, 11:50:03 PM3/6/04

to

Sander Vesik wrote:
>
> David C. DiNucci <da...@elepar.com> wrote:
> > Eugene Miya wrote:
> > >
> > > Just passing through.
> > >
> > > It's been argued that PP will never be mainstream.
> > > The big thing to remember is that right now parallelism is largely an
> > > O(n) solution to problems with computational complexity in some cases
> > > vastly exceeding in some cases O(N^3). It's all going to depend on your n.
> > > The vast majority of people reading this (or not) fail to understand this.
> >
> > I guess I'm one of them. For example, I don't understand where these
> > complexity assertions came from, or the possible significance of "n" vs.
> > "N". Maybe "O(n) solution" comes from some payback related to
>
> n is the number of processors and N is the number of data elements - it means
> that adding more processors provides only a constant speedup compared to
> O(N^3) increase in need of processing power when the dataset increses.

Thanks for the explanation. I can see it as an argument for how one
might want to vary n as N varies if the goal is to keep a relatively
constant time to solution, but I don't find that goal stated here, and I
don't see why it makes any difference as to whether PP will ever be
mainstream.

If the argument is that O(n) speedups from parallelism just get lost in
the noise, then the same argument applies to ILP. I don't see chip
architects restricting # instructions in flight with "Well, yes, we
could make our chip go 4 times faster than this with ILP, but heck,
what's a factor of 4 in the context of your entire compute load
anyway?" (answer: 4.) And with parallel software, you're not
restricted by the hardware provider (except perhaps in terms of the
interconnect) with how big to grow the n.

To me, this argument only makes "sense" for PP in the context of an
unstated assumption: That parallel programming is HARD. In fact, it's
the same assumption being made in the "PP again" thread right now, where
the discussion has turned to finding the technology and the money to
continue single-thread performance improvements. My question is "why
would you WANT to keep pouring money down that hole, with less and less
performance and more and more heat to show for it?" I'm clearly out of
step with most of the gang here because I haven't been brainwashed by
the "PP is HARD" mantra, and I have very specific information to the
contrary.

Stefan Monnier

unread,

Mar 7, 2004, 4:15:51 PM3/7/04

to

> The big thing to remember is that right now parallelism is largely an
> O(n) solution to problems with computational complexity in some cases
> vastly exceeding in some cases O(N^3). It's all going to depend on your n.

The question then becomes: can we make `n' follow Moore's law?

Stefan

Nick Maclaren

unread,

Mar 8, 2004, 3:02:18 AM3/8/04

to

In article <jwv1xo4bbql.fsf-...@asado.iro.umontreal.ca>,

Yes, we can, at least while Moore's law continues to hold. The
problem is making use of parallelism that increases at 50% per
annum, WITHOUT a corresponding reduction in latency.

That is why people are STILL trying to squeeze the last drops out
of the serial performance bottle.

Regards,
Nick Maclaren.

Andy Glew

unread,

Mar 9, 2004, 10:46:21 AM3/9/04

to

> > If Gelsinger says we're soon going to have chips made up of lots of
> > little cores and almost all code now being written is single-threaded,
>
> [Who is this Gelsinger, and where did he say this? Not that I think he's
> wrong, mind.]

Pat Gelsinger, some VP, probably CEO in training, of Intel.
Exactly what he's a VP of I don't know.
"Pseudo-CTO?"

One of the co-fathers of the i386.

My biggest memory of Pat is trying to persuade him to
let us add a 64 bit extension to the P6: explaining
that it was unreasonable to expect any 64 bit extension,
RISC or otherwise, to cain more than 15% performance.
This was just prior to the original P7 RISCy 64 bit extension
being cancelleed, and the VLIW 64 bit extension starting up.

Pat may have been right in not allowing 64 bits to be
added to P6 way back - when, 1994? 1995? we were
probably talkling about this in 1992 or 1993?
Adding 64 bits would probably have delayed P6.

However, adding 64 bits to a P6 derivative would have

(a) avoided the ugliness of the several different PAE,
physical address extension, modes that allow >32 bit
physical addresses to be used with <=32 bit virtual
addresses.
"Hmm, no, the original PAE with 8 byte page table
entries is not liked: let's instead use 4 byte page table entries,
but only allow >32 bits of physical address to be used
in 4M pages, using the unused bits 12..21 of the large
page table entry." etc. The public has not seen
the ugliest kluges that were considered.

(b) possibly, it might have avoided the Itanium debacle.

So what's happening now? Intel looks like it may be
copying AMD64.

---

This being said, I always thought Pat had a reasonably
good head on his shoulders. He wasn't a computer architect,
and he did not understand software, but he was a good
manager capable of thinking technically.

Andy Glew

unread,

Mar 9, 2004, 10:46:20 AM3/9/04

to

> Make is fairly close, if you look at it from far enough away... Works as
> an auto-parallelizer of compiles and other such dependency-list tasks.

John Ousterhout's new company,
http://www.electric-cloud.com
seems to have actually made make into an auto-parallelizer.

Make by itself isn't an auto-parallelizer: the user has
to write the make rules, and has to ensure that
all of the dependencies are indicated, that two rules
don't conflict on something like a broken intermediate
file name, etc.

I don't know what Electric Cloud does, but I can guess,
since I have described something that is a true auto-parallelizer,
I think, in this newsgroup:

Start off with something like a build script - whether a sequential
shell script, or a pseudo-parallel Makefile.

For each step, record the inputs and outputs.
E.g. do this by intercepting all, or a subset of, syscalls.
Having recorded the identities of inputs and outputs, you
can see the true dependencies.

Execute nearly everything in a "speculative" context:
don't overwrite files, keep the old versions around,
until some "oldest" step of the shell script has completed.
This allows you to be safe, when data dependent computation
uncovers dependencies that were not previously recorded.

---

I was led to this, most recently, about 1996 when I was
executing SPEC95 under SimpleScalar: I did not know
if the SPEC benchmarks were idempotent, i.e. if they
modified their input files or not, so I instrumented the
syscall stubs. Having done this, I realized that this
instrumentation was exactly what you need to
do true autoparallelization of make and shell scripts.

(By the way, simply tracking files open read-only
and writeably isn't sufficient: SPEC95 vortex opened
some files read/write, but was nevertheless idempotent.)

I believe that I encountered this idea even earlier, at
Gould circa 1986-1987, but there I can't claim credit:
it was probably one of the genii who live in the Little
Software House on the Prairie, Mike, Jody, or Ivan or...

Samuel Barber

unread,

Mar 9, 2004, 2:55:45 PM3/9/04

to

"Andy Glew" <glew2pub...@sbcglobal.net> wrote in message news:<hFl3c.7702$Gx1....@newssvr27.news.prodigy.com>...

> So what's happening now? Intel looks like it may be
> copying AMD64.

I'm curious: how does AMD64 compare with the not-implemented
proposal(s) inside Intel (predating AMD64/Prescott) for 64-bit
extensions to x86? Better, worse, essentially the same?

Sam

James Cownie

unread,

Mar 10, 2004, 4:11:30 AM3/10/04

to

Andy Glew wrote:

> Start off with something like a build script - whether a sequential
> shell script, or a pseudo-parallel Makefile.
>
> For each step, record the inputs and outputs.
> E.g. do this by intercepting all, or a subset of, syscalls.
> Having recorded the identities of inputs and outputs, you
> can see the true dependencies.
>

I think the Vesta system which DEC used did at least some of this
http://www.vestasys.org/

though it used the information to allow complete re-use of results, and
I'm not sure that it exploited the parallelism it had exposed.

--
-- Jim
--
James Cownie <jco...@etnus.com>
Etnus, LLC. +44 117 9071438
http://www.etnus.com

Russell Wallace

unread,

Mar 10, 2004, 6:18:46 AM3/10/04

to

On Sun, 7 Mar 2004 03:34:52 +0000 (UTC), Sander Vesik
<san...@haldjas.folklore.ee> wrote:

>n is the number of processors and N is the number of data elements - it means
>that adding more processors provides only a constant speedup compared to
>O(N^3) increase in need of processing power when the dataset increses.

The latter implies that the more data you have, the more computation
you want to do per datum. In other words, as time goes by we want more
computation relative to communication. Which is good news, not bad,
because computation is getting cheaper relative to communication.

--
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
http://www.esatclear.ie/~rwallace

David Gay

unread,

Mar 10, 2004, 12:34:11 PM3/10/04

to

wallacet...@eircom.net (Russell Wallace) writes:
> On Sun, 7 Mar 2004 03:34:52 +0000 (UTC), Sander Vesik
> <san...@haldjas.folklore.ee> wrote:
>
> >n is the number of processors and N is the number of data elements - it means
> >that adding more processors provides only a constant speedup compared to
> >O(N^3) increase in need of processing power when the dataset increses.
>
> The latter implies that the more data you have, the more computation
> you want to do per datum. In other words, as time goes by we want more
> computation relative to communication. Which is good news, not bad,
> because computation is getting cheaper relative to communication.

Except for the minor detail that the original poster didn't mention what
the communication complexity was...

--
David Gay
dg...@acm.org

Eugene Miya

unread,

Mar 10, 2004, 1:10:28 PM3/10/04

to

In article <404901A6...@elepar.com>,

David C. DiNucci <da...@elepar.com> wrote:
>Eugene Miya wrote:

>> It's been argued that PP will never be mainstream.
>> The big thing to remember is that right now parallelism is largely an
>> O(n) solution to problems with computational complexity in some cases
>> vastly exceeding in some cases O(N^3). It's all going to depend on your n.
>> The vast majority of people reading this (or not) fail to understand this.
>
>I guess I'm one of them. For example, I don't understand where these
>complexity assertions came from, or the possible significance of "n" vs.
>"N". Maybe "O(n) solution" comes from some payback related to
>functional decomposition?

Naw simpler just add some n processors. How fast can you add them?
In some cases, specialized archiectures you can them at the rate of n^2
which works for certain imagery applications, etc. but not the general case.

It doesn't have to be any n. At the time I left the high speed processor
group (before you entered, and also after Doug left), I was given an
O(NP) style benchmark which was no-floating point, small memory, and
not easily ammenable to parallelism (however this isn't stopping people
from making progress): Golomb rulers, which some of our users were
considering as a useful tool. There's quite a few of these kinds of
hard problems. Ignore the special cases for business, but they don't go
away.

> There are certainly ways to get more bang for
>your buck, through data parallelism, parallel recursion, reuse, etc.
>And I can only guess that the O(N^3) relates somehow to the analysis
>required to decompose sequential code. But much of that is discovering
>the complexity which arose from haphazardly adding inter-dependences
>when *composing* the sequential code. By constructing the code in
>parallel in the first place, with tools which help to organize each new

That's a rub eh Dave?

>dependence, the construction process is hardly more difficult (and
>certainly much more manageable) than sequential.

I think in the near future many of our users are going to disagree with you.
The easiest examples that computer people glom on to are things like
weather codes (at least some think weather codes are at least somewhat
understandable).

>> Who: in the near future it will be written by people who write it now.
>> Excepting high visibility benchmarks and people out for attention,
>> these smart people will write the apps for their areas and remain
>> relatively anonymous except in their communities.
>
>Again, I disagree. If parallel tools provide a better environment for
>writing even "sequential" code than existing tools do in important ways,
>existing "sequential" programmers will migrate to those tools, with
>potential parallelism in their applications being frosting rather than
>the cake.

You have to convince the application developer community, and I don't
mean business, to adopt a more tool oriented approach. Right now, we
have a more library (package) oriented approach.

>> How: One line at a time. Right now it's mostly some version of MPI.
>> And they debug using print statements. I see no drop in parallelism.
>
>Strike three. While I agree with how they do it now, I believe most
>parallelism in the future will be block-by-block, module-by-module,
>and/or app-by-app rather than line-by-line. Individual blocks, modules,
>and in some cases apps, will look much like they do now, written in the
>same languages with the same tools.

Now or when?

>> A few in the masses may get access to libraries, but that's as close to
>> mainstream as you are going to get.
>
>Only if people are scared to death of implementing parallel code because
>of assertions like you're making here! :-)

Dave, it's not merely me. I can point you to references to all the
points you make above where others are argued both ways on each of our
points. We are retreading the same ground.

Gotta run off to work.

--
------------ And now a word from our sponsor ------------------
Want to have instant messaging, and chat rooms, and discussion
groups for your local users or business, you need dbabble!
-- See http://netwinsite.com/sponsor/sponsor_dbabble.htm ----

Iain McClatchie

unread,

Mar 10, 2004, 1:28:08 PM3/10/04

to

Andy> Start off with something like a build script - whether a sequential
Andy> shell script, or a pseudo-parallel Makefile.

Andy> For each step, record the inputs and outputs.
Andy> E.g. do this by intercepting all, or a subset of, syscalls.
Andy> Having recorded the identities of inputs and outputs, you
Andy> can see the true dependencies.

Jim> I think the Vesta system which DEC used did at least some of this
Jim> http://www.vestasys.org/

Jim> though it used the information to allow complete re-use of results, and
Jim> I'm not sure that it exploited the parallelism it had exposed.

This sounds very similar to ClearCase's "clearmake" command. It would
track dependencies and command line arguments, and globally check to see
if someone else had run that same build earlier, in which case it would
"wink in" the derived object.

Sounded great, but it was a total pain in the ass for building a CPU. I
wish we'd just used Perforce. The only thing we really benefitted from
was the

This notion of rerunning commands should a build script overwrite one of
the dependencies sounds good, but I'm not sure I'd like to use that for
revision control of a chip design. The typical bad problem there is how
to handle ECO routes: late in the design, the RTL and gate netlist and
layout are all checked in. If you change the RTL, you have to hand-tweak
the gate netlist (or otherwise get a small diff), and check that in.
Fine. Now, you run the router in ECO mode. The router (maybe with some
hand holding) rips up and reroutes the wires around the bit of the
netlist you just changed. Input: the layout database and the netlist.
Output: the layout database. Hmmm, we seem to have a cycle in the
dependency graph.

Cycles sound easy if they're one operation. But what if the cycle
involves running several commands, like:
ECO synthesis
iterate to convergence, a.k.a. "timing closure", in other words, while(1) {
ECO route
extract parasitics
timing and ERC checks
sizing tweaks
}

Revision control on these big databases is tough.

Robert Myers

unread,

Mar 10, 2004, 2:45:46 PM3/10/04

to

On 10 Mar 2004 09:34:11 -0800, David Gay <dg...@beryl.CS.Berkeley.EDU>
wrote:

It really _was_ intended to be a question about computer architecture
and where the mainstream might be headed. There are a zillion ways
you can put low power cores together in any number from n=2 to n in
the tens of thousands (or more, as I'm sure we'll soon see).

I don't have to speculate about how n anywhere in that range might be
used by the kinds of applications I know anything about (technical
computing), but I'm less certain about the ability of broader markets
to put multiple threads to work.

The main obstacle I can forsee to mainstreaming parallel computation
is that it might as well be in a different universe as far as software
is concerned, judging by how little "end user" software there is that
can exploit even n=2, so I framed it as a question about software.

Make anything you like a free parameter. How are people going to make
or find a large market for multiply-threaded processors? The only
example we have readily at hand, Hyperthreading, still amounts to
little more than hypermarketing. Has the Power Mac G5 experience
offered other evidence that might be helpful in seeing through the
fog?

RM

Eugene Miya

unread,

Mar 10, 2004, 8:04:21 PM3/10/04

to

In article <10786304...@haldjas.folklore.ee>,

Sander Vesik <san...@haldjas.folklore.ee> wrote:
>n is the number of processors and N is the number of data elements - it means

Actually, I was using n as a case insensitive variable in an algebraic way.
You can use numbers of processors or numbers of data elements. N is linear.
It's not quadratic, etc.

Eugene Miya

unread,

Mar 10, 2004, 8:01:59 PM3/10/04

to

In article <c2au0f$3v$1...@pegasus.csx.cam.ac.uk>,
Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
"born parallel code"

>At one level, I agree. The vast majority of "grand challenge" problems
>are inherently parallel at the level of the basic physics, and it is
>the mapping to a mathematically tractable form that introduces the
>serialisation.

Forced sequentialism/serialization, no common word, is a pretty powerful
algorithmic tool. The only thing I would add is that not every one
of those problems is a physics problem.

> Whether this means that a suitable parallel programming
>paradigm would make it easy and natural to program them efficiently is
>less than clear.

Pretty nicely stated Nick, on the whole.

>But it is assuredly a topic that needs some serious research.

We all agree.

Ken Hagan

unread,

Mar 11, 2004, 5:31:42 AM3/11/04

to

Robert Myers wrote:
>
> It really _was_ intended to be a question about computer
> architecture and where the mainstream might be headed.

> [...]

> I'm less certain about the ability of broader markets
> to put multiple threads to work.

> [...]

> How are people going to make or find a large market for
> multiply-threaded processors?

Slowly, but I'm optimistic, because I think I can see an
evolutionary path, so there are no economic obstacles.

> The only example we have readily at hand, Hyperthreading,
> still amounts to little more than hypermarketing.

Most people still don't have an SMT-enabled processor. When
they do, it will be cost-effective for pleb programmers like
me to think about using a few (heavyweight) threads. Then it
becomes "safer" for processors to increase "n" and shift the
performance sweet-spot towards those larger values.

It gets harder and harder for software to support larger values
of n as long as each thread has to be lovingly hand-crafted by
plebs like me in languages like C, but existing (and widespread)
middleware technilogies (hey, good typo, I'll leave it in!) make
it easier and there are people out there claiming their compilers
can do it automatically for embarrassingly parallel subroutines.

Over time, the software base gets less and less tied to a single
thread of execution.

Nick Maclaren

unread,

Mar 11, 2004, 5:37:58 AM3/11/04

to

In article <c2pf1j$kj8$1$8300...@news.demon.co.uk>,
"Ken Hagan" <K.H...@thermoteknix.co.uk> writes:

|> Robert Myers wrote:
|>
|> > The only example we have readily at hand, Hyperthreading,
|> > still amounts to little more than hypermarketing.
|>
|> Most people still don't have an SMT-enabled processor. When
|> they do, it will be cost-effective for pleb programmers like
|> me to think about using a few (heavyweight) threads. Then it
|> becomes "safer" for processors to increase "n" and shift the
|> performance sweet-spot towards those larger values.

Not significantly. Firstly, there is a VERY high chance that
the results of your thinking about it comes to the conclusion
"no way, Jose". Secondly, experience with parallelism is that
the change from 2-way to N-way is HARDER than the change from
1-way to 2-way. And, thirdly, it is unlikely that Eggers-style
SMT will ever be cost-effective at more than 4-way with x86.

There is a FAR higher chance that CMP will be the trigger.

Regards,
Nick Maclaren.

Sander Vesik

unread,

Mar 11, 2004, 7:39:26 AM3/11/04

to

And its not safe at all to assume it would be low.