The End of Not-Moore's Law?

Nick Maclaren

unread,

Sep 21, 2003, 9:33:53 AM9/21/03

to

Moore's Law is often misquoted to say that CPU performance doubles
every 1.5-2 years. Let's call that Not-Moore's Law, because it has
been close to true for a long while, and a lot of people are relying
on it. It has essentially been provided entirely by process sizes
dropping (i.e. the REAL Moore's Law), both by allowing higher clock
speeds and by allowing the use of more and fancier logic. Note that
I am talking about the serial performance here.

As I understand the problems with power consumption in the 90 nM
process, leakage loss has increased from a significant secondary
consumer of power to a primary one, and this can only get worse as
the process is scaled down to 65 NM. Now, IBM, AMD, Intel and others
have announced forthcoming improvements to alleviate this, but that
will only be alleviation. And, as with all such work, it becomes
harder and harder to do as time goes on. We are already seeing the
problem in the apparently relentless increase in power consumption
of the highest-performing CPUs.

Now, my question is whether we have seen the beginning of the end of
Not-Moore's Law?

Even if Moore's Law holds, and the process sizes continue to drop,
the need to keep the power consumption down will mean that CPUs have
to be run slower. So we could see the doubling time increase from
1.5-2 years to 3-4 years, and perhaps longer. But will we? Or are
there significant rabbits still hidden in the hat?

Regards,
Nick Maclaren.

John Dallman

unread,

Sep 21, 2003, 11:18:00 AM9/21/03

to

In article <bkk9c1$qdh$1...@pegasus.csx.cam.ac.uk>, nm...@cus.cam.ac.uk (Nick
Maclaren) wrote:

> Moore's Law is often misquoted to say that CPU performance doubles
> every 1.5-2 years. Let's call that Not-Moore's Law, because it has
> been close to true for a long while, and a lot of people are relying
> on it. It has essentially been provided entirely by process sizes
> dropping (i.e. the REAL Moore's Law), both by allowing higher clock
> speeds and by allowing the use of more and fancier logic. Note that
> I am talking about the serial performance here.

A separate problem, which isn't so very far away, is speed-of-light-
in-silicon. Take a McKinley, at just over 400mm^2. So, it's about 20mm
from side to side, and that's equivalent to 15GHz in vacuum, or 10GHz at
best in silicon.

Me, I reckon we're going to see a bifurcation in processor designs.
Multiple cores per chip is a current fashion. It started to get
established with Power4, and Sun and HP have made announcements.
Announcements at IDF included true dual-core Xeons (still with
HyperThreading, for 4 virtual processors) and 16 IA-64 cores in
Tanglewood. Yes, they'll require gigantic bus bandwidths. Bandwidth can be
obtained with money. Latency is harder.

Multi-core processors don't have to be tightly-coupled the units on a
serial processor do. So they can take advantage of larger dies in bigger
processes, and add cores, rather than speed up the clock. There's just
more headroom. Serial processors have to be close-coupled (with current
ideas) so the speed of light matters, and in a few years, their growth is
going to be limited to process improvements, as limited by leakage. So
which is going to run faster?

Multi-core processors are excellent for the current fashion in application
design (web serving). They don't help at all with pure serial software.
Many manufacturers are going to be trying hard to convince people that
threading is the answer to performance now - which it isn't, at present,
but it takes sufficiently long to add that they can hope that threaded
software might be available by the time their processors are on sale.
Besides, how many marketing people really understand these issues?

The salvation for serial processors, oddly enough, might be Microsoft.
Since the desktop apps that they do so well aren't nearly as amenable to
multi-threading (to speed up their eye candy) as server software is,
they're going to keep up the pressure for faster serial processors for a
while. They'll have to stop, and start improving efficiency at some point,
but that cultural change might be hard to institute.

Of course, if someone can invest a way to build a serial processor without
being speed-of-light limited, all such predictions are dust. Let's face
it, they usually are. But stimulating ideas is worthwhile.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Andy Glew

unread,

Sep 21, 2003, 1:05:08 PM9/21/03

to

> A separate problem, which isn't so very far away, is speed-of-light-
> in-silicon.
>

> Serial processors have to be close-coupled (with current
> ideas) so the speed of light matters, and in a few years, their growth is
> going to be limited to process improvements, as limited by leakage. So
> which is going to run faster?

Loosely coupled will always run faster.

But there's no reason that processors that execute a single logical
thread of execution cannot be loosely coupled.

Historical anecdote: Bob Colwell thought that one of the most
innovative things about the P6 was that it was a decentralized,
decoupled, design, with all of the different pipestages working
independently. No central control.

So long as branch prediction works and/or you can overlap
the time between an incorrect branch misprediction,
its detection, and its repar, you can make pipelines longer.
Each pipestage communicates locally, to its nearest pipeline
neighbours.

The problem with designs that are presently on the market
is that the individual pipestages are growing big. There are obvious
ways to attack that.

---

Multiple core designs have a lot to be said. But don't write off
speeding up uniprocessor yet.

John Dallman

unread,

Sep 21, 2003, 1:38:00 PM9/21/03

to

In article <8Tkbb.1925$mq2....@newssvr29.news.prodigy.com>,
andy-gle...@yahoo.com (Andy Glew) wrote:

> But there's no reason that processors that execute a single logical
> thread of execution cannot be loosely coupled.
>
> Historical anecdote: Bob Colwell thought that one of the most
> innovative things about the P6 was that it was a decentralized,
> decoupled, design, with all of the different pipestages working
> independently. No central control.

Ah! I didn't know about that detail. OK, so the speed-of-light business is
amenable to moderately clever design already. I'll stop worrying about it
for a few years.

David Magda

unread,

Sep 21, 2003, 2:59:39 PM9/21/03

to

j...@cix.co.uk (John Dallman) writes:
[...]

> The salvation for serial processors, oddly enough, might be
> Microsoft. Since the desktop apps that they do so well aren't
> nearly as amenable to multi-threading (to speed up their eye candy)
> as server software is, they're going to keep up the pressure for
> faster serial processors for a while. They'll have to stop, and
> start improving efficiency at some point, but that cultural change
> might be hard to institute.

[...]

Could multi-threading in the OS help though?

One(!) of the reasons given for BeOS's very good UI response was
supposedly the use of threads for just about every aspect of the GUI
(combined with the fact that BeBoxes were SMP).

Although it may not speed things up on stop-watch tests, the
perceived responsiveness could be raised. (Perception of things being
in many cases as important as what actually happens.)

--
David Magda <dmagda at ee.ryerson.ca>, http://www.magda.ca/
Because the innovator has for enemies all those who have done well under
the old conditions, and lukewarm defenders in those who may do well
under the new. -- Niccolo Machiavelli, _The Prince_, Chapter VI

Benjamin Ylvisaker

unread,

Sep 21, 2003, 7:18:07 PM9/21/03

to

On 21 Sep 2003 13:33:53 GMT
nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

We are definitely seeing the beginning of the end of Not-Moore's law
for conventional microprocessor designs. A clear analysis:
http://citeseer.nj.nec.com/agarwal00clock.html

The authors basically say that we are beginning to see performance
scale sub-linearly with transistor size decreases, whereas it has
scaled super-linearly for the entire history of the microprocessor up
to this point. The basic problem is that architects are running out
of useful things for all those little transistors to do. It is clear
that highly parallel architectures will be needed in order to take
advantage of all the circuit space that will be available 5, 10, 20
years from now. Exactly how these "highly parallel architectures"
will work is another debate entirely (I have my bets).

Cheers,
Benjamin

Greg Lindahl

unread,

Sep 21, 2003, 8:03:08 PM9/21/03

to

>We are definitely seeing the beginning of the end of Not-Moore's law
>for conventional microprocessor designs. A clear analysis:
>http://citeseer.nj.nec.com/agarwal00clock.html

Figure 1 plots SPECint per Mhz, without mentioning that compiler
improvements play a large role in this chart. This is a common
problem, since vendors tend to not publish results with their new
compiler on old chips.

-- greg

Robert Myers

unread,

Sep 21, 2003, 8:17:53 PM9/21/03

to

On Sun, 21 Sep 2003 19:18:07 -0400, Benjamin Ylvisaker
<benj...@alumni.cmu.edu> wrote:

<snip>

>
>We are definitely seeing the beginning of the end of Not-Moore's law
>for conventional microprocessor designs. A clear analysis:
>http://citeseer.nj.nec.com/agarwal00clock.html
>
>The authors basically say that we are beginning to see performance
>scale sub-linearly with transistor size decreases, whereas it has
>scaled super-linearly for the entire history of the microprocessor up
>to this point. The basic problem is that architects are running out
>of useful things for all those little transistors to do.

I brought up this whole issue in another thread earlier this summer,
and another poster who was kind enough to take the time to educate me
privately basically said that the wire delay problem has been looming
for a long time and that circuit designers keep coming up with tricks
to circumvent it.

One that hasn't been used yet is diagonal routing. It's a one-time
shot, but a succession of one-time shots have kept things rumbling
along, and it would seem that predicting when, exactly, hardware
designers will run out of tricks is at the very least a risky
proposition.

The negative assertion made by the paper lends itself nicely to
analysis and publication in a paper of modest size. The engineering
cleverness that has forestalled the final day of reckoning sounds like
it would easily fill a book.

>It is clear
>that highly parallel architectures will be needed in order to take
>advantage of all the circuit space that will be available 5, 10, 20
>years from now. Exactly how these "highly parallel architectures"
>will work is another debate entirely (I have my bets).
>

Five, ten, twenty years is a long time. Probably while the rest of us
are still trying to figure out how to formulate the problem in
software, Andy Glew and others like him will have solved it in silicon
and chips of the (possibly not-too-distant future) will be spinning
off threads on the fly.

RM

Benjamin Ylvisaker

unread,

Sep 21, 2003, 9:33:03 PM9/21/03

to

On Sun, 21 Sep 2003 20:17:53 -0400
Robert Myers <rmy...@rustuck.com> wrote:

> On Sun, 21 Sep 2003 19:18:07 -0400, Benjamin Ylvisaker
> <benj...@alumni.cmu.edu> wrote:
>
> <snip>
> >
> >We are definitely seeing the beginning of the end of Not-Moore's
> >law for conventional microprocessor designs. A clear analysis:
> >http://citeseer.nj.nec.com/agarwal00clock.html
> >
> >The authors basically say that we are beginning to see performance
> >scale sub-linearly with transistor size decreases, whereas it has
> >scaled super-linearly for the entire history of the microprocessor
> >up to this point. The basic problem is that architects are running
> >out of useful things for all those little transistors to do.
>
> I brought up this whole issue in another thread earlier this summer,
> and another poster who was kind enough to take the time to educate
> me privately basically said that the wire delay problem has been
> looming for a long time and that circuit designers keep coming up
> with tricks to circumvent it.

Do you mean tricks like arranging buffers and wire sizes in some crazy
way to make a wire that naďvely should have a delay of 800 picoseconds
actually have a delay of 600 picoseconds? Or tricks like the whole
procession of micro-architectural enhancements (superscalar, branch
prediction, trace cache, etc.) that we have seen? If the former, then
I believe that the mileage we can get out of those tricks is
distinctly limited. If the latter, then it is much harder to predict
what creative people may come up with.

> The negative assertion made by the paper lends itself nicely to
> analysis and publication in a paper of modest size. The engineering
> cleverness that has forestalled the final day of reckoning sounds
> like it would easily fill a book.

Agreed.

> Five, ten, twenty years is a long time. Probably while the rest of
> us are still trying to figure out how to formulate the problem in
> software, Andy Glew and others like him will have solved it in
> silicon and chips of the (possibly not-too-distant future) will be
> spinning off threads on the fly.

I doubt that such a solution could scale very well. There is an
interesting parallel here with the superscalar versus VLIW debate.
The basic question when comparing those two ideas is, should it be the
programmer and/or compiler that figures out which instructions should
run in parallel or the hardware? Now we're wondering the exact same
thing, but about entire threads, not just a couple of adjacent (or
nearly so) instructions. I believe that the problem of deciding how
to spin off parallel threads from a single threaded instruction stream
and maintaining coherence between the threads is too hard to be solved
(with reasonable efficiency) in hardware. I would be fascinated if
someone can provide some solid counter-evidence against my belief.

Benjamin

Robert Myers

unread,

Sep 21, 2003, 11:24:08 PM9/21/03

to

On Sun, 21 Sep 2003 21:33:03 -0400, Benjamin Ylvisaker
<benj...@alumni.cmu.edu> wrote:

Benjamin Ylvisaker wrote:
> On Sun, 21 Sep 2003 20:17:53 -0400
> Robert Myers <rmy...@rustuck.com> wrote:
>
>
>>On Sun, 21 Sep 2003 19:18:07 -0400, Benjamin Ylvisaker
>><benj...@alumni.cmu.edu> wrote:
>>
>><snip>
>>
>>>We are definitely seeing the beginning of the end of Not-Moore's
>>>law for conventional microprocessor designs. A clear analysis:
>>>http://citeseer.nj.nec.com/agarwal00clock.html
>>>
>>>The authors basically say that we are beginning to see performance
>>>scale sub-linearly with transistor size decreases, whereas it has
>>>scaled super-linearly for the entire history of the microprocessor
>>>up to this point. The basic problem is that architects are running
>>>out of useful things for all those little transistors to do.
>>
>>I brought up this whole issue in another thread earlier this summer,
>>and another poster who was kind enough to take the time to educate
>>me privately basically said that the wire delay problem has been
>>looming for a long time and that circuit designers keep coming up
>>with tricks to circumvent it.
>
> Do you mean tricks like arranging buffers and wire sizes in some crazy

> way to make a wire that naïvely should have a delay of 800 picoseconds

> actually have a delay of 600 picoseconds?

I'm talking about physical device engineering: lower wire resistance,
low-k dielectrics, diagonal routing, more layers of metal. I had my
own pessimistic article to cite:

http://www.us.design-reuse.com/articles/article5786.html

"James Meindl at the Georgia Institute of Technology, who has become
an expert in predicting the impending impact of physical parameters on
future IC generations, has taken on the interconnect problem. His
analysis predicts 80 levels of metal by 2014 if no architectural
changes are made in circuit design."

I wondered when people would see the light, so to speak, and switch
from electrons to something sensible like photons. My private
correspondent didn't think that would happen anytime soon.

> Or tricks like the whole
> procession of micro-architectural enhancements (superscalar, branch
> prediction, trace cache, etc.) that we have seen? If the former, then
> I believe that the mileage we can get out of those tricks is
> distinctly limited. If the latter, then it is much harder to predict
> what creative people may come up with.
>

It's apparently been pretty hard to predict what creative people are
going to come up with in either case. Past ends-of-the-road foreseen
have come and gone with barely a blip.

<snip>

>
>>Five, ten, twenty years is a long time. Probably while the rest of
>>us are still trying to figure out how to formulate the problem in
>>software, Andy Glew and others like him will have solved it in
>>silicon and chips of the (possibly not-too-distant future) will be
>>spinning off threads on the fly.
>
>
> I doubt that such a solution could scale very well. There is an
> interesting parallel here with the superscalar versus VLIW debate.
> The basic question when comparing those two ideas is, should it be the
> programmer and/or compiler that figures out which instructions should
> run in parallel or the hardware? Now we're wondering the exact same
> thing, but about entire threads, not just a couple of adjacent (or
> nearly so) instructions. I believe that the problem of deciding how
> to spin off parallel threads from a single threaded instruction stream
> and maintaining coherence between the threads is too hard to be solved
> (with reasonable efficiency) in hardware. I would be fascinated if
> someone can provide some solid counter-evidence against my belief.
>

I speculated earlier this summer on comp.arch about speculative
threading (that's the subject line of the thread, and it's worth
finding, not because of my thoughts, but because of the responses I
got). Like almost any other idea you can think of, someone (at least
Andy Glew) had already thought of it before. Mike Haertel volunteered
a citation to a Ph.D. thesis:

"There is a published Ph.D. dissertation by Haitham Akkary, of
Portland State University, that gives some algorithms and speedups for
various forms of implicit multithreading. It looks quite promising.
Akkary reported average speedups of 20-30% and peak speedups of over
80% on some benchmarks."

There are plans in the works to use asymmetric helper threads on
actual processors. You can find stuff by googling on "helper thread."

The similarity to the VLIW problem is not accidental. The problems in
exposing and exploiting almost any kind of parallelism are pretty much
the same (I can thank David Gay for this insight) and that's what made
me decide the VLIW problem (and EPIC) were worth studying, whether
VLIW and EPIC were really good ideas or not.

The question remains: where best to expose the parallelism. My best
thought right now is for the programmer to give the compiler as much
of a head start and as many hints as possible, for the compiler to do
the same for the processor, and for the processor to provide at least
some last-instant flexibility. That pretty much covers all the bases,
doesn't it? :-).

People usually save the "I don't think it's going to scale" argument
as their last line of defense. You might have gone there a little
prematurely. ;-).

RM

Dennis M. O'Connor

unread,

Sep 22, 2003, 12:08:15 AM9/22/03

to

"Benjamin Ylvisaker" <benj...@alumni.cmu.edu> wrote

> Do you mean tricks like arranging buffers and wire sizes in some crazy
> way to make a wire that naďvely should have a delay of 800 picoseconds
> actually have a delay of 600 picoseconds? Or tricks like the whole
> procession of micro-architectural enhancements (superscalar, branch
> prediction, trace cache, etc.) that we have seen? If the former, then
> I believe that the mileage we can get out of those tricks is
> distinctly limited. If the latter, then it is much harder to predict
> what creative people may come up with.

The latter may become a victim of the increasing prominence
of wire delay as process geometry shrinks. When it takes
a relatively long time to simply cross a complex logic block,
and the block puts a dent in your power budget just by existing,
as well, being clever may not pay off like it used to. The meek
may inherit the micro-architectural Earth. KISS may rule all,
in the end; the best tool is often the simplest one.

Of course, that's just for serialized algorithms and
tightly-couple parallel algorithms. TLP may be the
work-around, but the software base needs to be
re-architected to do that. And maybe that's the
real important legacy-to-be of the Pentium(R) 4 processor's
Hyper-Threading technology: giving ISV's and OSV's
an incentive (sometimes including direct assistance
from Intel) to prepare the desktop and the laptop
for the future of computer architecture.

Or maybe not. It's a complicated game, and a
heavily-favored horse can still break a leg.
--
Dennis M. O'Connor dm...@primenet.com
Opinions expressed are not, AFAIK, those of my employer.

Dennis M. O'Connor

unread,

Sep 22, 2003, 12:12:20 AM9/22/03

to

"Robert Myers" <rmy...@rustuck.com> wrote ...

> I wondered when people would see the light, so to speak, and switch
> from electrons to something sensible like photons.

First someone should create a container
that can store photons, that can be
economically replicated tens of thousands
of times for mere tens of micro-dollars.

Oh, and that price has to include the mechanisms
for putting photons in and out, whenever desired.

There's a fine basic patent in there for someone.

David Wang

unread,

Sep 22, 2003, 12:32:58 AM9/22/03

to

Robert Myers <rmy...@rustuck.com> wrote:

> I'm talking about physical device engineering: lower wire resistance,
> low-k dielectrics, diagonal routing, more layers of metal. I had my
> own pessimistic article to cite:

> http://www.us.design-reuse.com/articles/article5786.html

> "James Meindl at the Georgia Institute of Technology, who has become
> an expert in predicting the impending impact of physical parameters on
> future IC generations, has taken on the interconnect problem. His
> analysis predicts 80 levels of metal by 2014 if no architectural
> changes are made in circuit design."

> I wondered when people would see the light, so to speak, and switch
> from electrons to something sensible like photons. My private
> correspondent didn't think that would happen anytime soon.

Something sensible like photons?

Just for curiosity sake, how would you propose to make such a
creature work? I'm not trying to slam you, just curious as to
what you have in mind, if you have anything in mind.

I gather you are trying to replace parts of the on-chip
interconnect infrastructure? How would you use Photons to
replace/augment such an infrastructure?

Line of sight doesn't seem practicable, so I'll set it aside
for now, so transmitter/receiver based on mirrored reflections?
Or do we build optical conduits for on chip applications?

More important to me is the question of what the latencies are
through the electrical-optical transmitter/receivers. It seems
that it may be profitable to go to optical system in the case
of off chip interconnects, since the assumed latencies are at
least an order of magnitude worse anyways, but my impression
of the problem here has been the latency through the electro-
optical interface. Has there been recent advances in circuits
that latencies through these devices are now next to nothing?

--
davewang202(at)yahoo(dot)com

Robert Myers

unread,

Sep 22, 2003, 1:19:02 AM9/22/03

to

On Mon, 22 Sep 2003 04:32:58 +0000 (UTC), David Wang <f...@bar.invalid>
wrote:

>Robert Myers <rmy...@rustuck.com> wrote:
>
>> I had my own pessimistic article to cite:
>
>> http://www.us.design-reuse.com/articles/article5786.html
>

<snip>

>
>> I wondered when people would see the light, so to speak, and switch
>> from electrons to something sensible like photons. My private
>> correspondent didn't think that would happen anytime soon.
>
>Something sensible like photons?
>
>Just for curiosity sake, how would you propose to make such a
>creature work? I'm not trying to slam you, just curious as to
>what you have in mind, if you have anything in mind.
>
>I gather you are trying to replace parts of the on-chip
>interconnect infrastructure? How would you use Photons to
>replace/augment such an infrastructure?
>
>Line of sight doesn't seem practicable, so I'll set it aside
>for now, so transmitter/receiver based on mirrored reflections?
>Or do we build optical conduits for on chip applications?

To cite from the same article:

"Another possibility in this direction is the introduction of photonic
waveguides for long interconnect lines. There is some hope here. With
recent work on building photonic bandgap structures into silicon
circuits, this might become a practical option for designers. Photonic
structures can now be defined on-chip with the same lithographic
processes used in CMOS manufacturing. Photonic interconnects do not
carry the RC delay penalty that creates so many problems for wire
inter connects."

I thought it would be obvious that my use of the term "something
sensible" wasn't entirely serious, since I took my correspondent's
appraisal of the current situation as the most realistic. That is to
say, the use of photons for on-chip communication is a subject for
laboratory research and little more at the moment. I don't think it
will stay that way forever. One imagines something like optical trunk
lines for long-haul data transmission.

>
>More important to me is the question of what the latencies are
>through the electrical-optical transmitter/receivers. It seems
>that it may be profitable to go to optical system in the case
>of off chip interconnects, since the assumed latencies are at
>least an order of magnitude worse anyways, but my impression
>of the problem here has been the latency through the electro-
>optical interface. Has there been recent advances in circuits
>that latencies through these devices are now next to nothing?

The google search

photonics nanosecond "switching speed"

turns up lots of neat stuff. The link

http://www.eetimes.com/at/news/OEG20000807S0052

talks about the problems of using very short wavelength photons for
optical circuits, but all you need to do to solve the chip real-estate
problem is to modulate and detect on nanosecond time-scales so you can
move data over long paths, and if there is some fundamental physics
that makes that impossible, I'm not aware of it.

RM

Terje Mathisen

unread,

Sep 22, 2003, 1:40:15 AM9/22/03

to

Robert Myers wrote:
> The question remains: where best to expose the parallelism. My best
> thought right now is for the programmer to give the compiler as much
> of a head start and as many hints as possible, for the compiler to do
> the same for the processor, and for the processor to provide at least
> some last-instant flexibility. That pretty much covers all the bases,
> doesn't it? :-).

It does.

Determining these possibilities as early as possible, i.e. by the
programmer before writing a single line of code is going to stay the
most efficient way to handle the problem.

OTOH, I have developed a healthy respect for all the ways in which hw
can do stuff relatively easily (or at least very fast!), that would take
a lot of effort/programming.

I am thinking in particular about all the various forms of aliasing that
hinders performance of C code.

The EPIC solution of exposing all these features to the programmer,
letting her insert early loads and explicit checks for aliasing has the
advantage of making the intent very clear, but some form of speculative
execution might be able to match or improve on it, without actually
increasing ifetch bandwidth too badly.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Nick Maclaren

unread,

Sep 22, 2003, 6:53:38 AM9/22/03

to

In article <memo.2003092...@jgd.compulink.co.uk>,

j...@cix.co.uk (John Dallman) writes:
|>
|> A separate problem, which isn't so very far away, is speed-of-light-

|> in-silicon. Take a McKinley, at just over 400mm^2. So, it's about 20mm
|> from side to side, and that's equivalent to 15GHz in vacuum, or 10GHz at
|> best in silicon.

Yes, indeed. See below.

|> The salvation for serial processors, oddly enough, might be Microsoft.
|> Since the desktop apps that they do so well aren't nearly as amenable to
|> multi-threading (to speed up their eye candy) as server software is,
|> they're going to keep up the pressure for faster serial processors for a
|> while. They'll have to stop, and start improving efficiency at some point,
|> but that cultural change might be hard to institute.

In my view, that is the reason that we haven't moved already!

In article <20030921191807.1...@alumni.cmu.edu>,

Benjamin Ylvisaker <benj...@alumni.cmu.edu> writes:
|>
|> We are definitely seeing the beginning of the end of Not-Moore's law
|> for conventional microprocessor designs. A clear analysis:
|> http://citeseer.nj.nec.com/agarwal00clock.html

Interesting. Whether or not it is right, it shows that some people closer
to the action share my suspicions :-)

|> Do you mean tricks like arranging buffers and wire sizes in some crazy

|> way to make a wire that naďvely should have a delay of 800 picoseconds
|> actually have a delay of 600 picoseconds? Or tricks like the whole

|> procession of micro-architectural enhancements (superscalar, branch
|> prediction, trace cache, etc.) that we have seen? If the former, then
|> I believe that the mileage we can get out of those tricks is
|> distinctly limited. If the latter, then it is much harder to predict
|> what creative people may come up with.

I agree with that, and the remark applies more generally. Most of the
tricks used to reduce power and improve effective performance (whether
in hardware or software) have limited scope. It is not as if nobody
has tackled the problems before!

|> > Five, ten, twenty years is a long time. Probably while the rest of
|> > us are still trying to figure out how to formulate the problem in
|> > software, Andy Glew and others like him will have solved it in
|> > silicon and chips of the (possibly not-too-distant future) will be
|> > spinning off threads on the fly.
|>
|> I doubt that such a solution could scale very well. There is an
|> interesting parallel here with the superscalar versus VLIW debate.
|> The basic question when comparing those two ideas is, should it be the
|> programmer and/or compiler that figures out which instructions should
|> run in parallel or the hardware? Now we're wondering the exact same
|> thing, but about entire threads, not just a couple of adjacent (or
|> nearly so) instructions. I believe that the problem of deciding how
|> to spin off parallel threads from a single threaded instruction stream
|> and maintaining coherence between the threads is too hard to be solved
|> (with reasonable efficiency) in hardware. I would be fascinated if

|> someone can provide some solid counter-evidence against my belief....

I doubt that anyone can, as I think you are right. Where Robert Myers's
solution could work is when the compiler generates a large number of very
lightweight actions, and the hardware schedules them.

In article <ijqsmvsiiaka1i2hr...@4ax.com>,

Robert Myers <rmy...@rustuck.com> writes:
|>
|> I wondered when people would see the light, so to speak, and switch
|> from electrons to something sensible like photons. My private
|> correspondent didn't think that would happen anytime soon.

Yes. The interest in photoelectronics in silicon and hybrid materials
is at least partially fuelled by the desire to be able to use light
guides for long-distance (on-chip) communication.

|> It's apparently been pretty hard to predict what creative people are
|> going to come up with in either case. Past ends-of-the-road foreseen
|> have come and gone with barely a blip.

And others have been predicted to be soluble, but have not been, often
with a denial that the cause of the direction change was the problem.

|> "There is a published Ph.D. dissertation by Haitham Akkary, of
|> Portland State University, that gives some algorithms and speedups for
|> various forms of implicit multithreading. It looks quite promising.
|> Akkary reported average speedups of 20-30% and peak speedups of over
|> 80% on some benchmarks."

Not very hopeful, I am afraid. For a scalable solution, we need an
average of 2-3 times, at least.

|> There are plans in the works to use asymmetric helper threads on
|> actual processors. You can find stuff by googling on "helper thread."

It is an old trick.

|> The similarity to the VLIW problem is not accidental. The problems in
|> exposing and exploiting almost any kind of parallelism are pretty much
|> the same (I can thank David Gay for this insight) and that's what made
|> me decide the VLIW problem (and EPIC) were worth studying, whether
|> VLIW and EPIC were really good ideas or not.

Why? The problems were already well understood - just not by the people
promoting those technologies.

|> People usually save the "I don't think it's going to scale" argument
|> as their last line of defense. You might have gone there a little
|> prematurely. ;-).

It's known as foresight :-)

The reason that I don't like Eggers's SMT model is that the scalability
restrictions mean that it will never deliver enough of an advantage to
counteract its problems, at least in my area. Worse, it means that it
will give continuing improvements for only a few years, whereupon the
changing circumstances could well make it redundant.

This is similar. I don't see the use of hardware threads to speed up
serial execution of current codes as scalable enough to be worth the
effort of developing. But I do see them as worthwhile if we can start
to use programming paradigms that would be more suited to them.

Regards,
Nick Maclaren.

Robert Myers

unread,

Sep 22, 2003, 9:32:09 AM9/22/03

to

On 22 Sep 2003 10:53:38 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

<snip>

>
>|> The similarity to the VLIW problem is not accidental. The problems in
>|> exposing and exploiting almost any kind of parallelism are pretty much
>|> the same (I can thank David Gay for this insight) and that's what made
>|> me decide the VLIW problem (and EPIC) were worth studying, whether
>|> VLIW and EPIC were really good ideas or not.
>
>Why? The problems were already well understood - just not by the people
>promoting those technologies.
>

You and I must have different notions of what it means for something
to be well understood, or there may be some vast body of knowledge
about parallelism that I just don't know about.

<snip>

>
>This is similar. I don't see the use of hardware threads to speed up
>serial execution of current codes as scalable enough to be worth the
>effort of developing. But I do see them as worthwhile if we can start
>to use programming paradigms that would be more suited to them.
>

You are trying to look too far ahead. One step at a time.

RM

Benjamin Ylvisaker

unread,

Sep 22, 2003, 9:56:40 AM9/22/03

to

On Sun, 21 Sep 2003 23:24:08 -0400
Robert Myers <rmy...@rustuck.com> wrote:

> It's apparently been pretty hard to predict what creative people are
> going to come up with in either case. Past ends-of-the-road
> foreseen have come and gone with barely a blip.

I'm not an expert when it comes to physical design issues, so I won't
speculate. I'll just say that I'm sceptical that circuit designers
can save us from what I see as the need for some fundamental
architecture changes (that is, nontrivial changes to ISAs) within the
decade.

> I speculated earlier this summer on comp.arch about speculative
> threading (that's the subject line of the thread, and it's worth
> finding, not because of my thoughts, but because of the responses I
> got). Like almost any other idea you can think of, someone (at
> least Andy Glew) had already thought of it before. Mike Haertel
> volunteered a citation to a Ph.D. thesis:
>
> "There is a published Ph.D. dissertation by Haitham Akkary, of
> Portland State University, that gives some algorithms and speedups
> for various forms of implicit multithreading. It looks quite
> promising. Akkary reported average speedups of 20-30% and peak
> speedups of over 80% on some benchmarks."

When I have a bit more time I'll read up on the thread and the
dissertation. However, as Nick said, 20-30% does not seem terribly
impressive.

> The similarity to the VLIW problem is not accidental. The problems
> in exposing and exploiting almost any kind of parallelism are pretty
> much the same (I can thank David Gay for this insight) and that's
> what made me decide the VLIW problem (and EPIC) were worth studying,
> whether VLIW and EPIC were really good ideas or not.

I disagree heartily. My current world view involves three different
classes of parallelism: local instruction parallelism, data streaming
parallelism, and concurrent thread parallelism. The difference
between the three is that very different hardware is required to
exploit them to their fullest. I think it is reasonable to have
hardware try to automagically exploit local instruction parallelism.
However, exploiting the other two kinds requires a great deal more
non-local knowledge and in my opinion more foresight than a piece of
hardware could reasonably have.

> People usually save the "I don't think it's going to scale" argument
> as their last line of defense. You might have gone there a little
> prematurely. ;-).

I hope that Nick is right that jumping to this conclusion is foresight
on my part and not a blind spot.

Cheers,
Benjamin

Nick Maclaren

unread,

Sep 22, 2003, 10:30:07 AM9/22/03

to

In article <14ttmvcl8jjg9c7dk...@4ax.com>,

Robert Myers <rmy...@rustuck.com> writes:
|> >
|> >|> The similarity to the VLIW problem is not accidental. The problems in
|> >|> exposing and exploiting almost any kind of parallelism are pretty much
|> >|> the same (I can thank David Gay for this insight) and that's what made
|> >|> me decide the VLIW problem (and EPIC) were worth studying, whether
|> >|> VLIW and EPIC were really good ideas or not.
|> >
|> >Why? The problems were already well understood - just not by the people
|> >promoting those technologies.
|> >
|> You and I must have different notions of what it means for something
|> to be well understood, or there may be some vast body of knowledge
|> about parallelism that I just don't know about.

Vast? No. Previous? Yes. Perhaps the difference is that I
knew about it, and (clearly) the EPIC designers didn't. Some of
the results were quite old, some were well-hidden, and some were
scattered so widely that only a generalist would know them. But
enough were widely known that I correctly predicted why and how
EPIC software would have trouble. Examples:

It was well-known that the problem with optimising for parallel
execution was more in the program analysis than the code generation,
and that it was fairly architecture-independent.

It was well-known that most unmodified programs did not optimise
for code movement well (essentially what EPIC relies on), that this
was a language dependent issue (with C being one of the worst), and
that this was effectively an insoluble problem.

It was known that one of the few techniques that did work well
was global data flow analysis, but it worked well for parallelism
only if the program was compiled as a whole from the source.

It was well-known that hundreds of thousands of man-years of
research had already been put into these areas, by some of the
best software houses and computer science departments around, and
that a lot of such experience had led to the conclusion that any
viable solution needed better source languages.

All of that was widely known by 1980, except for the global data
flow analysis conclusion, which was a bit later.

|> >This is similar. I don't see the use of hardware threads to speed up
|> >serial execution of current codes as scalable enough to be worth the
|> >effort of developing. But I do see them as worthwhile if we can start
|> >to use programming paradigms that would be more suited to them.
|> >
|> You are trying to look too far ahead. One step at a time.

In this case, I don't think that you can solve the problem
incrementally, because you need BOTH components to get enough gain
to counteract the (considerable) pain. Either change on its own
will be rejected.

Regards,
Nick Maclaren.

Hank Oredson

unread,

Sep 22, 2003, 11:20:38 AM9/22/03

to

"David Wang" <f...@bar.invalid> wrote in message
news:bklu1q$mnj$1...@grapevine.wam.umd.edu...

Long ago we went through a similar discussion, and a good deal
of research, about using magnetics on chip. The same issues arise.
At some point you need to convert those magnetic fields to
electric signals. Conversion delay and power use were the issues
that killed "on chip thin magnetic film memory".

Seems to me the same concepts apply to photonics.
As long as the whole chip (processor(s) and cache) is
photonic, perhaps the idea has merit. Otherwise you
lose due to intra-chip conversion issues.

--

... Hank

Hank: http://horedson.home.att.net
W0RLI: http://w0rli.home.att.net

Nick Maclaren

unread,

Sep 22, 2003, 11:43:38 AM9/22/03

to

In article <arEbb.149016$0v4.11...@bgtnsc04-news.ops.worldnet.att.net>,
"Hank Oredson" <hore...@att.net> writes:
|>
|> > Something sensible like photons?

|> >
|> > More important to me is the question of what the latencies are

|> > through the electrical-optical transmitter/receivers. ...

|>
|> Long ago we went through a similar discussion, and a good deal
|> of research, about using magnetics on chip. The same issues arise.
|> At some point you need to convert those magnetic fields to
|> electric signals. Conversion delay and power use were the issues
|> that killed "on chip thin magnetic film memory".
|>
|> Seems to me the same concepts apply to photonics.
|> As long as the whole chip (processor(s) and cache) is
|> photonic, perhaps the idea has merit. Otherwise you
|> lose due to intra-chip conversion issues.

There is some active work, and hopeful progress, in combining the
two technologies. It doesn't necessarily need the whole chip to
be photonic, but it DOES need a very efficient interconnexion.
I have not the slightest idea whether the researchers will succeed
in achieving that.

Regards,
Nick Maclaren.

Tim

unread,

Sep 22, 2003, 12:51:53 PM9/22/03

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:bkn5ba$5o5$1...@pegasus.csx.cam.ac.uk...

If CPUs were simpler (no registers, no pipeline, no ALU) and used serial
data instead of wide data, then you couldn't you use superior interconnect;
something like point-to-point LVDS? In other words, use electric field
rather than electric current (since the former propogates at the speed of
light). This would still be photonic; signals are being exchanged in a field
after all.

Tim

Zak

unread,

Sep 22, 2003, 2:03:19 PM9/22/03

to

Hank Oredson wrote:

> Long ago we went through a similar discussion, and a good deal
> of research, about using magnetics on chip. The same issues arise.
> At some point you need to convert those magnetic fields to
> electric signals. Conversion delay and power use were the issues
> that killed "on chip thin magnetic film memory".

Electric signals, magnetic signmals are 'quite related' to say the least.

> Seems to me the same concepts apply to photonics.
> As long as the whole chip (processor(s) and cache) is
> photonic, perhaps the idea has merit. Otherwise you
> lose due to intra-chip conversion issues.

Well... someone invented the light-amplifying optical fiber. It was a
clever idea.

I propose a mechanism where the transmission line itself amplifies its
signal. Note that for speed it does not really matter whether the signal
is electical or optical: it is only a different frequency on which the
refractive index of the material is relevant.

I can imagine an 'insulator' with avalanche properties that amplifies
signals passing through. The idea of solid state version of the
traveling wave tube comes to mind.

Now the issue is to find an amplifying material for these low frequency
photons that travel past these aluminum or copper lines.

I guess it will take a while.

Thomas

Del Cecchi

unread,

Sep 22, 2003, 2:25:06 PM9/22/03

to

In article <arEbb.149016$0v4.11...@bgtnsc04-news.ops.worldnet.att.net>,
"Hank Oredson" <hore...@att.net> writes:
|>

|> "David Wang" <f...@bar.invalid> wrote in message
|> news:bklu1q$mnj$1...@grapevine.wam.umd.edu...
|> > Robert Myers <rmy...@rustuck.com> wrote:
|> >

snip

snip

|> of research, about using magnetics on chip. The same issues arise.
|> At some point you need to convert those magnetic fields to
|> electric signals. Conversion delay and power use were the issues
|> that killed "on chip thin magnetic film memory".
|>
|> Seems to me the same concepts apply to photonics.
|> As long as the whole chip (processor(s) and cache) is
|> photonic, perhaps the idea has merit. Otherwise you
|> lose due to intra-chip conversion issues.
|>

If we were to switch to a material that, unlike silicon, was good at making
photons, GaAs for example, one cold imagine shooting photons through the air
somehow. Photons are always the cure all. But it never turns out that way.

There have been some interesting proposals for using optical signaling for chip
to chip signaling. And the military likes it because fibre is more EMP
resistant. Although if we have to actually deal with EMP we have a lot bigger
problem than booting our PC. :-)

--

Del Cecchi
cec...@us.ibm.com
Personal Opinions Only

Hank Oredson

unread,

Sep 22, 2003, 5:11:32 PM9/22/03

to

"Zak" <Z...@spam.invalid> wrote in message
news:HPGbb.6194$P51.10622@amstwist00...

> Hank Oredson wrote:
>
> > Long ago we went through a similar discussion, and a good deal
> > of research, about using magnetics on chip. The same issues arise.
> > At some point you need to convert those magnetic fields to
> > electric signals. Conversion delay and power use were the issues
> > that killed "on chip thin magnetic film memory".
>
> Electric signals, magnetic signmals are 'quite related' to say the least.

Magnetic signals?
These were magnetic memory elements.
I think you are confused about how they work.

Robert Myers

unread,

Sep 22, 2003, 7:04:44 PM9/22/03

to

On 22 Sep 2003 14:30:07 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

> It was well-known that the problem with optimising for parallel
>execution was more in the program analysis than the code generation,
>and that it was fairly architecture-independent.
>

It is really very hard to believe that the designers of EPIC didn't
understand that.

> It was well-known that most unmodified programs did not optimise
>for code movement well (essentially what EPIC relies on), that this
>was a language dependent issue (with C being one of the worst), and
>that this was effectively an insoluble problem.
>

From static analysis, yes. When you run the code, the ambiguities go
away. The first time, you guess. If you guess wrong, you have to fix
the mistake. The second time, you have a *much* smaller likelihood of
guessing wrong about data ambiguities. If you didn't do anything
else, FDO would be worth it for this advantage alone.

But what about all the problems with FDO we've talked about, and
haven't I expressed doubts about whether FDO will ever be acceptable
for routine use? Yes, I have. What, then, is my real opinion? It
depends on what day you ask me.

It is *much* easier to tell what a code actually does if you run it
(duh). If Intel implements the fast and slow scheduling window scheme
they've talked about openly (and the scheme looks complicated enough
that it seems likes they've invested a fair bit into it), the problems
with cache size and memory latency uncertainty become much smaller.
The FDO doesn't have to do optimal scheduling. It only has to resolve
control and data ambiguities (at least in a probabalistic sense).

> It was known that one of the few techniques that did work well
>was global data flow analysis, but it worked well for parallelism
>only if the program was compiled as a whole from the source.
>

As discussed in a recent thread, a sufficiently expressive
intermediate representation would probably be adequate.

> It was well-known that hundreds of thousands of man-years of
>research had already been put into these areas, by some of the
>best software houses and computer science departments around, and
>that a lot of such experience had led to the conclusion that any
>viable solution needed better source languages.
>

Mostly what all that research showed is that no amount of static
analysis with an off-the-shelf program would yield results that were
worth the trouble.

>All of that was widely known by 1980, except for the global data
>flow analysis conclusion, which was a bit later.
>
>|> >This is similar. I don't see the use of hardware threads to speed up
>|> >serial execution of current codes as scalable enough to be worth the
>|> >effort of developing. But I do see them as worthwhile if we can start
>|> >to use programming paradigms that would be more suited to them.
>|> >
>|> You are trying to look too far ahead. One step at a time.
>
>In this case, I don't think that you can solve the problem
>incrementally, because you need BOTH components to get enough gain
>to counteract the (considerable) pain. Either change on its own
>will be rejected.
>

In this particular case, I don't think it matters much what either one
of us thinks because I think the hardware guys are going to do
whatever they can with whatever they can jam into the transistor
budget, anyway.

RM

Robert Myers

unread,

Sep 22, 2003, 11:44:41 PM9/22/03

to

On Mon, 22 Sep 2003 09:56:40 -0400, Benjamin Ylvisaker
<benj...@alumni.cmu.edu> wrote:

>On Sun, 21 Sep 2003 23:24:08 -0400
>Robert Myers <rmy...@rustuck.com> wrote:

>
>> The similarity to the VLIW problem is not accidental. The problems
>> in exposing and exploiting almost any kind of parallelism are pretty
>> much the same (I can thank David Gay for this insight) and that's
>> what made me decide the VLIW problem (and EPIC) were worth studying,
>> whether VLIW and EPIC were really good ideas or not.
>
>I disagree heartily. My current world view involves three different
>classes of parallelism: local instruction parallelism, data streaming
>parallelism, and concurrent thread parallelism. The difference
>between the three is that very different hardware is required to
>exploit them to their fullest. I think it is reasonable to have
>hardware try to automagically exploit local instruction parallelism.
>However, exploiting the other two kinds requires a great deal more
>non-local knowledge and in my opinion more foresight than a piece of
>hardware could reasonably have.
>

We're talking past each other a bit. The analysis required to expose
instruction level, thread level, or streaming parallelism is the same.
Operations that are independent of one another can be done in parallel
or overlapped if both have the necessary data available to proceed.
In all three cases, you wind up asking yourself the same questions:
how does the programmer best identify opportunities for parallelism to
the compiler, how does the compiler identify opportunities for
parallelism to the processor, and how much can you get the hardware to
do at runtime.

We've pretty much talked instruction level parallelism to death.
Streaming parallelism is not all that interesting to talk about. You
don't think much can be done at runtime to identify and exploit
thread-level parallelism. It's not the first way I'd think of doing
business, but I wouldn't be prepared to predict that we'll never have
a hardware system that can chomp a dusty deck and run it effectively
on a processor that consists of many processors. Forty years of
research on software has given us some clumsy theories and dozens of
clumsy, inadequate, and incompatible software models. It's hard to
imagine that progress in hardware could be any slower.

RM

Paul A. Clayton

unread,

Sep 23, 2003, 3:29:30 AM9/23/03

to

In article <bkk9c1$qdh$1...@pegasus.csx.cam.ac.uk>,
nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

>Now, my question is whether we have seen the beginning of the end of
>Not-Moore's Law?

In addition to power consumption, wire delay, and reduced
return on investment in seeking ILP (i.e., technical problems),
there might be economic problems. Keeping die size the
same (i.e., using all of the feature shrink to expand features)
while increasing the fixed costs of manufacturing (cost of
fab, of masks [especially as their number increases], of
design [likely to increase from increased complexity]) requires
an increase in volume to maintain a performance/price
doubling. It seems quite possible that the market for
peak ILP processors might not be growing fast enough
to support increasing fixed costs.

>Even if Moore's Law holds, and the process sizes continue to drop,
>the need to keep the power consumption down will mean that CPUs have
>to be run slower. So we could see the doubling time increase from
>1.5-2 years to 3-4 years, and perhaps longer. But will we? Or are
>there significant rabbits still hidden in the hat?

As you know, not all of the performance gain is from
increasing clock frequency. Wider issue/SIMD, larger
on-die caches, 'better' software, OoOE, etc. have
contributed their share.

It is also interesting to ask how general and how
useful performance increases will be. It seems
that memory latency might become increasingly
painful (so _general_ performance might tend to
suffer). OTOH, some of the applications limited
by processor performance might be friendly to
hardware improvements like SIMD, so _useful_
performance might increase (i.e., if the 'fast
enough' applications remain fast enough and
most other applications follow the exponential
curve).

Paul A. Clayton
just a former McD's grill worker and technophile

Nick Maclaren

unread,

Sep 23, 2003, 4:39:29 AM9/23/03

to

In article <iguumvgefcgj2jtok...@4ax.com>,

Robert Myers <rmy...@rustuck.com> writes:
|> On 22 Sep 2003 14:30:07 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) wrote:
|>
|> > It was well-known that the problem with optimising for parallel
|> >execution was more in the program analysis than the code generation,
|> >and that it was fairly architecture-independent.
|> >
|> It is really very hard to believe that the designers of EPIC didn't
|> understand that.

It is, indeed, hard to believe that - but the evidence is that they
didn't! My guess is that they were AWARE of the problem, but did
not UNDERSTAND it in detail and consequences.

|> > It was well-known that most unmodified programs did not optimise
|> >for code movement well (essentially what EPIC relies on), that this
|> >was a language dependent issue (with C being one of the worst), and
|> >that this was effectively an insoluble problem.
|> >
|> From static analysis, yes. When you run the code, the ambiguities go
|> away. The first time, you guess. If you guess wrong, you have to fix
|> the mistake. The second time, you have a *much* smaller likelihood of
|> guessing wrong about data ambiguities. If you didn't do anything
|> else, FDO would be worth it for this advantage alone.

That applies ONLY if you are running the IDENTICAL operations a
second time - and that is of interest solely in benchmarketing.
The limitations of feedback-directed optimisation were why it was
abandoned in the 1970s, and not because the technology wasn't up
to it or because it wasn't needed.

In particular, while you can get some useful information from it,
there are few ways that you can get any RELIABLE information about
data ambiguities (mainly aliasing) from it. And, as getting the
aliasing wrong leads to wrong answers with architectures like EPIC,
you cannot use unreliable information.

This was all known by 1980, though perhaps not by the designers of
EPIC.

|> But what about all the problems with FDO we've talked about, and
|> haven't I expressed doubts about whether FDO will ever be acceptable
|> for routine use? Yes, I have. What, then, is my real opinion? It
|> depends on what day you ask me.

Quite.

|> It is *much* easier to tell what a code actually does if you run it
|> (duh). If Intel implements the fast and slow scheduling window scheme
|> they've talked about openly (and the scheme looks complicated enough
|> that it seems likes they've invested a fair bit into it), the problems
|> with cache size and memory latency uncertainty become much smaller.
|> The FDO doesn't have to do optimal scheduling. It only has to resolve
|> control and data ambiguities (at least in a probabalistic sense).

And that is precisely what it doesn't do, deterministically, and using
only probabilistic results works only if an error leads merely to loss
of efficiency and not wrong answers. I.e. not on EPIC.

|> > It was known that one of the few techniques that did work well
|> >was global data flow analysis, but it worked well for parallelism
|> >only if the program was compiled as a whole from the source.
|> >
|> As discussed in a recent thread, a sufficiently expressive
|> intermediate representation would probably be adequate.

Dare I mention ANDF?

|> > It was well-known that hundreds of thousands of man-years of
|> >research had already been put into these areas, by some of the
|> >best software houses and computer science departments around, and
|> >that a lot of such experience had led to the conclusion that any
|> >viable solution needed better source languages.
|> >
|> Mostly what all that research showed is that no amount of static
|> analysis with an off-the-shelf program would yield results that were
|> worth the trouble.

Absolutely NOT. There was also the evidence that dynamic analysis
could not help much with such programs, except when used for
benchmarketing, because it could not resolve the ambiguities
reliably. I knew about those results - why didn't they?

|> In this particular case, I don't think it matters much what either one
|> of us thinks because I think the hardware guys are going to do
|> whatever they can with whatever they can jam into the transistor
|> budget, anyway.

True. But I am interested in seeing which way the wind is blowing.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Sep 23, 2003, 4:44:35 AM9/23/03

to

In article <20030923032930...@mb-m16.aol.com>, memory...@aol.com (Paul A. Clayton) writes:
|> In article <bkk9c1$qdh$1...@pegasus.csx.cam.ac.uk>,
|> nm...@cus.cam.ac.uk (Nick Maclaren) wrote:
|>
|> >Now, my question is whether we have seen the beginning of the end of
|> >Not-Moore's Law?
|>
|> In addition to power consumption, wire delay, and reduced
|> return on investment in seeking ILP (i.e., technical problems),
|> there might be economic problems. Keeping die size the

|> same ...

A good point.

|> As you know, not all of the performance gain is from
|> increasing clock frequency. Wider issue/SIMD, larger
|> on-die caches, 'better' software, OoOE, etc. have
|> contributed their share.

But they are running out of steam, too! The initial gains are
almost always the easiest.

Regards,
Nick Maclaren.

Robert Myers

unread,

Sep 23, 2003, 8:45:22 AM9/23/03

to

On 23 Sep 2003 08:39:29 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

>
>In article <iguumvgefcgj2jtok...@4ax.com>,
>Robert Myers <rmy...@rustuck.com> writes:

<snip>

>|> >
>|> From static analysis, yes. When you run the code, the ambiguities go
>|> away. The first time, you guess. If you guess wrong, you have to fix
>|> the mistake. The second time, you have a *much* smaller likelihood of
>|> guessing wrong about data ambiguities. If you didn't do anything
>|> else, FDO would be worth it for this advantage alone.
>
>That applies ONLY if you are running the IDENTICAL operations a
>second time - and that is of interest solely in benchmarketing.
>The limitations of feedback-directed optimisation were why it was
>abandoned in the 1970s, and not because the technology wasn't up
>to it or because it wasn't needed.
>
>In particular, while you can get some useful information from it,
>there are few ways that you can get any RELIABLE information about
>data ambiguities (mainly aliasing) from it. And, as getting the
>aliasing wrong leads to wrong answers with architectures like EPIC,
>you cannot use unreliable information.
>

Well, but you can. That's what the Advanced Load Address Table is
for: to keep track of loads that might be wrong, so they can be
corrected. Loading speculatively (and possibly using the load) makes
no sense if the expected cost of fixing misspeculation exceeds the
cost of forcing the load to wait. You can't estimate that expected
cost without running the code, and it's not nearly as fragile as you
seem to think it is.

I'm not sure where the disconnect is, because you surely must be aware
of the ability of Itanium to speculate and to correct the speculation
if it is wrong. It might turn out not to be all that useful, or it
might turn out to be more trouble than it's worth, but the facility
exists.

<snip>

>
>|> > It was known that one of the few techniques that did work well
>|> >was global data flow analysis, but it worked well for parallelism
>|> >only if the program was compiled as a whole from the source.
>|> >
>|> As discussed in a recent thread, a sufficiently expressive
>|> intermediate representation would probably be adequate.
>
>Dare I mention ANDF?
>

Sure you can. I wasn't aware of it, so I'm glad you did. I'd say
it's at least as on-topic as the interminable discussions of
high-level languages.

>|> > It was well-known that hundreds of thousands of man-years of
>|> >research had already been put into these areas, by some of the
>|> >best software houses and computer science departments around, and
>|> >that a lot of such experience had led to the conclusion that any
>|> >viable solution needed better source languages.
>|> >
>|> Mostly what all that research showed is that no amount of static
>|> analysis with an off-the-shelf program would yield results that were
>|> worth the trouble.
>
>Absolutely NOT. There was also the evidence that dynamic analysis
>could not help much with such programs, except when used for
>benchmarketing, because it could not resolve the ambiguities
>reliably. I knew about those results - why didn't they?
>

I haven't seen results that used dynamic analysis other than
passively; i.e., to determine success or failure. That doesn't mean
the work wasn't done or published, it just means I haven't seen it.
I'll look harder.

>|> In this particular case, I don't think it matters much what either one
>|> of us thinks because I think the hardware guys are going to do
>|> whatever they can with whatever they can jam into the transistor
>|> budget, anyway.
>
>True. But I am interested in seeing which way the wind is blowing.
>

Me, too, and for more than just idle curiosity.

RM

Nick Maclaren

unread,

Sep 23, 2003, 9:41:13 AM9/23/03

to

In article <6te0nvsu4h7e64ucd...@4ax.com>,

Robert Myers <rmy...@rustuck.com> writes:
|> >|> >
|> >|> From static analysis, yes. When you run the code, the ambiguities go
|> >|> away. The first time, you guess. If you guess wrong, you have to fix
|> >|> the mistake. The second time, you have a *much* smaller likelihood of
|> >|> guessing wrong about data ambiguities. If you didn't do anything
|> >|> else, FDO would be worth it for this advantage alone.
|> >

|> >In particular, while you can get some useful information from it,
|> >there are few ways that you can get any RELIABLE information about
|> >data ambiguities (mainly aliasing) from it. And, as getting the
|> >aliasing wrong leads to wrong answers with architectures like EPIC,
|> >you cannot use unreliable information.
|>
|> Well, but you can. That's what the Advanced Load Address Table is
|> for: to keep track of loads that might be wrong, so they can be
|> corrected. Loading speculatively (and possibly using the load) makes
|> no sense if the expected cost of fixing misspeculation exceeds the
|> cost of forcing the load to wait. You can't estimate that expected
|> cost without running the code, and it's not nearly as fragile as you
|> seem to think it is.

But you're NOT estimating the cost - which is what feedback-directed
optimisation does - and you are NOT doing what you describe in your
first paragraph above. I agree that the ALAT technique is better
suited to unpredictable codes than feedback-directed optimisation.

|> >Absolutely NOT. There was also the evidence that dynamic analysis
|> >could not help much with such programs, except when used for
|> >benchmarketing, because it could not resolve the ambiguities
|> >reliably. I knew about those results - why didn't they?
|> >
|> I haven't seen results that used dynamic analysis other than
|> passively; i.e., to determine success or failure. That doesn't mean
|> the work wasn't done or published, it just means I haven't seen it.
|> I'll look harder.

They might be a trifle hard to find by simply searching, and you
would do best to start by asking people who were active in compiler
optimisation research in the 1970s and 1980s. They aren't hard
to repeat, in simple cases. Here are a few of them.

Take a look at the alias detection problem, as a static problem.
Now consider that you are running the program, and have perfect
information on whether two potentially aliased objects do, in fact,
alias one another. Now work out how much information the latter
gives you to reduce the uncertainties in the former - and it is
usually very little.

The point here is that static analysis is what is needed to be
certain that two objects will NEVER alias one another - yes, there
are circumstances under which that is statically possible but
dynamically impossible, but they aren't the most common. And it
that information that you need for optimisation.

Similarly, the dynamic approach suffers from combinatoric explosion
with more than a couple of objects involved, though it was and is
very heavily used on vector systems, SMP systems and EPIC. It is
also assumed by (and usually used in) the level 2 and 3 BLAS. ALAT
helps a little, but only with the simpler constructions.

In this, I am ignoring the serious problem with the IA-64 that the
compiler has to generate alternate code paths, one of which may be
almost never taken, which is a recipe for the foulest sort of code
generation error.

Regards,
Nick Maclaren.

Robert Myers

unread,

Sep 23, 2003, 10:09:18 AM9/23/03

to

On 23 Sep 2003 13:41:13 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

>
>In article <6te0nvsu4h7e64ucd...@4ax.com>,
>Robert Myers <rmy...@rustuck.com> writes:
>|> >|> >
>|> >|> From static analysis, yes. When you run the code, the ambiguities go
>|> >|> away. The first time, you guess. If you guess wrong, you have to fix
>|> >|> the mistake. The second time, you have a *much* smaller likelihood of
>|> >|> guessing wrong about data ambiguities. If you didn't do anything
>|> >|> else, FDO would be worth it for this advantage alone.
>|> >
>|> >In particular, while you can get some useful information from it,
>|> >there are few ways that you can get any RELIABLE information about
>|> >data ambiguities (mainly aliasing) from it. And, as getting the
>|> >aliasing wrong leads to wrong answers with architectures like EPIC,
>|> >you cannot use unreliable information.
>|>
>|> Well, but you can. That's what the Advanced Load Address Table is
>|> for: to keep track of loads that might be wrong, so they can be
>|> corrected. Loading speculatively (and possibly using the load) makes
>|> no sense if the expected cost of fixing misspeculation exceeds the
>|> cost of forcing the load to wait. You can't estimate that expected
>|> cost without running the code, and it's not nearly as fragile as you
>|> seem to think it is.
>
>But you're NOT estimating the cost - which is what feedback-directed
>optimisation does - and you are NOT doing what you describe in your
>first paragraph above. I agree that the ALAT technique is better
>suited to unpredictable codes than feedback-directed optimisation.
>

You've completely lost me here. You talk about the ALAT and
feedback-directed optimization as if they were competing methods.
Feedback-directed optimization allows you to make more intelligent use
of speculation. Maybe you're saying that it's supposed to do that,
but that it can't, or that in practice it isn't all that helpful.
Jury still out on that as far as I'm concerned; that's why you get
different opinions from me on different days.

<snip>

>
>The point here is that static analysis is what is needed to be
>certain that two objects will NEVER alias one another - yes, there
>are circumstances under which that is statically possible but
>dynamically impossible, but they aren't the most common. And it
>that information that you need for optimisation.
>

You don't need to guarantee that two objects will never alias one
another. Based on static analysis or profiling information, you guess
whether two objects will alias one another or not. If you guess that
they won't and you guess wrong, you fix your mistake. It sounds
complicated, and it is, but that's what the game is about.

>Similarly, the dynamic approach suffers from combinatoric explosion
>with more than a couple of objects involved, though it was and is
>very heavily used on vector systems, SMP systems and EPIC. It is
>also assumed by (and usually used in) the level 2 and 3 BLAS. ALAT
>helps a little, but only with the simpler constructions.

The standard way of avoiding combinatoric explosion in optimizing
compilers is to make arbitrary guesses about what code transformations
will be helpful and then to make even more arbitrary decisions about
whether they are helpful or not by seeing how they do on some standard
set of tests. There are so many things wrong with that approach that
I don't even know where to begin.

The idea of deciding what the compiler should do based on what happens
with *your* code, rather than on what happens with some arbitrary test
suite is so appealing to me that I am willing to be stubborn in the
face of skepticism and previous failures.

RM

Nick Maclaren

unread,

Sep 23, 2003, 10:19:17 AM9/23/03

to

In article <voj0nvc6mlfbtpnag...@4ax.com>,

Robert Myers <rmy...@rustuck.com> writes:
|> >
|> You've completely lost me here. You talk about the ALAT and
|> feedback-directed optimization as if they were competing methods.
|> Feedback-directed optimization allows you to make more intelligent use
|> of speculation. Maybe you're saying that it's supposed to do that,
|> but that it can't, or that in practice it isn't all that helpful.
|> Jury still out on that as far as I'm concerned; that's why you get
|> different opinions from me on different days.

Oh, right, I see what you mean. But that isn't the only purpose
of feedback-directed optimisation, or even the main one in many
compilers. To a great extent it and the ALAT are orthogonal
optimisation methods, and it can be used to enhance register use
efficiency, speculation, common expression elimination, data
reordering and many other methods, just as it can the use of the
ALAT.

|> >The point here is that static analysis is what is needed to be
|> >certain that two objects will NEVER alias one another - yes, there
|> >are circumstances under which that is statically possible but
|> >dynamically impossible, but they aren't the most common. And it
|> >that information that you need for optimisation.
|> >

|> You don't need to guarantee that two objects will never alias one
|> another. Based on static analysis or profiling information, you guess
|> whether two objects will alias one another or not. If you guess that
|> they won't and you guess wrong, you fix your mistake. It sounds
|> complicated, and it is, but that's what the game is about.

Right. Techniques like the ALAT, Alpha's load conditional and even
handling IEEE denormalised numbers by interrupt make it practical to
use unreliable information - but at a cost! If you get it more than
slightly wrong, the expense of recovering from the mistake can be
larger than the expense of not optimising.

|> The standard way of avoiding combinatoric explosion in optimizing
|> compilers is to make arbitrary guesses about what code transformations
|> will be helpful and then to make even more arbitrary decisions about
|> whether they are helpful or not by seeing how they do on some standard
|> set of tests. There are so many things wrong with that approach that
|> I don't even know where to begin.

Quite. And the same applies when it is done in hardware.

|> The idea of deciding what the compiler should do based on what happens
|> with *your* code, rather than on what happens with some arbitrary test
|> suite is so appealing to me that I am willing to be stubborn in the
|> face of skepticism and previous failures.

Yes, but that only applies when you are repeatedly doing almost
identical things! If you think that your test conditions are
representative, and are wrong, you can be worse off than not
optimising. I see this every day in operating systems :-(

Regards,
Nick Maclaren.

Benjamin Ylvisaker

unread,

Sep 23, 2003, 11:12:31 AM9/23/03

to

On Mon, 22 Sep 2003 23:44:41 -0400
Robert Myers <rmy...@rustuck.com> wrote:

> Operations that are independent of one another can be done in
> parallel or overlapped if both have the necessary data available to
> proceed. In all three cases, you wind up asking yourself the same
> questions: how does the programmer best identify opportunities for
> parallelism to the compiler, how does the compiler identify
> opportunities for parallelism to the processor, and how much can you
> get the hardware to do at runtime.

I agree. However, even though you ask yourself the same questions,
the answers are very different for each different kind of parallelism.
Let's just look at the programmer's responsibilities for a second. I
don't think programmers need to worry about local ILP that much;
optimizing compilers should mostly take care of that. However, the
sorts of programs that can easily be written to take advantage of
streaming data parallelism (e.g. DSP filters) and the sorts of
programs that can easily be written to take advantage of threading
parallelism (e.g. internet servers) require a very different
programmer mindset. I think that similar arguments apply the whole
way down the stack; different compiler optimizations and hardware
structures are needed to best take advantage of each kind of
parallelism.

> We've pretty much talked instruction level parallelism to death.

And architects have researched it to death. I believe that current
architectures are pretty close to the maximum economical level of ILP
exploitation.

> Streaming parallelism is not all that interesting to talk about.

If you say so. This is exactly what my current research is about.
Reconfigurable computing projects have shown that algorithms with a
high degree of data streaming parallelism can be run 1-2 orders of
magnitude more efficiently on architectures designed for such programs
compared to conventional microprocessors. The tricky part is
intergrating these data stream oriented architectures with more
traditional architectural ideas. Maybe in a couple of years I'll be
able to reference my dissertation on the topic.

> You don't think much can be done at runtime to identify and exploit
> thread-level parallelism. It's not the first way I'd think of doing
> business, but I wouldn't be prepared to predict that we'll never have
> a hardware system that can chomp a dusty deck and run it effectively
> on a processor that consists of many processors. Forty years of
> research on software has given us some clumsy theories and dozens of
> clumsy, inadequate, and incompatible software models. It's hard to
> imagine that progress in hardware could be any slower.

If you're interested in an agile, adequate, universal software model
of concurrency, I recommend checking out the pi calculus literature.
Good stuff.

Cheers,
Benjamin

Robert Myers

unread,

Sep 23, 2003, 11:15:48 AM9/23/03

to

On 23 Sep 2003 14:19:17 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

>
>In article <voj0nvc6mlfbtpnag...@4ax.com>,
>Robert Myers <rmy...@rustuck.com> writes:
>|> >
>|> You've completely lost me here. You talk about the ALAT and
>|> feedback-directed optimization as if they were competing methods.
>|> Feedback-directed optimization allows you to make more intelligent use
>|> of speculation. Maybe you're saying that it's supposed to do that,
>|> but that it can't, or that in practice it isn't all that helpful.
>|> Jury still out on that as far as I'm concerned; that's why you get
>|> different opinions from me on different days.
>
>Oh, right, I see what you mean. But that isn't the only purpose
>of feedback-directed optimisation, or even the main one in many
>compilers. To a great extent it and the ALAT are orthogonal
>optimisation methods, and it can be used to enhance register use
>efficiency, speculation, common expression elimination, data
>reordering and many other methods, just as it can the use of the
>ALAT.
>

I've confused the issue by switching back and forth between talking
about what happens with a proprietary compiler like icc and what could
be done with an open source compiler. If I had limitless time, I
might spend some of it trying to reverse engineer icc, at least for
simple situations, but I don't have limitless time.

One of the advantages of an open-source compiler is that things like
feedback-directed optimization mean what you think they should mean,
rather than what the owner of a proprietary compiler thinks they
should mean. :-).

<snip>

>|> You don't need to guarantee that two objects will never alias one
>|> another. Based on static analysis or profiling information, you guess
>|> whether two objects will alias one another or not. If you guess that
>|> they won't and you guess wrong, you fix your mistake. It sounds
>|> complicated, and it is, but that's what the game is about.
>
>Right. Techniques like the ALAT, Alpha's load conditional and even
>handling IEEE denormalised numbers by interrupt make it practical to
>use unreliable information - but at a cost! If you get it more than
>slightly wrong, the expense of recovering from the mistake can be
>larger than the expense of not optimising.
>

And the only way to find out for sure whether something is worth the
cost or not is to try it.

<snip>

>
>Yes, but that only applies when you are repeatedly doing almost
>identical things! If you think that your test conditions are
>representative, and are wrong, you can be worse off than not
>optimising. I see this every day in operating systems :-(

I'll leave operating systems to the truly brave souls of this world.
Resolving control and data ambiguities (at least in terms of
probabilities) by actually running the code are what I see as the
low-hanging fruit of feedback-directed-optimization.

Some things, like trying to substitute FDO for OoO to hide memory
latency are just too sensitive to unpredictable details and just
aren't going to work reliably. I think Intel has already figured that
out. It doesn't follow, though, that FDO is useless.

RM

Robert Myers

unread,

Sep 23, 2003, 11:50:58 AM9/23/03

to

On Tue, 23 Sep 2003 11:12:31 -0400, Benjamin Ylvisaker
<benj...@alumni.cmu.edu> wrote:

>On Mon, 22 Sep 2003 23:44:41 -0400
>Robert Myers <rmy...@rustuck.com> wrote:

<snip>

>
>> Streaming parallelism is not all that interesting to talk about.
>
>If you say so. This is exactly what my current research is about.
>Reconfigurable computing projects have shown that algorithms with a
>high degree of data streaming parallelism can be run 1-2 orders of
>magnitude more efficiently on architectures designed for such programs
>compared to conventional microprocessors. The tricky part is
>intergrating these data stream oriented architectures with more
>traditional architectural ideas. Maybe in a couple of years I'll be
>able to reference my dissertation on the topic.
>

Let me be in haste to correct myself. Streaming parallelism isn't all
that interesting to talk about in the context of current
general-purpose microprocessors.

I will readily agree that you could do much better with different
hardware. I hope you can talk somebody into building some, since most
of my work lends itself very nicely to streaming.

<snip>

>If you're interested in an agile, adequate, universal software model
>of concurrency, I recommend checking out the pi calculus literature.

I will do that. Anyone else who wanted to make similar promises about
a model for concurrency would get my interest, too. Thanks.

RM

Christopher Brian Colohan

unread,

Sep 23, 2003, 12:01:06 PM9/23/03

to

Robert Myers <rmy...@rustuck.com> writes:
> You
> don't think much can be done at runtime to identify and exploit
> thread-level parallelism. It's not the first way I'd think of doing
> business, but I wouldn't be prepared to predict that we'll never have
> a hardware system that can chomp a dusty deck and run it effectively
> on a processor that consists of many processors.

Actually, there has been at least one paper that has evaluated the
idea of dividing up a program into speculative threads at backwards
branches and at function call points. It isn't that bad.
Unfortunately, it requires hardware which can handle exposed registers
(either through forwarding of produced values, good value predictors,
a shared register file, or some combination) and that hardware can be
non-trivial and/or non-scalable. But if you are willing to invest in
the hardware, it looks like the performance can be pretty good.

I personally like to get the compiler involved in the thread selection
process, 'cause it can then go and perform optimizations and
transformations to improve the performance of the resulting threads.
At this point I think the best solution will involve a combination of
the compiler and some run-time techniques. Once I have a really good
answer for this, my advisor might even let me graduate. :-)

You would think I have a reference handy (I am currently trying to
write a thesis on the problem of finding speculative threads), but I
don't have it off the top of my head, and no time to go digging
through my pile of papers now. :-) In particular, many of the papers
which have evaluated architectures for thread-level speculation (and
other names for similar ideas) have involved either no compiler or
minimal compilers (identifying thread boundaries and exposed register
sets, without changing the generated code). Some of these papers show
promising results, even without an intelligent compiler -- so there is
hope yet! :-)

Chris
--
Chris Colohan Email: ch...@colohan.ca PGP: finger col...@cs.cmu.edu
Web: www.colohan.com Phone: (412)268-4751

Thomas Lindgren

unread,

Sep 23, 2003, 4:12:43 PM9/23/03

to

Robert Myers <rmy...@rustuck.com> writes:

> The standard way of avoiding combinatoric explosion in optimizing
> compilers is to make arbitrary guesses about what code transformations
> will be helpful and then to make even more arbitrary decisions about
> whether they are helpful or not by seeing how they do on some standard
> set of tests. There are so many things wrong with that approach that
> I don't even know where to begin.

I feel your pain, I really do.

<http://www.google.com/groups?q=group:comp.arch+author:Thomas+author:Lindgren&start=10&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=lzn0samsgt.fsf%40no-longer-at.bluetail.com&rnum=19>

Best,
Thomas
--
Thomas Lindgren
"It's becoming popular? It must be in decline." -- Isaiah Berlin

David C DiNucci

unread,

Sep 23, 2003, 5:57:36 PM9/23/03

to

Robert Myers wrote:
>
> On Mon, 22 Sep 2003 09:56:40 -0400, Benjamin Ylvisaker
> <benj...@alumni.cmu.edu> wrote:
>
> >On Sun, 21 Sep 2003 23:24:08 -0400
> >Robert Myers <rmy...@rustuck.com> wrote:
>
> >
> >> The similarity to the VLIW problem is not accidental. The problems
> >> in exposing and exploiting almost any kind of parallelism are pretty
> >> much the same (I can thank David Gay for this insight) and that's
> >> what made me decide the VLIW problem (and EPIC) were worth studying,
> >> whether VLIW and EPIC were really good ideas or not.
> >
> >I disagree heartily. My current world view involves three different
> >classes of parallelism: local instruction parallelism, data streaming
> >parallelism, and concurrent thread parallelism. The difference
> >between the three is that very different hardware is required to

> >exploit them to their fullest....

> >
> We're talking past each other a bit. The analysis required to expose
> instruction level, thread level, or streaming parallelism is the same.

...

>
> We've pretty much talked instruction level parallelism to death.
> Streaming parallelism is not all that interesting to talk about.

I'm not sure what are you guys calling "streaming parallelism". A
google search returns about 5 diverse hits. Same with "stream level
parallelism". I could interpret it to mean the ability to break a
computation down into a pipeline (e.g. at the software level), I could
interpret it to mean parallelism that can be extracted from the
instruction stream by a general purpose processor (crossing into OoO and
ILP in general), or that which occurs in a specialized processor like a
GPU, or maybe even dataflow in general. ("Sisal", for example, stands
for "Streams and iteration in a single assignment language.) Some of
these may benefit radically from special purpose processors, some not so
much.

I'm sure that people would look at the brand of parallelism I propose
and say it deals with streams, but I still consider it thread-level
parallelism (or I guess I'd call it "threadlet-level", or "transaction
level"). In it, transactions (or threadlets) can certainly be set up in
a pipelined fashion to hand data off in a unidirectional order, thereby
forming pipelines, but the same sorts of transactions can be set up in
non-pipelined patterns, so I see little merit in naming it for just one
thing that it can do at the exclusion of others. Then again, if the
transactions or threadlets are essentially one (or a few) instructions
each, and there is special hardware to handle their scheduling and
handing data off to one another efficiently, I can see why people might
want to give it a special category. In that context, streaming
parallelism requires different hardware *by definition*.

> You
> don't think much can be done at runtime to identify and exploit
> thread-level parallelism. It's not the first way I'd think of doing
> business, but I wouldn't be prepared to predict that we'll never have
> a hardware system that can chomp a dusty deck and run it effectively
> on a processor that consists of many processors.

I think it's a sad statement. First we make compilers so they don't
really spit out the same program we put in, in some sense. Then we make
machines so they don't even run the same program we give them. If a
hardware system like you mention is ever created, I for one would not
consider it progress. It is apparently motivated by, and perpetuates,
this stereotype that multi-threaded code is too complex for programmers
to handle. I agree that it is if threads persist through interactions
with other threads, but with proper programming methodologies, that
doesn't have to be so, and threadlet programming becomes a natural part
of software engineering. In other words, it would make much more sense
to convert that dusty deck at the software level, not only for the
benefit of the machine, but for the benefit of the people maintaining
and debugging it.

-- Dave
-----------------------------------------------------------------
David C. DiNucci Elepar Tools for portable grid,
da...@elepar.com http://www.elepar.com parallel, distributed, &
503-439-9431 Beaverton, OR 97006 peer-to-peer computing

Robert Myers

unread,

Sep 23, 2003, 8:33:10 PM9/23/03

to

On Tue, 23 Sep 2003 14:57:36 -0700, David C DiNucci <da...@elepar.com>
wrote:

<snip>

>I'm not sure what are you guys calling "streaming parallelism".

If you have an array f(i,j) and you want to compute the 2-D FFT on a
Cray 1, you do the x-transform by making j the inner loop index and
performing the actual operations of the FFT as the outer loop. In
doing so, you give up data locality, so it's not a good strategy for a
machine that uses cache to get reasonable access to memory, but it
worked beautifully on the Cray-1, where, if all things were arranged
properly, everthing was pipelined so you could produce and store one
new FMAC per clock.

In concept, you are feeding an endless stream of data through a
functional pipeline that does the same thing to each item it is
presented with (even though on the Cray, you actually had to do things
64 items at a time, a detail the compiler would take care of for you
if you were writing in Fortran). I would call that streaming
parallelism.

Endless elaborations of the basic scheme can be imagined, and I
imagine some have actually been implemented in hardware. If you had
enough FMAC units that could all be chained, for example, you could
unroll the outer loop and produce one new bundle of FFT's every clock.
(The picture in my mind is right, but I'm not sure if I just described
it correctly in words).

>A
>google search returns about 5 diverse hits. Same with "stream level
>parallelism". I could interpret it to mean the ability to break a
>computation down into a pipeline (e.g. at the software level), I could
>interpret it to mean parallelism that can be extracted from the
>instruction stream by a general purpose processor (crossing into OoO and
>ILP in general), or that which occurs in a specialized processor like a
>GPU, or maybe even dataflow in general.

Not unless you have a large number of identical problems you wish to
feed through your dataflow machine, but maybe Benjamin has something
else in mind.

<snip>

>
>> You
>> don't think much can be done at runtime to identify and exploit
>> thread-level parallelism. It's not the first way I'd think of doing
>> business, but I wouldn't be prepared to predict that we'll never have
>> a hardware system that can chomp a dusty deck and run it effectively
>> on a processor that consists of many processors.
>
>I think it's a sad statement. First we make compilers so they don't
>really spit out the same program we put in, in some sense. Then we make
>machines so they don't even run the same program we give them. If a
>hardware system like you mention is ever created, I for one would not
>consider it progress. It is apparently motivated by, and perpetuates,
>this stereotype that multi-threaded code is too complex for programmers
>to handle. I agree that it is if threads persist through interactions
>with other threads, but with proper programming methodologies, that
>doesn't have to be so, and threadlet programming becomes a natural part
>of software engineering. In other words, it would make much more sense
>to convert that dusty deck at the software level, not only for the
>benefit of the machine, but for the benefit of the people maintaining
>and debugging it.
>

I think I've just been accused of approving of a lowering of
expectations with respect to what is to be expected from programmers.
The problem with software is that the quality of the software is no
better than the skills of the least competent programmer to work on a
given program.

BAH isn't here to slam me again for comparing the rigor of software
engineering (if, indeed, there is such a discipline) unfavorably to
the rigor of hardware engineering, but it is what it is.

RM

Robert Myers

unread,

Sep 23, 2003, 9:42:42 PM9/23/03

to

On Tue, 23 Sep 2003 20:12:43 GMT, Thomas Lindgren
<***********@*****.***> wrote:

>
>Robert Myers <rmy...@rustuck.com> writes:
>
>> The standard way of avoiding combinatoric explosion in optimizing
>> compilers is to make arbitrary guesses about what code transformations
>> will be helpful and then to make even more arbitrary decisions about
>> whether they are helpful or not by seeing how they do on some standard
>> set of tests. There are so many things wrong with that approach that
>> I don't even know where to begin.
>
>I feel your pain, I really do.
>
><http://www.google.com/groups?q=group:comp.arch+author:Thomas+author:Lindgren&start=10&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=lzn0samsgt.fsf%40no-longer-at.bluetail.com&rnum=19>
>

It's hard to believe the same observation hasn't been made over and
over again.

It's a big problem for open source, the linchpin of which is gcc. One
compiler, itself compiled from source, is supposed to work for an
almost endless list of target platforms. While you don't *have* to
use gcc, using it is as close to a guarantee of portability as you can
get.

There is no generally recognized overarching design strategy, an
excess of possible figures of merit, and things seem to get broken
rather easily. This is not the place for an airing of these issues,
but a processor without a compiler that can exploit its wonderful
capabilities is of limited usefulness.

RM

Terje Mathisen

unread,

Sep 24, 2003, 1:58:29 AM9/24/03

to

Nick Maclaren wrote:
> Robert Myers <rmy...@rustuck.com> writes:
> |> You don't need to guarantee that two objects will never alias one
> |> another. Based on static analysis or profiling information, you guess
> |> whether two objects will alias one another or not. If you guess that
> |> they won't and you guess wrong, you fix your mistake. It sounds
> |> complicated, and it is, but that's what the game is about.
>
> Right. Techniques like the ALAT, Alpha's load conditional and even
> handling IEEE denormalised numbers by interrupt make it practical to
> use unreliable information - but at a cost! If you get it more than
> slightly wrong, the expense of recovering from the mistake can be
> larger than the expense of not optimising.

I'm actually in Intel's camp here:

Real aliasing problems are _really_ rare in my experience, it is just
that when they do occur, they can be hard to debug/fix.

For a mostly-working program, where almost all operations are
non-aliasing, the cost of making an occasional bad guess, discovering it
via the ALAT, and then branching to a fixup snippet, sounds very workable.

OTOH, I also very much fear for bugs within those almost-never executed
pieces of fixup code. :-(

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Jan C. Vorbrüggen

unread,

Sep 24, 2003, 3:51:45 AM9/24/03

to

> Ah! I didn't know about that detail. OK, so the speed-of-light business is
> amenable to moderately clever design already. I'll stop worrying about it
> for a few years.

The problem with this approach is that you have to planarize - embed in a
two-dimensional plane - the stage interconnection graph. That restricts
what you can do.

Jan

Jan C. Vorbrüggen

unread,

Sep 24, 2003, 3:57:03 AM9/24/03

to

> I believe that the problem of deciding how
> to spin off parallel threads from a single threaded instruction stream
> and maintaining coherence between the threads is too hard to be solved
> (with reasonable efficiency) in hardware. I would be fascinated if
> someone can provide some solid counter-evidence against my belief.

What exactly do you mean by "maintaining coherence between threads"?
Depending on the answer, the CSP/transputer/occam2 approach might be
applicable, or not.

Jan

Nick Maclaren

unread,

Sep 24, 2003, 4:00:52 AM9/24/03

to

In article <bkrbq6$avd$1...@osl016lin.hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> writes:
|> Nick Maclaren wrote:
|> > Robert Myers <rmy...@rustuck.com> writes:
|> > |> You don't need to guarantee that two objects will never alias one
|> > |> another. Based on static analysis or profiling information, you guess
|> > |> whether two objects will alias one another or not. If you guess that
|> > |> they won't and you guess wrong, you fix your mistake. It sounds
|> > |> complicated, and it is, but that's what the game is about.
|> >
|> > Right. Techniques like the ALAT, Alpha's load conditional and even
|> > handling IEEE denormalised numbers by interrupt make it practical to
|> > use unreliable information - but at a cost! If you get it more than
|> > slightly wrong, the expense of recovering from the mistake can be
|> > larger than the expense of not optimising.
|>
|> I'm actually in Intel's camp here:
|>
|> Real aliasing problems are _really_ rare in my experience, it is just
|> that when they do occur, they can be hard to debug/fix.
|>
|> For a mostly-working program, where almost all operations are
|> non-aliasing, the cost of making an occasional bad guess, discovering it
|> via the ALAT, and then branching to a fixup snippet, sounds very workable.

Yes and no. To a great extent, that is a politician's costing (i.e.
"How likely is it I will get away with it? Fixing up failure will
be Someone Else's Problem.") rather than an engineer's (i.e. game
theory, a.k.a. cost-benefit analysis).

A low-probability, data-dependent slowdown of a factor of ten is bad
news in HPC, but catastrophic in process control (e.g. chemical plant,
nuclear reactor, aircraft control). You can't even avoid the problem
by using three computers in parallel, because they will usually all
have the problem, or none :-(

Is a 10% performance increase worth a 0.001% chance of the program
slowing down by a factor of 10 sometime within the next 10 years?
In HPC, unquestionably. In process control, I am not so sure.

|> OTOH, I also very much fear for bugs within those almost-never executed
|> pieces of fixup code. :-(

Yes. And, again, think of the cost of encountering such things. My
experience is that you are very lucky indeed if you can resolve such
a problem for man-weeks of work, and the number of people capable of
tracking such things down is reducing as we die off, retire and so on.
Very, very few people under 50 have ever done such things.

I have certainly had to recommend users to abandon complete projects
because they had such bugs that even I couldn't track down, and am
seeing an increasing number of them in vendors' code. And that is
even on RISC and x86 systems :-(

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Sep 24, 2003, 4:10:48 AM9/24/03

to

In article <3F715003...@mediasec.de>, Jan C. =?iso-8859-1?Q?Vorbr=FCggen?= <jvorbr...@mediasec.de> writes:
|> > Yes. The interest in photoelectronics in silicon and hybrid materials
|> > is at least partially fuelled by the desire to be able to use light
|> > guides for long-distance (on-chip) communication.
|>
|> Why use light guides? It's usually cited as an advatage of photonic
|> communication that you can use free space, that the different paths
|> don't interact, and that you can do multicasting "for free" if you want.
|> All that goes away if you use light guides.

The use I saw was as a straight replacement for long-distance wiring,
to eliminate the RC delay, and I probably assumed light guides.
Certainly, to prevent different paths interacting, you need either
very tight and controllable frequency differentiation or coherence,
and I am under the impression that current integrated circuit
photoemission provides neither.

But this is not something that I know anything much about; all I
really know is that it is having serious money put into it, and one
of the reasons is long-distance on-chip communication.

Regards,
Nick Maclaren.

Jan C. Vorbrüggen

unread,

Sep 24, 2003, 4:15:08 AM9/24/03

to

> As discussed in a recent thread, a sufficiently expressive
> intermediate representation would probably be adequate.

As I see it, the largest problem is information loss in program
transformation - and the very first transformation that usually
looses the most information is expressing the programmer's intention
as (source) code. Note that "loss" here might also involve having
to make an arbitrary decision that precludes different solutions
that would be equally valid in the light of the programmer's intention.

In a sense, we need a "do what I mean" programming language. Good luck
with that!

Jan

Jan C. Vorbrüggen

unread,

Sep 24, 2003, 4:18:54 AM9/24/03

to

> Yes, but that only applies when you are repeatedly doing almost
> identical things! If you think that your test conditions are
> representative, and are wrong, you can be worse off than not
> optimising. I see this every day in operating systems :-(

In the OS case, I think feedback-directed optimization is usually
done manually: the vendor's OS groups receives feedback from a
(usually important 8-)) customer about pathological behaviour,
(tries to) reproduce the problem (often the most difficult step),
analyses the reasons for the behaviour, and finds a way to get
rid of it - hopefully without introducing pathological behaviour
in other places.

It's a difficult way to work, with the added complication that the
wheel is re-invented a lot in OS implementation, for a variety of
reasons (only some of which are marginally acceptable).

Jan

Terje Mathisen

unread,

Sep 24, 2003, 7:56:41 AM9/24/03

to

Nick Maclaren wrote:

> Terje Mathisen <terje.m...@hda.hydro.com> writes:
> |> For a mostly-working program, where almost all operations are
> |> non-aliasing, the cost of making an occasional bad guess, discovering it
> |> via the ALAT, and then branching to a fixup snippet, sounds very workable.
>
> Yes and no. To a great extent, that is a politician's costing (i.e.
> "How likely is it I will get away with it? Fixing up failure will
> be Someone Else's Problem.") rather than an engineer's (i.e. game
> theory, a.k.a. cost-benefit analysis).

Rather the opposite:

This was an engineer's "average runtime" argument. Sometimes (hard
realtime process control) worst case is much more important, but I can't
foresee any time in the future where IA64 is a reasonable choice for
those tasks. :-)

Nick Maclaren

unread,

Sep 24, 2003, 8:19:00 AM9/24/03

to

In article <bks0pq$njf$1...@osl016lin.hda.hydro.com>,

Hmm. I suggest that you need to do the analysis rather more
carefully :-)

Let us stick to performance, and assume that it is irrelevant in
49.99% of cases, gives a 10% improvement in 50% of cases, and
causes a 10 fold slowdown in 0.01% of cases. A simple analysis
would imply a 4.91% overall improvement.

Let us assume that the 10 fold slowdown is unacceptable in 50%
of the cases. We now have the costs of either using a different
platform for those cases or employing an extremely skilled person
for an open-ended project to find and remove the cause. In both
cases, you are adding potentially much larger costs than the mere
lost computer time.

Are you a gane theorist? If so, you will recognise this as a
classic problem with a non-linear cost function, such as is so
common in real life. I have encountered the problem of slowdowns
going over a flatly unacceptable barrier more than once - one
classic example is in weather forecasting, where a daily forecast
that takes a complete day to produce is NBG.

Regards,
Nick Maclaren.

Benjamin Ylvisaker

unread,

Sep 24, 2003, 8:47:05 AM9/24/03

to

The Big Transition that is underway in computer architecture, is that
communication is gradually becoming more expensive than computation.
This transition can be seen at the low level where circuit designers
are starting to worry more about wires than transistors and at the
high level where architects can afford to throw down bunches of ALUs,
but the communication overhead needed to keep them doing useful work
is prohibitively high.

The reason multi-threaded programs are attractive (from the
instructions per second-joule perspective) is that we can let two more
or less independent streams of computation go on with very little
communication. However, it is rare to find two threads of computation
that can go along their merry way without communicating with each
other at all. By maintaining coherence between threads I meant
providing the communication necessary to let each thread know what it
needs to know about the other thread's state at the appropriate time.
I have no idea if this is proper use of terminology.

The reason that I believe that automagically spinning concurrent
threads out of a single threaded program will never be an economically
sound implementation plan is that the amount of communication needed
to keep those threads abreast of what all the other threads are doing
in their registers and in memory is too large.

unread,

Sep 24, 2003, 11:01:16 AM9/24/03

to

In article <vgm1nvku11j8a2e65...@4ax.com>,
Robert Myers <rmy...@rustuck.com> wrote:
...

> BAH isn't here to slam me again for comparing the rigor of software

> engineering (if, indeed, there is such a discipline) ...

There is such a discipline. When I was in college I worked for a
Software Engineer, and also watched him run a large software / hardware
project. That project was completed on time and in budget. Real
Software Engineers are real Engineers (and not at all like the average
Engineering school graduate one encounters). I was impressed. He'd
started out as a Process Control EE, and gone back to school to be a
Software Engineer. OTOH, AFAICT, the Wang School taught graduate
"Software Engineering" to programmers as if it were a collection of
software to run.
____________________________________________________________________
TonyN.:' tony...@shore.net
'

Tony Nelson

unread,

Sep 24, 2003, 11:01:23 AM9/24/03

to

In article <bkrivk$jt8$1...@pegasus.csx.cam.ac.uk>,
nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

> In article <bkrbq6$avd$1...@osl016lin.hda.hydro.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> writes:

...

> |> For a mostly-working program, where almost all operations are
> |> non-aliasing, the cost of making an occasional bad guess, discovering it
> |> via the ALAT, and then branching to a fixup snippet, sounds very workable.

> |> OTOH, I also very much fear for bugs within those almost-never executed

> |> pieces of fixup code. :-(
>
> Yes. And, again, think of the cost of encountering such things. My
> experience is that you are very lucky indeed if you can resolve such
> a problem for man-weeks of work, and the number of people capable of
> tracking such things down is reducing as we die off, retire and so on.
> Very, very few people under 50 have ever done such things.

...

The ALAT fixup code could be a much easier problem. If it is really
rarely used, it's use could be affordably logged, so there would be a
trail to follow.
____________________________________________________________________
TonyN.:' tony...@shore.net
'

Robert Myers

unread,

Sep 24, 2003, 11:07:50 AM9/24/03

to

On Wed, 24 Sep 2003 09:07:41 -0400, Benjamin Ylvisaker
<benj...@alumni.cmu.edu> wrote:

>... As I mentioned in a previous post, it is possible to

>run certain portions of algorithms more efficiently (measured in
>instructions per second-joule) on a reconfigurable computer than a
>conventional microprocessor by factors of 1-2 orders of magnitude.
>

One to two orders of magnitude better performance is easy to believe.
What is getting hard to believe is that the concept will ever turn
into useable hardware. People recognized the potential of pipelining
and chaining a long time ago, but Japan has the vector processor and
the US is fumbling around with ever more elaborate Rube Goldberg
contraptions.

RM

Nick Maclaren

unread,

Sep 24, 2003, 11:25:57 AM9/24/03

to

In article <tonynlsn-7B690D...@news.primus.ca>,

|> The ALAT fixup code could be a much easier problem. If it is really
|> rarely used, it's use could be affordably logged, so there would be a
|> trail to follow.

I am afraid that it isn't that simple. Firstly, logging its use
turns a high overhead into a huge one, so a program that previously
suffered no impact now will be badly degraded. Secondly, merely
logging its use doesn't help enough to be worthwhile, because
typical problems will not show up until much later in the code.

For decent error detection, you really need either to log its
input and output (to check it by hand), or to run such bits of code
in the ALAT-hit and ALAT-miss paths in parallel, and check them.
This is why it is generally better to repeat the same code, slowly,
than to have alternate code.

As far as I know, there are no good tools to check the alternate
paths for consistency, which does confirm that it isn't a trivial
thing to do. And ALAT misses are not the only place where the
IA-64 has them :-(

Regards,
Nick Maclaren.

unread,

Sep 24, 2003, 2:41:50 PM9/24/03

to

Tom Gardner wrote:
> There's a perennial discussion about a related topic:
> checked exceptions (compiler insists it is caught or
> rethrown) vs unchecked exceptions (compiler allows
> programmer to ignore them). Hejlsberg (of C# and

> ..NET fame) doesn't like checked exceptions,

Wow, have so much time passed!?!

Anders Hejlsberg should be famous for Turbo (nee Compas) Pascal, the
rest is mostly a way to stay employed. :-)

> Gosling (of Java and UniPress Emacs fame) does
> like them.
>
> I'm horrified by the number of softies that prefer
> unchecked exceptions, particularly if they are also
> writing long-lived distributed server-based
> applications.

I do believe there's good reasons for having unchecked exceptions, but
only as an exceptional state, i.e. checking should be the default.

Nick Maclaren

unread,

Sep 24, 2003, 2:52:07 PM9/24/03

to

In article <bkso36$5k3$1...@osl016lin.hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>Benjamin Ylvisaker wrote:
>> complete system. As I mentioned in a previous post, it is possible to
>> run certain portions of algorithms more efficiently (measured in
>> instructions per second-joule) on a reconfigurable computer than a
>> conventional microprocessor by factors of 1-2 orders of magnitude.
>
>Do you have one or two examples?
>
>Please!!!

Hmm. I can provide examples where that is true, but they could equally
well be satisfied by an ISA that gave the software access to primitives
that are currently available only to hardware. I.e. I don't need the
reconfigurability - it is merely the deficiencies of current ISAs that
are my problem :-(

One is, of course, probabilistic floating-point rounding, which is a
right royal pain in the arse to have to emulate, but could be implemented
fairly efficiently in hardware.

Another is the fairly large number of applications that are dominated
by uses of the error function. This is almost equally bad to have to
implement using current ISA primitives - though it is none too easy,
even given a free hand with the hardware.

I should be EXTREMELY interested in an example where mere ISA extension
is not enough, and reconfigurability is what is needed. To repeat your
request, please!!!

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Sep 24, 2003, 2:54:37 PM9/24/03

to

In article <bksohe$63t$1...@osl016lin.hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>
>I do believe there's good reasons for having unchecked exceptions, but
>only as an exceptional state, i.e. checking should be the default.

While I agree with you (surprise, surprise!), this attitude is close
to heretical nowadays :-(

Regards,
Nick Maclaren.

Benjamin Ylvisaker

unread,

Sep 24, 2003, 3:13:48 PM9/24/03

to

On 24 Sep 2003 18:52:07 GMT
nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

http://www.ece.cmu.edu/~herman/publications/ISCA99.pdf

Though this is/was an academic project, a chip was actually fabricated
and ran at something close to the advertised speed. I was involved in
a failed attempt to commercialize this technology. I came away
convinced that the primary problem is the lack of integration with a
conventional sequential microprocessor that I refered to earlier.

Benjamin

Terje Mathisen

unread,

Sep 24, 2003, 4:21:58 PM9/24/03

to

Benjamin Ylvisaker wrote:
> nm...@cus.cam.ac.uk (Nick Maclaren) wrote:
>>In article <bkso36$5k3$1...@osl016lin.hda.hydro.com>,
>>Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>>>Do you have one or two examples?
>>>
>>>Please!!!

[snip]

>>I should be EXTREMELY interested in an example where mere ISA
>>extension is not enough, and reconfigurability is what is needed.
>>To repeat your request, please!!!
>
>
> http://www.ece.cmu.edu/~herman/publications/ISCA99.pdf

Oops, another victim of 'the attack of the killer micros':

To go through the examples in the paper:

FIR filter:

This compares ~100 Mhz PipeRench (their term) and FPGA/Xilink with a 200
Mhz DSP, while forgetting that the target to beat (by one to two orders
of magnitude!) is a 2-3 GHz P4.

Custom instruction: PopCount, when used in streaming mode:

Naive bit-wise counting loop compared with binary stack of adders, while
state of the art code for this particular application is Robert Harvey's
64-bit bitslice code that counts 15 64-bit words in parallel, at an
effective speed of around a cycle/byte, by first doing bitslice
compression of the 15 input words into a vertical stack of 4 words, then
count the bits in these in parallel using a set of shift/mask/add
operations.

Their other examples seem to be in the same ballpark, i.e. comparing
tuned, small word-size, implementations against generic C code running
on a 300 Mhz Sparc.

In fact, the best of their 'real-world' examples (IDEA as part of PGP)
achieved just 12% speedup, which is a _long_ way from 'one to two orders
of magnitude'.

>
> Though this is/was an academic project, a chip was actually fabricated
> and ran at something close to the advertised speed. I was involved in
> a failed attempt to commercialize this technology. I came away
> convinced that the primary problem is the lack of integration with a
> conventional sequential microprocessor that I refered to earlier.

I/O is a real killer even if you manage to get those 10-100x speedups. :-(

David Gay

unread,

Sep 24, 2003, 4:23:33 PM9/24/03

to

Tom Gardner <tomDELET...@logicacmg.com> writes:
> nm...@cus.cam.ac.uk (Nick Maclaren) wrote in
> news:bks5bp$5eq$1...@pegasus.csx.cam.ac.uk:

> > With these, I agree with you. I am perpetually at loggerheads
> > with standards bodies and others on the merits of mandatory
> > warnings - experience with using compilers is that they can help
> > the programmer a great deal.
>
> There's a perennial discussion about a related topic:
> checked exceptions (compiler insists it is caught or
> rethrown) vs unchecked exceptions (compiler allows
> programmer to ignore them). Hejlsberg (of C# and
> .NET fame) doesn't like checked exceptions,
> Gosling (of Java and UniPress Emacs fame) does
> like them.

I had a quick glance through the C# arguments for unchecked exceptions, and
they made no sense to me... Especially as the Java language allows unchecked
exceptions if you make them subclasses of RuntimeException or Error) - I
do agree that requiring catching exceptions that you believe can't happen
would rapidly become burdeonsome.

--
David Gay
dg...@acm.org

Tom Gardner

unread,

Sep 24, 2003, 5:22:00 PM9/24/03

to

Benjamin Ylvisaker

unread,

Sep 24, 2003, 5:55:51 PM9/24/03

to

On Wed, 24 Sep 2003 22:21:58 +0200
Terje Mathisen <terje.m...@hda.hydro.com> wrote:

> Benjamin Ylvisaker wrote:
> > nm...@cus.cam.ac.uk (Nick Maclaren) wrote:
> >>In article <bkso36$5k3$1...@osl016lin.hda.hydro.com>,
> >>Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> >>>Do you have one or two examples?
> >>>
> >>>Please!!!
> [snip]
> >>I should be EXTREMELY interested in an example where mere ISA
> >>extension is not enough, and reconfigurability is what is needed.
> >>To repeat your request, please!!!
> >
> >
> > http://www.ece.cmu.edu/~herman/publications/ISCA99.pdf
>
> Oops, another victim of 'the attack of the killer micros':
>
> To go through the examples in the paper:
>
> FIR filter:
>
> This compares ~100 Mhz PipeRench (their term) and FPGA/Xilink with a
> 200 Mhz DSP, while forgetting that the target to beat (by one to two
> orders of magnitude!) is a 2-3 GHz P4.

Well, you need to keep in mind that the paper was published in 1999
and written in 1998 about a chip that was dreamed up in 1997. So
comparing against a 2-3 GHz P4 would have been challenging.

> I/O is a real killer even if you manage to get those 10-100x
> speedups. :-(

I/O is certainly a big problem, and something that we have focused on
quite a lot more in our current research.

Benjamin

Ronald H. Nicholson, Jr.

unread,

Sep 24, 2003, 9:04:50 PM9/24/03

to

In article <3F714D11...@mediasec.de>,
Jan C. =?iso-8859-1?Q?Vorbr=FCggen?= <jvorbr...@mediasec.de> wrote:
>The problem with this approach is that you have to planarize - embed in a
>two-dimensional plane - the stage interconnection graph. That restricts
>what you can do.

Maybe Seymore wasn't so crazy after all, shaving down and stacking up
those chips for the Cray 4...

The original statement of Moore's law (as opposed to Not-Moore's law)
was something about the number of transistors per "device" over time.
If the devices becomes a three dimensional constructs, for transistors
as well as wires, then you can get a lot of doublings before the
transistor stack height gets anywhere near the current horizontal
density.

Now the problem becomes how do you keep these hot cubes from melting?
At some power density, even immersion of diamond film laminates in
liquid nitrogen won't be enough.

IMHO. YMMV.
--
Ron Nicholson rhn AT nicholson DOT com http://www.nicholson.com/rhn/
#include <canonical.disclaimer> // only my own opinions, etc.

Andy Glew

unread,

Sep 25, 2003, 12:37:58 AM9/25/03

to

"Robert Myers" <rmy...@rustuck.com> wrote in message
> I brought up this whole issue in another thread earlier this summer,
> and another poster who was kind enough to take the time to educate me
> privately basically said that the wire delay problem has been looming
> for a long time and that circuit designers keep coming up with tricks
> to circumvent it.

I'm not scared of wire delay when all it means
is that it takes lots of clocks to cross a chip.

I get a bit scared of wire delay when it means the number
of transistors that can be reached in same scaled time,
such as fastest switching time tau, declines.
Mead Conway predicted this years ago.
It will eventually happen.

After this I get worried when a 16 entry 32 bit
RAM takes more than 1 clock.

But I only really get worried when a carry save
adder bypassed back to itself takes more than 1 clock.

But then - life gets really fun and interesting.
More stuff for architects to work with.

Dynamic execution microarchitectures have more stuff
to work with the more variability there is in execution
time.

Terje Mathisen

unread,

Sep 25, 2003, 2:08:18 AM9/25/03

to

Benjamin Ylvisaker wrote:

> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>>Oops, another victim of 'the attack of the killer micros':
>>
>>To go through the examples in the paper:
>>
>>FIR filter:
>>
>>This compares ~100 Mhz PipeRench (their term) and FPGA/Xilink with a
>>200 Mhz DSP, while forgetting that the target to beat (by one to two
>>orders of magnitude!) is a 2-3 GHz P4.
>
> Well, you need to keep in mind that the paper was published in 1999
> and written in 1998 about a chip that was dreamed up in 1997. So
> comparing against a 2-3 GHz P4 would have been challenging.

The problem is still very valid though:

The example (popcount) code I mentioned was written with an ~1999 Alpha
in mind, it still ran at the stated speed (about a cycle+ per byte), and
any kind of reconfigurable logic would be hard put to beat it. I won't
even talk about the likelyhood of '1-2 orders of magnitude'.

Besides, the current target to beat for applications like that is a SIMD
engine (MMX/SSE) doing 8 to 16 such operations in parallel on 64 to
128-bit short vector registers.

>>I/O is a real killer even if you manage to get those 10-100x
>>speedups. :-(
>
>
> I/O is certainly a big problem, and something that we have focused on
> quite a lot more in our current research.

I'm looking forward to see whatever you do come up with, even if I seem
to have turned somewhat cynical in my more 'mature' years, I'd love to
see a few more real revolutions. :-)

Sander Vesik

unread,

Sep 25, 2003, 8:42:05 AM9/25/03

to

John Dallman <j...@cix.co.uk> wrote:
> In article <bkk9c1$qdh$1...@pegasus.csx.cam.ac.uk>, nm...@cus.cam.ac.uk (Nick
> Maclaren) wrote:
>
>> Moore's Law is often misquoted to say that CPU performance doubles
>> every 1.5-2 years. Let's call that Not-Moore's Law, because it has
>> been close to true for a long while, and a lot of people are relying
>> on it. It has essentially been provided entirely by process sizes
>> dropping (i.e. the REAL Moore's Law), both by allowing higher clock
>> speeds and by allowing the use of more and fancier logic. Note that
>> I am talking about the serial performance here.
>
> A separate problem, which isn't so very far away, is speed-of-light-
> in-silicon. Take a McKinley, at just over 400mm^2. So, it's about 20mm
> from side to side, and that's equivalent to 15GHz in vacuum, or 10GHz at
> best in silicon.

Well... Why is that a problem? Surely you don't need more than say just
a couple of pipeline stages to be within the propagation delay of one cycle?
That is - surely one can uild circuits that are not limited by prpagation
of one cycle time throughthe pipeline but by clock skew over the length of
the pipeline and the lock cycles that are "inside" the pipeline at any one
moment of time?

And its not asif anybody was doing recticle limited chips in 130 or 90nm...

>
> Me, I reckon we're going to see a bifurcation in processor designs.
> Multiple cores per chip is a current fashion. It started to get
> established with Power4, and Sun and HP have made announcements.
> Announcements at IDF included true dual-core Xeons (still with
> HyperThreading, for 4 virtual processors) and 16 IA-64 cores in
> Tanglewood. Yes, they'll require gigantic bus bandwidths. Bandwidth can be
> obtained with money. Latency is harder.
>

But are these really being done because of "the propagation during
one clock cycle" reasons or because the amount of on-die cache has
hit the limit of dimishing returns and including multiple cores is
simple the easiest way to make sure transistors aren't going to waste,
that also only needs limited redesign?

>
> ---
> John Dallman j...@cix.co.uk
> "Any sufficiently advanced technology is indistinguishable from a
> well-rigged demo"

--
Sander

+++ Out of cheese error +++

Sander Vesik

unread,

Sep 25, 2003, 8:50:33 AM9/25/03

to

Benjamin Ylvisaker <benj...@alumni.cmu.edu> wrote:
>
> We are definitely seeing the beginning of the end of Not-Moore's law
> for conventional microprocessor designs. A clear analysis:
> http://citeseer.nj.nec.com/agarwal00clock.html

Problem - if anything, computer speed has been increasing faster during
the past 3 years than before.

>
> The authors basically say that we are beginning to see performance
> scale sub-linearly with transistor size decreases, whereas it has
> scaled super-linearly for the entire history of the microprocessor up
> to this point. The basic problem is that architects are running out
> of useful things for all those little transistors to do. It is clear

I don't see any evidence of that. There is evidence that in a down market,
people are doing some simple calculations on performance delivered per
machine and going "hey! lets just duplicate the core!", but that doesn't
mean there isn't a use for the transistors in a non-parralel design.

> that highly parallel architectures will be needed in order to take
> advantage of all the circuit space that will be available 5, 10, 20
> years from now. Exactly how these "highly parallel architectures"
> will work is another debate entirely (I have my bets).
>
> Cheers,
> Benjamin
>
> [-- application/pgp-signature, encoding 7bit, 8 lines --]

Benjamin Ylvisaker

unread,

Sep 25, 2003, 8:51:59 AM9/25/03

to

On Thu, 25 Sep 2003 08:08:18 +0200
Terje Mathisen <terje.m...@hda.hydro.com> wrote:

> Benjamin Ylvisaker wrote:
> > Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> >>Oops, another victim of 'the attack of the killer micros':
> >>
> >>To go through the examples in the paper:
> >>
> >>FIR filter:
> >>
> >>This compares ~100 Mhz PipeRench (their term) and FPGA/Xilink with
> >>a 200 Mhz DSP, while forgetting that the target to beat (by one to
> >>two orders of magnitude!) is a 2-3 GHz P4.
> >
> > Well, you need to keep in mind that the paper was published in
> > 1999 and written in 1998 about a chip that was dreamed up in 1997.
> > So comparing against a 2-3 GHz P4 would have been challenging.
>
> The problem is still very valid though:
>
> The example (popcount) code I mentioned was written with an ~1999
> Alpha in mind, it still ran at the stated speed (about a cycle+ per
> byte), and any kind of reconfigurable logic would be hard put to
> beat it. I won't even talk about the likelyhood of '1-2 orders of
> magnitude'.

You may be right. Since I have not personally done much of this sort
of benchmarking I'll concede for now. Just two more points you may
want to think about: My original claim was 1-2 orders of magnitude
measured in instructions per second-joule. It's a lot easier for
reconfigurable computers to beat traditional microprocessors (at the
algorithms that RC's do well) when power consumption is factored in.
An important reason is that RC's move data from one operator to
another nearly directly, through a very small number of relatively
small muxes. Conventional microprocessors move data from one operator
to another at best through a forwarding network, more likely through a
register file, and yet more likely (if the data set we're interested
in is large enough) through a hierarchy of caches. All of those
storage elements take a heavy toll on the power budget.

Also, we think it should be relatively easy for a compiler to produce
"good enough" code for the kind of architectures we are investigating.
We have already worked out some of the core algorithms necessary. My
understanding is that it is very hard for compilers to produce good
(or perhaps any) MMX/SSE/AltiVec/3DNow! code.

> I'm looking forward to see whatever you do come up with, even if I
> seem to have turned somewhat cynical in my more 'mature' years, I'd
> love to see a few more real revolutions. :-)

I'm doing my damnedest. (:

Benjamin

Sander Vesik

unread,

Sep 25, 2003, 9:02:10 AM9/25/03

to

Robert Myers <rmy...@rustuck.com> wrote:
>
> I've confused the issue by switching back and forth between talking
> about what happens with a proprietary compiler like icc and what could
> be done with an open source compiler. If I had limitless time, I
> might spend some of it trying to reverse engineer icc, at least for
> simple situations, but I don't have limitless time.
>
> One of the advantages of an open-source compiler is that things like
> feedback-directed optimization mean what you think they should mean,
> rather than what the owner of a proprietary compiler thinks they
> should mean. :-).

And you are claiming that this will take any less limitless time than
the previous? If you throw enough money at gcc you can make it do what
you want - if you throw enough money at Intel, they'll make icc do what
you want. Its not entirely clear the first will necessarily be cheaper.

>
> RM

Sander Vesik

unread,

Sep 25, 2003, 9:23:51 AM9/25/03

to

Jan C. Vorbr?ggen <jvorbr...@mediasec.de> wrote:
>> As discussed in a recent thread, a sufficiently expressive
>> intermediate representation would probably be adequate.
>
> As I see it, the largest problem is information loss in program
> transformation - and the very first transformation that usually
> looses the most information is expressing the programmer's intention
> as (source) code. Note that "loss" here might also involve having
> to make an arbitrary decision that precludes different solutions
> that would be equally valid in the light of the programmer's intention.
>
> In a sense, we need a "do what I mean" programming language. Good luck
> with that!

you can limit some of the source->binary losses by doing as many transofrms
as possible as high as possible - whetever as source to source transforms
or at least (this doesn't reduce the importance of previous) using a high
level intermediate language (ANDF or similar). For languages like C,
source to source transforms are hard - for languages like scheme, they are
simple.

While do_what_I_mean would be good, we can reduce present losses, both
in software already written and in software to be written.

>
> Jan

Sander Vesik

unread,

Sep 25, 2003, 9:26:48 AM9/25/03

to

Tony Nelson <tony...@shore.net> wrote:
> In article <3F71528C...@mediasec.de>,
> Jan C. Vorbruggen <jvorbr...@mediasec.de> wrote:
>
>> > As discussed in a recent thread, a sufficiently expressive
>> > intermediate representation would probably be adequate.
>>
>> As I see it, the largest problem is information loss in program
>> transformation - and the very first transformation that usually
>> looses the most information is expressing the programmer's intention
>> as (source) code. Note that "loss" here might also involve having
>> to make an arbitrary decision that precludes different solutions
>> that would be equally valid in the light of the programmer's intention.
>>
>> In a sense, we need a "do what I mean" programming language. Good luck
>> with that!
>
> Something much simpler and easier would work. When compiler writers
> notice that an optimization is hard or iffy because there isn't enough
> information available, they should add something to the language to
> allow that information to be supplied. In C/C++, it can be done with
> #pragmas, and, if a particular bit of information becomes widely used,
> possibly added to the language itself. There's no "do what I mean",
> instead there's "you need to know this". This doesn't require an
> Artificial Intelligence in the compiler, which is a good thing, as AI
> can't read minds the way humans, and to a lesser extent, other animals
> do.
>

Using #pragma-s will probably get you to 'do what I don't mean or want at
all' very easily. Compare the use of say OpenMP to the use of pthreads.

Tim Olson

unread,

Sep 25, 2003, 9:35:21 AM9/25/03

to

In article <bku0oi$qjg$1...@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> wrote:

| The example (popcount) code I mentioned was written with an ~1999 Alpha
| in mind, it still ran at the stated speed (about a cycle+ per byte), and
| any kind of reconfigurable logic would be hard put to beat it. I won't
| even talk about the likelyhood of '1-2 orders of magnitude'.
|
| Besides, the current target to beat for applications like that is a SIMD
| engine (MMX/SSE) doing 8 to 16 such operations in parallel on 64 to
| 128-bit short vector registers.

The Intrinsity FastMATH processor has a vector popcount instruction
which performs 64 parallel byte popcount operations on a 512-bit wide
vector/matrix register in a single (0.5ns) cycle.

-- Tim Olson

Bengt Larsson

unread,

Sep 25, 2003, 9:30:31 AM9/25/03

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

>As far as I know, there are no good tools to check the alternate
>paths for consistency, which does confirm that it isn't a trivial
>thing to do. And ALAT misses are not the only place where the
>IA-64 has them :-(

It would be useful to be able turn off the ALAT in such a way that a
load-with-check always misses in order to run debugging. Ie, all the
fixup code would run then.

Terje Mathisen

unread,

Sep 25, 2003, 10:51:48 AM9/25/03

to

Tim Olson wrote:

This is still a trivial operation, it is only when done as part of a
serious amount of crunching that speeds like this matter.

I.e. as long as a generic cpu can saturate your memory bandwidth, the
actual cost of running those instructions is just the somewhat higher
energy use. If you must have the general cpu anyway, and that cpu will
either be idle (while waiting for your reconfigurable hw to do its
magic), or run optimized versions of the same algorithm, it becomes very
hard to beat it.

AS I said to Benjamin, I'd love to see stuff like this take off, but I'm
not holding my breath.

I see more of a chance in riding the coattails of the GPU revolution,
using graphics processors to run more or less general algorithms.

Jan C. Vorbrüggen

unread,

Sep 25, 2003, 10:54:54 AM9/25/03

to

> you can limit some of the source->binary losses by doing as many transofrms
> as possible as high as possible - whetever as source to source transforms
> or at least (this doesn't reduce the importance of previous) using a high
> level intermediate language (ANDF or similar). For languages like C,
> source to source transforms are hard - for languages like scheme, they are
> simple.

As I said, a large - the largest? - loss already occurs in the transformation
from mental process to "source". That is, is there a language in which I can
say, "please convolve this image with this set of filters, and give me the
magnitudes of the filter responses per pixel" - with the implication that I
don't care how you implement it? Languages such as IDL/PV-WAVE are quite near,
but not well known for performance...

Jan

Robert Myers

unread,

Sep 25, 2003, 11:22:58 AM9/25/03

to

As things stand right now, a very smart human can almost always beat a
compiler. I don't think that any plausible amount of money thrown at
any undertaking is going to change that reality in the forseeable
future.

While I have at other places in this thread talked about gcc, in this
particular instance I referred only to an open-source compiler and not
specifically to gcc. For better or worse, gcc is a production tool.
There are other open-source compilers available that are much better
suited to being used as research tools.

Many claims have been made about VLIW and about Itanium, and I am at
least skeptical about many of them. As another of my posts in this
thread shows, I don't even agree with claims about what previous
research has shown.

I am an applied scientist and not an academic, but my experience has
taught me that if you want to learn how to do something properly, it
never pays to build on someone else's muck, even if that's what
everyone else is doing.

Some people will happily fiddle and tinker with the mysterious
optimization flags of a proprietary compiler, trying to get it to do
what they want. I would prefer to work with a compiler with no
mysteries.

Fruitful line of inquiry or quixotic waste of time? Only people
funding government research grants think you should know the answers
to such questions ahead of time, and if you do know the answers, what
you are doing is probably not research.

RM

Greg Lindahl

unread,

Sep 25, 2003, 2:34:57 PM9/25/03

to

In article <20030925085159.7...@alumni.cmu.edu>,
Benjamin Ylvisaker <benj...@alumni.cmu.edu> wrote:

>Also, we think it should be relatively easy for a compiler to produce
>"good enough" code for the kind of architectures we are investigating.
>We have already worked out some of the core algorithms necessary. My
>understanding is that it is very hard for compilers to produce good
>(or perhaps any) MMX/SSE/AltiVec/3DNow! code.

Vectorizing compilers have exited for decades. In the current era,
it's considered fairly easy to generate SIMD code for fp programs, and
Intel's compiler generates some integer SSE2 instructions for the
SPECint benchmark suite.

greg

Kral Stefan

unread,

Sep 25, 2003, 3:30:17 PM9/25/03

to

Greg Lindahl (lin...@pbm.com) wrote:
: In article <20030925085159.7...@alumni.cmu.edu>,
: Benjamin Ylvisaker <benj...@alumni.cmu.edu> wrote:
: >[...] My understanding is that it is very hard for compilers

: >to produce good (or perhaps any) MMX/SSE/AltiVec/3DNow! code.

Basically, I agree. (Although there are some codes that may be
vectorizable with little or almost no effort.)

: Vectorizing compilers have existed for decades. In the current

: era, it's considered fairly easy to generate SIMD code for fp
: programs, and Intel's compiler generates some integer SSE2
: instructions for the SPECint benchmark suite.

I think that some programs can easily be automatically vectorized,
while vectorizing others poses a much tougher challange.

Some time ago, our group tested if FFTs could be vectorized by
using C compilers (Intel C compiler with vectorization support,
Codeplay Vector C, and others). Unfortunately, none of the
compilers we used was able to do a good job in reasonable time.
A similar problem exists for vectorizing code for digital signal
transform algorithms (DCTs, DSTs, etc).

The bottom line of that story is: We had to write our own
vectorizing compiler (for FFTs) to do the job (FFTW-GEL,
available at URL http://www.complang.tuwien.ac.at/skral/fftwgel.html)
and I know that some other projects (e.g., SPIRAL at CMU) have
chosen a similar way...

Regards,
Stefan

--
Stefan Kral http://www.complang.tuwien.ac.at/skral/

Greg Lindahl

unread,

Sep 25, 2003, 4:38:05 PM9/25/03

to

In article <bkvfo9$md2$1...@news.tuwien.ac.at>,
Kral Stefan <sk...@mips.complang.tuwien.ac.at> wrote:

>The bottom line of that story is: We had to write our own
>vectorizing compiler (for FFTs) to do the job (FFTW-GEL,
>available at URL http://www.complang.tuwien.ac.at/skral/fftwgel.html)
>and I know that some other projects (e.g., SPIRAL at CMU) have
>chosen a similar way...

Which is no surprise, because FFTW (and similar systems such as ATLAS)
use a large number of unusual algorithms to speed computations at
various sizes. I can't see why you would ever expect a compiler to be
able to do that automatically; compilers don't invent new
cache-blocking algorithms, they only apply them.

I think the problem described in the thread was much simpler:
generating SIMD code for the algorithm actually present in the user
program.

-- greg

Benjamin Ylvisaker

unread,

Sep 25, 2003, 5:19:49 PM9/25/03

to

From H&P 3rd Edition p. 127:
-----------------------------------------------------------------
Compiler Support (or Lack Thereof) for Multimedia Instructions

Alas, the designers of the SIMD instructions that operate on several
narrow data items in a single clock cycle consciously ignored the
previous subsection. These instructions tend to be solutions, not
primitives; they are short of registers; and the data types do not
match existing programming languages. Architects hoped to find an
inexpensive solution that would help some users, but in reality, only
a few low-level graphics library routines use them.

The SIMD instructions are really an abbreviated version of an elegant
architecture style that has its own compiler technology. As explained
in Appendix G, vector architectures operate on vectors of data.
Invented originally for scientific codes, multimedia kernels are often
vectorizable as well. Hence, we can think of Intel's MMX or PowerPC's
AltiVec as simply short vector computers: MMX with vectors of eight
8-bit elements, four 16-bit elements, or two 32-bit elements, and
AltiVec with vectors twice that length. They are implemented as
simply adjacent, narrow elements in wide registers.

These abbreviated architectures build the vector register size into
the architecture: the sum of the sizes of the elements is limited to
64 bits for MMX and 128 bits for AltiVec. When Intel decided to
expand to 128-bit vectors, it added a whole new set of instructions,
called Streaming SIMD Extension (SSE).

The missing elegance from these architectures involves the
specification of the vector length and the memory addressing modes.
By making the vector width variable, these vectors seamlessly switch
between different data widths simply by increasing the number of
elements per vector. For example, vectors could have, say, 32 64-bit
elements, 64 32-bit elements, 128 16-bit elements and 256 8-bit
elements. Another advantage is that the number of elements per vector
register can vary between generations while remaining binary
compatible. One generation might have 32 64-bit elements per vector
register, and the next have 64 64-bit elements. (The number of
elements per register is located in a status register.) The number of
elements executed per clock cycle is also implementation dependent,
and all run the same binary code. Thus, one generation might operate
at 64 bits per clock cycle, and another at 256 bits per clock cycle.

...

-----------------------------------------------------------------

It seems to me that you are in disagreement with H&P. Who is wrong
and/or misunderstanding something?

Benjamin

Greg Lindahl

unread,

Sep 25, 2003, 6:30:49 PM9/25/03

to

In article <20030925171949.6...@alumni.cmu.edu>,
Benjamin Ylvisaker <benj...@alumni.cmu.edu> wrote:

>It seems to me that you are in disagreement with H&P. Who is wrong
>and/or misunderstanding something?

I don't see any disagreement. H&P say that the design of current SIMD
instructions is lame and stupid; I said that current compilers can
generate code for them in some cases, using known technology from the
vector days. No disagreement. I happen to agree with their analysis,
and they would probably agree with my factoid.

-- greg

Sander Vesik

unread,

Sep 25, 2003, 6:56:19 PM9/25/03

to

Jan C. Vorbr?ggen <jvorbr...@mediasec.de> wrote:

>> you can limit some of the source->binary losses by doing as many transofrms
>> as possible as high as possible - whetever as source to source transforms
>> or at least (this doesn't reduce the importance of previous) using a high
>> level intermediate language (ANDF or similar). For languages like C,
>> source to source transforms are hard - for languages like scheme, they are
>> simple.
>
> As I said, a large - the largest? - loss already occurs in the transformation
> from mental process to "source". That is, is there a language in which I can

Sure. We are in agreement on that part.

> say, "please convolve this image with this set of filters, and give me the
> magnitudes of the filter responses per pixel" - with the implication that I
> don't care how you implement it? Languages such as IDL/PV-WAVE are quite near,
> but not well known for performance...
>

UNless I'm mistaken PV-WAVE deals with only the visualisation domain?

One way for such is to separate the semantics and incarnation of software -
you first specify the vocabulary and grammar to be used in defining a piece
of software (its not important what the piece is - library, program or part
thereof) - while not specifying a binding to implementation for any such piece
and then at a later time apply a binding to go from abstract description
to a concrete piece.

(define-syntax convolve
(syntax-rules (per-pixel-filter-responses ...)
...
((convolve image per-pixel-filter-responses (filterlist))
(per-pixel-filter-response-colvolver image (filterlist)))
...
))

It is opaq as to what per-pixel-filter-response-colvolver does or is -
it might be a function or it might be another syntax definition. Its opaque -
you essentialy define your own language and to an extenrt the level of
abstraction and how things are specified is up to you. And you can get
efficent translation to C or assembler.

Peter Boyle

unread,

Sep 25, 2003, 8:18:55 PM9/25/03

to

On Thu, 25 Sep 2003, Greg Lindahl wrote:

> In article <20030925085159.7...@alumni.cmu.edu>,
> Benjamin Ylvisaker <benj...@alumni.cmu.edu> wrote:
>
> >Also, we think it should be relatively easy for a compiler to produce
> >"good enough" code for the kind of architectures we are investigating.
> >We have already worked out some of the core algorithms necessary. My
> >understanding is that it is very hard for compilers to produce good
> >(or perhaps any) MMX/SSE/AltiVec/3DNow! code.
>
> Vectorizing compilers have exited for decades. In the current era,

^^^^^^

Was that deliberate?
Given the history of the vector market it's quite appropriate.
Peter

Robert Myers

unread,

Sep 25, 2003, 9:11:23 PM9/25/03

to

On Thu, 25 Sep 2003 22:56:19 +0000 (UTC), Sander Vesik
<san...@haldjas.folklore.ee> wrote:

<snip>

>
>It is opaq as to what per-pixel-filter-response-colvolver does or is -
>it might be a function or it might be another syntax definition. Its opaque -
>you essentialy define your own language and to an extenrt the level of
>abstraction and how things are specified is up to you. And you can get
>efficent translation to C or assembler.
>

But you are trivializing the hard part of the problem, which is to
exploit knowledge of program structure and programmer intent in
translating software into executable machine code.

Arbitrary abstractions are a dime a dozen. Abstractions that map
efficiently and naturally onto hardware are another story.

RM

Sander Vesik

unread,

Sep 25, 2003, 11:42:52 PM9/25/03

to

Robert Myers <rmy...@rustuck.com> wrote:
> On Thu, 25 Sep 2003 22:56:19 +0000 (UTC), Sander Vesik
> <san...@haldjas.folklore.ee> wrote:
>
> <snip>
>>
>>It is opaq as to what per-pixel-filter-response-colvolver does or is -
>>it might be a function or it might be another syntax definition. Its opaque -
>>you essentialy define your own language and to an extenrt the level of
>>abstraction and how things are specified is up to you. And you can get
>>efficent translation to C or assembler.
>>
> But you are trivializing the hard part of the problem, which is to
> exploit knowledge of program structure and programmer intent in
> translating software into executable machine code.

No, not at all - in the statement, there was only a limited amount of
exploitable information (at least until you get down to unstated
specifics). If you were dealing with a staticly typed language you could
put in a tiny bit more information - but not that much.

Sure, you will have to some level define the context sensitive meaning
for the verbs - unles sof course "convolve" has a predifined menaing in
the base.

>
> Arbitrary abstractions are a dime a dozen. Abstractions that map
> efficiently and naturally onto hardware are another story.

Which hardware? A dsp? Superscalar OoO processor? a SMP? A NUMA?
A dataflow machine? A cluster of machines? A grid? A seti@home style
distributed set of unreliable computers?

Any particular reason you want to *start* with ruling some of these
out? This is supposed to be "do what I mean" - why do you care about
hardware specifics (yet)?

>
> RM

Robert Myers

unread,

Sep 26, 2003, 1:45:37 AM9/26/03

to

On Fri, 26 Sep 2003 03:42:52 +0000 (UTC), Sander Vesik
<san...@haldjas.folklore.ee> wrote:

>Robert Myers <rmy...@rustuck.com> wrote:
>> On Thu, 25 Sep 2003 22:56:19 +0000 (UTC), Sander Vesik
>> <san...@haldjas.folklore.ee> wrote:
>>
>> <snip>
>>>
>>>It is opaq as to what per-pixel-filter-response-colvolver does or is -
>>>it might be a function or it might be another syntax definition. Its opaque -
>>>you essentialy define your own language and to an extenrt the level of
>>>abstraction and how things are specified is up to you. And you can get
>>>efficent translation to C or assembler.
>>>
>> But you are trivializing the hard part of the problem, which is to
>> exploit knowledge of program structure and programmer intent in
>> translating software into executable machine code.
>
>No, not at all - in the statement, there was only a limited amount of
>exploitable information (at least until you get down to unstated
>specifics). If you were dealing with a staticly typed language you could
>put in a tiny bit more information - but not that much.
>
>Sure, you will have to some level define the context sensitive meaning
>for the verbs - unles sof course "convolve" has a predifined menaing in
>the base.
>

Carefully-designed special-purpose languages are a good solution to
many of the problems of software development, and what you wrote is a
fine start.

>>
>> Arbitrary abstractions are a dime a dozen. Abstractions that map
>> efficiently and naturally onto hardware are another story.
>
>Which hardware? A dsp? Superscalar OoO processor? a SMP? A NUMA?
>A dataflow machine? A cluster of machines? A grid? A seti@home style
>distributed set of unreliable computers?
>
>Any particular reason you want to *start* with ruling some of these
>out? This is supposed to be "do what I mean" - why do you care about
>hardware specifics (yet)?
>

No, you absolutely do not want to rule out any specific hardware in
the software specification until you absolutely have to. It may be
that my expectations are unrealistic, but at some point you will have
to come to grips with those details, and then you will be back into
all the mess of trying to write in a "high level" language like c so
that it will somehow map onto a variety of hardware targets without
constantly being rewritten.

The "and you can get efficient translation into c or assembler" could
mean that same as "and in the end you will be left to hand code
specifically for target hardware (perhaps using ugly language adjuncts
like pragmas) and with no way to pass information about program
structure to the compiler or to the run-time hardware."

What you've proposed is fine as far as it goes. It just doesn't
propose a way of getting intent into a form that wouldn't require
constant tweaking and rewriting for specific target hardware,
something that I've spent way too much time doing, and I doubt if I am
alone.

RM

Eugene Miya

unread,

Sep 26, 2003, 2:28:31 AM9/26/03

to

In article <Pine.GSO.4.58.03...@holyrood.ed.ac.uk>,

Peter Boyle <pbo...@holyrood.ed.ac.uk> wrote:
>On Thu, 25 Sep 2003, Greg Lindahl wrote:
>> Vectorizing compilers have exited for decades. In the current era,
> ^^^^^^
>Was that deliberate?
>Given the history of the vector market it's quite appropriate.

It depends how you view that history.

Perlis Epigram 68. If we believe in data structures, we must believe in
independent (hence simultaneous) processing.
For why else would we collect items within a structure?
Why do we tolerate languages that give us the one without the other?

But then I just saved a couple of ICL DAPs, but more people heard of its
competitor.

Eugene Miya

unread,

Sep 26, 2003, 2:38:09 AM9/26/03

to

That depends on the language and the application program.

The preliminary experience of the Natl. Labs was 10% improvement for
drop in vectorization. With time, experienced application programmmers
and talented users were able to get that up to 80-90%. However,
there are warped, detail issues which will try the average programmer
like exception handling and vector branching. There are conflicting
issues here.

I would not use the term "fairly easy". A lot of people paid for Convex's
compiler technology rather than do the homework and roll their own.
You can simple code generation, but you still pay the good compiler
writers a lot of money, because it's far from open source: it's family jewels.

Nick Maclaren

unread,

Sep 26, 2003, 3:42:37 AM9/26/03

to

On the other hand, they were rude about the design of SSE/SSE2,
and did not justify it - that is sloppy.

My belief is that such designs should be justified by producing
code generation strategies for the following (and perhaps a few
other cores):

Complex arithmetic, compiler generated
The level 1 (sic) BLAS, when inlined, compiler generated
A simple, clean FFT core, compiler generated
3- and 4-D matrix multiplication, coded by hand
The BLAS functions DGEMM and ZGEMM, coded by hand
A simple, clean FFT core, coded by hand

Unless they can produce significant gains for those operations,
then the designs are clearly not worth the effort. If they can,
they they are at least potentially worthwhile.

On the matter of vectorising FFTs, my experience on vector and
pseduo-vector systems is that FFTW can be very slow indeed; it is
too clever by half, and makes quite a few unreasonable assumptions.
I was getting significant (and often large - factors of 5-10)
improvement by compiling just a simple, clean FFT with full
optimisation. This effect is not unusual.

Regards,
Nick Maclaren.