Have the Itanium critics all been proven wrong?

Mark Thorson

unread,

Aug 6, 2012, 9:05:31 PM8/6/12

to

Here we are, over 10 years later, and Intel
is still producing new Itanium designs.
Is there any reason to believe that won't
continue for the indefinite future?

If the critics were correct, wouldn't someone
have to be nuts to join the Itanium design team?
And yet, that doesn't seem to have been a problem.
Generation after generation of new designs have
been produced, and more are on the way. (Oddly
enough, the next one is named after an infamous
hacker -- what's up with that?)

What's clear is that this wasn't a repeat of
the iAPX432 experience. That processor was
completely unsaleable. But billions of dollars
of Itanium-based equipment is sold each year,
and Itanium is profitable for Intel. As long
as that continues to be true, why would they
ever stop?

van...@vsta.org

unread,

Aug 6, 2012, 8:37:20 PM8/6/12

to

Mark Thorson <nos...@sonic.net> wrote:
> Here we are, over 10 years later, and Intel
> is still producing new Itanium designs.
> Is there any reason to believe that won't
> continue for the indefinite future?

http://channelnomics.com/2012/08/06/hps-integrity-itanium-future-remains-unclear/

I bet nobody had to put a (figurative) gun to Oracle's head to get an AMD64
port.

--
Andy Valencia
Home page: http://www.vsta.org/andy/
To contact me: http://www.vsta.org/contact/andy.html

Doug McIntyre

unread,

Aug 6, 2012, 11:36:16 PM8/6/12

to

Mark Thorson <nos...@sonic.net> writes:
>Here we are, over 10 years later, and Intel
>is still producing new Itanium designs.
>Is there any reason to believe that won't
>continue for the indefinite future?

>If the critics were correct, wouldn't someone
>have to be nuts to join the Itanium design team?
>And yet, that doesn't seem to have been a problem.
>Generation after generation of new designs have
>been produced, and more are on the way. (Oddly
>enough, the next one is named after an infamous
>hacker -- what's up with that?)

Because HP keeps paying them millions and millions of $$ to
keep the Itanium design process going. They can't let go of
their legacy RISC roots involved in the process.

nm...@cam.ac.uk

unread,

Aug 7, 2012, 3:35:29 AM8/7/12

to

In article <502069DB...@sonic.net>,
Mark Thorson <nos...@sonic.net> wrote:
>
> . . .

None of those points are relevant to any of the ones that the
informed critics made, and you have omitted the 'minor' detail
that Intel's plans for the Itanic were VERY different and FAR
more grandiose. They failed. The fact that it has found a market
does not negate the fact that the project has been a near-complete
failure in terms of its original objectives.

To the best of my knowledge, nobody outside Intel and HP (and
quite possibly nobody inside, either) knows whether the Itanic
project has been a profit-maker or loss-maker overall. At one
point, Intel were gambling enough on it to risk the company,
but either got cold feet or listened to wider heads and backed
off. And, as has been posted, their competitors gave them the
time to do so.

But, yes, it will continue being produced and even developed
while HP want it and the Intel/HP contract continues. But, if
HP go belly-up or lose interest, on that contract times out,
I wouldn't expect it to continue.

Regards,
Nick Maclaren.

MitchAlsup

unread,

Aug 7, 2012, 10:53:07 AM8/7/12

to nos...@sonic.net

On Monday, August 6, 2012 8:05:31 PM UTC-5, Mark Thorson wrote:
> But billions of dollars of Itanium-based equipment is sold each year, and Itanium is profitable for Intel. As long as that continues to be true, why would they ever stop?

It is not clear if cach flow from Itanium could support a design team were it not for the largess of Intel and HP.

Mitch

Anton Ertl

unread,

Aug 7, 2012, 11:04:03 AM8/7/12

to

MitchAlsup <Mitch...@aol.com> writes:
>It is not clear if cach flow from Itanium could support a design team were it not for the largess of Intel and HP.

I don't think largesse is the right word for the motive of a
profit-oriented company to support something that generates a loss in
the short term. If Itanium generates a loss and they still support
it, the reason is probably that they want to be seen as reliable
companies that support their products long-term; i.e., marketing.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

jacko

unread,

Aug 7, 2012, 11:43:01 AM8/7/12

to

Shhhhhh... Itanium. Don't tell ;D

Is the court battle loss the start of an anus horriblus at Oracle?

Mark Thorson

unread,

Aug 7, 2012, 2:18:33 PM8/7/12

to

nm...@cam.ac.uk wrote:
>
> In article <502069DB...@sonic.net>,
> Mark Thorson <nos...@sonic.net> wrote:
> >
> > . . .
>
> None of those points are relevant to any of the ones that the
> informed critics made, and you have omitted the 'minor' detail

Wasn't the main technical criticism that compilers
weren't available to exploit the parallelism of the
architecture, and hasn't that criticism been refuted?

nm...@cam.ac.uk

unread,

Aug 7, 2012, 1:56:10 PM8/7/12

to

In article <50215BF9...@sonic.net>,

Mark Thorson <nos...@sonic.net> wrote:
>>
>> None of those points are relevant to any of the ones that the
>> informed critics made, and you have omitted the 'minor' detail
>
>Wasn't the main technical criticism that compilers
>weren't available to exploit the parallelism of the
>architecture, and hasn't that criticism been refuted?

That was probably the main criticism, yes, except that it was also
stated by Those With Clue that it was extremely unlikely that such
compilers would be developed. But, so far from being refuted,
those criticisms have been proven to be absolutely correct, not
just in principle but in detail, and the IA64 proponents have been
been conclusively proven wrong.

The fact that compilers have been developed to produce efficient
execution for vectorisable codes and other such easy tasks isn't
relevant - we knew how to do that in the 1970s, and it is equally
easy for almost any architecture. The IA64 was designed as a
general-purpose architecture, and the compilers singularly fail
to deliver on the vast majority of codes - which, as we know, are
typically C++ or similarly tangled spaghetti.

As usual, the benchmarketing figures are intentionally misleading,
because the 'solution' to the dire performance was to provide
a huge amount of cache and related bandwidth, and all evidence is
that the same (or more!) benefits would show up on any other
architecture if it were treated in the same way.

Regards,
Nick Maclaren.

Quadibloc

unread,

Aug 7, 2012, 2:48:05 PM8/7/12

to

On Aug 6, 7:05 pm, Mark Thorson <nos...@sonic.net> wrote:

> If the critics were correct, wouldn't someone
> have to be nuts to join the Itanium design team?

Hey, Intel pays good money, and jobs are hard to come by in a slow
economy.

After all, given the dominance of the x86, it's not as if there are
lots of openings on the 68080 design team, or even the PowerPC design
team or the SPARC design team.

Being on the Itanium design team looks better on a resume than
flipping burgers.

John Savard

nm...@cam.ac.uk

unread,

Aug 7, 2012, 3:09:39 PM8/7/12

to

In article <93f95a21-1d27-4f8a...@i10g2000pbh.googlegroups.com>,
Quadibloc <jsa...@ecn.ab.ca> wrote:

>On Aug 6, 7:05=A0pm, Mark Thorson <nos...@sonic.net> wrote:
>
>> If the critics were correct, wouldn't someone
>> have to be nuts to join the Itanium design team?
>
>Hey, Intel pays good money, and jobs are hard to come by in a slow
>economy.

Well, yes.

>After all, given the dominance of the x86, it's not as if there are
>lots of openings on the 68080 design team, or even the PowerPC design
>team or the SPARC design team.

Don't bet on it for SPARC. Look at the Top500 list. How good is
your Japanese? :-)

Regards,
Nick Maclaren.

Mark Thorson

unread,

Aug 7, 2012, 5:23:27 PM8/7/12

to

nm...@cam.ac.uk wrote:
>
> In article <50215BF9...@sonic.net>,
> Mark Thorson <nos...@sonic.net> wrote:
> >>
> >> None of those points are relevant to any of the ones that the
> >> informed critics made, and you have omitted the 'minor' detail
> >
> >Wasn't the main technical criticism that compilers
> >weren't available to exploit the parallelism of the
> >architecture, and hasn't that criticism been refuted?
>
> That was probably the main criticism, yes, except that it was also
> stated by Those With Clue that it was extremely unlikely that such
> compilers would be developed. But, so far from being refuted,
> those criticisms have been proven to be absolutely correct, not
> just in principle but in detail, and the IA64 proponents have been
> been conclusively proven wrong.

In theory, it seems logical that Itanium could outperform
OoO x86 architectures because the compiler has information
which the hardware does not. The hardware only sees the
code emitted by the compiler, while the compiler sees the
source code. The compiler could exploit parallelism that
the hardware can't because the compiler knows more about
what optimizations can and can't be done. Are you saying
this isn't true?

What got me thinking about Itanium was these passages from
the interview with Bob Colwell:

$ This was probably about 1994 or so. The presenter happened
$ to be the same guy who was in the front of the car from when
$ I interviewed with the Santa Clara design team; same guy.
$ He's presenting and he's predicting some performance numbers
$ that looked astronomically too high to me. I did not know
$ anything about how they expected to get there, I just knew
$ what I thought was reasonable, what would be an aggressive
$ boost forward and what would be just wishful thinking.
$ The predictions being shown were in the ludicrous camp as
$ far as I could tell. So I'm sitting and staring at this
$ presentation, wondering what are they doing, how is it
$ humanly possible to get what he's promising. And if it is,
$ is it possible for this particular design team to do it.
$ I was intensely thinking about what's happening here.
$ Finally I just couldn't stand it anymore and I put my hand
$ up. There was some discussion, but you have to realize none
$ of these people were really chip designers or computer
$ architects, with the exception of Gelsinger and Dadi
$ Perlmutter.

and a little later:

$ Anyway this chip architect guy is standing up in front of
$ this group promising the moon and stars. And I finally put
$ my hand up and said I just could not see how you're proposing
$ to get to those kind of performance levels. And he said well
$ we've got a simulation, and I thought Ah, ok. That shut me up
$ for a little bit, but then something occurred to me and I
$ interrupted him again. I said, wait I am sorry to derail
$ this meeting. But how would you use a simulator if you don't
$ have a compiler? He said, well that's true we don't have a
$ compiler yet, so I hand assembled my simulations. I asked
$ "How did you do thousands of line of code that way?" He said
$ "No, I did 30 lines of code". Flabbergasted, I said, "You're
$ predicting the entire future of this architecture on 30 lines
$ of hand generated code?" [chuckle], I said it just like that,
$ I did not mean to be insulting but I was just thunderstruck.

It would be nice to know what those 30 lines of code were, and
who gave this presentation. (Anybody who posts here?) There
must have been some coherent, logical argument that the Itanium
approach would deliver these performance gains, based on these
30 lines of code. Does anybody know where I could find that?
It would be nice to compare the plan against what actually
happened.

Tom Gardner

unread,

Aug 7, 2012, 4:48:17 PM8/7/12

to

Mark Thorson wrote:
> In theory, it seems logical that Itanium could outperform
> OoO x86 architectures because the compiler has information
> which the hardware does not. The hardware only sees the
> code emitted by the compiler, while the compiler sees the
> source code. The compiler could exploit parallelism that
> the hardware can't because the compiler knows more about
> what optimizations can and can't be done.

The hardware "knows" what the code is actually doing,
whereas the compiler has to take account of all the things
the code might conceivably do. Although the techniques are
different, the speedups described below are, to me,
surprising. N.B. "binaries running under Dynamo" effectively
means the binaries are being interpreted/emulated.

Transparent Dynamic Optimization: The Design and Implementation
of Dynamo

Abstract: Dynamic optimization refers to the runtime
optimization of a native program binary. This report
describes the design and implementation of Dynamo, a
prototype dynamic optimizer that is capable of optimizing
a native program binary at runtime. Dynamo is a realistic
implementation, not a simulation, that is written entirely
in user-level software, and runs on a PA-RISC machine
under the HPUX operating system. Dynamo does not depend
on any special programming language, compiler, operating
system or hardware support. Contrary to intuition, we
demonstrate that it is possible to use a piece of
software to improve the performance of a native,
statically optimized program binary, while it is executing.
Dynamo not only speeds up real application programs, its
performance improvement is often quite significant. For
example, the performance of many +O2 optimized SPECint95
binaries running under Dynamo is comparable to the
performance of their +O4 optimized version running without Dynamo.

http://www.hpl.hp.com/techreports/1999/HPL-1999-78.html
and
http://archive.arstechnica.com/reviews/1q00/dynamo/dynamo-1.html

Overall it sounds like the techniques in Java's HotSpot
applied to C/C++ binaries.

EricP

unread,

Aug 7, 2012, 5:01:32 PM8/7/12

to

That is what I found interesting from that quote I posted earlier.

Itanium - A System Implementor's Tale (2005)
http://static.usenix.org/event/usenix05/tech/general/gray.html

"...and a final execution time of 36 cycles, or 24ns
on a 1.5GHz Itanium 2. This is extremely fast, in fact
unrivalled on any other architecture. In terms of cycle
times this is about a factor of two faster than the fastest
RISC architecture (Alpha 21264) to which the kernel has
been ported so far, and in terms of absolute time it is well
beyond anything we have seen so far."

I am curious if there is anything to salvage.
Your answer would seem to be No but it looks like there
is at least a factor of 2 better in there.
I'd like to know where that comes from.

Now maybe that's all due to the huge cache,
which they can afford because yield is much less of a concern
and they have higher heat dissipation abilities.

Let's assume that, after 15 years, the HP/Intel Itanium
compilers are now as good now as they are ever going to get.
If we replace the back end with a OoO one which doesn't stall,
or get bubbles, and eliminate all the Noops, and the need for
performance analysis to statically optimize code,
then the dynamic scheduling problems that the compilers are
unable to tackle can be handled better.

I wonder if that gets anything?

Eric

nm...@cam.ac.uk

unread,

Aug 7, 2012, 6:19:48 PM8/7/12

to

In article <5021874F...@sonic.net>,

Mark Thorson <nos...@sonic.net> wrote:
>
>In theory, it seems logical that Itanium could outperform
>OoO x86 architectures because the compiler has information
>which the hardware does not. The hardware only sees the
>code emitted by the compiler, while the compiler sees the
>source code. The compiler could exploit parallelism that
>the hardware can't because the compiler knows more about
>what optimizations can and can't be done. Are you saying
>this isn't true?

Yes, and it has been known to be false since the 1970s.

The fundamental problem is determining what aliasing and
which execution paths are possible, which is independent
of the architecture. Unfortunately, it is an intractable
problem theoretically and in general, and becomes feasible
only for some codes and in some languages. In particular,
it is relatively feasible for clean Fortran (especially
modern Fortran), and almost impossible even for the cleanest
C and C++.

You are looking at precisely the wrong end of optimisation.
It never was the code generation that was the issue, but the
analysis - and the problem on ALL architectures always has
been that most codes in most languages don't provide enough
information.

>What got me thinking about Itanium was these passages from
>the interview with Bob Colwell:
>

>$ Anyway this chip architect guy is standing up in front of
>$ this group promising the moon and stars. And I finally put
>$ my hand up and said I just could not see how you're proposing
>$ to get to those kind of performance levels. And he said well
>$ we've got a simulation, and I thought Ah, ok. That shut me up
>$ for a little bit, but then something occurred to me and I
>$ interrupted him again. I said, wait I am sorry to derail
>$ this meeting. But how would you use a simulator if you don't
>$ have a compiler? He said, well that's true we don't have a
>$ compiler yet, so I hand assembled my simulations. I asked
>$ "How did you do thousands of line of code that way?" He said
>$ "No, I did 30 lines of code". Flabbergasted, I said, "You're
>$ predicting the entire future of this architecture on 30 lines
>$ of hand generated code?" [chuckle], I said it just like that,
>$ I did not mean to be insulting but I was just thunderstruck.
>
>It would be nice to know what those 30 lines of code were, and
>who gave this presentation. (Anybody who posts here?) There
>must have been some coherent, logical argument that the Itanium
>approach would deliver these performance gains, based on these
>30 lines of code. Does anybody know where I could find that?
>It would be nice to compare the plan against what actually
>happened.

Why are you assuming that? Those of us With Clue knew perfectly
well at the time that there could be no coherent, logical argument,
because the conclusion was known to be wrong.

I know what some of the proponents said at the time and it went
like this:

Proponent: We are going to deliver an average IPC of 5+ on
unmodified, general-purpose codes.

Me: How are you going to do that? You know that the smartest
computer scientists in the world have been trying for 25 years,
and can't get above 2, and usually can't get even that?

Proponent: There are some very smart people in the compiler
team in HP.

Me: < stunned silence >

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Aug 7, 2012, 6:26:15 PM8/7/12

to

In article <0nfUr.48057$7y4....@newsfe23.iad>,

EricP <ThatWould...@thevillage.com> wrote:
>
>That is what I found interesting from that quote I posted earlier.
>
>Itanium - A System Implementor's Tale (2005)
>http://static.usenix.org/event/usenix05/tech/general/gray.html
>
>"...and a final execution time of 36 cycles, or 24ns
>on a 1.5GHz Itanium 2. This is extremely fast, in fact
>unrivalled on any other architecture. In terms of cycle
>times this is about a factor of two faster than the fastest
>RISC architecture (Alpha 21264) to which the kernel has
>been ported so far, and in terms of absolute time it is well
>beyond anything we have seen so far."
>
>I am curious if there is anything to salvage.
>Your answer would seem to be No but it looks like there
>is at least a factor of 2 better in there.
>I'd like to know where that comes from.

Well, yes, the Itanium can outperform all other architectures
on some code fragments and some programs. So could the Hitachi
SR2201, the ICL DAP and lots of other extreme systems. That's
not the point, nor ever was.

The point is that it can't handle the 'average' general-purpose
codes, the fragments that it handles well don't dominate more
than a few applications, and the codes it is best suited for
are only a very small proportion of HPC ones. That doesn't
make it a viable archititecture.

Regards,
Nick Maclaren.

Mark Thorson

unread,

Aug 7, 2012, 7:43:04 PM8/7/12

to

nm...@cam.ac.uk wrote:
>
> In article <5021874F...@sonic.net>,
> Mark Thorson <nos...@sonic.net> wrote:
> >
> >It would be nice to know what those 30 lines of code were, and
> >who gave this presentation. (Anybody who posts here?) There
> >must have been some coherent, logical argument that the Itanium
> >approach would deliver these performance gains, based on these
> >30 lines of code. Does anybody know where I could find that?
> >It would be nice to compare the plan against what actually
> >happened.
>
> Why are you assuming that? Those of us With Clue knew perfectly
> well at the time that there could be no coherent, logical argument,
> because the conclusion was known to be wrong.

Well, okay. But there must have been _an_ argument based
on the 30 lines, and I'd like to see those lines and hear
the argument.

> I know what some of the proponents said at the time and it went
> like this:
>
> Proponent: We are going to deliver an average IPC of 5+ on
> unmodified, general-purpose codes.
>
> Me: How are you going to do that? You know that the smartest
> computer scientists in the world have been trying for 25 years,
> and can't get above 2, and usually can't get even that?
>
> Proponent: There are some very smart people in the compiler
> team in HP.
>
> Me: < stunned silence >

I assume they didn't just pull that number out of
thin air. There must have been _something_ on which
they based that number.

van...@vsta.org

unread,

Aug 7, 2012, 7:51:55 PM8/7/12

to

nm...@cam.ac.uk wrote:
> The point is that it can't handle the 'average' general-purpose
> codes, the fragments that it handles well don't dominate more
> than a few applications, and the codes it is best suited for
> are only a very small proportion of HPC ones. That doesn't
> make it a viable archititecture.

I'm impressed that it's made it as long as it has. I remember when a vendor
(who I'd best not mention by name) was all ramped up to cut over to Itanium
based systems. I mean, a full-bore platform transition of the company. And
the following week Intel said "Nope, we're going to slip a year. Maybe more.
We'll see.". Boom, platform strategy shot.

Right there I pretty much concluded it really was The Itanic. And yet it
still exists all these years later. Not flourishing, but not gone yet.

MitchAlsup

unread,

Aug 7, 2012, 10:59:30 PM8/7/12

to nos...@sonic.net

On Tuesday, August 7, 2012 4:23:27 PM UTC-5, Mark Thorson wrote:
> In theory, it seems logical that Itanium could outperform OoO x86 architectures because the compiler has information which the hardware does not.

The hardware ALSO has information that the compiler does not. And in particular, both the branch predictor and the data cache, in general, override the compiler because the HW can actually know what IS going on. Advantage: draw.

> The hardware only sees the code emitted by the compiler, while the compiler sees the source code.

The compiler, also, HAS to make conservative decisions as to what code to emit. Whereas, the HW can make agressive decisions, and than microfault and undo the damage should the prediction be incorrect. Advantage: HW

> The compiler could exploit parallelism that the hardware can't because the compiler knows more about what optimizations can and can't be done.

The compiler, now, has to target machines of different dispatch widths (and potential different latencies), whereas the HW only has to deal with exactly one--the size it dispatches and the latencies that it produces. Advantage: HW.

Mitch

Joe Pfeiffer

unread,

Aug 7, 2012, 11:27:21 PM8/7/12

to

MitchAlsup <Mitch...@aol.com> writes:

> On Tuesday, August 7, 2012 4:23:27 PM UTC-5, Mark Thorson wrote:
>> In theory, it seems logical that Itanium could outperform OoO x86 architectures because the compiler has information which the hardware does not.
>
> The hardware ALSO has information that the compiler does not. And in particular, both the branch predictor and the data cache, in general, override the compiler because the HW can actually know what IS going on. Advantage: draw.

To me it sounds like advantage: HW. Compiler can give a hint for the
first pass; after that the HW knows better.

>> The hardware only sees the code emitted by the compiler, while the compiler sees the source code.
>
> The compiler, also, HAS to make conservative decisions as to what code to emit. Whereas, the HW can make agressive decisions, and than microfault and undo the damage should the prediction be incorrect. Advantage: HW
>
>> The compiler could exploit parallelism that the hardware can't because the compiler knows more about what optimizations can and can't be done.
>
> The compiler, now, has to target machines of different dispatch widths (and potential different latencies), whereas the HW only has to deal with exactly one--the size it dispatches and the latencies that it produces. Advantage: HW.

In principle, the compiler can optimize for the particular
implementation, and produce the "right" code as appropriate. I'd call
this one a draw.

William Clodius

unread,

Aug 7, 2012, 11:57:52 PM8/7/12

to

Mark Thorson <nos...@sonic.net> wrote:

> <snip>

It is obvious from the context that Bobl Colwell at the time of the
presentation did not believe that there was any possible coherent
logical argument that the hand optimization of any single set of thirty
lines of code (it reads as if Bob thought that was of assembler not
source) could represent the performance of a compiler. Now at the time
the presenter might have been intimitated out of giving his
rationalization by Bob's comments, but even under the best of conditions
it is too small small a sample to represent more than a small fraction
of optimization problems. In particular, it would be very difficult to
avoid a selection bias towards highly optimizable code.

Andy (Super) Glew

unread,

Aug 8, 2012, 2:06:26 AM8/8/12

to

On 8/6/2012 6:05 PM, Mark Thorson wrote:
> Here we are, over 10 years later, and Intel
> is still producing new Itanium designs.
> Is there any reason to believe that won't
> continue for the indefinite future?
>
> If the critics were correct, wouldn't someone
> have to be nuts to join the Itanium design team?

No.

I know some now former Itanium architects, who started after Itanium was
clearly deprecated. It was good experience - they got far more
responsibility on the Itanium team, than they would have gotten on an
x86 team, where everyone is jostling for position.

As for design engineers, look at how Sam Naffziger went from doing
Itanium to AMD Llano.

Sometimes it's good to be big fish in a small pond.

Terje Mathisen

unread,

Aug 8, 2012, 2:36:21 AM8/8/12

to

EricP wrote:
> "...and a final execution time of 36 cycles, or 24ns
> on a 1.5GHz Itanium 2. This is extremely fast, in fact
> unrivalled on any other architecture. In terms of cycle
> times this is about a factor of two faster than the fastest
> RISC architecture (Alpha 21264) to which the kernel has
> been ported so far, and in terms of absolute time it is well
> beyond anything we have seen so far."
>
> I am curious if there is anything to salvage.
> Your answer would seem to be No but it looks like there
> is at least a factor of 2 better in there.
> I'd like to know where that comes from.

That article was in fact two things:

a) an example of a specific hw feature (fast but extremely limited
syscall) and

b) an example of how the Itanium compilers couldn't even come close to
actually exploiting the "all but the kitchen sink" features of the cpu
in a synergistic (sorry about the buzzword bingo) manner, handwritten
asm code was needed.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,

Aug 8, 2012, 2:26:26 AM8/8/12

to

MitchAlsup <Mitch...@aol.com> writes:
>On Tuesday, August 7, 2012 4:23:27 PM UTC-5, Mark Thorson wrote:
>> In theory, it seems logical that Itanium could outperform OoO x86 architectures because the compiler has information which the hardware does not.
>
>The hardware ALSO has information that the compiler does not. And in particular, both the branch predictor and the data cache, in general, override the compiler because the HW can actually know what IS going on. Advantage: draw.

Hardware or compiler is a false dichotomy. If the hardware has
information that the compiler does not and the compiler has
information that the hardware does not, one should make use of both;
and that's what was done in the RISC revolution. And IA-64 is a
continuation of that, it's just that they misjudged what OoO would do
for the hardware. Hmm, makes me wonder how better judgement would
have changed the IA-64 architecture or if it would just have changed
the implementation.

>> The hardware only sees the code emitted by the compiler, while the compiler sees the source code.
>
>The compiler, also, HAS to make conservative decisions as to what code to emit. Whereas, the HW can make agressive decisions, and than microfault and undo the damage should the prediction be incorrect. Advantage: HW

The compiler can also make aggressive decisions and generate code to
undo the damage if the prediction is incorrect. The compiler has the
disadvantage of being worse at prediction, but the advantage that it
can do aggressive things that are beyond what hardware can do. IA-64
has architectural support for such speculation.

>> The compiler could exploit parallelism that the hardware can't because the compiler knows more about what optimizations can and can't be done.
>
>The compiler, now, has to target machines of different dispatch widths (and potential different latencies), whereas the HW only has to deal with exactly one--the size it dispatches and the latencies that it produces. Advantage: HW.

JIT compilers also know exactly which hardware they produce code for,
and off-line compilers typically have switches for generating code for
different micro-architectures and different architecture variants.
While they are usually conservative by default about architectural
extensions (and that's good), targeting a particular
micro-architecture is fine (the code will still run on the others).

nm...@cam.ac.uk

unread,

Aug 8, 2012, 3:54:15 AM8/8/12

to

In article <5021A808...@sonic.net>,

Mark Thorson <nos...@sonic.net> wrote:
>> >
>> >It would be nice to know what those 30 lines of code were, and
>> >who gave this presentation. (Anybody who posts here?) There
>> >must have been some coherent, logical argument that the Itanium
>> >approach would deliver these performance gains, based on these
>> >30 lines of code. Does anybody know where I could find that?
>> >It would be nice to compare the plan against what actually
>> >happened.
>>
>> Why are you assuming that? Those of us With Clue knew perfectly
>> well at the time that there could be no coherent, logical argument,
>> because the conclusion was known to be wrong.
>
>Well, okay. But there must have been _an_ argument based
>on the 30 lines, and I'd like to see those lines and hear
>the argument.
>

>I assume they didn't just pull that number out of
>thin air. There must have been _something_ on which
>they based that number.

Right. I will go with that. But I also side with Bob Colwell,
from the software rather than hardware perspective. We knew
that the claim was bogus, and I knew that it was impossible;
whether he did as well, I can't say.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Aug 8, 2012, 4:04:01 AM8/8/12

to

In article <a8do0r...@mid.individual.net>, <van...@vsta.org> wrote:
>
>> The point is that it can't handle the 'average' general-purpose
>> codes, the fragments that it handles well don't dominate more
>> than a few applications, and the codes it is best suited for
>> are only a very small proportion of HPC ones. That doesn't
>> make it a viable archititecture.
>
>I'm impressed that it's made it as long as it has. I remember when a vendor
>(who I'd best not mention by name) was all ramped up to cut over to Itanium
>based systems. I mean, a full-bore platform transition of the company. And
>the following week Intel said "Nope, we're going to slip a year. Maybe more.
>We'll see.". Boom, platform strategy shot.

At least one company went belly-up purely because it made that
mistake, and it was instrumental in several others doing so.

>Right there I pretty much concluded it really was The Itanic. And yet it
>still exists all these years later. Not flourishing, but not gone yet.

Slipping a year isn't all that rare, but the Itanic actually shipped
about 4 years after the originally scheduled date.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Aug 8, 2012, 4:27:09 AM8/8/12

to

In article <1bwr1ar...@pfeifferfamily.net>,

That is a mistaken analysis - not that it's wrong, but it misses
the major difference.

Where compilation scores big time is when it can do high-level
analysis and do major code movement (e.g. preloading), elimination
elimination of 'executed' code, vectorisation and parallelisation.
But, to do that, it needs a suitable language and programming
paradigm to work with, and is why well-written modern Fortran
beats the living daylights out of even well-written C++, let alone
the typically ghastly code.

Where hardware scores big time is precisely when the compiler is
defeated, because intractable aliasing and control-flow analysis
becomes feasible once you have put actual values into the variables.
It's the difference between doing numerical analysis on the basic
formulae, which is very powerful when you can do it, and running
interval arithmetic on a specific problem.

Now, as usual, real life isn't as clear-cut as that, which is why
realistic optimisation involves some rather messy collaborative
hacking in both the compiler and hardware. But it is important
to realise that they are approaching the issue from radically
different directions.

Regards,
Nick Maclaren.

nedbrek

unread,

Aug 8, 2012, 8:55:30 AM8/8/12

to

On 08/07/2012 05:23 PM, Mark Thorson wrote:
> It would be nice to know what those 30 lines of code were, and
> who gave this presentation. (Anybody who posts here?) There
> must have been some coherent, logical argument that the Itanium
> approach would deliver these performance gains, based on these
> 30 lines of code. Does anybody know where I could find that?
> It would be nice to compare the plan against what actually
> happened.

Most Itanium presentations used 8 queens. They would start with a big
spaghetti mess, and reduce it down to a software pipelined loop.

Ned

BGB

unread,

Aug 8, 2012, 10:59:15 AM8/8/12

to

some of this also sort of applies to a VM context as well:
the high level language and bytecode may actually be fairly vague about
what exactly is going on (I am thinking more here like ECMAScript levels);
however, a VM's JIT can generate code for the particular variations,
figuring out what types the variables hold, how objects are laid out in
memory, ... and generate code which is more reasonably competitive with
a more static language (granted, static types still have advantages for
things like static type-checking and similar though, so I more prefer a
hybrid strategy).

meanwhile, if something like ECMAScript or similar were compiled
directly to machine code without being able to exploit any run-time
features (only using things statically provable from the source code),
performance would likely be considerably worse.

EricP

unread,

Aug 8, 2012, 11:33:24 AM8/8/12

to

Yes I understand that, but there is scant info available on how Itanium
HW/Compiler pair actually performs. So I'll take whatever we have.
In this case even though it was a very specific syscall,
an Interprocess Messaging, it would be more similar
to commercial code than vector.

So I find it interesting that the best hand assembled Itanium
code could beat the best hand assembled Alpha code by quite a bit.
It _suggests_ that there is higher top end on commercial style
code if only the compiler could exploit it.

In this case their compiler was GCC, which was known to be
substantially inferior to the HP/Intel compilers.
GCC-Itanium is probably not even maintained anymore.

So I was tossing out as an idea is: ok we take the HP/Intel compiler
'as is'. We have a back end which appears it may have a higher
top end for commercial code if only it could be exploited.

Could adding HW fill in the holes that the compiler can't.
For example, that paper focuses on instruction scheduling.
Well, that is something that HW could help with.
Have the compiler do the course grain scheduling,
and let the HW do the fine grain.

By backing off of the original philosophy of VLIW and
'very smart compiler does everything', to a more compromise position,
is there something to be salvaged? In the same sense that people
backed off the RISC philosophy of 'all instructions take 1 clock'
to get something more useful.

Eric

Tim McCaffrey

unread,

Aug 8, 2012, 11:48:59 AM8/8/12

to

In article <5021A808...@sonic.net>, nos...@sonic.net says...

There was an article about Itanium in EETimes years ago where the author (who
was employed by Intel) went through various transformations to show how a
loop could be pipelined. I don't know that is the 30 lines quoted above, but
it wouldn't surprise me.

(Of course, the EETimes site is down right now...)

- Tim

Terje Mathisen

unread,

Aug 8, 2012, 4:29:34 PM8/8/12

to

I'll bet you I can write x86/x64 asm code which will do pretty much
equally well, and when it works I can convert it back to plain portable
C which will compile to near-optimal code.

8 queens isn't really a hard problem, and with 64 spaces a bitmap
representation is pretty obvious.

One register to hold the current board, a 64-entry lookup table to find
out which positions would be attacked by a queen in the next spot, and a
simple TEST RAX,table[EBX] to check if this is a legal spot.

With 128-bit code I can do even better, by keeping 8 guard positions
between each row: This allows me to shift the remaining board down so
that the current position is always bit 0, allowing a single register to
hold the queen attack mask and avoid all table lookups.

Split the code into 4/8 threads, one for each first-row starting point,
and you can keep all 4/8 cores busy.

nm...@cam.ac.uk

unread,

Aug 8, 2012, 5:46:23 PM8/8/12

to

In article <gquaf9-...@ntp6.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

>nedbrek wrote:
>>
>> Most Itanium presentations used 8 queens. They would start with a big
>> spaghetti mess, and reduce it down to a software pipelined loop.
>
>I'll bet you I can write x86/x64 asm code which will do pretty much
>equally well, and when it works I can convert it back to plain portable
>C which will compile to near-optimal code.
>
>8 queens isn't really a hard problem, and with 64 spaces a bitmap
>representation is pretty obvious.

What amazed me then, and still does, is how many people swallowed
that sales pitch, hook, line and sinker. In addition to your
points, it is about as unrepresentative of most real, important
optimisation problems as it is possible to be.

When they used it initially, I was unimpressed, but thought little
of it. When it was still being used two years later, I realised
that my most negative expectations were also likely to be my most
correct ones!

Regards,
Nick Maclaren.

Mark Thorson

unread,

Aug 8, 2012, 10:32:15 PM8/8/12

to

EricP wrote:
>
> Could adding HW fill in the holes that the compiler can't.
> For example, that paper focuses on instruction scheduling.
> Well, that is something that HW could help with.
> Have the compiler do the course grain scheduling,
> and let the HW do the fine grain.
>
> By backing off of the original philosophy of VLIW and
> 'very smart compiler does everything', to a more compromise position,
> is there something to be salvaged? In the same sense that people
> backed off the RISC philosophy of 'all instructions take 1 clock'
> to get something more useful.

You mean hybridizing EPIC and OoO? Wouldn't that
be like hybridizing AC and DC?

Anton Ertl

unread,

Aug 9, 2012, 1:54:23 AM8/9/12

to

Mark Thorson <nos...@sonic.net> writes:
>You mean hybridizing EPIC and OoO? Wouldn't that
>be like hybridizing AC and DC?

EPIC is an architecture style, while OoO is a microarchitectural
feature. An OoO implementation of an EPIC architecture is certainly
possible, unlike a hybrid of AC and DC. The question is how big, if
any, the benefits are over an OoO implementation of a more
conventional architecture like AMD64.

Terje Mathisen

unread,

Aug 9, 2012, 2:26:01 AM8/9/12

to

Anton Ertl wrote:
> Mark Thorson <nos...@sonic.net> writes:
>> You mean hybridizing EPIC and OoO? Wouldn't that
>> be like hybridizing AC and DC?
>
> EPIC is an architecture style, while OoO is a microarchitectural
> feature. An OoO implementation of an EPIC architecture is certainly
> possible, unlike a hybrid of AC and DC. The question is how big, if
> any, the benefits are over an OoO implementation of a more
> conventional architecture like AMD64.

In an OoO Itanium the sw preload/check sequences would effectively
become a regular prefetch/load pair.

Besides this I don't remember any other specific hw features to work
around the missing OoO hardware?

Predicate masks, rotating register sets etc seems like they could be
implemented just as well in an OoO microarchitecture...

Michael S

unread,

Aug 9, 2012, 4:44:01 AM8/9/12

to

The fact that absolute register names used in instruction are not
known immediately after decoding could be problematic.
Of course, microarchitecture can use famous predict-speculate-
replay_on_miss pattern. However it is not obvious to me that it is
easy to predict the name of stacked GP register near call and that it
is easy to predict the name of rotated FP register at last iteration
of the loop.

As to predicates, in brute force implementation predicate register is
yet another input to track by OoO scheduler. Probably o.k. on the FP
side, but problematic on integer side where you want scheduler loop to
be as tight as possible.
May be, they can build 2/3rd OoO dynamically renaming GPRs and FPRs
but leaving predicate registers non-renamed? Hopefully, with 64
predicate registers in hand and with legacy code scheduled for in-
order, stalls due to false dependencies on predicate register will be
very rare.

Another problematic Itanium feature is very widely used post-increment
addressing. That's particularly problematic for integer loads, because
integer load instruction generates 2 outputs. Your typical OoO
implementation don't like 2 outputs. Normal solution for such case is
either cracking offending instruction in two uOPs or issuing it twice
simultaneously through are couple of execution ports. Both solutions
are fine when, as in case of Power, instructions generation 2 GPR
outputs are rare. But on Itanium integer load with post-increment is
not rare at all.

Michael S

unread,

Aug 9, 2012, 5:08:07 AM8/9/12

to

On Aug 9, 8:54 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> Mark Thorson <nos...@sonic.net> writes:
> >You mean hybridizing EPIC and OoO? Wouldn't that
> >be like hybridizing AC and DC?
>
> EPIC is an architecture style

Architecture style or marketable name for VL-VLIW? Me thinks, the
later, same as Texas Instruments VelociTI.

And just as TI de-emphasised the name VelociTI, Intel/HP seem to de-
emphasis the name EPIC in favor of Itanium/Integrity

nedbrek

unread,

Aug 9, 2012, 6:18:27 AM8/9/12

to

On 08/09/2012 04:44 AM, Michael S wrote:
> On Aug 9, 9:26 am, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
> The fact that absolute register names used in instruction are not
> known immediately after decoding could be problematic.
> Of course, microarchitecture can use famous predict-speculate-
> replay_on_miss pattern. However it is not obvious to me that it is
> easy to predict the name of stacked GP register near call and that it
> is easy to predict the name of rotated FP register at last iteration
> of the loop.

It amounts to an add before you hit the rename map. The offset is
speculatively updated when you see a call/alloc or call.ret and
recovered on branch mispredict.

> As to predicates, in brute force implementation predicate register is
> yet another input to track by OoO scheduler. Probably o.k. on the FP
> side, but problematic on integer side where you want scheduler loop to
> be as tight as possible.
> May be, they can build 2/3rd OoO dynamically renaming GPRs and FPRs
> but leaving predicate registers non-renamed? Hopefully, with 64
> predicate registers in hand and with legacy code scheduled for in-
> order, stalls due to false dependencies on predicate register will be
> very rare.

There were a number of solutions that we looked at:
1) Uop
2) 3rd input
3) Prediction

I liked #1, but it does have some performance impact on heavily
predicated code (which McKinley optimized code tended to be).

There were some fans of #3, but it creates another source of flushes.

> Another problematic Itanium feature is very widely used post-increment
> addressing. That's particularly problematic for integer loads, because
> integer load instruction generates 2 outputs. Your typical OoO
> implementation don't like 2 outputs. Normal solution for such case is
> either cracking offending instruction in two uOPs or issuing it twice
> simultaneously through are couple of execution ports. Both solutions
> are fine when, as in case of Power, instructions generation 2 GPR
> outputs are rare. But on Itanium integer load with post-increment is
> not rare at all.

Uops. I forget the uop rate... We had a lot of different performance
metrics:
1) Bundles per cycle
2) Instructions per cycle
3) Non-nop IPC
4) Non-predicate false IPC
5) Uops per cycle

The post-increment ones are easier than x86 ld-op or ld-op-st because
they are independent.

Ned

nm...@cam.ac.uk

unread,

Aug 9, 2012, 6:52:43 AM8/9/12

to

In article <k002pl$b7t$1...@dont-email.me>, nedbrek <ned...@yahoo.com> wrote:
>
>> Another problematic Itanium feature is very widely used post-increment
>> addressing. That's particularly problematic for integer loads, because
>> integer load instruction generates 2 outputs. Your typical OoO
>> implementation don't like 2 outputs. Normal solution for such case is
>> either cracking offending instruction in two uOPs or issuing it twice
>> simultaneously through are couple of execution ports. Both solutions
>> are fine when, as in case of Power, instructions generation 2 GPR
>> outputs are rare. But on Itanium integer load with post-increment is
>> not rare at all.
>
>Uops. I forget the uop rate... We had a lot of different performance
>metrics:
>1) Bundles per cycle
>2) Instructions per cycle
>3) Non-nop IPC
>4) Non-predicate false IPC
>5) Uops per cycle

Well, yes, but those are just tuning tools, not real performance
measurements - when they are claimed to be the latter, it's just
set of people claiming bragging rights over another. This issue
is a complete irrelevance to real codes, anyway, because computers
haven't been limited by instruction issue rate in decades, though
I accept that it was treated as important by the IA64 designers.

As I have posted before, a far more interesting design would be
one that concentrated on the ISA providing more information to
the hardware, to allow it to do much more in the way of OoO data
access, and preferably to minimise the amount of data access.
Now, this WAS something where the IA64 had some useful features,
but the most useful of all was dropped early on, apparently
because some of the involved parties couldn't make it work.

Regards,
Nick Maclaren.

j...@cix.compulink.co.uk

unread,

Aug 9, 2012, 6:59:06 AM8/9/12

to

In article <5021874F...@sonic.net>, nos...@sonic.net (Mark

Thorson) wrote:

> In theory, it seems logical that Itanium could outperform
> OoO x86 architectures because the compiler has information

> which the hardware does not. The hardware only sees the

> code emitted by the compiler, while the compiler sees the

> source code. The compiler could exploit parallelism that

> the hardware can't because the compiler knows more about
> what optimizations can and can't be done.

This seemed obvious at the time to people at the correct level of
ignorance. I was one of them. I have learned not to trust my reactions
on these matters, but to look for evidence. I wish the lesson had been
less painfully learned.

--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.

nedbrek

unread,

Aug 9, 2012, 7:03:30 AM8/9/12

to

On 08/09/2012 02:26 AM, Terje Mathisen wrote:

> Anton Ertl wrote:
>>
>> EPIC is an architecture style, while OoO is a microarchitectural
>> feature. An OoO implementation of an EPIC architecture is certainly
>> possible, unlike a hybrid of AC and DC. The question is how big, if
>> any, the benefits are over an OoO implementation of a more
>> conventional architecture like AMD64.
>
> In an OoO Itanium the sw preload/check sequences would effectively
> become a regular prefetch/load pair.
>
> Besides this I don't remember any other specific hw features to work
> around the missing OoO hardware?
>
> Predicate masks, rotating register sets etc seems like they could be
> implemented just as well in an OoO microarchitecture...
>

The biggest problems were Itanium (rather than EPIC/VLIW) specific:
1) Huge architected register file (144 int regs, 128 fp) (which drove us
to investigated register file caching)
2) 1/16/64 wide predicate writes (see patent 7,428,631)

The other issues (and our recommended solutions) were:
1) RSE - uops
2) ALAT - deprecate (ld.a and ld.c become regular loads, chk.a always fails)
3) Predicated instructions - uop
4) Funky instructions (Parallel compare, post-increment, etc) - uop
5) NAT - have to eat that one
6) Too many nops (33% int, 50% FP) - see patent 7,111,154

We also had some architecture requests:
1) Deprecate stop bits
2) Add GGG (easy) or even GGGG (hard, could remove predicate bits)
bundle template
3) Int and FP divide
4) Reduce amount of predication (save it for really hard branches)

Ned

Anton Ertl

unread,

Aug 9, 2012, 8:46:18 AM8/9/12

to

Michael S <already...@yahoo.com> writes:
>On Aug 9, 8:54=A0am, an...@mips.complang.tuwien.ac.at (Anton Ertl)

>wrote:
>> Mark Thorson <nos...@sonic.net> writes:

>> >You mean hybridizing EPIC and OoO? =A0Wouldn't that

>> >be like hybridizing AC and DC?
>>
>> EPIC is an architecture style
>
>Architecture style or marketable name for VL-VLIW?

Would that be "variable length VLIW"? That would also be an
architecture style, and whether they are the same style is irrelevant
for my point.

Anyway, IA-64 has a number of features that are included in
"Explicitly parallel instruction computing" and that are beyond
classical VLIW:

* Instruction groups (that have no register dependencies and stop bits
between them) are different from bundles (grouping of instructions
for encoding and addressing purposes). That's what I would expect
from something that might be called variable length VLIW, but
there's more:

* There may be memory dependencies within an instruction group.

* There are additional features for speculative execution (in
particular, delayed exceptions and speculative loads).

* One may also consider predication as a feature designed for
increasing ILP.

Sure, if you don't look closely, you can consider it a VLIW or a RISC
with an unusual encoding, and if you look closely, you can just call
it IA-64 or Itanium; there is only one significant architecture that's
called EPIC, so maybe we don't need the EPIC term after all.

>Intel/HP seem to de-
>emphasis the name EPIC in favor of Itanium/Integrity

Which speaks against your theory of it being a marketing name.

Or maybe it was a marketing name for marketing the concept to the
research community and technophiles, whereas current marketing efforts
center more on "suits", who are more interested in other things that
instruction sets, and for whom "EPIC" is therefore irrelevant.

Anton Ertl

unread,

Aug 9, 2012, 9:36:51 AM8/9/12

to

nedbrek <ned...@yahoo.com> writes:
>On 08/09/2012 04:44 AM, Michael S wrote:
>> As to predicates, in brute force implementation predicate register is
>> yet another input to track by OoO scheduler. Probably o.k. on the FP
>> side, but problematic on integer side where you want scheduler loop to
>> be as tight as possible.
>> May be, they can build 2/3rd OoO dynamically renaming GPRs and FPRs
>> but leaving predicate registers non-renamed? Hopefully, with 64
>> predicate registers in hand and with legacy code scheduled for in-
>> order, stalls due to false dependencies on predicate register will be
>> very rare.
>
>There were a number of solutions that we looked at:
>1) Uop
>2) 3rd input
>3) Prediction
>
>I liked #1, but it does have some performance impact on heavily
>predicated code (which McKinley optimized code tended to be).

It's not exactly clear what you mean by #1, but from context I suspect
that you split a predicated instruction into an unpredicated
instruction and a conditional move or somesuch, right?

Paul A. Clayton

unread,

Aug 9, 2012, 9:46:27 AM8/9/12

to

On Thursday, August 9, 2012 7:03:30 AM UTC-4, nedbrek wrote:
[snip]

> The biggest problems were Itanium (rather than EPIC/VLIW) specific:
> 1) Huge architected register file (144 int regs, 128 fp) (which drove us
> to investigated register file caching)

I thought the huge RF was an EPIC feature to facilitate
register prefetch and broad scheduling for a high degree
of parallelism (which can require a larger number of live
values). In addition, if one plans to use a register
stack with a largish minimum number of 'extra' registers,
it is probably very tempting to just allow direct access
to those (minimum) extra registers.

The provision of 16 shadow registers seems at least very
consistent with EPIC (just as the register stack avoids
loads and stores for function calls, the shadow registers
avoid such for interrupts and syscalls). (I like shadow
registers, but I think their allocation should be more
flexible and that [like MIPS MT ASE] they should be
reusable as--or better [?], synonymous with--thread
contexts.)

> 2) 1/16/64 wide predicate writes (see patent 7,428,631)
>
> The other issues (and our recommended solutions) were:
> 1) RSE - uops

I wish the RSE was better optimized--not just asynchronous
but also exploiting wider accesses, cache block allocation,
and known access patterns. The new Itanium 9500 does add
32 registers to the register stack--as suggested by at
least one paper--to reduce spills and fills. Such might
be a step towards support for asynchronous save/restore
since such could be used as a buffer.

> 2) ALAT - deprecate (ld.a and ld.c become regular loads, chk.a always fails)

I wonder if a predictor would be useful for guessing the
nature of the speculation (e.g., whether the load result
is speculatively used) and the probability of
misspeculation. Having chk.a always fail bothers me.

> 3) Predicated instructions - uop

I would probably wish for prediction with low-confidence
predictions falling back to prediction (with perhaps
medium-low confidence biasing operation). What is four
times the complexity in an already nearly hopelessly
complex project? :-\

> 4) Funky instructions (Parallel compare, post-increment, etc) - uop
> 5) NAT - have to eat that one
> 6) Too many nops (33% int, 50% FP) - see patent 7,111,154

Nops should be easy to handle. You have previously
indicated that relaxing template restrictions would have
significantly reduced nop count.

[BEGIN RANT]
I am annoyed that the Itanium 9500 is considered 12
wide issue when one of the issue ports is for nops--
and up to 12 nops can be issued per cycle (unless
there is some other constraint, this would seem to
imply that one could issue 23 instructions per cycle;
claiming 23-wide issue--even if possible--would
probably have been recognized as somehow deceptive
by outside observers).

Perhaps counting the nop pipeline makes sense, but it
seems to me like a way to artificially increase the
number, very possibly with the intent to encourage
people to think it is a four bundle wide implementation.
[END RANT]

> We also had some architecture requests:
> 1) Deprecate stop bits
> 2) Add GGG (easy) or even GGGG (hard, could remove predicate bits)

G?

> bundle template
> 3) Int and FP divide

At least 32-bit source 64-bit result integer (GPR)
multiply has been added.

> 4) Reduce amount of predication (save it for really hard branches)

Wouldn't this be a compiler feature rather than an Architecture
feature? Or were you referring to something more like the
ARM AArch64 (which limits predication to a select with optional
negate, invert, or increment, IIRC)?

Michael S

unread,

Aug 9, 2012, 10:25:57 AM8/9/12

to

On Aug 9, 4:36 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> nedbrek <nedb...@yahoo.com> writes:
> >On 08/09/2012 04:44 AM, Michael S wrote:
> >> As to predicates, in brute force implementation predicate register is
> >> yet another input to track by OoO scheduler. Probably o.k. on the FP
> >> side, but problematic on integer side where you want scheduler loop to
> >> be as tight as possible.
> >> May be, they can build 2/3rd OoO dynamically renaming GPRs and FPRs
> >> but leaving predicate registers non-renamed? Hopefully, with 64
> >> predicate registers in hand and with legacy code scheduled for in-
> >> order, stalls due to false dependencies on predicate register will be
> >> very rare.
>
> >There were a number of solutions that we looked at:
> >1) Uop
> >2) 3rd input
> >3) Prediction
>
> >I liked #1, but it does have some performance impact on heavily
> >predicated code (which McKinley optimized code tended to be).
>
> It's not exactly clear what you mean by #1, but from context I suspect
> that you split a predicated instruction into an unpredicated
> instruction and a conditional move or somesuch, right?
>

The problem is - conditional move itself takes 3 inputs.

Anton Ertl

unread,

Aug 9, 2012, 11:18:03 AM8/9/12

to

nedbrek <ned...@yahoo.com> writes:
>The biggest problems were Itanium (rather than EPIC/VLIW) specific:
>1) Huge architected register file (144 int regs, 128 fp) (which drove us
>to investigated register file caching)

That's interesting, because the Pentium 4 had 128 physical integer
registers. Why is such a large architected register file a problem
and a physical register file not?

Anton Ertl

unread,

Aug 9, 2012, 11:22:01 AM8/9/12

to

Michael S <already...@yahoo.com> writes:
>On Aug 9, 4:36=A0pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
>wrote:

>> It's not exactly clear what you mean by #1, but from context I suspect
>> that you split a predicated instruction into an unpredicated
>> instruction and a conditional move or somesuch, right?
>>
>
>The problem is - conditional move itself takes 3 inputs.

Only two inputs from the integer register file, just like an
unconditional add. By contrast, a conditional add would take three
inputs from the integer registers: the two addends, plus the old value
of the destination register.

Hmm, makes me wonder how viable/useful OoO without register renaming
would be. That could help with predication and could also reduce
problems from the large architected register files.

Michael S

unread,

Aug 9, 2012, 11:43:34 AM8/9/12

to

On Aug 9, 6:18 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

Non-SMT OoO Itanium is probably not that interesting. And with 2-way
SMT you are at 288 architected registers. Plus, you want something
like 60-to-100 renaming registers. That brings you close to 400-entry
integer PRF.

nedbrek

unread,

Aug 9, 2012, 11:57:53 AM8/9/12

to

It requires extra bits in the data field...

Given: p1.op r1 = r2, r3
(that reads "if predicate one is true, perform the op of r2 and r3 into r1")

You emit:
1) op tmp = r2, r3
2) append tmp = tmp, p1
3) cmov r1 = r1, tmp
(where cmov drives the mux off the extra bit in tmp)

This folds the dependency graph for the old r1 and new r1.

Ned

nm...@cam.ac.uk

unread,

Aug 9, 2012, 12:02:36 PM8/9/12

to

In article <2012Aug...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>nedbrek <ned...@yahoo.com> writes:
>
>>The biggest problems were Itanium (rather than EPIC/VLIW) specific:
>>1) Huge architected register file (144 int regs, 128 fp) (which drove us
>>to investigated register file caching)
>
>That's interesting, because the Pentium 4 had 128 physical integer
>registers. Why is such a large architected register file a problem
>and a physical register file not?

1) Interrupts. It HAD to push lazy saving to the limit which, as
I have posted, is very tricky to code and notorious for causing
the foulest sort of bug.

2) Swapping. Even a logical swap (e.g. timeslice) needs to save
the 4+ KB of register state, or confound swapping and lazy saving,
the thought of which is enough to make strong men blench.

Regards,
Nick Maclaren.

nedbrek

unread,

Aug 9, 2012, 12:15:39 PM8/9/12

to

On 08/09/2012 09:46 AM, Paul A. Clayton wrote:
> On Thursday, August 9, 2012 7:03:30 AM UTC-4, nedbrek wrote:
> [snip]
>> The biggest problems were Itanium (rather than EPIC/VLIW) specific:
>> 1) Huge architected register file (144 int regs, 128 fp) (which drove us
>> to investigated register file caching)
>
> I thought the huge RF was an EPIC feature to facilitate
> register prefetch and broad scheduling for a high degree
> of parallelism (which can require a larger number of live
> values). In addition, if one plans to use a register
> stack with a largish minimum number of 'extra' registers,
> it is probably very tempting to just allow direct access
> to those (minimum) extra registers.

On the FP side, it might make a certain amount of sense.

On the Int side, yes - you want a lot of regs for the register stack.
But, there is no reason to make them visible to instructions.

> The provision of 16 shadow registers seems at least very
> consistent with EPIC (just as the register stack avoids
> loads and stores for function calls, the shadow registers
> avoid such for interrupts and syscalls). (I like shadow
> registers, but I think their allocation should be more
> flexible and that [like MIPS MT ASE] they should be
> reusable as--or better [?], synonymous with--thread
> contexts.)

I'd have to hear from an OS guy if it actually helps them.

>> The other issues (and our recommended solutions) were:
>> 1) RSE - uops
>
> I wish the RSE was better optimized--not just asynchronous
> but also exploiting wider accesses, cache block allocation,
> and known access patterns. The new Itanium 9500 does add
> 32 registers to the register stack--as suggested by at
> least one paper--to reduce spills and fills. Such might
> be a step towards support for asynchronous save/restore
> since such could be used as a buffer.

There is a control bit to make the RSE operate async. We had enough
problems that we didn't dig into it :) RSE loads and stores seemed to
do pretty well in the L1/L2.

>> 2) ALAT - deprecate (ld.a and ld.c become regular loads, chk.a always fails)
>
> I wonder if a predictor would be useful for guessing the
> nature of the speculation (e.g., whether the load result
> is speculatively used) and the probability of
> misspeculation. Having chk.a always fail bothers me.

The binaries we had hardly used it (and they were super-optimized).

>> 3) Predicated instructions - uop
>
> I would probably wish for prediction with low-confidence
> predictions falling back to prediction (with perhaps
> medium-low confidence biasing operation). What is four
> times the complexity in an already nearly hopelessly
> complex project? :-\

It could make sense in a short pipe. We were looking at a deep pipe. I
prefer data dependencies to control dependencies.

> [BEGIN RANT]
> I am annoyed that the Itanium 9500 is considered 12
> wide issue when one of the issue ports is for nops--
> and up to 12 nops can be issued per cycle (unless
> there is some other constraint, this would seem to
> imply that one could issue 23 instructions per cycle;
> claiming 23-wide issue--even if possible--would
> probably have been recognized as somehow deceptive
> by outside observers).
>

> [END RANT]

Nops issued in Merced and McKinley. I remember a slide set which had
"Beware the NOP.f" (which was ironic, because Merced code loved nop.f)

>> We also had some architecture requests:
>> 1) Deprecate stop bits
>> 2) Add GGG (easy) or even GGGG (hard, could remove predicate bits)

>> bundle template
>
> G?

Generic. The templates had letters MIBFL to indicate the type of
syllables (instructions) encoded within the bundle.

One of the biggest causes of nops was the lack of templates. There was
MII and MMI but nothing starting with I or ending with M. Finding the
right set of stop bits complicated this problem (GGG would not require
stop bits).

>> 4) Reduce amount of predication (save it for really hard branches)
>
> Wouldn't this be a compiler feature rather than an Architecture
> feature? Or were you referring to something more like the
> ARM AArch64 (which limits predication to a select with optional
> negate, invert, or increment, IIRC)?

Yes, dealing with the architecture committee usually overlapped with
dealing with the compiler people :)

However, the GGGG proposal would have required reducing the amount of
predicate registers accessible by the generic instructions.

Ned

Tim McCaffrey

unread,

Aug 9, 2012, 12:17:57 PM8/9/12

to

In article <k004pr$8aa$1...@needham.csi.cam.ac.uk>, nm...@cam.ac.uk says...
>
[snip]

>As I have posted before, a far more interesting design would be
>one that concentrated on the ISA providing more information to
>the hardware, to allow it to do much more in the way of OoO data
>access, and preferably to minimise the amount of data access.
>Now, this WAS something where the IA64 had some useful features,
>but the most useful of all was dropped early on, apparently
>because some of the involved parties couldn't make it work.
>
>

Of course, you would need a programming language that allowed the programmer
to communicate that information in the first place...

- Tim

nedbrek

unread,

Aug 9, 2012, 12:20:30 PM8/9/12

to

On 08/09/2012 11:18 AM, Anton Ertl wrote:
> nedbrek <ned...@yahoo.com> writes:
>> The biggest problems were Itanium (rather than EPIC/VLIW) specific:
>> 1) Huge architected register file (144 int regs, 128 fp) (which drove us
>> to investigated register file caching)
>
> That's interesting, because the Pentium 4 had 128 physical integer
> registers. Why is such a large architected register file a problem
> and a physical register file not?
>

It's because of the renamer.

The renamer is basically a massively ported array:
1) 1 read per source arch reg
2) 1 write per instruction dispatched (assuming single dest uops)
3) 1 entry per architected reg

For a 6 wide machine with 1 dest and 2 src regs, you have 12 read ports
and 6 write ports.

On x86, you have 16 (or so) entries - not too bad, but a big power
hungry chunk of hardware.

On Itanium, you have 9x (144) that...

Ned

nm...@cam.ac.uk

unread,

Aug 9, 2012, 12:50:15 PM8/9/12

to

In article <k00nrl$is2$1...@USTR-NEWS.TR.UNISYS.COM>,

Tim McCaffrey <timca...@aol.com> wrote:
>>
>>As I have posted before, a far more interesting design would be
>>one that concentrated on the ISA providing more information to
>>the hardware, to allow it to do much more in the way of OoO data
>>access, and preferably to minimise the amount of data access.
>>Now, this WAS something where the IA64 had some useful features,
>>but the most useful of all was dropped early on, apparently
>>because some of the involved parties couldn't make it work.
>
>Of course, you would need a programming language that allowed the programmer
>to communicate that information in the first place...

Right. Modern Fortran makes a worthy start, but words fail me
about the usual culprits ....

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Aug 9, 2012, 1:04:24 PM8/9/12

to

On Thursday, August 9, 2012 12:15:39 PM UTC-4, nedbrek wrote:
[snip]

> On the Int side, yes - you want a lot of regs for the register stack.
> But, there is no reason to make them visible to instructions.

No reason? While there are few Int side operations with
latency greater than one cycle, I would have guessed that
compiler people would prefer more registers. If the
physical registers must be present anyway and code size
does not matter much (and OoO does not make sense--one of
the motivations behind EPIC), then exposing such registers
seems quite reasonable.

[snip]

> There is a control bit to make the RSE operate async. We had enough
> problems that we didn't dig into it :)

Yes, it is an architectural feature that has never been
supported.

> RSE loads and stores seemed to
> do pretty well in the L1/L2.

Contaminating the (already small) L1 cache seems
inappropriate. Saves and restores also provide a
substantial opportunity for optimization because of
the known access pattern.

[snip ld.a deprecation]

> The binaries we had hardly used it (and they were super-optimized).

That fits with your much earlier comment that the suggestion
of removal did not receive a lot of push-back.

Speculative loads seemed interesting. (Were they only
intended for handling unlikely aliasing or were they also
used for hoisting above branches?)

[snip prediction + predication handling of predication]

> It could make sense in a short pipe. We were looking at a deep pipe. I
> prefer data dependencies to control dependencies.

I suspect the length would depend on the accuracy of the
predictor (both in the guess and in the confidence estimate).
Erring on the side of low confidence (executing as
predication) would tend to be better for longer pipelines.

[snip rant on counting nops in issue width]

> Nops issued in Merced and McKinley. I remember a slide set which had
> "Beware the NOP.f" (which was ironic, because Merced code loved nop.f)

This kind of makes sense for an old-style VLIW mindset
where minimizing control is considered extremely important,
but in the absence of nops the peak issue rate for
McKinley would be the same (which is not the case for
Poulson/Itanium 9500). Claiming 12-wide issue for the
Itanium 9500 seems somewhat misleading.

nedbrek

unread,

Aug 9, 2012, 7:07:11 PM8/9/12

to

On 08/09/2012 01:04 PM, Paul A. Clayton wrote:
> On Thursday, August 9, 2012 12:15:39 PM UTC-4, nedbrek wrote:
>
> Speculative loads seemed interesting. (Were they only
> intended for handling unlikely aliasing or were they also
> used for hoisting above branches?)
>

For hoisting above branches, there was "ld.s". This would set the NAT
bit if it failed. It was followed by a chk.s after the branch, which
was effectively a branch driven by the NAT bit.

It could be combined with data speculation with "ld.sa" followed by chk.a

ld.s was probably one of the better ideas, although to really gain over
prefetching it needs to pull along some dependent instructions which
lead to another memory access (kick in some memory level parallelism).

Ned

Paul A. Clayton

unread,

Aug 9, 2012, 7:19:37 PM8/9/12

to

On Thursday, August 9, 2012 12:20:30 PM UTC-4, nedbrek wrote:
[snip problem of huge Architected RF]

> It's because of the renamer.
>
> The renamer is basically a massively ported array:
> 1) 1 read per source arch reg
> 2) 1 write per instruction dispatched (assuming single dest uops)
> 3) 1 entry per architected reg
>
> For a 6 wide machine with 1 dest and 2 src regs, you have 12 read ports
> and 6 write ports.

Quibble: It seems that one could get away with less wide
renaming, especially if one is targeting commercial server
workloads (which have notoriously low ILP).

> On x86, you have 16 (or so) entries - not too bad, but a big power
> hungry chunk of hardware.
>
> On Itanium, you have 9x (144) that...

Well, the shadow registers are generally not accessed
at the same time as the corresponding ordinary registers,
so a banking optimization could be applied.

(It is a bit surprising that Itanium did not provide
suggested banking optimizations. Other than the
_requirement_ that the FP load (store?) pair instructions
target post rotation registers whose names differ in the
least significant bit--which banking-friendly requirement
does not seem to have been exploited--, I do not recall
any encouragement to facilitate banking optimizations.)

For a shallow OoO window, one could use a CAM or cache-
like mechanism (something like the early OoO mechanism in
PowerPC). I am not certain if it would be helpful to use
a scoreboard-like mechanism where the "scoreboard" filters
CAM accesses, perhaps such a scoreboard could also be
used to indicate a version number (where each cell might
indicate "committed to architected register", "use CAM/cache
to find name", "version 1", or "version 2"), supporting
something like virtual physical registers.

Predecoding could presumably simplify the renaming
hardware somewhat.

There are presumably other methods of handling renaming
in a large namespace. It might be possible to exploit
statistical access patterns in Itanium code to simplify
renaming while maintaining adequate performance.

(I do not mean to suggest that a large flat-ish RF does
not have significant issues for an OoO implementation--
though I think one could increase register-like state
without making renaming insanely difficult--, but it
seems that cleverness could reduce the performance/power
pain even with a flat-ish RF.)

BGB

unread,

Aug 9, 2012, 11:01:44 PM8/9/12

to

I remember many years ago:
Intel was promoting Itanium;
I knew a teacher in one of my classes who thought it was going to be big;
I had heard some of AMD working on x86-64 as well;
I figured x86-64 was going to beat the crap out of IA-64;
I was right...

it was not that I was thinking so much about raw performance (although I
did imagine generating code for the architecture would be a pain, so I
had some hesitation here as well), but rather that a full scale jump of
the sort Intel was imagining (and with limited backwards compatibility)
would likely have been absurdly expensive overall, and that an
incremental transition (due to backwards compatibility) would be much
less costly, and much more likely to succeed.

observe that with 64 bit chips, it has taken nearly a decade thus far to
make the transition, and still it is underway (there is general 64-bit
OS support, but large amounts of 32-bit software is still in use, and
much new software is still 32-bit as well). so, 32-bit probably still
has some life left.

Robert Wessel

unread,

Aug 9, 2012, 11:14:23 PM8/9/12

to

On Thu, 09 Aug 2012 22:01:44 -0500, BGB <cr8...@hotmail.com> wrote:

>observe that with 64 bit chips, it has taken nearly a decade thus far to
>make the transition, and still it is underway (there is general 64-bit
>OS support, but large amounts of 32-bit software is still in use, and
>much new software is still 32-bit as well). so, 32-bit probably still
>has some life left.

Well, all the big 64-bit OS's run 32-bit applications with little or
no handicap, and for most applications compiling for 64 bits has
little or no benefit. On most platforms rebuilding most applications
for 64-bit mode makes them a bit slower because the size and memory
bandwidth requirements increase - here x86-64 is better than most
other platforms, since register pressure is so high in 32-bit mode the
extra registers can be a good sized win. So there's little point in
building most applications 64-bit.

BGB

unread,

Aug 10, 2012, 1:12:48 AM8/10/12

to

and all the more reason that people would be less reasonably motivated
to jump ship to an entirely new/different architecture.

notable detractors with little real gain is not likely a market win.

nm...@cam.ac.uk

unread,

Aug 10, 2012, 3:11:36 AM8/10/12

to

In article <jru828pfh1aml4r4p...@4ax.com>,

Actually, there are two major benefits:

One is that the problem of how large to make the various areas
largely goes away, which vastly reduces the amount of system hacking
needed to get perfectly ordinary applications to work. Yes, this
affects only those that use significant amounts of memory (say,
256+ MB).

The second is that it means that there is a VASTLY higher chance
of a bad index, uninitialised pointer or similar error being trapped,
rather than just trashing some other part of the application. That
can save a LOT of effort and increase RAS considerably.

Regards,
Nick Maclaren.

j...@cix.compulink.co.uk

unread,

Aug 10, 2012, 4:58:31 AM8/10/12

to

In article <k01to6$kf7$1...@news.albasani.net>, cr8...@hotmail.com (BGB)
wrote:

> I knew a teacher in one of my classes who thought it was going to be
> big;

My ex-boss also thought it was going to be big, on the grounds that
Intel's producer power was such as to enable them to force it. Didn't
work: producer power can overcome some lack of appeal, but this was
clearly beyond it.

> observe that with 64 bit chips, it has taken nearly a decade thus far
> to make the transition, and still it is underway (there is general
> 64-bit OS support, but large amounts of 32-bit software is still in
> use, and much new software is still 32-bit as well). so, 32-bit
> probably still has some life left.

In the mathematical modelling I work on, 64-bit started growing in the
customer base with DEC Alpha in the mid-nineties. Windows and Mac have
been the slowest platforms to move, for reasons others have explained
well.

Stephen Sprunk

unread,

Aug 10, 2012, 9:50:22 AM8/10/12

to

On 10-Aug-12 03:58, j...@cix.compulink.co.uk wrote:
> In article <k01to6$kf7$1...@news.albasani.net>, cr8...@hotmail.com (BGB)
> wrote:
>> I knew a teacher in one of my classes who thought it was going to be
>> big;
>
> My ex-boss also thought it was going to be big, on the grounds that
> Intel's producer power was such as to enable them to force it. Didn't
> work: producer power can overcome some lack of appeal, but this was
> clearly beyond it.

There's a saying, "with sufficient thrust, even pigs can fly." That is
often applied to x86 to explain the market success of an ISA that is
widely considered "inferior" due to the sheer amount of money that Intel
(and AMD) can apply.

However, even Intel and HP do not appear to be capable of applying
enough thrust to Itanic to get that pig to fly--and that should tell us
all something about exactly how bad an idea it was.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Stephen Sprunk

unread,

Aug 10, 2012, 9:52:57 AM8/10/12

to

It's too bad that x64's ABI designers didn't develop a code model that
allows use of the extra registers, etc. without the cost of 64-bit pointers.

Mikael Pettersson

unread,

Aug 10, 2012, 10:33:34 AM8/10/12

to

In article <k033np$5lb$2...@dont-email.me>,

Stephen Sprunk <ste...@sprunk.org> wrote:
>On 09-Aug-12 22:14, Robert Wessel wrote:
>> On Thu, 09 Aug 2012 22:01:44 -0500, BGB <cr8...@hotmail.com> wrote:
>>> observe that with 64 bit chips, it has taken nearly a decade thus far to
>>> make the transition, and still it is underway (there is general 64-bit
>>> OS support, but large amounts of 32-bit software is still in use, and
>>> much new software is still 32-bit as well). so, 32-bit probably still
>>> has some life left.
>>
>> Well, all the big 64-bit OS's run 32-bit applications with little or
>> no handicap, and for most applications compiling for 64 bits has
>> little or no benefit. On most platforms rebuilding most applications
>> for 64-bit mode makes them a bit slower because the size and memory
>> bandwidth requirements increase - here x86-64 is better than most
>> other platforms, since register pressure is so high in 32-bit mode the
>> extra registers can be a good sized win. So there's little point in
>> building most applications 64-bit.
>
>It's too bad that x64's ABI designers didn't develop a code model that
>allows use of the extra registers, etc. without the cost of 64-bit pointers.

They didn't originally, but since then the "x32" ABI has been developed.
There's support for it in recent versions of glibc, gcc, binutils, and the
Linux kernel. I don't know of any Linux distro using it yet, however.

I had hoped that ARM would have done the same for ARMv8, i.e. support
32-bit via a variant of AArch64, but they elected to specify optional
hardware support for user-mode ARMv7. I don't think that's optimal.

Andy (Super) Glew

unread,

Aug 10, 2012, 10:58:11 AM8/10/12

to Mikael Pettersson

On 8/10/2012 7:33 AM, Mikael Pettersson wrote:
> In article <k033np$5lb$2...@dont-email.me>,
> Stephen Sprunk <ste...@sprunk.org> wrote:

>> It's too bad that x64's ABI designers didn't develop a code model that
>> allows use of the extra registers, etc. without the cost of 64-bit pointers.
>
> They didn't originally, but since then the "x32" ABI has been developed.
> There's support for it in recent versions of glibc, gcc, binutils, and the
> Linux kernel. I don't know of any Linux distro using it yet, however.

At one point in the development of 64 bit at Intel, that was a
fear shared by myself and other Intel architects: that AMD would
enable not just x86-64, which we called REX-64, but also REX-32,
giving access to the extra registers.

The math ran roughly like this:

* extra registers, 5-15% perf win, with some outliers much higher.

* performance loss for 64 bit pointers, typically 5-15%, often higher

* occasional big wins for 64 bit virtual memory.

Plus, one of the biggest wins was for running large memory 32 bit
apps on a 64 bit OS - allowing the app to use all 4G of its
address space, without the typical reservation of half of the
virtual address space for the OS.

---

However, it turns out that the OS folk both at Microsoft, but
especially on Linux and other UNIxes, did not want to have to
deal with such an intermediate mode.

They may well have been right: perhaps supporting REX-32 as well
as legacy x86 REX-64 would have delayed the release of OSes that
supported REX-64 by a few years.

But, if that had not been a concern, I do highly suspect that if
AMD had gotten REX-32 available at a time when Intel had no
equivalent EMT-64, i.e. when Intel had only Itanium to compete,
for extra registers and/or 64 bit addresses - that AMD might have
been able to get a more important market position. Beating Intel
on 32 bit benchmarks, admittedly 32 bit benchmarks recompiled to
use REX32, without the lossage of 64 bit pointers, would have
helped their marketing position.

Andy (Super) Glew

unread,

Aug 10, 2012, 11:25:39 AM8/10/12

to

On 8/8/2012 7:32 PM, Mark Thorson wrote:
> EricP wrote:

> You mean hybridizing EPIC and OoO? Wouldn't that

> be like hybridizing AC and DC?
>

To religious zealots, perhaps.

I wanted to reply to some other post something like "One of the things I
regret about Itanium's lack of market success is that it killed off
research in VLIW for many years. It also killed off OOO research." Here
I am not just talking about combining OOO and VLIW, but also "purist"
VLIW.

I think there were many good and interesting ideas in Itanium.

For example:

==> Register renaming *is* a problem for very wide OOO machines. A
moderate degree of explicit parallelism in the instruction set - 2-wide
or 4-wide - can reduce such ahrdware costs.

This may not be worth thinking about when you have a trace cache that
contains pre-renamed instructions - instead of renaming all
instructions, you only need to rename live-ins and live-outs that go
outside the block - but I think that it is always better to do something
once and for all in the compiler and jitter, than to have to do and redo
it every time the trace cache misses. Can't help but save power, so
long as the extra bits needed don't waste more power.

==> Predication - although branch predictors are good, they are not
always accurate. There are many things one can do to combine
predication with prediction, so that you do not always waste time
fetching executing instructions that your predictor accurately predicts
will not need to be executed.

==> Rotating registers - this is a very nice way of creating efficient
software pipelined loops, without code size explosion. I *often* find
that I can come much closer to saturating the machine if I heavily
software pipeline, but the code size cost of the loop prologue and
epilogue make it not pay off, and/or the costs of the branches that try
to predict which version of a loop you need to exit, and/or repair when
you exit early.

==> Sheer number of registers. The sheer number of registers can help
some codes. But the sheer number of registers can be painful for OOO to
deal with. I often wonder about following inthe footsteps of Cray1, and
creating 2 levels of register file: a first, smaller, RF that is easy
for OOO to deal with, and a larger L2 RF that may be less aggressively
renamed.
In some ways, some OOO hardware is already renaming into two levels
of physical register file. This might just be exposing such to compilers.

==> non-32 bit sized instructions. Other ISAs are going that way now.

--

Mark Thorson, who seems to diss hybridizing EPIC and OOO, elsewhere
talks about adding a level of indirection to the instruction set. This
is something I have often thought about.

The GPUs have shown that for certain applications large numbers of
registers - 128, 256, 8 bit fields - and really wide instructions - 64
bit, sometimes wider - can really help. Often with lots of specialized
widget instructions that *could* be replaced by simple sequencesof
instructions, but where simple combinatorc hardware can do in one
instruction cycle what would take

But for many applications the code size increase of using such large
instructions everywhere is unacceptable.

I wonder if we should not have a compact 32 bit wide instruction mode
(or even 16/32 bit mode) for use in most places. But for tight loops
expose a wider VLIW inspired instruction set - many registers, wide
instructions, rotating registers, predication.

We can imagine the compact instructions as being a specific subset of
the wide instructions.

We can imagine switching microarchitectres from OOO to VLIW in such
tight loops.

Heck, we can imagine using the level of indirection everywhere
- so instead of having 64 or 128 bit instructions in the tight loop
fill up a loop buffer, instead map smaller instructions, 16 or 32, to
the wider canonical 64 or wider instructions.

Andy (Super) Glew

unread,

Aug 10, 2012, 11:29:49 AM8/10/12

to

On 8/9/2012 1:44 AM, Michael S wrote:
> On Aug 9, 9:26 am, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>> Anton Ertl wrote:

>>> Mark Thorson <nos...@sonic.net> writes:
>>>> You mean hybridizing EPIC and OoO? Wouldn't that
>>>> be like hybridizing AC and DC?
>>

>>> EPIC is an architecture style, while OoO is a microarchitectural
>>> feature. An OoO implementation of an EPIC architecture is certainly
>>> possible, unlike a hybrid of AC and DC. The question is how big, if
>>> any, the benefits are over an OoO implementation of a more
>>> conventional architecture like AMD64.
>>
>> In an OoO Itanium the sw preload/check sequences would effectively
>> become a regular prefetch/load pair.
>>
>> Besides this I don't remember any other specific hw features to work
>> around the missing OoO hardware?
>>
>> Predicate masks, rotating register sets etc seems like they could be
>> implemented just as well in an OoO microarchitecture...
>>
>> Terje
>> --
>> - <Terje.Mathisen at tmsw.no>
>> "almost all programming can be viewed as an exercise in caching"
>
> The fact that absolute register names used in instruction are not
> known immediately after decoding could be problematic.

Intel legacy x87 floating point has exactly that same problem.

> Another problematic Itanium feature is very widely used post-increment
> addressing. That's particularly problematic for integer loads, because
> integer load instruction generates 2 outputs. Your typical OoO
> implementation don't like 2 outputs. Normal solution for such case is
> either cracking offending instruction in two uOPs or issuing it twice
> simultaneously through are couple of execution ports. Both solutions
> are fine when, as in case of Power, instructions generation 2 GPR
> outputs are rare. But on Itanium integer load with post-increment is
> not rare at all.

I used to be afraid of instructions with more than two inputs and one
output for OOO.
Now I am no longer afraid.

Andy (Super) Glew

unread,

Aug 10, 2012, 11:40:27 AM8/10/12

to

On 8/9/2012 8:18 AM, Anton Ertl wrote:
> nedbrek <ned...@yahoo.com> writes:
>> The biggest problems were Itanium (rather than EPIC/VLIW) specific:
>> 1) Huge architected register file (144 int regs, 128 fp) (which drove us
>> to investigated register file caching)
>
> That's interesting, because the Pentium 4 had 128 physical integer
> registers. Why is such a large architected register file a problem
> and a physical register file not?

Renaming.

Especially if you are thinking about OOO machines that use checkpoints
of the renamer map to do misprediction recovery.

You are forced to start looking at solutions where you don't necessarily
rename all architectural registers. Or where you have a limited number
of renames - two, four (i.e. 1 or 2, a few) bits in the renamer for most
registers, essentially limited associativity renaming, possibly with a
dynamically changing subset that are more fully renamed.

It would be unfortunate if you start having to think about copying
register values from a ROB to an RRF, P6-style, rather than just leaving
them in a single place, PRF style, Sandybridge style. Copying ROB->RRF
wastes power.

(Unfortunate, but you don't need to go there.)

Andy (Super) Glew

unread,

Aug 10, 2012, 11:51:47 AM8/10/12

to

On 8/9/2012 2:08 AM, Michael S wrote:
> On Aug 9, 8:54 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)

> wrote:
>> Mark Thorson <nos...@sonic.net> writes:
>>> You mean hybridizing EPIC and OoO? Wouldn't that
>>> be like hybridizing AC and DC?
>>
>> EPIC is an architecture style
>

> Architecture style or marketable name for VL-VLIW?

Many VLIW bigots are adamant that Itanium is not VLIW.

E.g. I once shared a car ride with some ex-Multiflow guys - not Colwell
and Papworth, others - who were adamant that Itanium was not a VLIW
because it did not expose pipeline latencies. I.e. their concept of
VLIW was that the compiler *knew* about instruction latencies, and
*knew* that it could use the old value of a register that had just seen
an instruction write to it, for the next N instructions. Whereas
Itanium required hardware to detect this and stall (so that you had a
chance of (a) varying instruction latencies and running existing
binaries), and (b) handle traps and exceptions transparently).

"Explicitly parallel" characterizes Itanium nicely, because in some ways
that it just what it is: a group of simple, RISC-like, instructions, of
variable number (up to the stop bit) in a PIG (Parallel Instruction Group).

Itanium does not expose all of the pipeline details. Pipeline registers.
Itanium has pipeline interlocks.

I would like to reform the terminology, although that is unlikely:

LIW (Long instruction word) - "instructions" that have multiple
independently specifiable opcodes or subinstructions.

Usually LIW are EPIC, Explicitly Parallel.

But there are LIWs that are the opposite of EPC, explicitly parallel.
Some LIW proposals require instructions to be dependent on each other,
tree worse or even linearly.

Exposing the pipeline is another, orthogonal, property.

BGB

unread,

Aug 10, 2012, 12:35:49 PM8/10/12

to

I would rather have liked a 32-bit REX.

it would have been nice, but probably would have been more costly, vs
making a more significant change to the ISA as they did in 64-bit mode
(unless the ISA changes were propagated back to 32-bits, but then it
isn't really "clearly" different from x32 or similar I guess...).

I guess another risk would be that it could have lead to a window of
several years where the registers existed but the OS's wouldn't preserve
them, making them unsafe to use anyways.

this brings up a thought, I would rather have a "x32 cdecl" ABI, which
would basically be like x32, except still use cdecl and similar more
like 32-bit x86 (and also have it work on Windows...).
(I guess it could be done on current Windows by having explicit compiler
support and hacking WoW64, but this would be ugly).

I had an x86 interpreter which actually added the extended registers
(and optional 64-bit math) as an extension though (among other things),
namely a so-called "PREX" prefix. unlike the normal REX though, it was a
2 byte prefix (similar to the 2-byte forms of the VEX and XOP prefixes).
(so, code compiled with an unaware compiler would still work).

functionally, my assembler was also partly extended to support it,
namely spitting out a PREX if building for a "32-bit extended" mode, or
REX for 64-bit mode.

not that I expect it to actually catch on or anything though...

an alternate encoding is also possible though (for a 2-byte PREX), like
say interpreting another prefix as a "special" escape for a REX prefix.

say, for example, we have:
BRT REX ...
so:
3E 4x ...

since neither branch-prediction nor a segment override makes much sense
for a single byte inc/dec register instruction.

or, alternatively:
ADDRSZ REX ...
67 4x ...

with similar reasoning (can't have a 16-bit memory access, for a register).

or such...

BGB

unread,

Aug 10, 2012, 12:43:39 PM8/10/12

to

On 8/10/2012 8:50 AM, Stephen Sprunk wrote:
> On 10-Aug-12 03:58, j...@cix.compulink.co.uk wrote:
>> In article <k01to6$kf7$1...@news.albasani.net>, cr8...@hotmail.com (BGB)
>> wrote:
>>> I knew a teacher in one of my classes who thought it was going to be
>>> big;
>>
>> My ex-boss also thought it was going to be big, on the grounds that
>> Intel's producer power was such as to enable them to force it. Didn't
>> work: producer power can overcome some lack of appeal, but this was
>> clearly beyond it.
>
> There's a saying, "with sufficient thrust, even pigs can fly." That is
> often applied to x86 to explain the market success of an ISA that is
> widely considered "inferior" due to the sheer amount of money that Intel
> (and AMD) can apply.
>
> However, even Intel and HP do not appear to be capable of applying
> enough thrust to Itanic to get that pig to fly--and that should tell us
> all something about exactly how bad an idea it was.
>

in many ways I like x86 more, and find it more cleanly designed, than,
say, modern forms of ARM (especially the Thumb/Thumb2/ThumbEE funkiness).

if the legacy and OS-level functionality is ignored, it isn't really all
that bad in comparison.

it would be a little better though if there were more x86 chips on the
market though, say, if a larger portion of the x86 ISA could be used
without worries over patents or similar, ...

Anton Ertl

unread,

Aug 10, 2012, 1:11:28 PM8/10/12

to

Stephen Sprunk <ste...@sprunk.org> writes:
>There's a saying, "with sufficient thrust, even pigs can fly." That is
>often applied to x86 to explain the market success of an ISA that is
>widely considered "inferior" due to the sheer amount of money that Intel
>(and AMD) can apply.
>
>However, even Intel and HP do not appear to be capable of applying
>enough thrust to Itanic to get that pig to fly--and that should tell us
>all something about exactly how bad an idea it was.

For "performance" rather than "market success", that is also true for
IA-32, but it took 17 years after the release of the 8086 and 10 years
after the 80386 until the Pentium Pro started to take off (and the
Athlon and Pentium 4 really flew several years later).

It seems to me that that much thrust has not been applied to IA-64
yet. There have been two designs, Merced and McKinley, and since then
everything seems to be derived from McKinley (not sure about the
upcoming Poulson); maybe McKinley is the best possible IA-64
implementation design, but maybe the just wanted to cut their losses
by limiting the design effort they apply.

Whether getting the pig to fly performancewise would translate to
flight in terms of market success at this point is questionable, which
may be another reason not to invest too much into thrust.

MitchAlsup

unread,

Aug 10, 2012, 1:47:53 PM8/10/12

to

On Friday, August 10, 2012 8:52:57 AM UTC-5, Stephen Sprunk wrote:
> It's too bad that x64's ABI designers didn't develop a code model that allows use of the extra registers, etc. without the cost of 64-bit pointers.

We did, and the gates were there in the early Opterons to enable it. It was called REX-32 in house.

The cost to software would have been high (new compilers, and environments), the benefits to applications rather low (a few percent), and the worst part was that it would have delayed the transition to 64-bits (the purported Holy Grail.) So, we along with a push from MS decided not to go there. It was discussed many times with voiciferous arguments on both sides.

Mitch

Quadibloc

unread,

Aug 10, 2012, 2:24:20 PM8/10/12

to

n...@cam.ac.uk wrote:

> Now, as usual, real life isn't as clear-cut as that, which is why
> realistic optimisation involves some rather messy collaborative
> hacking in both the compiler and hardware. But it is important
> to realise that they are approaching the issue from radically
> different directions.

Which is, of course, why Intel should have implemented the Itanium as
an out-of-order machine if they were serious about maximizing
performance.

John Savard

MitchAlsup

unread,

Aug 10, 2012, 1:55:24 PM8/10/12

to an...@spam.comp-arch.net

On Friday, August 10, 2012 10:29:49 AM UTC-5, Andy (Super) Glew wrote:

> Intel legacy x87 floating point has exactly that same problem.

Some implementations of x87 are supplied with more than 200 floating point registers, hidden behind the x87 stack and also mapped over the SSE register space, uniformly.

>> Another problematic Itanium feature is very widely used post-increment > addressing. <snip> outputs are rare. But on Itanium integer load with post-increment is not rare at all.

> I used to be afraid of instructions with more than two inputs and one output for OOO. Now I am no longer afraid.

Back in the early 1990s, I was hard both acedemically and industrially; now it is nothing more than a small addition to the renaming issue set (at least industrially).

Mitch

Quadibloc

unread,

Aug 10, 2012, 2:21:13 PM8/10/12

to

Mark Thorson wrote:
> nm...@cam.ac.uk wrote:

> > I know what some of the proponents said at the time and it went
> > like this:

> > Proponent: We are going to deliver an average IPC of 5+ on
> > unmodified, general-purpose codes.

> > Me: How are you going to do that? You know that the smartest
> > computer scientists in the world have been trying for 25 years,
> > and can't get above 2, and usually can't get even that?

> > Proponent: There are some very smart people in the compiler
> > team in HP.

> > Me: < stunned silence >

> I assume they didn't just pull that number out of
> thin air. There must have been _something_ on which
> they based that number.

No doubt, but his point is valid - even if new inventions *are*
created all the time that do things that the smartest people in the
world didn't know how to do until one person found the way.

John Savard

Quadibloc

unread,

Aug 10, 2012, 2:27:57 PM8/10/12

to

Anton Ertl wrote:

> That's interesting, because the Pentium 4 had 128 physical integer
> registers. Why is such a large architected register file a problem
> and a physical register file not?

If the architecture calls for 128 registers, you have to save and
restore them on every interrupt and possibly most subroutine calls.

If the architecture calls for 8 registers, you only have to save and
restore that many.

And if there happen to be physically 128 registers, that just means
you don't even have to save and restore 8 registers when switching
between 16 favored processes which have been assigned portions of that
register space.

John Savard

James Van Buskirk

unread,

Aug 10, 2012, 5:18:26 PM8/10/12

to

<nm...@cam.ac.uk> wrote in message news:k02c78$cs1$1...@needham.csi.cam.ac.uk...

> In article <jru828pfh1aml4r4p...@4ax.com>,
> Robert Wessel <robert...@yahoo.com> wrote:

>>On Thu, 09 Aug 2012 22:01:44 -0500, BGB <cr8...@hotmail.com> wrote:

>>>observe that with 64 bit chips, it has taken nearly a decade thus far to
>>>make the transition, and still it is underway (there is general 64-bit
>>>OS support, but large amounts of 32-bit software is still in use, and
>>>much new software is still 32-bit as well). so, 32-bit probably still
>>>has some life left.

>>Well, all the big 64-bit OS's run 32-bit applications with little or
>>no handicap, and for most applications compiling for 64 bits has
>>little or no benefit. On most platforms rebuilding most applications
>>for 64-bit mode makes them a bit slower because the size and memory
>>bandwidth requirements increase - here x86-64 is better than most
>>other platforms, since register pressure is so high in 32-bit mode the
>>extra registers can be a good sized win. So there's little point in
>>building most applications 64-bit.

Can you support the word "because" in the above paragraph? It
seems to me that if one cares about performance of any application,
it will have been optimized to a certain extent for the available
compiler. Given a new platform for which optimization considerations
have not yet been made and a new compiler which doesn't have the
benefit of experience to guide optimization efforts, doesn't that
by itself imply a performance loss due to any transition, other
things being equal?

> Actually, there are two major benefits:

> One is that the problem of how large to make the various areas
> largely goes away, which vastly reduces the amount of system hacking
> needed to get perfectly ordinary applications to work. Yes, this
> affects only those that use significant amounts of memory (say,
> 256+ MB).

> The second is that it means that there is a VASTLY higher chance
> of a bad index, uninitialised pointer or similar error being trapped,
> rather than just trashing some other part of the application. That
> can save a LOT of effort and increase RAS considerably.

Also having to play extra games when data size grows too large
leads to programs that exhibit problems at the point of transition
to large data that don't always get caught during testing.

Windows benefits tremendously because there are enough registers
that there is no longer a need for a zoo of calling conventions
so it's easier for code written with different tools to
interoperate.

It's also way easier to load REAL(KIND=C_DOUBLE) constants as
immediates. At least Itanium allowed 64-bit immediates where
Alpha had a rather lengthy sequence to achieve this.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Ivan Godard

unread,

Aug 10, 2012, 5:54:25 PM8/10/12

to

Unfortunate indeed. And yes, you don't have to go there. Our machine has
no general registers, and so no ROB. It saves nothing in anticipation of
a mispredict, nor anything on an occurrence. The state to save on an
interrupt is roughly the equivalent of ~40 GRs, and save is all finished
before the interrupt handler is done with decode.

Robert Wessel

unread,

Aug 10, 2012, 6:59:50 PM8/10/12

to

Experience on other 32->64 bit transitions. The larger pointers and
the larger registers needing saving causes about a 10% hit (note that
Andy Glew posted similar numbers elsewhere in this thread).

As I mentioned x86 is a bit special because 64 bit mode adds extra
registers, which helps performance on most codes. Although as Andy
mentioned, the benefit from the extra registers is about the same as
the hit from the larger pointers and registers.

Obviously some stuff benefits greatly from a 64 bit ISA - bignum
arithmetic and applications needing a lot of memory, for example.

ChrisQ

unread,

Aug 10, 2012, 7:07:09 PM8/10/12

to

On 08/10/12 18:27, Quadibloc wrote:
> Anton Ertl wrote:
>
>> That's interesting, because the Pentium 4 had 128 physical integer
>> registers. Why is such a large architected register file a problem
>> and a physical register file not?
>
> If the architecture calls for 128 registers, you have to save and
> restore them on every interrupt and possibly most subroutine calls.
>
> If the architecture calls for 8 registers, you only have to save and
> restore that many.
>

That's not quite correct, in that it's a function of how the software is
written. If you know that you will only ever use a subset of the register
set, then that's all you need to save.

Context switching is not so easy, however...

Regards,

Chris

Quadibloc

unread,

Aug 10, 2012, 10:39:03 PM8/10/12

to

You're quite right, but I didn't want to get into details not
concerned with the question: why it is architectural registers, not
physical registers, that can be a problem.

John Savard

nm...@cam.ac.uk

unread,

Aug 11, 2012, 3:45:05 AM8/11/12

to

In article <bb4b289u6th9ucusp...@4ax.com>,

Robert Wessel <robert...@yahoo.com> wrote:
>On Fri, 10 Aug 2012 15:18:26 -0600, "James Van Buskirk"
><not_...@comcast.net> wrote:
>
>>>>Well, all the big 64-bit OS's run 32-bit applications with little or
>>>>no handicap, and for most applications compiling for 64 bits has
>>>>little or no benefit. On most platforms rebuilding most applications
>>>>for 64-bit mode makes them a bit slower because the size and memory
>>>>bandwidth requirements increase - here x86-64 is better than most
>>>>other platforms, since register pressure is so high in 32-bit mode the
>>>>extra registers can be a good sized win. So there's little point in
>>>>building most applications 64-bit.
>>
>>Can you support the word "because" in the above paragraph? It
>>seems to me that if one cares about performance of any application,
>>it will have been optimized to a certain extent for the available
>>compiler. Given a new platform for which optimization considerations
>>have not yet been made and a new compiler which doesn't have the
>>benefit of experience to guide optimization efforts, doesn't that
>>by itself imply a performance loss due to any transition, other
>>things being equal?
>
>Experience on other 32->64 bit transitions. The larger pointers and
>the larger registers needing saving causes about a 10% hit (note that
>Andy Glew posted similar numbers elsewhere in this thread).

Right. But the operative term is "a bit" - and it isn't true for
x86, anyway, where the associated improvements can make 64-bit
massively faster (3-4 times in some cases). In the original
32=>64 bit transitions, there were screams of a factor of two
from the opponents, but few applications exceeded 10%.

But your last sentence doesn't follow because of the points I
made (less segment conflict and a higher chance of errors being
detected).

Regards,
Nick Maclaren.

Anton Ertl

unread,

Aug 11, 2012, 5:13:29 AM8/11/12

to

Actually the Linux people are doing it now, as mentioned above. And I
think now is the time to do such things (if at all), not then. Now
nearly all IA-32 hardware that's still in use with these OSs is
actually AMD-64 hardware, so now the effort of building stuff for x32
will have much more benefit than when AMD-64 was introduced, so
application developers or packagers are more likely to do it. And
even now there is the question of whether a few percent in
performance are worth the cost of either having another platform to
build for or abandoning the remaining IA-32-only systems.

Of course, from a benchmarketing perspective, x32 then would have
produced better SPEC CPU numbers for Opteron and Athlon 64 CPUs, but
hardly anybody would have used x32 in production.

>But, if that had not been a concern, I do highly suspect that if
>AMD had gotten REX-32 available at a time when Intel had no
>equivalent EMT-64, i.e. when Intel had only Itanium to compete,
>for extra registers and/or 64 bit addresses - that AMD might have
>been able to get a more important market position. Beating Intel
>on 32 bit benchmarks, admittedly 32 bit benchmarks recompiled to
>use REX32, without the lossage of 64 bit pointers, would have
>helped their marketing position.

On SPEC CPU, yes, but on typical Windows-based benchmarks (games,
office software, encoding), i.e., what sites like anandtech.com mostly
present), no. Given that Opteron and Athlon 64 were very competetive
anyway performancewise, I don't think it would have mattered.

- anton (typing this on a 2003-vintage Athlon 64 3200+ :-)

Anton Ertl

unread,

Aug 11, 2012, 8:07:56 AM8/11/12

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>Anton Ertl wrote:
>
>> That's interesting, because the Pentium 4 had 128 physical integer
>> registers. Why is such a large architected register file a problem
>> and a physical register file not?
>
>If the architecture calls for 128 registers, you have to save and
>restore them on every interrupt

No, only those that you use.

> and possibly most subroutine calls.

At least on the integer side IA-64 has the register stack for that.
And even if it had not, not everything (or even a large part) has to
be saved on calls; calling conventions are designed to require
relatively few saves and restores on most calls.

- anton

BGB

unread,

Aug 11, 2012, 9:26:55 AM8/11/12

to

I think it depends some on performance or memory use.
AFAIK, the impact on memory use is a bit larger than the impact on the
performance.

although, 2x would still likely be a bit steep (and unlikely) for memory
use as well, as probably most apps store stuff in memory besides
pointers as well.

>
> Regards,
> Nick Maclaren.
>

nm...@cam.ac.uk

unread,

Aug 11, 2012, 10:16:07 AM8/11/12

to

In article <k05mo7$c8c$1...@news.albasani.net>, BGB <cr8...@hotmail.com> wrote:
>>>>>>
>>>>>> So there's little point in
>>>>>> building most applications 64-bit.
>>

>> But your last sentence doesn't follow because of the points I
>> made (less segment conflict and a higher chance of errors being
>> detected).
>
>I think it depends some on performance or memory use.
>AFAIK, the impact on memory use is a bit larger than the impact on the
>performance.

That is true, but your last sentence STILL doesn't follow, because
of the points I made - which had nothing to do with performance,
but a great deal to do with RAS.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Aug 11, 2012, 10:19:37 AM8/11/12

to

In article <2012Aug1...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>Quadibloc <jsa...@ecn.ab.ca> writes:
>>Anton Ertl wrote:
>>
>>> That's interesting, because the Pentium 4 had 128 physical integer
>>> registers. Why is such a large architected register file a problem
>>> and a physical register file not?
>>
>>If the architecture calls for 128 registers, you have to save and
>>restore them on every interrupt
>
>No, only those that you use.
>
>> and possibly most subroutine calls.
>
>At least on the integer side IA-64 has the register stack for that.
>And even if it had not, not everything (or even a large part) has to
>be saved on calls; calling conventions are designed to require
>relatively few saves and restores on most calls.

As I pointed out, that has been known to be an engineering mistake
for the past 50 years. Designing a system for the easy cases,
and assuming that the hard cases don't matter, is a classic error.
Even with the very restricted forms of lazy saving used on more
conventional architectures, it has caused continual trouble; with
the extreme form used by the Itanic, it was a recipe for disaster.

Regards,
Nick Maclaren.

ChrisQ

unread,

Aug 11, 2012, 11:19:54 AM8/11/12

to

On 08/11/12 02:39, Quadibloc wrote:

> You're quite right, but I didn't want to get into details not
> concerned with the question: why it is architectural registers, not
> physical registers, that can be a problem.
>
> John Savard

I guess they are all physical registers in the end, if they exist at
all, though your point was:

> If the architecture calls for 128 registers, you have to save and
> restore them on every interrupt and possibly most subroutine calls.
>

If there's something going on under the hood that's not visible to
the programmer, then fair enough, but otherwise, no.

It was easy in the old days, where a lot of system code was written
in asm and you could make tradeoffs by limiting the number of registers
available to called functions or interrupt handlers. Even with hll's,
there's no reason why you couldn't have compiler switches to do the same
thing...

Regards,

Chris

BGB

unread,

Aug 11, 2012, 12:27:13 PM8/11/12

to

if you mean the statement made at the top of the post:
I didn't write that statement.

look back a ways, and it will be noted that someone else wrote that
statement. further evidence: note the use of capitalization.

not that I entirely disagree with it though:
many apps have a memory footprint considerably smaller than the 2-3GB
limit, so compiling them as 64 bits will not gain much (but will make
them use more memory and potentially run slower).

likewise, despite the much larger address space, common OS's, such as
Windows, will still tend to cluster everything into the low 4GB until
things will no longer fit, and then move past the 4GB mark (ignoring
Linux for a moment, which has different behavior, 1).

so, it is only really likely to make much of a difference in cases where
pointers are overwritten with ASCII data. typically, this will not make
as much of a difference for apps with relatively small memory use either
(say, <100-200MB or so), since most ASCII characters would lead into the
512MB-2GB range, which is mostly unoccupied in this case apart from
system DLLs mapped near the 2GB mark.

most small integers will tend to fall into inaccessible memory as well
(either the NULL page, below the stack, or in the top of the OS-reserved
region), and most floats will tend to fall into a similar area as ASCII
data (actually, the typical range is much narrower than ASCII).

so, in most cases, for an app with a moderately small memory footprint,
most cases of a bad pointer due to overflows/... will still tend to be
caught by the usual memory-protection features (though this will drop
off sharply as an app approaches the address-space limits).

presumably though, this would lead to a scenario where most small-apps
and tools continue to use 32-bits, but 64-bits becomes more popular for
larger and more memory-hungry apps (such as games and web-browsers and
similar...). (although, granted, say, having FireFox crash whenever it
runs out of memory does have the merit of avoiding the case where the
whole computer bogs down due to FireFox or similar growing to an
enormous memory footprint, so it is sort of a tradeoff).

as I look over at FF and notice again its memory use steadily creeping
up and just short of the 3GB mark, yeah, if not closed it will probably
crash soon.

1: for Linux, it may make a bigger difference, given that Linux seems to
like to map things at seemingly arbitrary locations, rather than the use
of linear scanning.

ChrisQ

unread,

Aug 11, 2012, 12:54:00 PM8/11/12

to

On 08/11/12 16:27, BGB wrote:

> as I look over at FF and notice again its memory use steadily creeping
> up and just short of the 3GB mark, yeah, if not closed it will probably
> crash soon.
>

Later versions of FF have at last addressed the gross memory leak problems
that have plagued that series of browsers every since the early netscape
days.

FF has been running for weeks on a couple of machines here and the memory
footprint is not bad: 450Mb size and 205Mb resident on the sparc / sol10
machine and ~460Mb on the xp machine. Both have 6-10 tabs open at any given
time. Older version on xp used to creep up to 7 or 800 Mb.

Where is the 3Gb coming from and on what machine / os ?. Just curious...

Regards,

Chris

BGB

unread,

Aug 11, 2012, 1:48:34 PM8/11/12

to

I am not sure what is the reason for the 3GB, but FF still seems to have
memory-leak issues.

FF version is 14.0.1 (32-bit), OS is Windows 7 x64 (Athlon II X4).

this usually happens with tabs left on sites using a lot of fancy UI
features (probably due to JavaScript?), notable examples would include
OkCupid, Facebook, MySpace, ... but also includes YouTube and the Google
search page, and to a lesser extent LinkedIn, ...

the Flash plugin also crashes a lot, but this runs in its own process on
newer FF, and typically crashes a lot if doing stuff on sites like
YouTube and similar.

maybe 20-30 tabs or so isn't really all that uncommon (but, memory use
will steadily creep up even when it is left idle, and I have often
observed that opening a new tab on many sites will use about 10-40MB).

usually, if it is being used on fairly "clean" sites, like Wikipedia or
TVTropes or similar, or other sites using plain HTML, there doesn't seem
to be much of an issue here (memory use is more stable, and the per-tab
cost is much lower).

or such...

nm...@cam.ac.uk

unread,

Aug 11, 2012, 2:59:11 PM8/11/12

to

In article <k061a9$tm$1...@news.albasani.net>, BGB <cr8...@hotmail.com> wrote:
>>>>>>>>
>>>>>>>> So there's little point in
>>>>>>>> building most applications 64-bit.
>>>>
>>>> But your last sentence doesn't follow because of the points I
>>>> made (less segment conflict and a higher chance of errors being
>>>> detected).
>>>
>>> I think it depends some on performance or memory use.
>>> AFAIK, the impact on memory use is a bit larger than the impact on the
>>> performance.
>>
>> That is true, but your last sentence STILL doesn't follow, because
>> of the points I made - which had nothing to do with performance,
>> but a great deal to do with RAS.
>
>if you mean the statement made at the top of the post:
>I didn't write that statement.
>
>look back a ways, and it will be noted that someone else wrote that
>statement. further evidence: note the use of capitalization.

Sorry.

>not that I entirely disagree with it though:
>many apps have a memory footprint considerably smaller than the 2-3GB
>limit, so compiling them as 64 bits will not gain much (but will make
>them use more memory and potentially run slower).

It's STILL wrong! For the third time, there is a very good reason
to do that - increased RAS in the ways I mentioned. Whether you
regard that as important in any particular case, but stating that
there is very little reason to recompile most applications is
factually wrong.

Regards,
Nick Maclaren.

BGB

unread,

Aug 11, 2012, 5:03:15 PM8/11/12

to

most likely, for many of these applications, this will not matter much,
but the app running slower or having a larger memory footprint may be
much more relevant.

a lot probably depends on expectations and how a given application or
computer is being used, ...

like, reliability may be more important if people expect long up-times
or similar, but less important for normal computers, where it is usually
expected that the user will reboot every-day to every few days anyways
(and where going much longer than about 1-2 weeks will often cause the
OS to start getting increasingly buggy and usually end up crashing anyways).

Robert Wessel

unread,

Aug 11, 2012, 6:32:57 PM8/11/12

to

I don't disagree, but I thought we were speaking in terms of
performance. (Of course an application that doesn't run successfully
or reliably has no performance to speak of).

MitchAlsup

unread,

Aug 11, 2012, 7:48:17 PM8/11/12

to nm...@cam.ac.uk

I have to fall in line with Nick about RAS.

In fact my current instruction set has OPcode
faults for zeros, small negative numbers, and
for the floating point numbers between 0.01
and 100. Positives close to zero (in a 32-bit
or 64-bit sense) are illegal opcodes, nagatives
of similar ilk are illegal opcodes, and the
typical range of FP data are also illegal opcodes.

The paging model provides independent enumeration
of read, write, and execute. Constant data is
fetched out of code space rather than being
loaded from data space. So there is essentially
no reason to have a page be both writable and
fetchable (executable) at the same time.

So, not only can you not jump into data areas
(pageFault) but if you arrive, there is the
greatest probability that you won't get anywhere
(codeFault), unthess the code actualy smells like
code.

Given that one has a real 64-bit virtual address
space where any linkable section can be placed
anywhere in virtual space, means that it becomes
increasinlgy unlikely to stumble upon data when
the CPU think it is really code.

Given a mapping scheme where one does not have to
waste pages for tables in the hierarchy with no
essential utility, means that the simplest apps
can be mapped with a single page of translation
overhead. Root pointer->page.physical[512] are
the 512 PTEs. So, there is no longe any page table
overhead to map the huge 64-bit space. Should the
application dynamically attach more maodues and
exceede the map table of a singl page, another layer
in the hierarchy can be adder (at any layer desired.)

Mitch

van...@vsta.org

unread,

Aug 11, 2012, 8:53:23 PM8/11/12

to

BGB <cr8...@hotmail.com> wrote:
> although, 2x would still likely be a bit steep (and unlikely) for memory
> use as well, as probably most apps store stuff in memory besides
> pointers as well.

For some knowledge mining applications on old Opteron systems under Python,
we saw well beyond 2x slowdown when using 64 bits. This "cliff" was not seen
with newer Intel systems. Our guess was having too much of our memory
reference shadow exceed the L2 cache.

I'm still puzzled by everybody's rush to move to 64 bits. Out of all the
activities I've measured on a number of Ubuntu systems, nothing comes
anywhere near exceeding 31 bits of addressing. Because of my knowledge
engine work, I'm certainly grateful that the 64 bit platforms are available
at commodity prices! But for almost everybody, it feels rather like that
volume knob that goes to 11.

--
Andy Valencia
Home page: http://www.vsta.org/andy/
To contact me: http://www.vsta.org/contact/andy.html

Stephen Sprunk

unread,

Aug 11, 2012, 9:39:19 PM8/11/12

to

On 11-Aug-12 16:03, BGB wrote:
> reliability may be more important if people expect long up-times or
> similar, but less important for normal computers, where it is usually
> expected that the user will reboot every-day to every few days anyways
> (and where going much longer than about 1-2 weeks will often cause the
> OS to start getting increasingly buggy and usually end up crashing
> anyways).

That was my experience up through Windows 2000/NT4, but since Windows
XP/Server 2003, I've never had any significant _OS_ stability problems,
just bad hardware and/or drivers. Servers can run 24x7, and my laptops
and desktops are put into "sleep" or "hibernate" modes between uses.
The only reboots are to apply patches, since Microsoft somehow still
hasn't figured out how to do so without a reboot. (I've seen countless
Linux systems with uptimes measured in years; reboots are only required
for a kernel upgrade, and that became rare with the advent of modules.)

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking