Suggestions for Intel (was: No visible activity)

John Savard

unread,

Sep 26, 2004, 3:17:14 AM9/26/04

to

On Sat, 25 Sep 2004 08:55:07 GMT, Brian Inglis
<Brian....@SystematicSW.Invalid> wrote, in part:

>there are also 6 integer units, 4 FMAC units (2 DP + 2 SP),
>4 MMX units, 3 branch units, 2 load + 2 store units, IA-32 decode/
>control unit, Advanced Load Address Table, 128 integer + 128 FP
>registers, and 8 branch registers. And they're adding OoO execution.

Ah. So the Itanium is really, really superscalar.

And if they put all of *that* in a chip that instead supported the EMT64
type of architecture, it would be even bigger.

Hence, although an Itanium can execute x86 code, it does it with a fast
486 on the die. So that a JIT compiler for x86 code to Itanium code can
actually be faster.

All right, Intel. Listen up.

The Itanium uses the same floating-point formats, the same MMX, and a
whole lot else that it shares with the 386 architecture.

Design a *small* Itanium, not as powerful as the real Itanium. So that
there is room for an EMT64 control unit on the same die that directly
accesses the same superscalar pipelined arithmetic units as the Itanium
control unit. (Incidentally, this means extending the Itanium
architecture to embrace SSE2. Use both of 2 MMX units at once.)

When running Itanium code, it will be *slightly* faster now. (Basically
because making the EMT64 control unit complex enough to perform optimal
pipelining of micro-ops won't be done.)

Your chip will knock anything AMD has out of the water in performance -
AND because it becomes the new standard, and it gives the opportunity to
increase performance further, if incrementally, by switching to Itanium
code, people will do it.

And, incidentally, don't throw performance away. The chip might have
only half of the ALUs that an Itanium 2 is blessed with - but give it a
128-bit wide data bus to main memory. Your existing packages have enough
pins for that. They used to make microprocessors with 40 pins, so a
microprocessor with a 64-bit address and a 128-bit data bus needs:

64 pins for the address (the lowest 4 pins are used when less wide
memory is connected)

128 pins for the data

16 pins for ECC

16 pins for byte select

and at most 40 other pins for control signals.

OK, so on modern chips you need multiple pins for signal and ground, and
there are some other advanced capabilities of the chipset that a
straightforward microprocessor architecture of the old type can't
handle. But your current packages go well over 264 pins. Well over.
There must be some fat you can get rid of.

People will thank you for it, because it will be easier to design
systems around your chip.

I may also note that in my studies of computer architecture, I now
realize that it is *not* a gimmick on Intel's part to use twice as many
pipeline stages as AMD.

If it takes just as long to do a floating-point multiplication on one
chip as another, but another chip has twice as many pipeline stages, so
its frequency is twice as many megahertz, it certainly seems like the
chip only *sounds* twice as fast.

But the thing about a pipeline is that, if you have twice as many
pipeline stages, your chip can be working on twice as many
floating-point multiplications at the same time.

This is how the Cray computers did vector calculations; they used a fast
and highly-pipelined ALU to perform a vector operation on 64 numbers at
once, not a bank of 64 ALUs in parallel.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Kai Harrekilde-Petersen

unread,

Sep 26, 2004, 4:09:25 AM9/26/04

to

jsa...@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:

[only posting to comp.arch, by newsserver refuses to post to a.f.c]

> All right, Intel. Listen up.
>
> The Itanium uses the same floating-point formats, the same MMX, and a
> whole lot else that it shares with the 386 architecture.
>
> Design a *small* Itanium, not as powerful as the real Itanium. So that

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> there is room for an EMT64 control unit on the same die that directly
> accesses the same superscalar pipelined arithmetic units as the Itanium
> control unit. (Incidentally, this means extending the Itanium
> architecture to embrace SSE2. Use both of 2 MMX units at once.)
>
> When running Itanium code, it will be *slightly* faster now. (Basically

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> because making the EMT64 control unit complex enough to perform optimal
> pipelining of micro-ops won't be done.)

You seem to be contradicting yourself here. I must admit I did not
read most of the thread here, so I would have missed any 'golden
clues' as to why you aren't contradicting yourself.

Regards,

Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

Grumble

unread,

Sep 27, 2004, 11:12:22 AM9/27/04

to

John Savard wrote:

> Brian Inglis wrote:
>
>> there are also 6 integer units, 4 FMAC units (2 DP + 2 SP),
>> 4 MMX units, 3 branch units, 2 load + 2 store units, IA-32 decode/
>> control unit, Advanced Load Address Table, 128 integer + 128 FP
>> registers, and 8 branch registers. And they're adding OoO execution.
>
> Ah. So the Itanium is really, really superscalar.
>
> And if they put all of *that* in a chip that instead supported the EMT64
> type of architecture, it would be even bigger.
>
> Hence, although an Itanium can execute x86 code, it does it with a fast
> 486 on the die. So that a JIT compiler for x86 code to Itanium code can
> actually be faster.

A fast 486... How misinformed are you?

http://google.com/groups?selm=ci9nql$lhu$1...@news-rocq.inria.fr

> All right, Intel. Listen up.

Right :-)

> The Itanium uses the same floating-point formats, the same MMX, and
> a whole lot else that it shares with the 386 architecture.

Could you, please, explicitely state what IPF and IA-32 share?

--
Regards, Grumble

John Savard

unread,

Sep 27, 2004, 1:28:20 PM9/27/04

to

On Mon, 27 Sep 2004 17:12:22 +0200, Grumble <dev...@kma.eu.org> wrote,
in part:

>Could you, please, explicitely state what IPF and IA-32 share?

They both have stacks.

They both use the same data formats: IEEE-759 floating-point numbers and
two's complement integers. (Of course, _this_ is shared with the Power
PC, the Motorola 68000, and virtually all modern computers.)

They both have instructions that divide a 64-bit operand into multiple
parts for parallel arithmetic - the MMX instruction set is present on
the native Itanium architecture.

They both support little-endian operand storage, although the Itanium
adds the ability to switch into big-endian operation.

You may well say this isn't much, but it's enough that the Itanium is a
superset of the 386. There aren't any features of the Pentium that it
leaves out that would make it awkward to use the back end of an Itanium
chip as the back end of a Pentium.

True, it leaves out some of the memory model stuff, and it has extra
operations such as population count.

Also, it uses sixteen registers at a time instead of eight.

Of course, there are major differences at the *front* end.

The instruction format is very different: 386 architecture instructions
are built one byte at a time, while Itanium instructions have a fixed
length of 41 bits, being organized three to a 128-bit block, which also
includes five bits to specify parallelism and instruction group.

You could claim that all computers pretty much do the same thing, so
that saying the instructions do the same kind of things on a Pentium and
an Itanium doesn't make them any more similar than any two other
computers taken at random.

But this isn't true. Take a Pentium and a System/360. Different
floating-point formats. No stack on the 360. No MMX on the 360. No
variable-length packed decimal instructions on the Pentium.

But then there is one significant difference in the middle as well: the
Itanium only has a flat memory model.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Colin Andrew Percival

unread,

Sep 27, 2004, 1:48:57 PM9/27/04

to

In comp.arch John Savard <jsa...@excxn.anospamb.cdn.invalid> wrote:
> They both use the same data formats: IEEE-759 floating-point numbers and
> two's complement integers.

I don't think the "759-1984 IEEE Standard Test Procedures for
Semiconductor X-Ray Energy Spectrometers" standard says anything about
floating-point numbers.

You're probably thinking of the "754-1985 IEEE Standard for Binary
Floating-Point Arithmetic".

Colin Percival

Morten Reistad

unread,

Sep 27, 2004, 5:00:35 PM9/27/04

to

In article <4158482f...@news.ecn.ab.ca>,

John Savard <jsa...@excxn.aNOSPAMb.cdn.invalid> wrote:
>On Mon, 27 Sep 2004 17:12:22 +0200, Grumble <dev...@kma.eu.org> wrote,
>in part:
>
>>Could you, please, explicitely state what IPF and IA-32 share?
>
>They both have stacks.
>
>They both use the same data formats: IEEE-759 floating-point numbers and
>two's complement integers. (Of course, _this_ is shared with the Power
>PC, the Motorola 68000, and virtually all modern computers.)
>
>They both have instructions that divide a 64-bit operand into multiple
>parts for parallel arithmetic - the MMX instruction set is present on
>the native Itanium architecture.
>
>They both support little-endian operand storage, although the Itanium
>adds the ability to switch into big-endian operation.
>
>You may well say this isn't much, but it's enough that the Itanium is a
>superset of the 386. There aren't any features of the Pentium that it
>leaves out that would make it awkward to use the back end of an Itanium
>chip as the back end of a Pentium.

Or build a 386 to itanium compiler.

>True, it leaves out some of the memory model stuff, and it has extra
>operations such as population count.
>
>Also, it uses sixteen registers at a time instead of eight.
>
>Of course, there are major differences at the *front* end.
>
>The instruction format is very different: 386 architecture instructions
>are built one byte at a time, while Itanium instructions have a fixed
>length of 41 bits, being organized three to a 128-bit block, which also
>includes five bits to specify parallelism and instruction group.

Here is a challenge for compiler-writers; to take 386 code,
(re)build prase trees, do the normal optimizations, and identify
parallellism; and generate itanium code.

The point that data representations remain the same removes a
lot of the complexity of this task.

>You could claim that all computers pretty much do the same thing, so
>that saying the instructions do the same kind of things on a Pentium and
>an Itanium doesn't make them any more similar than any two other
>computers taken at random.
>
>But this isn't true. Take a Pentium and a System/360. Different
>floating-point formats. No stack on the 360. No MMX on the 360. No
>variable-length packed decimal instructions on the Pentium.
>
>But then there is one significant difference in the middle as well: the
>Itanium only has a flat memory model.

Intel finally learned?

-- mrr

Yousuf Khan

unread,

Sep 28, 2004, 1:54:38 AM9/28/04

to

John Savard wrote:
> On Mon, 27 Sep 2004 17:12:22 +0200, Grumble <dev...@kma.eu.org>
> wrote, in part:
>
>> Could you, please, explicitely state what IPF and IA-32 share?
>
> They both have stacks.
>
> They both use the same data formats: IEEE-759 floating-point numbers
> and two's complement integers. (Of course, _this_ is shared with the
> Power PC, the Motorola 68000, and virtually all modern computers.)
>
> They both have instructions that divide a 64-bit operand into multiple
> parts for parallel arithmetic - the MMX instruction set is present on
> the native Itanium architecture.

I would doubt that MMX is present on the native Itanium instruction set,
it's probably just there as part of its x86 emulation suite. Afterall, why
would Itanium need MMX natively? It's native instructions can pretty much do
everything MMX can do and then some. MMX was a kludge on top of the x86
instruction set to allow it to do SIMD operations. IA-64 started out right
from the beginning as a MIMD (multiple-instructions, multiple-data)
operator; that would mean that it can do SIMD as subset of its MIMD.

> They both support little-endian operand storage, although the Itanium
> adds the ability to switch into big-endian operation.

Yup, but many architectures have this ability too, and no one accuses them
of being supersets of x86.

> You may well say this isn't much, but it's enough that the Itanium is
> a superset of the 386. There aren't any features of the Pentium that
> it leaves out that would make it awkward to use the back end of an
> Itanium chip as the back end of a Pentium.
>
> True, it leaves out some of the memory model stuff, and it has extra
> operations such as population count.
>
> Also, it uses sixteen registers at a time instead of eight.
>
> Of course, there are major differences at the *front* end.
>
> The instruction format is very different: 386 architecture
> instructions are built one byte at a time, while Itanium instructions
> have a fixed length of 41 bits, being organized three to a 128-bit
> block, which also includes five bits to specify parallelism and
> instruction group.
>
> You could claim that all computers pretty much do the same thing, so
> that saying the instructions do the same kind of things on a Pentium
> and an Itanium doesn't make them any more similar than any two other
> computers taken at random.
>
> But this isn't true. Take a Pentium and a System/360. Different
> floating-point formats. No stack on the 360. No MMX on the 360. No
> variable-length packed decimal instructions on the Pentium.
>
> But then there is one significant difference in the middle as well:
> the Itanium only has a flat memory model.

What exactly are you trying to say here? First you start out saying that an
Itanium is just an x86 superset, and then proceed to list just about all of
the *major* architectural differences that makes it obviously not an x86
superset.

I know that you're an Itanium fan, and would dearly love to see it continue.
But everything you're suggesting that would make Itanium more x86-compatible
would be a major internal redesign of the processor, plus it would add
circuitry that could only increase the die size of the Itanium, which would
make it even more uneconomical to make. I have no idea how you think such a
beast could even begin to compete against a real x86 processor, either on
performance, let alone on price/performance?

Isn't it at all conceivable that the perfect upgrade path from x86 came from
AMD and not Intel this time? AMD's 64-bit modifications added slightly less
than 5% to the overall die size. In the meantime, AMD went ahead added
next-generation features such as internal memory controller and
Hypertransport to help increase performance. The Itanium seems to be all
instruction set and cache.

Unfortunately, most of Itanium's future was already glaringly obvious a few
years before it ever came out. The path taken was wrong.

Yousuf Khan

John Savard

unread,

Sep 28, 2004, 4:21:08 AM9/28/04

to

On Mon, 27 Sep 2004 21:00:35 GMT, Morten Reistad
<firs...@lastname.pr1v.n0> wrote, in part:

>Intel finally learned?

It could be.

Also, I do see a mistake in my post: the fields that specify a register
on the Itanium are usually seven bits long, not four.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Maynard Handley

unread,

Sep 28, 2004, 4:33:30 AM9/28/04

to

In article <XridneHV29Q...@rogers.com>,
"Yousuf Khan" <bbb...@ezrs.com> wrote:

> Unfortunately, most of Itanium's future was already glaringly obvious a few
> years before it ever came out. The path taken was wrong.
>
> Yousuf Khan

One could argue that Itanium failed because
(1) the architecture (meaning the whole thing: ISA, OS model, idea of
pushing lotsa hard work into the compiler) is stupid, at least for the
environment of now and the foreseeable future OR
(2) the architecture is stupid, for now, but makes sense for say ten
years from now, it was just, sadly, too soon, OR
(3) the architecture is actually reasonable, even for now, but has been
horribly implemented, (eg [a] the idea of pushing this as a very very
expensive chip rather than shipping cheap low-end variants, or [b]
perhaps an argument that this is yet one more example of how, if you get
in bed with MS you'll wake up with AIDS; that if Intel hadn't allowed MS
to achieve so overwhelming a monopoly position in OSs, they wouldn't
have been so at the mercy of whether MS decided when and how to support
Itanium, that Linux support by Intel, nice as it is, does just not
affect a large enough segment of the market to change things)

Now Yousuf is simply asserting that the problem was situation (1),
something I don't think is justified. My opinion, given everything I've
read, is that I've never seen a convincing case for any particular one
of these four argument. (By which I mean, yeah, it sure looks like
Itanium is a failure, but WHY did it fail?)

Maynard

Morten Reistad

unread,

Sep 28, 2004, 7:30:36 AM9/28/04

to

In article <name99-F94513.01331728092004@localhost>,

I see the problem a little diffrently; it is the 'x86 jail that
locks people in, and how we get out of it.

No matter what we say about Microsoft; they CANNOT be in the forefront
of an ISA change. Too many customers and too much at stake. They
can support, but not lead.

This means that steps out must be evolutionary, not revolutionary.
Bridges have to exist back and forth for a long time.

The Itanium has a 486 backwards mode; but you needs a whole migration
plan to go from '386 to <whatever>. The designers of the Alpha saw this.
This means we need a large too chest; with compiler support; OS support,
binary translation support etc.

Without OS support it is probably hard; but you can do incremental
steps there as well. Transmeta has built in a code morpher; such a
beast could be a solution to run legacy code; and have a "native"
mode on top. They could even run different OS'es.

Starting out such a project with the top, expensive model is
purely lunacy. A small, cheap proof-of-concept model would be faster
to market and could take up converts in an evolutionary fashion.

Now, wait a minute; small cheap proof-of-concept that was fast to
market; am I not describing Transmeta?

-- mrr

Alex Johnson

unread,

Sep 28, 2004, 8:01:26 AM9/28/04

to

John Savard wrote:

> You may well say this isn't much, but it's enough that the Itanium is a

> superset of the 386. <cut>

>
> True, it leaves out some of the memory model stuff

Okay. Right there you contradict yourself. Itanium is a superset of
80386 that does not include everything the 80386 does. By definition of
superset it must, therefore it is not a true superset.

> Also, it uses sixteen registers at a time instead of eight.

How do you figure that? I count far more than 16. If you just count
general registers, I see 128 addressable at any instant. If you aren't
that restrictive, I count nearly a thousand registers total.

> But then there is one significant difference in the middle as well: the
> Itanium only has a flat memory model.

Maybe my lingo differs from yours but Itanium has a flat memory mode
(physical) and a paged memory mode (virtual). True, it doesn't have
segmentation. Your understanding of computer architecture complexities
is still young. Keep studying. Connecting two front ends to a computer
is infinitely more complex than saying, "Let it be so!"

Alex
--
My words are my own. They represent no other; they belong to no other.
Don't read anything into them or you may be required to compensate me
for violation of copyright. (I do not speak for my employer.)

Alex Johnson

unread,

Sep 28, 2004, 8:07:32 AM9/28/04

to

John Savard wrote:
> On Mon, 27 Sep 2004 17:12:22 +0200, Grumble <dev...@kma.eu.org> wrote,
> in part:
>
>
>>Could you, please, explicitely state what IPF and IA-32 share?
>
>
> They both have stacks.

I forgot to attack this. This is a bland, general statement. Yes, they
both have stacks, but neither has a single stack that performs the same
function as the other's. The x86 has a generalized stack in memory for
anything you want to store in it. There are push and pop operations for
registers. Your IP and status register are automatically pushed on
interruptions (if I remember correctly...it's been many years since I
stopped using x86 assembly...many years). On Itanium there is a backing
store for the register stack engine which is hardware controlled and not
guaranteed to be coherent with the actual state of the call stack
(registers need not be spilled if the RSE is not full). There is no
stack that gets used for interruptions and you can't just say "save r107
off to the stack" since there are no push or pop directives.

Nick Maclaren

unread,

Sep 28, 2004, 8:28:13 AM9/28/04

to

In article <cjbjqm$700$1...@news01.intel.com>,

Alex Johnson <comp...@jhu.edu> writes:
|>
|> Okay. Right there you contradict yourself. Itanium is a superset of
|> 80386 that does not include everything the 80386 does. By definition of
|> superset it must, therefore it is not a true superset.

Well, if people are allowed to talk about extended subsets, why not
restricted supersets? Now we can start to discuss the difference
between:

An extended subset
A restricted superset
A separate but overlapping set
A copy of the set with differences

If that is too easy, we could consider the effect of the failure of
Bell's inequality on the meaning of set intersection. Or, indeed,
whether the Itanium can be regarded as a superset (or subset, for
that matter) of the Manchester Baby.

|> > Also, it uses sixteen registers at a time instead of eight.
|>
|> How do you figure that? I count far more than 16. If you just count
|> general registers, I see 128 addressable at any instant. If you aren't
|> that restrictive, I count nearly a thousand registers total.

Yeah. There is about 4 KB of architected context.

|> ... Connecting two front ends to a computer

|> is infinitely more complex than saying, "Let it be so!"

You clearly aren't executive material :-)

Regards,
Nick Maclaren.

Grumble

unread,

Sep 28, 2004, 8:54:02 AM9/28/04

to

Yousuf Khan wrote:

> Isn't it at all conceivable that the perfect upgrade path
> from x86 came from AMD and not Intel this time? AMD's 64-bit
> modifications added slightly less than 5% to the overall die
> size.

Five percent?

Newcastle (144 mm^2) is 25% larger than Barton (115 mm^2).

I imagine the integrated memory controller does take some die
space, but your figure seems very low.

--
Regards, Grumble

Nick Maclaren

unread,

Sep 28, 2004, 9:04:50 AM9/28/04

to

In article <cjbmta$4v6$1...@news-rocq.inria.fr>,

He said 5% for the 64-bitness, not for the new design. There are
a LOT of other changes.

Regards,
Nick Maclaren.

Grumble

unread,

Sep 28, 2004, 9:07:38 AM9/28/04

to

Morten Reistad wrote:

> The Itanium has a 486 backwards mode; but you needs a whole migration
> plan to go from '386 to <whatever>. The designers of the Alpha saw this.

Are you referring to Digital's FX!32 ?

Perhaps you already know that Intel has been actively pursuing a
similar strategy, the IA-32 Execution Layer. They plan to remove
IA-32 hardware support altogether.

Wolves in CISC Clothing (Paul DeMone)
http://www.realworldtech.com/page.cfm?ArticleID=RWT122803224105&p=8

--
Regards, Grumble

Grumble

unread,

Sep 28, 2004, 9:57:06 AM9/28/04

to

Nick Maclaren wrote:

> Grumble wrote:
>
>> Yousuf Khan wrote:
>>
>>> Isn't it at all conceivable that the perfect upgrade path
>>> from x86 came from AMD and not Intel this time? AMD's 64-bit
>>> modifications added slightly less than 5% to the overall die
>>> size.
>>
>> Five percent?
>>
>> Newcastle (144 mm^2) is 25% larger than Barton (115 mm^2).
>>
>> I imagine the integrated memory controller does take some die
>> space, but your figure seems very low.
>
> He said 5% for the 64-bitness, not for the new design. There are
> a LOT of other changes.

Are there?

Integrated memory controller.
Two additional pipeline stages.
Slightly larger integer scheduler buffers.
GPRs and media registers x2 -> update instruction decoder.
ECC-protected (??) L2 cache.
64-bit (wider registers, LS buffers, and other buffers).

What did I forget? How much would you say each accounts for?

--
Regards, Grumble

Nick Maclaren

unread,

Sep 28, 2004, 10:15:47 AM9/28/04

to

In article <cjbqji$646$1...@news-rocq.inria.fr>,

Hypertransport support
Improved/enlarged SMP support

Also doubling the number of general-purpose registers has quite
a lot of knock-on effects, far more than going from 32 to 64 bits
does.

I am not a hardware person, but the above list and your figure of
25% overall increase is fully compatible with the 64-bitness adding
slightly less than 5%.

Regards,
Nick Maclaren.

Mitch Alsup

unread,

Sep 28, 2004, 11:34:30 AM9/28/04

to

jsa...@excxn.aNOSPAMb.cdn.invalid (John Savard) wrote in message news:<4158482f...@news.ecn.ab.ca>...

> You may well say this isn't much, but it's enough that the Itanium is a
> superset of the 386. There aren't any features of the Pentium that it
> leaves out that would make it awkward to use the back end of an Itanium
> chip as the back end of a Pentium.

80-bit FP anyone?

Alan Balmer

unread,

Sep 28, 2004, 11:39:16 AM9/28/04

to

On Tue, 28 Sep 2004 11:30:36 GMT, Morten Reistad
<firs...@lastname.pr1v.n0> wrote:

>The Itanium has a 486 backwards mode; but you needs a whole migration
>plan to go from '386 to <whatever>. The designers of the Alpha saw this.
>This means we need a large too chest; with compiler support; OS support,
>binary translation support etc.
>
>Without OS support it is probably hard; but you can do incremental
>steps there as well. Transmeta has built in a code morpher; such a
>beast could be a solution to run legacy code; and have a "native"
>mode on top. They could even run different OS'es.
>
>Starting out such a project with the top, expensive model is
>purely lunacy. A small, cheap proof-of-concept model would be faster
>to market and could take up converts in an evolutionary fashion.
>

I'm a little confused - what are we talking about here? The HP-UX
Itanium systems indeed have a "code morpher", though it's not
hardware. Also, Itaniums are not particularly expensive - they're
considerably cheaper than the PA-RISC systems we're using now. It does
run different OS's, too. HP ships HP-UX and Linux, and Win2003 is
available for it.

If you're looking for a system to run Win XP on your desktop, that's
not what it's for.

--
Al Balmer
Balmer Consulting
removebalmerc...@att.net

Eric Gouriou

unread,

Sep 28, 2004, 12:09:03 PM9/28/04

to

Check. Or what was the question ?

(Actually the Itanium may have "one more bit" than x86 (81?),
although I am pretty ignorant of FP formats details so don't trust
me on that)

From what our libm folks are saying, 80bits FP is a god-send to
implement fast and accurate float and double operations (and
somewhat fast quad operations in software).

Eric

Stephen Fuld

unread,

Sep 28, 2004, 12:06:49 PM9/28/04

to

"Maynard Handley" <nam...@name99.org> wrote in message
news:name99-F94513.01331728092004@localhost...

Note subject change.

I think it is some combination of 1 and 2, but with another twist. My
memory says that in he beginning, the fundamental idea behind Itanium was
something like that the OOO stuff in a CPU scales as something like n**2,
where n is the number of instructions executing simultaniously and that this
wouldn't scale well as the degree of parallelism went up. Therefore, they
did an in order machine to eliminate this "bottleneck" and expected the
parallelism to come from the compilers.

Now fast forward to the actual implementation. While the major premise
might be right, it apears that the compilers, at least so far, cannot
generate enough parallelism to cause the n**2 stuff to be a problem in OOO
CPUs. Furthermore, between the basic idea and the final architecturem, it
appears that the design suffered "featuritis" to the point where the actual
implementation was slower than perhaps was initially projected. And lastly,
they didn't anticipate the tremendous progress in performance in the IA-32
area, probably driven by competition from AMD. This made what might
otherwise have been good performance seem worse by comparison.

As for a low end implementation, it appears that the only way to get
reasonable performance from IA-64 is to have lots of cache. This causes
large die size and thus higher expense. I don't think having a low cost
"mass market" version of the chip would have helped much. It would have
been even slower so why would a substantial number of people choose it. It
really wasn't a desktop chip for the masses. They did make development
systems available for those people doing serious development work, so they
did seed that market.

So it's performance was less than originally expected, and the competing
sysstems turned out to have better performane than was expected. Overall,
it just wasn't a big enough improvement over what else was available to make
a case for customers to pay the price (conversion, etc.) for its adoptation.

--
- Stephen Fuld
e-mail address disguised to prevent spam

Stefan Monnier

unread,

Sep 28, 2004, 12:34:02 PM9/28/04

to

> Now fast forward to the actual implementation. While the major premise
> might be right, it apears that the compilers, at least so far, cannot
> generate enough parallelism to cause the n**2 stuff to be a problem in OOO

Actually, it can be brought down to O(n log n), so even if we could extract
much more parallelism, we wouldn't suffer from O(n^2) behavior.

Stefan

Rupert Pigott

unread,

Sep 28, 2004, 12:35:36 PM9/28/04

to

Alan Balmer wrote:

> On Tue, 28 Sep 2004 11:30:36 GMT, Morten Reistad
> <firs...@lastname.pr1v.n0> wrote:
>
>>The Itanium has a 486 backwards mode; but you needs a whole migration
>>plan to go from '386 to <whatever>. The designers of the Alpha saw this.
>>This means we need a large too chest; with compiler support; OS support,
>>binary translation support etc.
>>
>>Without OS support it is probably hard; but you can do incremental
>>steps there as well. Transmeta has built in a code morpher; such a
>>beast could be a solution to run legacy code; and have a "native"
>>mode on top. They could even run different OS'es.
>>
>>Starting out such a project with the top, expensive model is
>>purely lunacy. A small, cheap proof-of-concept model would be faster
>>to market and could take up converts in an evolutionary fashion.

[SNIP]

> hardware. Also, Itaniums are not particularly expensive - they're
> considerably cheaper than the PA-RISC systems we're using now. It does

The question that matters is : Are Itaniums cheaper in terms
of bang/buck than x86s ? If that answer is no, then you buy
x86 boxes as well as Itanics.

Comparing PA-RISC to Itaniums is kinda silly too, because HP
have been actively trying to move customers onto the Itanium
and they have complete control of PA-RISC pricing. A more
realistic price comparison would be against an architecture
that addresses a similar market that is not under their
control. POWER and SPARC for example.

Cheers,
Rupert

Yousuf Khan

unread,

Sep 28, 2004, 12:23:56 PM9/28/04

to

John Savard wrote:
> On Mon, 27 Sep 2004 21:00:35 GMT, Morten Reistad
> <firs...@lastname.pr1v.n0> wrote, in part:
>> In article <4158482f...@news.ecn.ab.ca>,
>> John Savard <jsa...@excxn.aNOSPAMb.cdn.invalid> wrote:
>
>>> But then there is one significant difference in the middle as well:
>>> the Itanium only has a flat memory model.
>
>> Intel finally learned?
>
> It could be.

There is a memory management feature in Itanium that allows it to use
something like upto 85-bits of address space, using a form of segmentation.

Yousuf Khan

unread,

Sep 28, 2004, 1:21:27 PM9/28/04

to

Yup, 25% is the overall increase from the previous generation. But that
includes memory controller, Coherent Hypertransports, slightly tweaked
pipeline (12 stages vs. 10 stages integer), *plus* the 64-bit extensions
(including instruction set and extra registers). The 5% is just that last
little bit of it.

Yousuf Khan

unread,

Sep 28, 2004, 1:10:51 PM9/28/04

to

And you failed to mention another possibility, that:

(4) the architecture had no previous built-up base (i.e. x86) to build on.
If Itanium was a reasonable x86 processor, it could've done exactly like
what Opteron is now doing: waited for its own software base to form while it
ran the software base from a previous architecture flawlessly.

I think that's the actual reason it failed, nothing to do with compiler
complexities, whether there was enough Microsoft support or not, etc. Both
Microsoft and Intel are slaves to a much bigger god: x86 compatibility.
Neither Microsoft nor Intel with their huge market moving size have ever
been successful outside of the x86 sphere. They exist at the pleasure of the
x86 god, and Intel just learned the lesson of daring to displease the x86
god -- it bestowed success upon AMD, who pleased the x86 god.

Yousuf Khan

unread,

Sep 28, 2004, 1:23:02 PM9/28/04

to

Grumble wrote:
> Are there?
>
> Integrated memory controller.
> Two additional pipeline stages.
> Slightly larger integer scheduler buffers.
> GPRs and media registers x2 -> update instruction decoder.
> ECC-protected (??) L2 cache.
> 64-bit (wider registers, LS buffers, and other buffers).
>
> What did I forget? How much would you say each accounts for?

All of that seems reasonable for a mere 25% increase in die size.

Yousuf Khan

glen herrmannsfeldt

unread,

Sep 28, 2004, 1:58:51 PM9/28/04

to

Yousuf Khan wrote:

> Maynard Handley wrote:

(snip)

>>One could argue that Itanium failed because
>>(1) the architecture (meaning the whole thing: ISA, OS model, idea of
>>pushing lotsa hard work into the compiler) is stupid, at least for the
>>environment of now and the foreseeable future OR

I believe it is a good idea in general, but it doesn't
scale well. If you use such an architecture with a JIT,
or, sort of similar to JIT but convert the whole thing at
load time to native code, it can work. The less you
put into the compiler the longer you can stretch the architecture
out by improving the processor.

>>(2) the architecture is stupid, for now, but makes sense for say ten
>>years from now, it was just, sadly, too soon, OR

As far as I know, the mistake way trying for x86 compatability along
with a new, good, RISC design.

>>(3) the architecture is actually reasonable, even for now, but has
>>been horribly implemented, (eg [a] the idea of pushing this as a very
>>very expensive chip rather than shipping cheap low-end variants, or
>>[b] perhaps an argument that this is yet one more example of how, if
>>you get in bed with MS you'll wake up with AIDS; that if Intel hadn't
>>allowed MS to achieve so overwhelming a monopoly position in OSs,
>>they wouldn't have been so at the mercy of whether MS decided when
>>and how to support Itanium, that Linux support by Intel, nice as it
>>is, does just not affect a large enough segment of the market to
>>change things)

(snip)

> And you failed to mention another possibility, that:

> (4) the architecture had no previous built-up base (i.e. x86) to build on.
> If Itanium was a reasonable x86 processor, it could've done exactly like
> what Opteron is now doing: waited for its own software base to form while it
> ran the software base from a previous architecture flawlessly.

For dedicated server applications you can break from the past
with a whole new architecture. Trying to do that well AND achieve
x86 compatability at the same time AND make an affordable chip...

> I think that's the actual reason it failed, nothing to do with compiler
> complexities, whether there was enough Microsoft support or not, etc. Both
> Microsoft and Intel are slaves to a much bigger god: x86 compatibility.
> Neither Microsoft nor Intel with their huge market moving size have ever
> been successful outside of the x86 sphere. They exist at the pleasure of the
> x86 god, and Intel just learned the lesson of daring to displease the x86
> god -- it bestowed success upon AMD, who pleased the x86 god.

SPARC, Alpha, HP-PA, MIPS, all exist, though the market share
isn't huge.

-- glen

Robert Myers

unread,

Sep 28, 2004, 3:27:56 PM9/28/04

to

Stefan Monnier wrote:

n is the issue width? The complexity of OoO scheduling circuitry was
claimed to be quadratic in the issue width in, for example,
_Itanium_Rising_. If it were to be the issue width, what would a
comparison of O(n log n) to O(n^2) mean, with n=8 (the fallacy of
asymptotic arguments for non-asymptotic n)?

RM

Stefan Monnier

unread,

Sep 28, 2004, 3:39:38 PM9/28/04

to

> n is the issue width? The complexity of OoO scheduling circuitry was
> claimed to be quadratic in the issue width in, for example,
> _Itanium_Rising_.

The Ultrascalar project claimed otherwise, with fairly compelling arguments.
Of course, the constant factor is very important, but the original claim
was O(n^2) so it's only fair to concentrate on the asymptotic behavior.

Stefan

Robert Myers

unread,

Sep 28, 2004, 4:01:59 PM9/28/04

to

Stefan Monnier wrote:

Andy Glew was dismissive when I brought up the O(n^2) argument in
discussing scheduling strategies here; I interpreted his response as
meaning that the estimate would imply naive circuit design.

I apologize if my question about asymptotic estimates came off as
smart-kid-at-the-front-of-the-class. As you imply, I might as well have
questioned the agenda at the first appearance of a big Oh for such small
n. I interpreted the O(n^2) estimate as: the complexity rapidly gets
out of hand as you increase the issue width. Comparing O(n^2) to O(n
log n) seemed like shaving things just a little too fine. :-).

If you're doing FFT's that small (and even big FFT's are made up of
little FFT's), the asymptotic ops count is useless, and in fact, overly
pessimistic (as maybe the whole world knows).

RM

Nick Maclaren

unread,

Sep 28, 2004, 4:41:00 PM9/28/04

to

In article <Xqj6d.273411$mD.159443@attbi_s02>,

Robert Myers <rmyer...@comcast.net> wrote:
>Stefan Monnier wrote:
>
>>>n is the issue width? The complexity of OoO scheduling circuitry was
>>>claimed to be quadratic in the issue width in, for example,
>>>_Itanium_Rising_.

Hmm. As built in the City of Rl'yeh ....

>> The Ultrascalar project claimed otherwise, with fairly compelling arguments.
>> Of course, the constant factor is very important, but the original claim
>> was O(n^2) so it's only fair to concentrate on the asymptotic behavior.
>
>Andy Glew was dismissive when I brought up the O(n^2) argument in
>discussing scheduling strategies here; I interpreted his response as
>meaning that the estimate would imply naive circuit design.

It is extremely common for algorithms to be O(N) best case, O(N log N)
for random input and O(N^2) worst case. The problem with HPC is that
it tends to be either best or worst case.

Regards,
Nick Maclaren.

Pleasant Thrip

unread,

Sep 28, 2004, 5:00:57 PM9/28/04

to

Yeah, also Newcastle has half the cache (512KB) of the Opteron (1MB)
from which the original "5%" came from. So, maybe nearer 10% than 5%
for Newcastle.

I'm sure you know, Yousuf.

Brian Inglis

unread,

Sep 28, 2004, 5:12:18 PM9/28/04

to

On Mon, 27 Sep 2004 17:28:20 GMT in alt.folklore.computers,
jsa...@excxn.aNOSPAMb.cdn.invalid (John Savard) wrote:

>On Mon, 27 Sep 2004 17:12:22 +0200, Grumble <dev...@kma.eu.org> wrote,
>in part:
>
>>Could you, please, explicitely state what IPF and IA-32 share?
>
>They both have stacks.
>

>They both use the same data formats: IEEE-759 floating-point numbers and
754
>two's complement integers. (Of course, _this_ is shared with the Power
>PC, the Motorola 68000, and virtually all modern computers.)
>
>They both have instructions that divide a 64-bit operand into multiple
>parts for parallel arithmetic - the MMX instruction set is present on
>the native Itanium architecture.
>
>They both support little-endian operand storage, although the Itanium
>adds the ability to switch into big-endian operation.

>
>You may well say this isn't much, but it's enough that the Itanium is a
>superset of the 386. There aren't any features of the Pentium that it
>leaves out that would make it awkward to use the back end of an Itanium
>chip as the back end of a Pentium.

That would have to be true to allow the IA-32 decode/control to
execute IA-64 instructions/groups instead of IA-32 microops.

>True, it leaves out some of the memory model stuff, and it has extra
>operations such as population count.

>
>Also, it uses sixteen registers at a time instead of eight.

128

--
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada

Brian....@CSi.com (Brian[dot]Inglis{at}SystematicSW[dot]ab[dot]ca)
fake address use address above to reply

Charlie Gibbs

unread,

Sep 28, 2004, 8:35:38 PM9/28/04

to

In article <n8-dneY48ai...@rogers.com>, bbb...@ezrs.com
(Yousuf Khan) writes:

>There is a memory management feature in Itanium that allows it to
>use something like upto 85-bits of address space, using a form of
>segmentation.

Dammit, it would be much easier to write my GOD (General Oracle
Dispenser) program if I didn't have to worry about that 128TB wall.

--
/~\ cgi...@kltpzyxm.invalid (Charlie Gibbs)
\ / I'm really at ac.dekanfrus if you read it the right way.
X Top-posted messages will probably be ignored. See RFC1855.
/ \ HTML will DEFINITELY be ignored. Join the ASCII ribbon campaign!

Yousuf Khan

unread,

Sep 28, 2004, 11:57:54 PM9/28/04

to

Charlie Gibbs wrote:
> In article <n8-dneY48ai...@rogers.com>, bbb...@ezrs.com
> (Yousuf Khan) writes:
>
>> There is a memory management feature in Itanium that allows it to
>> use something like upto 85-bits of address space, using a form of
>> segmentation.
>
> Dammit, it would be much easier to write my GOD (General Oracle
> Dispenser) program if I didn't have to worry about that 128TB wall.

Isn't the memory limit of a 64-bit architecture something like 16
Giga-Gigabytes (what is that? a Yotabyte?)?

Yousuf Khan

Bill Leary

unread,

Sep 29, 2004, 12:46:04 AM9/29/04

to

"Yousuf Khan" <bbb...@ezrs.com> wrote in message
news:WcGdnTFpE-m...@rogers.com...

> > Dammit, it would be much easier to write my GOD (General Oracle
> > Dispenser) program if I didn't have to worry about that 128TB wall.
>
> Isn't the memory limit of a 64-bit architecture something like 16
> Giga-Gigabytes (what is that? a Yotabyte?)?

Assuming that the address buss outs as 64 bits, it looks like it's 16 exabytes.

From (a version of) the Jargon File, at
http://www.jargon.8hz.com/html/Q/quantifiers.html

prefix decimal binary
kilo- 1000^1 1024^1 = 2^10 = 1,024
mega- 1000^2 1024^2 = 2^20 = 1,048,576
giga- 1000^3 1024^3 = 2^30 = 1,073,741,824
tera- 1000^4 1024^4 = 2^40 = 1,099,511,627,776
peta- 1000^5 1024^5 = 2^50 = 1,125,899,906,842,624
exa- 1000^6 1024^6 = 2^60 = 1,152,921,504,606,846,976
zetta- 1000^7 1024^7 = 2^70 = 1,180,591,620,717,411,303,424
yotta- 1000^8 1024^8 = 2^80 = 1,208,925,819,614,629,174,706,176

- Bill

Brian Inglis

unread,

Sep 29, 2004, 12:44:52 AM9/29/04

to

fOn Tue, 28 Sep 2004 12:23:56 -0400 in alt.folklore.computers, "Yousuf
Khan" <bbb...@ezrs.com> wrote:

MS will need it for the OS after Longhorn.

John Savard

unread,

Sep 29, 2004, 2:44:15 AM9/29/04

to

On Tue, 28 Sep 2004 21:12:18 GMT, Brian Inglis
<Brian....@SystematicSW.Invalid> wrote, in part:

>On Mon, 27 Sep 2004 17:28:20 GMT in alt.folklore.computers,
>jsa...@excxn.aNOSPAMb.cdn.invalid (John Savard) wrote:

>>Also, it uses sixteen registers at a time instead of eight.
> 128

Yes, I commented on that mistake in my post, which was unfortunate.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Nick Maclaren

unread,

Sep 29, 2004, 3:59:17 AM9/29/04

to

In article <WcGdnTFpE-m...@rogers.com>,

"Yousuf Khan" <bbb...@ezrs.com> writes:
|>
|> Isn't the memory limit of a 64-bit architecture something like 16
|> Giga-Gigabytes (what is that? a Yotabyte?)?

A lottabytes, anyway.

Regards,
Nick Maclaren.

Grumble

unread,

Sep 29, 2004, 4:09:04 AM9/29/04

to

Eric Gouriou wrote:

> Mitch Alsup wrote:

>
>> John Savard wrote:
>>
>>> You may well say this isn't much, but it's enough that the Itanium is a
>>> superset of the 386. There aren't any features of the Pentium that it
>>> leaves out that would make it awkward to use the back end of an Itanium
>>> chip as the back end of a Pentium.
>>
>> 80-bit FP anyone?
>
> Check. Or what was the question ?
>
> (Actually the Itanium may have "one more bit" than x86 (81?),
> although I am pretty ignorant of FP formats details so don't
> trust me on that)
>
> From what our libm folks are saying, 80bits FP is a god-send to
> implement fast and accurate float and double operations (and
> somewhat fast quad operations in software).

Hello Eric,

Itanium Architecture Software Developer's Manual
Volume 1: Application Architecture
http://intel.com/design/itanium/manuals/245317.pdf#page=90

Floating-point Programming Model

5.1.2 Floating-point Register Format

Data contained in the floating-point registers can be either integer
or real type. The format of data in the floating-point registers is
designed to accommodate both of these types with no loss of information.

Real numbers reside in 82-bit floating-point registers in a
three-field binary format (see Figure 5-1). The three fields are:

o The 64-bit significand field, b63. b62b61..b1b0, contains the
number's significant digits. This field is composed of an explicit
integer bit (significand{63}), and 63 bits of fraction
(significand{62:0}).

o The 17-bit exponent field locates the binary point within or
beyond the significant digits (i.e., it determines the number's
magnitude). The exponent field is biased by 65535 (0xFFFF). An
exponent field of all ones is used to encode the special values for
IEEE signed infinity and NaNs. An exponent field of all zeros and a
significand field of all zeros is used to encode the special values
for IEEE signed zeros. An exponent field of all zeros and a non-zero
significand field encodes the double-extended real denormals and
double-extended real pseudo-denormals.

o The 1-bit sign field indicates whether the number is positive
(sign=0) or negative (sign=1).

By the way, Ellie and Verra say hello ;-)

--
Regards, Grumble

Alex Johnson

unread,

Sep 29, 2004, 8:21:10 AM9/29/04

to

Itanium does use floating-point-extended data types. check the books,
there's support for *fe commands, which are this 80 (82?) bit format.

Alex
--
My words are my own. They represent no other; they belong to no other.
Don't read anything into them or you may be required to compensate me
for violation of copyright. (I do not speak for my employer.)

Andy Freeman

unread,

Sep 29, 2004, 12:16:16 PM9/29/04

to

Stefan Monnier <mon...@iro.umontreal.ca> wrote in message news:<jwvekkmxk1t.fsf-...@gnu.org>...

The value of N is also very important. OoO processors use a small
number of PCs to coordinate activities. A single PC can't reasonably
name a huge number of operations, so N tends to be somewhat small.
(N=50 may make sense, but N=10k probably doesn't, even if the cost was
O(n).)

Yes, I know that there are sections of code where a single PC value
could initiate 1000s of operations. However, Amdahl's Law applies.

Outside of some very specialized problems, massively parallel systems
tend to have lots of concurrent PCs for some pretty good reasons.

Stephen Fuld

unread,

Sep 29, 2004, 2:46:06 PM9/29/04

to

"Stefan Monnier" <mon...@iro.umontreal.ca> wrote in message
news:jwvekkmxk1t.fsf-...@gnu.org...

Just to be clear, my statement was about what I remember Intel believing,
not what I believe, for I really know too little about it to have an
informed opinion. BTW, was the date of the Ultrascaler project before or
after the initial Itanium design started?

Del Cecchi

unread,

Sep 30, 2004, 12:59:44 PM9/30/04

to

"Maynard Handley" <nam...@name99.org> wrote in message
news:name99-F94513.01331728092004@localhost...

> In article <XridneHV29Q...@rogers.com>,
> "Yousuf Khan" <bbb...@ezrs.com> wrote:
>
> > Unfortunately, most of Itanium's future was already glaringly obvious a
few
> > years before it ever came out. The path taken was wrong.
> >
> > Yousuf Khan
>

> One could argue that Itanium failed because
> (1) the architecture (meaning the whole thing: ISA, OS model, idea of
> pushing lotsa hard work into the compiler) is stupid, at least for the
> environment of now and the foreseeable future OR

> (2) the architecture is stupid, for now, but makes sense for say ten
> years from now, it was just, sadly, too soon, OR

> (3) the architecture is actually reasonable, even for now, but has been
> horribly implemented, (eg [a] the idea of pushing this as a very very
> expensive chip rather than shipping cheap low-end variants, or [b]
> perhaps an argument that this is yet one more example of how, if you get
> in bed with MS you'll wake up with AIDS; that if Intel hadn't allowed MS
> to achieve so overwhelming a monopoly position in OSs, they wouldn't
> have been so at the mercy of whether MS decided when and how to support
> Itanium, that Linux support by Intel, nice as it is, does just not
> affect a large enough segment of the market to change things)
>

> Now Yousuf is simply asserting that the problem was situation (1),
> something I don't think is justified. My opinion, given everything I've
> read, is that I've never seen a convincing case for any particular one
> of these four argument. (By which I mean, yeah, it sure looks like
> Itanium is a failure, but WHY did it fail?)
>

> Maynard

Aren't your (1) and (2) equivilent, ie it was bad for now and the forseeable
future (10 years being over the event horizon in this context)?

And I don't know that Intel "allowed MS to achieve....". Arguably IBM did,
but Intel didn't so far as I know.

I think perhaps IA64 is or was affected by the dichotomy between (large)
servers and desktop single user boxes. Intel's bread and butter is the
desktop box, while HP was more into servers. A couple quotes can cover that
situation

"no (chip) can serve two masters...." (Jesus)
"It's a floor wax AND a desert topping" (Dan Akroyd)

We shall see what we shall see. It seems likely that IA64 will not be the
high volume chip, but that doesn't mean it can't be successful in large
servers and HPC. Will that be good enough for Intel?

"future cloudy, ask again later" (magic 8 ball)

del cecchi

Tom Linden

unread,

Sep 30, 2004, 1:57:05 PM9/30/04

to

On Thu, 30 Sep 2004 11:59:44 -0500, Del Cecchi <cecchi...@us.ibm.com>
wrote:

If it is a failure, then it is because Intel will have lost interest due to
other competitive pressures, and HP will have to shoulder the bill for the
ongoing development required to keep it competitive, which was the one of
the
major reasons in the first place for the partnership between the two.
>
>
>

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/

dorothy

unread,

Oct 1, 2004, 7:59:31 PM10/1/04

to

Is the Itanium failing, or is it simply a competive sector?

Big-iron erodes bottom-up, as SGI-pricing found:
o On price - Opteron & even Dual-Xeon are cheaper, capable solutions
o On performance - Multi-Opteron in cheap boxes from IBM etc

Big-iron is lower volume, economic buyers have specific usage:
o Generic applications can be done cheaply with generic iron
---- Dual-Xeon/Opteron off the shelf offer a lot, for a lot less
---- 64-bit in business tends to be for dbase - small %age of servers
---- 64-bit also has cheaper multi-vendor solutions using Opteron
o Specific applications which need Itanium have other options
---- 64-bit technical will ITB from the broader 64-bit mkt - SUN

Big-iron fulfills a requirement spec, and how much the economic buyer
can make the vendor send round a kitchen sink in the right colour.

More specialised h/w solutions involve single-point failure (so a
more involved SLA/service-contract for it) and specialisation requires
a longer-term TCO study re how viable that solution (v Dual-Xeon or
Dual/Quad-Opteron) is long-term for the same bang for the buck.

Itanium on the workstation mkt was always a "service leader":
o Post Y2k the spend is "what cost do you save or revenue create?"
o Selling Itanium must have been fun re airmails per unit sold
o Selling against Opteron boxes probably comes down to whether
the application really justifies it, or whether it is a small
part of a much larger IT package and so "amortised" within that

So did Dell enjoy trying to fill the space filled by Opteron with
the Itanium2 offering, being unable (perhaps) to sell AMD v Intel?

Business faces 2 costs:
o Cost of a body
---- human as input/process/output workflow at the end of an RJ45
---- soln - redefine the process around an RJ45 & outsource it
o Cost of enabling that body
---- IT costs look even uglier against outsourced labour rates
---- soln - probably why Dual Xeon whilst "old" still does well

I suspect Itanium2 buyers come into 3 groups:
o Server - Financial Services re dbase & less price sensitive
---- product is sell-side redistributing money from buy-side
o Workstation - Academic, Research, Engineering, Gov't
---- here Itanium2 as a solution is a small part of the budget
o Bundled with IT contract - ie, buy HP services & get some Itanium2s
---- someone is buying them as Intel boasts about figs
---- is it like inkjet/laser consumables re the back-end support revenue?

Is Opteron a failure because it's not on every desktop?
o Even for generic PCs it is expensive & boards aren't cheap
How many buyers know how much better it is over say Athlon64?
o Branded IT service contact tie-in still sells a lot of Dual-Athlons

No-one got fired for buying the cheapest bid (but should have, have
you see the state of the avg gov't or defence project, oh never mind).

If it's science/research/eng, then it is possible the task will scale
over a cluster of machines - dual-Xeon or whatever comes cheaper. If
you argue "it's all in the interconnect", then does switching fabric
displace the raw benefit of Itanium2 - particularly cost of several?

I can't help but think that Itanium2 was plain late.
o It was designed for economic buyers & usage projections years earlier
o So when it arrived, it hadn't outpaced the "big-iron-erosion curve"
---- a case of time to mkt for a window that had moved away from cash-cow
---- instead it became as much a "brand-cow" re who-is-fastest-PR/title

Never seen a breakdown of buyer usage of Itanium2.
It ain't generic servers that's for sure re price - when you get into
serious $s you also get into a lot of justification and then if the prj
changes and you want to re-use that pile of sand, what economics then?
Sure, business "should" write-off the sunk cost, but business at the
moment despite low-interest-rate environment is setting hurdle rates
for projects too high and simply playing "show me the cost-saving".

Follow the money - if the money is flowing to outsourcing, it comes down
to whether the outsourcing company is going to choose a cheap as chips
generic solution or just use Itanium2 where it will shine economically.
Itanium2 usage does make good ad-copy/PR - so a factor against its cost.

Before Itanium2 clustering of e-commerce to dbase (SQL) to email (Exchange)
was weak - and so scaling to big-iron had buyers. ISPs routinely plumbed
up for ever bigger SUN boxes. As MSFT went clustering, Linux came along so
perhaps Itanium2 got left with the research/eng/academic/financial end.

Beta was a better architecture than VHS, we still got (and have) VHS.
AMD kept Intel speeding up the low-end, perhaps faster than Intel expected
and particularly with the move to AMD64 & Opteron, so Itanium2 got 'caught
by Moores law coming up from below'. Economic buyers tend to have momentum,
some would call is ostrichism, but change is another thing to manage.

Intel sold enough other chips on the basis of no change, it just runs faster,
and will no doubt continue doing so for a lot longer. When a buyer is either
academic, research or eng, s/w changes or optimisation are costed differently.
When you want N revenue due to M performance it's just a box to realise that.

Will be interesting to see how long it takes before "lower" CPUs to match it,
at least on performance. A server is more than performance, re sum of the parts
so Itanium 2 still has a niche. If Intel has to 'go backwards' to go forwards
I don't see much wrong with that - it's planning that on Post-Prescott re P-M.
For intel it's generating a revenue-creation portfolio, and reusing knowledge.

Not sure how the Itanium2 architecture pans out LT tho, and that may have been
the bigger barrier since it was late & the goalposts had moved. AMD did the
cheap mass-market chip - Athlon64, with Intel now playing "AMD compatibility".

A lot of the projections pre/around Y2k were nonsense anyway, projections were
the tool W/S sell-side manipulated the stock prices by, and so sold the paper.
Itanium2 needed uptake critical mass faster than lower-chips came up to it, and
it simply didn't achieve that fast/broad enough to justify the sticker price.

Itanium2 servers are a different market to Itanium2 workstations tho, and
that is a market outside the perhaps limited academic/research buyers. A lot
of post Sep-11 has been Distributed, Redundancy, Real-Estate as the price of
coloco per 1U has climbed (Tokyo for example, but many other places too). It
at least allowed IBM/others to sell a lot of various blade incarnations.

John Savard

unread,

Oct 2, 2004, 11:45:12 AM10/2/04

to

On Sun, 26 Sep 2004 07:17:14 GMT, jsa...@excxn.aNOSPAMb.cdn.invalid
(John Savard) wrote, in part:
>On Sat, 25 Sep 2004 08:55:07 GMT, Brian Inglis
><Brian....@SystematicSW.Invalid> wrote, in part:

>>there are also 6 integer units, 4 FMAC units (2 DP + 2 SP),
>>4 MMX units, 3 branch units, 2 load + 2 store units, IA-32 decode/
>>control unit, Advanced Load Address Table, 128 integer + 128 FP
>>registers, and 8 branch registers. And they're adding OoO execution.

>Ah. So the Itanium is really, really superscalar.

And my next obvious question:

Does the Opteron have anything like that?

How about the population count instructions of the Itanium?

The Itanium lacks SSE and SSE2, of course. But the Opteron is sold as
competing against the Xeon, not the Itanium.

Right now, the Itanium is a bit ahead of the curve; for throughput, at
least, Opterons are more cost-effective.

But when you have a sequential problem, and want to get the answer to
that one problem as soon as possible, other processors in parallel can't
help: since real-world applications are often only imperfectly
parallelizable, it has always seemed to me that starting with the
fastest uniprocessor one can obtain is usually a very good idea.

Of course, megapipelining is simply an inexpensive way to get
parallelism, it doesn't allow one to perform logically dependent
instructions one per cycle. Instead of an Itanium, perhaps what is
really needed is a really fast 486. But the technology to make a 486 out
of ECL instead of CMOS, even with a 90nm BiCMOS process, probably isn't
available at this time.

Thus, it seems to me that the problems with the Itanium are just another
symptom of the old truism - you will know the pioneers by the arrows in
their backs. The commercial success of the Itanium at present may be
limited, but for Intel to abandon that project would be to abandon an
important contribution to R&D. To *stay* in business, they need to stay
ahead, and not rest on their laurels - or their cash cows.

But when they can come out with an Itanium 3 that applies the full
processing power of the chip to executing 386/Pentium instructions, with
only the minimum performance degradation resulting from not having
optimized EPIC code to give the chip... then the investment in Itanium
might start to pay off.

A new architecture is more easily insinuated than imposed.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Tom Linden

unread,

Oct 2, 2004, 11:56:55 AM10/2/04

to

Pipelining was always intended to avoid stalls. Are you aware of anyone
using active logic?

>
> Thus, it seems to me that the problems with the Itanium are just another
> symptom of the old truism - you will know the pioneers by the arrows in
> their backs. The commercial success of the Itanium at present may be
> limited, but for Intel to abandon that project would be to abandon an
> important contribution to R&D. To *stay* in business, they need to stay
> ahead, and not rest on their laurels - or their cash cows.

Remains to be seen whether it is an important contribution.

>
> But when they can come out with an Itanium 3 that applies the full
> processing power of the chip to executing 386/Pentium instructions, with
> only the minimum performance degradation resulting from not having
> optimized EPIC code to give the chip... then the investment in Itanium
> might start to pay off.

Now, it would be more interesting (for me anyway) if they supported the VAX
instruction set.

>
> A new architecture is more easily insinuated than imposed.
>
> John Savard
> http://home.ecn.ab.ca/~jsavard/index.html

--

Yousuf Khan

unread,

Oct 2, 2004, 1:52:21 PM10/2/04

to

John Savard wrote:
> But when they can come out with an Itanium 3 that applies the full
> processing power of the chip to executing 386/Pentium instructions,
> with only the minimum performance degradation resulting from not
> having optimized EPIC code to give the chip... then the investment in
> Itanium might start to pay off.
>
> A new architecture is more easily insinuated than imposed.

By the time such an Itanium is brought out, it will not only need to emulate
386/Pentium 32-bit instructions, but also AMD64/EM64T 64-bit instructions.
As well as all of the intermediate x86 language instructions in between,
such as SSEx.

Yousuf Khan

Kelli Halliburton

unread,

Oct 2, 2004, 2:14:46 PM10/2/04

to

On Sat, 02 Oct 2004 13:52:21 -0400, Yousuf Khan wrote:
> By the time such an Itanium is brought out, it will not only need to emulate
> 386/Pentium 32-bit instructions, but also AMD64/EM64T 64-bit instructions.
> As well as all of the intermediate x86 language instructions in between,
> such as SSEx.

It turns out that Intel is already building an Opteron clone. However, it
does not feature IA-64 compatibility; it is almost, for practical
purposes at least, a concession that the single-processor 64-bit upgrade
path for x86 is the Opteron.

Yousuf Khan

unread,

Oct 2, 2004, 5:49:51 PM10/2/04

to

Yes, that's already known, it's not just being built, it is already here.
However, this particular thread is about how to save Itanium from the
scrapheap.

Yousuf Khan

Peter Grandi

unread,

Oct 2, 2004, 6:05:05 PM10/2/04

to

>>> On 1 Oct 2004 16:59:31 -0700, dorothy....@ntlworld.com
>>> (dorothy) said:

dorothy.bradbury> Is the Itanium failing, or is it simply a competive
dorothy.bradbury> sector? Big-iron erodes bottom-up, as SGI-pricing
dorothy.bradbury> found:

dorothy.bradbury> o On price - Opteron & even Dual-Xeon are cheaper,
dorothy.bradbury> capable solutions
dorothy.bradbury> o On performance - Multi-Opteron in cheap boxes from
dorothy.bradbury> IBM etc [ ... ]

Ah, interesting and agreeable platform marketing reflections...

But then I just read somewhere an interview with some big Intel cheese
and his line on Itanium seems amazing to me : my understanding of what
he was saying is that they are going to drop prices on Itanium so much
that in a couple of years one will be able to buy at the same price
point much the same kit with either Itanium or Xeon (EM64T I guess, I
think that IA32 is basically dead), and since the Itanium based kit at
the same price point will be about twice as fast, everybody will stop
using Xeon.

This sounds like a truly major shift in pricing strategy for Intel, and
too bad I can't remember where I saw the interview.

John Savard

unread,

Oct 2, 2004, 8:51:11 PM10/2/04

to

On Sat, 02 Oct 2004 08:56:55 -0700, "Tom Linden" <t...@kednos.com> wrote,
in part:

>Pipelining was always intended to avoid stalls.

Huh?

Stalls are what you *get* when you pipeline. So you try to avoid stalls
when pipelining through various elaborate techniques.

But the purpose of pipelining itself is just to allow the ALU to work on
multiple instructions at once, at various stages of completion, instead
of one instruction at a time. (Which would totally avoid stalls, of
course.)

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Tom Linden

unread,

Oct 2, 2004, 8:54:55 PM10/2/04

to

You are right, I din't mean stalls.

John Dallman

unread,

Oct 3, 2004, 5:34:00 AM10/3/04

to

In article <415eca51...@news.ecn.ab.ca>,
jsa...@excxn.aNOSPAMb.cdn.invalid (John Savard) wrote:

> But when they can come out with an Itanium 3 that applies the full
> processing power of the chip to executing 386/Pentium instructions, with
> only the minimum performance degradation resulting from not having
> optimized EPIC code to give the chip... then the investment in Itanium
> might start to pay off.

I think you may have missed some of the point. Said point being that
because of the explicit parallelism, the Itanium didn't need the complex
logic for observing and tracking dependencies and doing speculative
execution that seems to be required for high-throughput x86. And the lack
of need for such logic was why it could have all these execution units.

This tends to make the "minimum performance degradation" you specify
something like the two-thirds that one observes with current
implementations.

Now that they can have a whole heap more transistors, Intel might, of
course, try designing a processor with all that dependency logic for x86,
and maybe even try applying it to IA-64 code, abandoning the premise that
the compilers can do that job better. Really big design job though, and
currently they seem to be spending the transistors on dual-core, to
provide something they can call HyperThreading, plus bigger caches. Much
easier to design, but unlikely to solve the throughput problems in the
general case.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Peter Grandi

unread,

Oct 3, 2004, 1:18:32 PM10/3/04

to

>>> On Sat, 02 Oct 2004 23:05:05 +0100, pg...@0409.exp.sabi.co.UK (Peter
>>> Grandi) said:

[ ... Itanium price, performance, positioning etc. ... ]

pg_nh> But then I just read somewhere an interview with some big Intel
pg_nh> cheese and his line on Itanium seems amazing to me : my
pg_nh> understanding of what he was saying is that they are going to
pg_nh> drop prices on Itanium so much that in a couple of years one will
pg_nh> be able to buy at the same price point much the same kit with
pg_nh> either Itanium or Xeon [ ... ] This sounds like a truly major
pg_nh> shift in pricing strategy for Intel, and too bad I can't remember
pg_nh> where I saw the interview.

Bah! I have found the original refence, and it is an Intel PR, and it is
from six months ago, and I noticed it only recently... But it is still
quite remarkable, even if the price parity will be three, not two, years
from now, unlike I remembered.

http://WWW.Intel.com/pressroom/archive/releases/20040413comp.htm

<<The price/performance improvements of these new processors are the
next step toward achieving Intel's goal of delivering Itanium 2
based systems with up to twice the performance as Intel(R) Xeon*
processor based systems for the same system cost in 2007.

Intel has two server architectures, which makes up approximately 85
percent of the server market segment share.****

The Itanium 2 processor family is targeted at business critical
enterprise servers and technical computing clusters while the Intel
Xeon processor family is broadly used for general purpose IT
infrastructure.

"In the next few years, system manufacturers will be able to design
an Itanium 2 processor and Intel Xeon processor-based system using
the same low cost components," Dracott said. "Every product and
technology we roll out moves us one step closer to a common system
with common infrastructure costs.">>

Grumble

unread,

Oct 3, 2004, 1:52:41 PM10/3/04

to

John Dallman wrote:

> Now that they can have a whole heap more transistors, Intel
> might, of course, try designing a processor with all that
> dependency logic for x86, and maybe even try applying it to IA-64
> code, abandoning the premise that the compilers can do that job
> better. Really big design job though, and currently they seem to
> be spending the transistors on dual-core, to provide something
> they can call HyperThreading, plus bigger caches. Much easier to
> design, but unlikely to solve the throughput problems in the
> general case.

HyperThreading is the name given by Intel's marketing department
to the implementation of SMT (or plain MT, I haven't read any
whitepapers on it) on NetBurst. Calling the implementation of SMT
(or, again, plain MT, I don't know) on Itanium "HyperThreading" is
somewhat misleading, in my humble opinion, since the implementation
will likely be quite different :-)

--
Regards, Grumble

John Dallman

unread,

Oct 3, 2004, 2:27:00 PM10/3/04

to

In article <41603c47$0$8333$626a...@news.free.fr>, dev...@kma.eu.org
(Grumble) wrote:

> HyperThreading is the name given by Intel's marketing department
> to the implementation of SMT (or plain MT, I haven't read any
> whitepapers on it) on NetBurst. Calling the implementation of SMT
> (or, again, plain MT, I don't know) on Itanium "HyperThreading" is
> somewhat misleading, in my humble opinion, since the implementation
> will likely be quite different :-)

The great thing about using marketing terms to describe your products is
that they can have whatever meaning you currently want.

Mark Hahn

unread,

Oct 3, 2004, 4:56:10 PM10/3/04

to

> http://WWW.Intel.com/pressroom/archive/releases/20040413comp.htm

> <<The price/performance improvements of these new processors are the
> next step toward achieving Intel's goal of delivering Itanium 2
> based systems with up to twice the performance as Intel(R) Xeon*
> processor based systems for the same system cost in 2007.

the claim is really "we'll have a unified xeon/it2 chipset some day".
they don't claim CPU-price parity, and they don't really claim parity
of cpu-included-systems, or of the optimal system for each processor.

this is actually a pretty weak claim.

> The Itanium 2 processor family is targeted at business critical
> enterprise servers and technical computing clusters while the Intel
> Xeon processor family is broadly used for general purpose IT
> infrastructure.

the "business critical" thing is pure marketing dung. the "technical
computing" part is pretty amusing if you notice that, for instance,
all of ia64's impressive specFP results from a handful of specFP
components that run entirely in-cache. a great deal of technical
computing is *not* cache-friendly.

> "In the next few years, system manufacturers will be able to design
> an Itanium 2 processor and Intel Xeon processor-based system using
> the same low cost components," Dracott said. "Every product and

right. same chipset to support both, and possibly a very low-end it2
which costs the same as a high-end xeon.

Bernd Paysan

unread,

Oct 3, 2004, 4:52:59 PM10/3/04

to

Grumble wrote:
> HyperThreading is the name given by Intel's marketing department
> to the implementation of SMT (or plain MT, I haven't read any
> whitepapers on it) on NetBurst. Calling the implementation of SMT
> (or, again, plain MT, I don't know) on Itanium "HyperThreading" is
> somewhat misleading, in my humble opinion, since the implementation
> will likely be quite different :-)

A short definition of terms:

MT, multithreading, is the execution of several programs (threads)
interleaved on one CPU core. Interleaved means that the CPU core
executes one instruction (or one bundle of instruction in a VLIW case),
and then switches to another thread in a round-robin fashion. Tera's
machine works that way, as well as Cray's CDC6600 IO processor.

SMT, simultaneous multithreading, is the execution of several programs
(threads) simultaneous on one CPU core. Simultaneous means that the CPU
core queues instructions of several programs (quite likely interleaved,
but several instructions per cycle), but executes them when the
resources become ready, with normal OoO execution, so that several
instructions of different threads are executed in the same cycle. Alpha
EV8 was supposed to do that, but since this project was axed, we have
Pentium 4 and POWER5 which implement SMT (both just 2-way).

Itanic's "HyperThreading" fits more with the category of MT (not
simultaneous), with the exception that a context switch is only done
when there's a long-latency operation stalling the execution of one
task. I.e. not one-per-cycle interleaving like Tera or CDC6600 IO
processor, but one-per-stall interleaving.

How much will this increase the die size? Pentium 4's HyperThreading
uses only a few %, because if enabled, the register file is split into
two, one half for each thread (that's easy to do, because there are
still 128 renamed registers for 8 architectural ones). The decoder is
still the same limiting 3-way decoder. In POWER5's case, the decoder is
doubled (5 to 2*5), and probably also the number of reservation
stations, so IBM claims a 24% area increase of the CPU core due to SMT.
Itanic still needs to double the size of the (already quite large)
register file, but that's it.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Stephen Fuld

unread,

Oct 4, 2004, 12:03:05 AM10/4/04

to

"Bernd Paysan" <bernd....@gmx.de> wrote in message
news:be5632-...@vimes.paysan.nom...

snip

> A short definition of terms:
>
> MT, multithreading, is the execution of several programs (threads)
> interleaved on one CPU core. Interleaved means that the CPU core
> executes one instruction (or one bundle of instruction in a VLIW case),
> and then switches to another thread in a round-robin fashion. Tera's
> machine works that way, as well as Cray's CDC6600 IO processor.
>
> SMT, simultaneous multithreading, is the execution of several programs
> (threads) simultaneous on one CPU core. Simultaneous means that the CPU
> core queues instructions of several programs (quite likely interleaved,
> but several instructions per cycle), but executes them when the
> resources become ready, with normal OoO execution, so that several
> instructions of different threads are executed in the same cycle. Alpha
> EV8 was supposed to do that, but since this project was axed, we have
> Pentium 4 and POWER5 which implement SMT (both just 2-way).
>
> Itanic's "HyperThreading" fits more with the category of MT (not
> simultaneous), with the exception that a context switch is only done
> when there's a long-latency operation stalling the execution of one
> task. I.e. not one-per-cycle interleaving like Tera or CDC6600 IO
> processor, but one-per-stall interleaving.

Isn't this generally called "Switch on Event Multi-Threading, or SOEMT? You
have then usefully defined the basic concept (MT) and the two divisions into
which it can be divided. SOEMT can be further divided into catagories
depending on what the "event" is that causes a switch. i.e. on Tera it is
an instruction, on IBM's Northstar, it was an L2 (?) cache miss, etc.

I think you have performed a service in stating these definitions (assuming
other agree to use them), as it will clarify communication.

Randy Howard

unread,

Oct 4, 2004, 5:37:56 AM10/4/04

to

In article <ktidnahuark...@rogers.com>, bbb...@ezrs.com says...

> However, this particular thread is about how to save Itanium from the
> scrapheap.

Not going to happen. It's already rusting in the back 40.

--
Randy Howard (2reply remove FOOBAR)

Del Cecchi

unread,

Oct 4, 2004, 11:33:48 AM10/4/04

to

"Bernd Paysan" <bernd....@gmx.de> wrote in message
news:be5632-...@vimes.paysan.nom...
>
snip

> A short definition of terms:
>
> MT, multithreading, is the execution of several programs (threads)
> interleaved on one CPU core. Interleaved means that the CPU core
> executes one instruction (or one bundle of instruction in a VLIW case),
> and then switches to another thread in a round-robin fashion. Tera's
> machine works that way, as well as Cray's CDC6600 IO processor.
>
> SMT, simultaneous multithreading, is the execution of several programs
> (threads) simultaneous on one CPU core. Simultaneous means that the CPU
> core queues instructions of several programs (quite likely interleaved,
> but several instructions per cycle), but executes them when the
> resources become ready, with normal OoO execution, so that several
> instructions of different threads are executed in the same cycle. Alpha
> EV8 was supposed to do that, but since this project was axed, we have
> Pentium 4 and POWER5 which implement SMT (both just 2-way).
>
> Itanic's "HyperThreading" fits more with the category of MT (not
> simultaneous), with the exception that a context switch is only done
> when there's a long-latency operation stalling the execution of one
> task. I.e. not one-per-cycle interleaving like Tera or CDC6600 IO
> processor, but one-per-stall interleaving.

You're kidding! Isn't this the same thing that this group refused to
classify as SMT when the Star series processors did it lo these many years
ago?

And they call it HyperThreading?

del cecchi

Alan Balmer

unread,

Oct 4, 2004, 11:31:42 AM10/4/04

to

On Sat, 02 Oct 2004 23:05:05 +0100, pg...@0409.exp.sabi.co.UK (Peter

Grandi) wrote:

Price parity in 2007, is what I remember from a recent development
workshop.

--
Al Balmer
Balmer Consulting
removebalmerc...@att.net

Alan Balmer

unread,

Oct 4, 2004, 11:34:18 AM10/4/04

to

On 3 Oct 2004 20:56:10 GMT, Mark Hahn
<ha...@coffee.psychology.mcmaster.ca> wrote:

>> http://WWW.Intel.com/pressroom/archive/releases/20040413comp.htm
>
>> <<The price/performance improvements of these new processors are the
>> next step toward achieving Intel's goal of delivering Itanium 2
>> based systems with up to twice the performance as Intel(R) Xeon*
>> processor based systems for the same system cost in 2007.
>
>the claim is really "we'll have a unified xeon/it2 chipset some day".
>they don't claim CPU-price parity, and they don't really claim parity
>of cpu-included-systems, or of the optimal system for each processor.
>
>this is actually a pretty weak claim.
>
>> The Itanium 2 processor family is targeted at business critical
>> enterprise servers and technical computing clusters while the Intel
>> Xeon processor family is broadly used for general purpose IT
>> infrastructure.
>
>the "business critical" thing is pure marketing dung.

Not entirely. The Itanium has reliability and integrity features not
found on the Xeon.

> the "technical
>computing" part is pretty amusing if you notice that, for instance,
>all of ia64's impressive specFP results from a handful of specFP
>components that run entirely in-cache. a great deal of technical
>computing is *not* cache-friendly.
>
>> "In the next few years, system manufacturers will be able to design
>> an Itanium 2 processor and Intel Xeon processor-based system using
>> the same low cost components," Dracott said. "Every product and
>
>right. same chipset to support both, and possibly a very low-end it2
>which costs the same as a high-end xeon.

--

Peter Dickerson

unread,

Oct 4, 2004, 11:38:08 AM10/4/04

to

"Del Cecchi" <cecchi...@us.ibm.com> wrote in message
news:2sd8qtF...@uni-berlin.de...

Yep, I think so. I think the conclusion was that comp.arch-consensus SMT
required that instructions from different theads can be in the same
pipestage at the same time.

Peter

Nick Maclaren

unread,

Oct 4, 2004, 11:59:24 AM10/4/04

to

In article <2sd8qtF...@uni-berlin.de>,

"Del Cecchi" <cecchi...@us.ibm.com> writes:
|>
|> > Itanic's "HyperThreading" fits more with the category of MT (not
|> > simultaneous), with the exception that a context switch is only done
|> > when there's a long-latency operation stalling the execution of one
|> > task. I.e. not one-per-cycle interleaving like Tera or CDC6600 IO
|> > processor, but one-per-stall interleaving.
|>
|> You're kidding! Isn't this the same thing that this group refused to
|> classify as SMT when the Star series processors did it lo these many years
|> ago?
|>
|> And they call it HyperThreading?

I told you that would happen :-)

Regards,
Nick Maclaren.

Bernd Paysan

unread,

Oct 4, 2004, 12:05:32 PM10/4/04

to

Stephen Fuld wrote:
> Isn't this generally called "Switch on Event Multi-Threading, or SOEMT?

Yes, I've heard that name.

> You have then usefully defined the basic concept (MT) and the two
> divisions into
> which it can be divided. SOEMT can be further divided into catagories
> depending on what the "event" is that causes a switch. i.e. on Tera it is
> an instruction, on IBM's Northstar, it was an L2 (?) cache miss, etc.

I wouldn't call an instruction an "event". If the word "event" makes sense,
it's something that's not as regular as the next instruction (which is
bound to come, anyway).

Peter Dickerson

unread,

Oct 4, 2004, 12:22:52 PM10/4/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:cjrs0s$27m$1...@gemini.csx.cam.ac.uk...

And now I have to agree with Nick! I think he did.

Peter

Stephen Fuld

unread,

Oct 4, 2004, 1:05:36 PM10/4/04

to

"Del Cecchi" <cecchi...@us.ibm.com> wrote in message
news:2sd8qtF...@uni-berlin.de...
>

Because the X86 takes CISC instructions and converts them into RISC micro
ops which are then cached and executed, it has an extra complication. The
way I understand it, only one CISC instruction can be in the decode stage at
a time, but multiple RISC micro-ops, from different threads can be executing
on the functional units simultaneously. So it is sort of a hybrid. But
given that at least "pieces" of instructions from different threads can be
executing on different functional units simultaneously, I would still call
it a form of SMT. But of course YMMV.

Stephen Fuld

unread,

Oct 4, 2004, 1:05:37 PM10/4/04

to

"Bernd Paysan" <bernd....@gmx.de> wrote in message

news:cv8832-...@miriam.mikron.de...

> Stephen Fuld wrote:
>> Isn't this generally called "Switch on Event Multi-Threading, or SOEMT?
>
> Yes, I've heard that name.
>
>> You have then usefully defined the basic concept (MT) and the two
>> divisions into
>> which it can be divided. SOEMT can be further divided into catagories
>> depending on what the "event" is that causes a switch. i.e. on Tera it
>> is
>> an instruction, on IBM's Northstar, it was an L2 (?) cache miss, etc.
>
> I wouldn't call an instruction an "event". If the word "event" makes
> sense,
> it's something that's not as regular as the next instruction (which is
> bound to come, anyway).

OK, then your taxonomy would have three divisions of MT, SMT, SOEMT, and
Tera style, or instruction interleaved MT. The key thing here is that in
both Tera and SOEMT, instructions from only one thread are executing on the
FUs at a time, wheras in SMT, more than one thread are on the FUs
simultaneously.

Andrew Gabriel

unread,

Oct 4, 2004, 1:04:30 PM10/4/04

to

In article <25r2m0tvifvlfemf6...@4ax.com>,

Alan Balmer <alba...@att.net> writes:
>
> Price parity in 2007, is what I remember from a recent development
> workshop.

This has to depend on volumes being vaguely in parity (at least, not
orders of magnitude difference as now), or cross-subsidising it from
somewhere else, such as Xeon, which means a competitor who is not
using their revenue to subsidise some other product can undercut you
(like AMD, for instance).

Intel has to get Itanium's volume way up for the chip to become
viable. The volumes involved in HP's servers alone don't come close
which is why HP had to stop persuing HP-PA, and with Intel only so
far managing to ship to that market (all other Itanium consumers are
currently insignificant), Itanium is now at best in same state as
HP-PA was, with sales way too low to fund development.

It's not clear how it can recover from this point. I can't see that
any high volume vendor is going to touch it as there are no
applications for it. Without high volume, there's no funding for
developing it (or even keeping it up with other processors).

--
Andrew Gabriel

Nick Maclaren

unread,

Oct 4, 2004, 2:05:32 PM10/4/04

to

In article <Bpf8d.485851$OB3.4...@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>
>OK, then your taxonomy would have three divisions of MT, SMT, SOEMT, and
>Tera style, or instruction interleaved MT. The key thing here is that in
>both Tera and SOEMT, instructions from only one thread are executing on the
>FUs at a time, wheras in SMT, more than one thread are on the FUs
>simultaneously.

How would you classify my design :-)

Regards,
Nick Maclaren.

Iain McClatchie

unread,

Oct 4, 2004, 2:26:57 PM10/4/04

to

John> I think you may have missed some of the point. Said point being that
John> because of the explicit parallelism, the Itanium didn't need the complex
John> logic for observing and tracking dependencies and doing speculative
John> execution that seems to be required for high-throughput x86. And the lack
John> of need for such logic was why it could have all these execution units.

I would be really interested if any of the architecture folks working on
the Itanium actually believed this, and if so, in what time frame. Jim
Hull, care to comment?

Register dependencies are just not very hard to deduce from the instruction
encoding, especially among instructions within the same basic block. It
takes power, but so does moving bits that explicitly encode dependencies
through a cache system.

Does the Itanium encode any kind of explicit dependency information about
memory operands? This would be a wonderful bit of information for the
hardware, but I suspect that the hardware is better at collecting it than
the compiler is.

Anton Ertl

unread,

Oct 4, 2004, 2:12:09 PM10/4/04

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:
>Because the X86 takes CISC instructions and converts them into RISC micro
>ops which are then cached and executed, it has an extra complication. The
>way I understand it, only one CISC instruction can be in the decode stage at
>a time, but multiple RISC micro-ops, from different threads can be executing
>on the functional units simultaneously.

The Pentium 4 can decode only one instruction per cycle. However, in
the better cases the microcode it runs comes out of the trace cache in
decoded form (potentially from more than one instruction per cycle).
Other IA32 implementations can decode more than one instruction per
cycle (with some restrictions).

Concerning the multithreading, the Pentium 4 supplies 6
microinstructions from one thread every two cycles. On the next beat,
they might come from the other thread. The ooo execution engine can
process microinstructions from different threads in the same stage (of
parallel pipelines) in the same cycle, though. Is this SMT? Mostly,
I guess.

It's unclear to me though why you discuss the Pentium 4
multithreading, when Del Cecchi asked about the future Itanium
multithreading.

Back to IA-64 and other machines. How hard would it be to let the
machines execute instructions from two different groups
simultaneously, and how much would it buy? My guess is that for a
6-issue engine like we have seen until now, it would buy quite a bit
of performance. Even partitioning the machine into two three-wide
threads should not cost too much per-thread performance, and increase
throughput quite a bit (when one thread stalls, make the working
thread six-wide, or switch in another thread).

As for cost, they already have to match up the requested with the
available resources; doing that for two threads should not be that
much harder. Supplying data from two huge register files
simultaneously might be a problem, though (with per-cycle switching I
can imagine some tricks that make the alternative context relatively
cheap to keep around). Any other issues?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Greg Lindahl

unread,

Oct 4, 2004, 3:45:32 PM10/4/04

to

In article <Apf8d.485850$OB3.2...@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

>Because the X86 takes CISC instructions and converts them into RISC micro
>ops which are then cached and executed, it has an extra complication. The
>way I understand it, only one CISC instruction can be in the decode stage at
>a time, but multiple RISC micro-ops, from different threads can be executing
>on the functional units simultaneously.

By X86 you mean the Pentium4, yes? The Athlon and Opteron can decode
up to 3 CISC instructions at once, but don't have any kind of
hyperthreading.

Most people use X86 to refer to the entire class of cpus, not just the
Pentium4 implementation.

-- greg

Bernd Paysan

unread,

Oct 4, 2004, 4:36:34 PM10/4/04

to

Anton Ertl wrote:
> Back to IA-64 and other machines. How hard would it be to let the
> machines execute instructions from two different groups
> simultaneously, and how much would it buy? My guess is that for a
> 6-issue engine like we have seen until now, it would buy quite a bit
> of performance. Even partitioning the machine into two three-wide
> threads should not cost too much per-thread performance, and increase
> throughput quite a bit (when one thread stalls, make the working
> thread six-wide, or switch in another thread).

Yes, even that simple partitioning would buy something. Itanic, after
all, is defined in a way that you can break down instruction bundles
and execute them serially, while "real" VLIWs have swap-type
parallelism, which would make that difficult. So having two fetch units
issuing one bundle per cycle to half of the execution logic would make
sense. You still could react to longer stalls (at least L2 misses) and
during that time allow the other thread to fetch two bundles per cycle.

It's probably a motivation factor. If my promising project (EV8) was
axed by a clueless manager, and I'd been sold to another company to
"rescue" a disgusting project, while there are a lot of people there
who want to keep as much as possible of the original, broken approach,
I won't be able to come with something useful, either.

Bernd Paysan

unread,

Oct 4, 2004, 4:44:25 PM10/4/04

to

Peter Dickerson wrote:
> Yep, I think so. I think the conclusion was that comp.arch-consensus
> SMT required that instructions from different theads can be in the
> same pipestage at the same time.

It should be execution stage. Other stages don't really matter that
much, especially if you think of modern architectures the way Andy Glew
does: as several decoupled pipelines, connected by buffers. We feel
comfortable with the SMT term if the decode pipeline is interleaved
(like Pentium 4); I also would not worry if the retire stage also was
interleaved as a consequence, since retired instructions only matter to
the register renaming, and that's part of the decode stage.

Yousuf Khan

unread,

Oct 4, 2004, 4:51:15 PM10/4/04

to

Andrew Gabriel wrote:
> In article <25r2m0tvifvlfemf6...@4ax.com>,
> Alan Balmer <alba...@att.net> writes:
>>
>> Price parity in 2007, is what I remember from a recent development
>> workshop.
>
> This has to depend on volumes being vaguely in parity (at least, not
> orders of magnitude difference as now), or cross-subsidising it from
> somewhere else, such as Xeon, which means a competitor who is not
> using their revenue to subsidise some other product can undercut you
> (like AMD, for instance).

Now the only choice left for Intel is to accept lower margins. The
cross-subsidization route is now blocked by its competitors. It can still
make good coin on Xeon, but if it needs to fund Itanium, the books are going
to have to show some revenue coming out of the Xeons for it.

Yousuf Khan

Del Cecchi

unread,

Oct 4, 2004, 5:21:58 PM10/4/04

to

"Andrew Gabriel" <and...@cucumber.demon.co.uk> wrote in message
news:cjrvqu$dj3$1...@new-usenet.uk.sun.com...

> In article <25r2m0tvifvlfemf6...@4ax.com>,
> Alan Balmer <alba...@att.net> writes:
> >
> > Price parity in 2007, is what I remember from a recent development
> > workshop.
>
> This has to depend on volumes being vaguely in parity (at least, not
> orders of magnitude difference as now), or cross-subsidising it from
> somewhere else, such as Xeon, which means a competitor who is not
> using their revenue to subsidise some other product can undercut you
> (like AMD, for instance).

I think it would most likely just mean living with a lower profit margin on
that chip than the normal profit that Intel gets. 65% Gross margins or
something like that. Check the financial statements to find the number.

snip, followups trimmed.

del cecchi

Stephen Fuld

unread,

Oct 4, 2004, 6:21:22 PM10/4/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:cjs3dc$gqk$1...@gemini.csx.cam.ac.uk...

I could be snarfy and say that you haven't explained your design enough to
say :-). But an early view is that what you have proposed is orthogonal to
multi-threading in that it could just as easily be implemented one thread
per CPU chip and is thus not a multi-threading hardware design point. But I
am not committed to that.

Stephen Fuld

unread,

Oct 4, 2004, 6:24:31 PM10/4/04

to

"Greg Lindahl" <lin...@pbm.com> wrote in message
news:4161a85c$1...@news.meer.net...

> In article <Apf8d.485850$OB3.2...@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>>Because the X86 takes CISC instructions and converts them into RISC micro
>>ops which are then cached and executed, it has an extra complication. The
>>way I understand it, only one CISC instruction can be in the decode stage
>>at
>>a time, but multiple RISC micro-ops, from different threads can be
>>executing
>>on the functional units simultaneously.
>
> By X86 you mean the Pentium4, yes? The Athlon and Opteron can decode
> up to 3 CISC instructions at once, but don't have any kind of
> hyperthreading.

Yes. Since this was in the context of the term "Hyperthreading", I thought
that was clear. Sorry if I confused anyone.

Yousuf Khan

unread,

Oct 4, 2004, 8:40:52 PM10/4/04

to

Greg Lindahl wrote:
> By X86 you mean the Pentium4, yes? The Athlon and Opteron can decode
> up to 3 CISC instructions at once, but don't have any kind of
> hyperthreading.
>
> Most people use X86 to refer to the entire class of cpus, not just the
> Pentium4 implementation.

Even the AMD version of x86 breaks the instructions down into more atomic
instructions, which could be considered a form of RISC instructions. Maybe
not the same type of RISC instructions that run through the Intel
processors, but a similar concept. Regardless neither of these "RISC"
instruction sets are directly accessible by the programs, so everything
simply sees the x86 CISC only.

Yousuf Khan

Peter Grandi

unread,

Oct 4, 2004, 5:27:21 PM10/4/04

to

[ ... on Intel's market/pricing strategies for Itanium ... ]

>>> On 4 Oct 2004 17:04:30 GMT, and...@cucumber.demon.co.uk (Andrew
>>> Gabriel) said:

andrew> In article <25r2m0tvifvlfemf6...@4ax.com>, Alan
andrew> Balmer <alba...@att.net> writes:

albalmer> Price parity in 2007, is what I remember from a recent
albalmer> development workshop.

andrew> This has to depend on volumes being vaguely in parity (at least,
andrew> not orders of magnitude difference as now), or cross-subsidising
andrew> it from somewhere else, such as Xeon, which means a competitor
andrew> who is not using their revenue to subsidise some other product
andrew> can undercut you (like AMD, for instance).

That competitor is still tiny compared to Intel and its cash resources.
As Wolff wrote in «Burn rate»:

«West Coast capital is a technology play. Technology investors can
rationalize losses in a way that Time Warner's investor's can't.
Technology money follows different assumptions than content money.

Technology money believes that for a more or less extended period a
lot of different entities and approaches duke it out for market share,
which leads to market dominance.

The dominant player then provides a historic return on investment to
its investors (there are often positive outcomes for the losers too,
as the industry and products consolidate).»

The big news is that it seems that Intel is now pretty unsure whether it
will be AMD64/EM64T or IA64 that will «historic return» in the 64 bit
market, so they will do whatever it takes to make sure that the return
goes to them either way.

andrew> Intel has to get Itanium's volume way up for the chip to become
andrew> viable.

Eventually, to get the «historic return», and my understanding of Intel
is that they are changing radically their marketing strategy to make
Itanium->=2 systems have ``up to twice better price/performance'' than
Xeon systems as that's what they think is needed to get «market
dominance».

In my delirious imagination I see a marketing meeting at Intel, where
they brainstorm for hours on "Why is Itanium failing?" (name wrong?
should the box have been in a more fashionable colour? not enough shelf
space in Wal*Mart? lack of endorsement from Monica Lewinsky? :->), until
someone has a true moment of genius: perhaps it is because it costs much
more than alternatives that have roughly the same performance or better!
The marketing guys then sit in stunned surprise for a while, and then
decide to fix this small inconvenient. :-)

andrew> [ ... ] Itanium is now at best in same state as HP-PA was, with
andrew> sales way too low to fund development.

Not for Intel -- they got lots of cash. They know about learning curves,
or similar ideas like above. To them IA64 is different than HP-PA to the
HP UNIX workstation/server division. I greatly admire the latter for
suckering Intel into funding development of a new line of CPUs for them,
to HP's spec, and then getting stuck with it.

andrew> It's not clear how it can recover from this point. I can't see
andrew> that any high volume vendor is going to touch it as there are no
andrew> applications for it. Without high volume, there's no funding for
andrew> developing it (or even keeping it up with other processors).

If you get Itanium->=2 boxes that other things being equal cost the same
as a Xeon box but are ``upt to twice as fast'', it's a no brainer:
applications will be developed for it.

But to me there are a couple reflections on the Intel strategy change:

* Itanium->=2 performance seems to be not great unless it has enormous
amounts of cache. Which means that probably the chip's cost to Intel
instead of being a very small fraction of sale price as usual is
probably rather more expensive.

Considering also power consumption etc., to get two equivalently
priced systems, one Xeon and one Itanium->=2, the latter needs to
cost the same or less... Uhm. Not good news. The target date of 2007
probably means ``whenever our next process gen allows us to do huge
cache chips without too much pain''.

* I reckon that the Itanium architecture is basically that of an exposed
vector processor more than an exposed superscalar, or in other words
its relative performance is a lot better on SpecFP than SpecInt, that
is it seems to have been designed around the needs of HP's workstations
rather than Intel based boxes which are usually deployed to run business
applications.

A low price, low margin Itanium->=2 strategy may end up subsidizing
HP's scientific workstation/server business a lot more than Intel's
traditional business application target market. But things seem rather
stranger than that.

What about Windows Server 2003 for Itanium and business applications?
Well, that seems really positioned to run business applications. In a
couple of relevant Microsoft/Intel pages:

http://www.Microsoft.com/windowsserver2003/64bit/ipf/
http://www.Intel.com/business/bss/swapps/server2003/

I find the following engaging words:

«Intel Itanium 2 processor-based platforms (2-way to 64-way) can
power today's most demanding enterprise and technical computing
applications. Explicitly Parallel Instruction Computing (EPIC)
architecture supports highly parallel processing, and innovative
compiler-based optimization drives world-class performance and
availability. Itanium architecture is now the most widely supported
64-bit architecture in the world with the largest ecosystem of
solutions. Learn how Itanium 2-based servers running Microsoft
Windows Server 2003 enable an ideal platform for the most
data-intensive, business-critical applications.»

«Windows Server 2003 for 64-Bit Itanium-based Systems is well-suited
for native 64-bit applications requiring the highest levels of
scalability.

Typical workloads include database, business applications, and
technical computing.

With support for up to 64-Way Symmetric Multi-Processing (SMP)
servers and 512 GB of memory, Windows Server 2003 for 64-Bit
Itanium-based Systems is the most scalable Windows platform ever
created.»

Note Intel's emphasis on "scalability" as the Itanium-2/Xeon
differentiator... Bah!

Now, now, my impression, and this as usual is just speculation, is that
if someone were to say ``up to twice as fast at the same system price'',
that ``up'' means ``in floating point oriented vector like scientific
applications'', and I would expect performance then to be at best the
same between Xeon an Itanium->=2 on business applications, without good
backwards compatibility. A much less compelling price/performance story?

But then the big news of the year of course is EM64T, despite the
attitude from Intel having been ``never'', and what looks like to me the
disappearance of pure IA32 processors except perhaps Celerons from
Intel's lineup...

Rupert Pigott

unread,

Oct 4, 2004, 6:09:02 PM10/4/04

to

Peter Grandi wrote:

[SNIP]

> If you get Itanium->=2 boxes that other things being equal cost the same

--^^

That's a bloody big IF.

> as a Xeon box but are ``upt to twice as fast'', it's a no brainer:
> applications will be developed for it.

Other vendors can play that game too, and some of them are already
playing that game with a large and well established software base.

Cheers,
Rupert

--
Threading sequential code through the eye of a parallel needle
makes little sense. ;)

Peter Dickerson

unread,

Oct 5, 2004, 7:50:42 AM10/5/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:cjs3dc$gqk$1...@gemini.csx.cam.ac.uk...

Origami - much easier with paper than silicon :-)

Peter

Mitch Alsup

unread,

Oct 5, 2004, 11:41:18 AM10/5/04

to

"Yousuf Khan" <bbb...@ezrs.com> wrote in message news:<nqWdnTVE5tu...@rogers.com>...

> Greg Lindahl wrote:
> > By X86 you mean the Pentium4, yes? The Athlon and Opteron can decode
> > up to 3 CISC instructions at once, but don't have any kind of
> > hyperthreading.
> >
> > Most people use X86 to refer to the entire class of cpus, not just the
> > Pentium4 implementation.
>
> Even the AMD version of x86 breaks the instructions down into more atomic
> instructions, which could be considered a form of RISC instructions.

For example, consider the x86 instruction:

ADD [EAX+ESI*4+0x3BF8],ECX

This gets converted (in Athlon or Opty) into: <drum roll>

ADD.W WORD[EAX+ESI*4+0x3BF8],ECX

Which gets inserted into a single reservation station. Not very RISCy
at this point!

However, It does get launched into the AGEN pipe for the load independently
of launching the ADD.W part into the ALU, but collects the store result
directly into LS2 without another launching from the reservation station.

All in all; Not RISCy at all. The AMD guys call this a MACRO-operation.

> Maybe
> not the same type of RISC instructions that run through the Intel
> processors, but a similar concept.

I disagree. The AMD approach is to pack as much work as makes sense for
CISC macro-ops in a reservation station entry, and break down the RISCy
units of work into sequential 'launchings' out of the reservation station.

The Intel approach is to do the decomposition in the decoder not in
the schedulers.

> Regardless neither of these "RISC"
> instruction sets are directly accessible by the programs, so everything
> simply sees the x86 CISC only.
>
> Yousuf Khan

Mitch

Stefan Monnier

unread,

Oct 5, 2004, 12:19:20 PM10/5/04

to

> I disagree. The AMD approach is to pack as much work as makes sense for
> CISC macro-ops in a reservation station entry, and break down the RISCy
> units of work into sequential 'launchings' out of the reservation station.

That makes a lot of sense: it takes advantage of the fact that those
macro-ops have no parallelism and that there's thus no reason to check every
sub-part for "ready to execute": there's only ever one subpart that can
be ready.

Stefan

pr...@prep.synonet.com

unread,

Oct 6, 2004, 5:53:50 AM10/6/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:

Ashes? :)

--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.

Sander Vesik

unread,

Oct 6, 2004, 10:45:13 AM10/6/04

to

Bernd Paysan <bernd....@gmx.de> wrote:
> Grumble wrote:
> > HyperThreading is the name given by Intel's marketing department
> > to the implementation of SMT (or plain MT, I haven't read any
> > whitepapers on it) on NetBurst. Calling the implementation of SMT
> > (or, again, plain MT, I don't know) on Itanium "HyperThreading" is
> > somewhat misleading, in my humble opinion, since the implementation
> > will likely be quite different :-)

>
> A short definition of terms:
>
> MT, multithreading, is the execution of several programs (threads)
> interleaved on one CPU core. Interleaved means that the CPU core
> executes one instruction (or one bundle of instruction in a VLIW case),
> and then switches to another thread in a round-robin fashion. Tera's
> machine works that way, as well as Cray's CDC6600 IO processor.

Wouldn't it be better to increase the scope of it so that SMT would
be a subset of MT?

--
Sander

+++ Out of cheese error +++

Sander Vesik

unread,

Oct 6, 2004, 10:47:15 AM10/6/04

to

Del Cecchi <cecchi...@us.ibm.com> wrote:
>
> You're kidding! Isn't this the same thing that this group refused to
> classify as SMT when the Star series processors did it lo these many years
> ago?

Nope, he isn't saying that Itanic Hyperthreading is SMT - see, due to
somebody clever in Intel marketing we now seem to have:
* P4 hyperthreading which is SMT
* Itanic hyperthreading which is not SMT

>
> And they call it HyperThreading?
>

Yeah, its just another meaningless marketing term.

> del cecchi

Patrick Schaaf

unread,

Oct 6, 2004, 10:57:19 AM10/6/04

to

Sander Vesik <san...@haldjas.folklore.ee> writes:

>> And they call it HyperThreading?

>Yeah, its just another meaningless marketing term.

I think it now has aquired a very precise meaning: it's HyperThreading
for as long as Windows and Oracle still count it as one CPU.

best regards
Patrick

Anton Ertl

unread,

Oct 6, 2004, 10:59:17 AM10/6/04

to

Mitch...@aol.com (Mitch Alsup) writes:
>"Yousuf Khan" <bbb...@ezrs.com> wrote in message news:<nqWdnTVE5tu...@rogers.com>...
>> Greg Lindahl wrote:
>> > By X86 you mean the Pentium4, yes? The Athlon and Opteron can decode
>> > up to 3 CISC instructions at once, but don't have any kind of
>> > hyperthreading.
>> >
>> > Most people use X86 to refer to the entire class of cpus, not just the
>> > Pentium4 implementation.
>>
>> Even the AMD version of x86 breaks the instructions down into more atomic
>> instructions, which could be considered a form of RISC instructions.
>
>For example, consider the x86 instruction:
>
> ADD [EAX+ESI*4+0x3BF8],ECX
>
>This gets converted (in Athlon or Opty) into: <drum roll>
>
> ADD.W WORD[EAX+ESI*4+0x3BF8],ECX

Not ADD.L (from the ECX)? And, actually, the assembler has already
resolved the overloading of ADD when encoding the instruction.

>However, It does get launched into the AGEN pipe for the load independently
>of launching the ADD.W part into the ALU, but collects the store result
>directly into LS2 without another launching from the reservation station.
>
>All in all; Not RISCy at all. The AMD guys call this a MACRO-operation.

Some optimization manual also talks about micro-ops (which is probably
what you call launchings). And from some of it I got the expression
that such an instruction is broken into two micro-ops: one load+store
(not very RISCy at all), and one add.

Somehow these manuals have got me confused; in particular, some of the
manual seemed to say that an instruction could be broken into
up-to-two macro-ops in the fast decoder (or more in the vector
decoder), and a macro-op could be broken into two micro-ops later.

But I am not sure that I understood this correctly. Can you
confirm/elaborate on this?

Chris Morgan

unread,

Oct 6, 2004, 12:18:45 PM10/6/04

to

j...@cix.co.uk (John Dallman) writes:

> In article <41603c47$0$8333$626a...@news.free.fr>, dev...@kma.eu.org

> (Grumble) wrote:
>
> > HyperThreading is the name given by Intel's marketing department
> > to the implementation of SMT (or plain MT, I haven't read any
> > whitepapers on it) on NetBurst. Calling the implementation of SMT
> > (or, again, plain MT, I don't know) on Itanium "HyperThreading" is
> > somewhat misleading, in my humble opinion, since the implementation
> > will likely be quite different :-)
>

> The great thing about using marketing terms to describe your products is
> that they can have whatever meaning you currently want.

Lewis Carroll gets more prophetic all the time :-

"When I use a word," Humpty Dumpty said, in rather a scornful tone,
"it means just what I choose it to mean -- neither more nor less."

Humpty Dumpty words include :-

open
proprietary
(industry) standard

Any more?

Chris
--
Chris Morgan
"Post posting of policy changes by the boss will result in
real rule revisions that are irreversible"

- anonymous correspondent

Greg Lindahl

unread,

Oct 6, 2004, 4:39:57 PM10/6/04

to

In article <2004Oct...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>Some optimization manual also talks about micro-ops (which is probably
>what you call launchings). And from some of it I got the expression
>that such an instruction is broken into two micro-ops: one load+store
>(not very RISCy at all), and one add.

That's only because it's going to multiple functional units: the add
blocks until the information from the load/store unit arrives.

BTW, the Athlon/Opteron essentially has only one addressing mode: it
happens to be complicated, but I wouldn't go so far as to call it
CISC. It had dedicated hardware to do the address compuatation, and is
useful to accelerate real programs, not just because it's needed for
legacy x86 code that uses the complicated instructions.

>Somehow these manuals have got me confused; in particular, some of the
>manual seemed to say that an instruction could be broken into
>up-to-two macro-ops in the fast decoder (or more in the vector
>decoder), and a macro-op could be broken into two micro-ops later.

Yes. It's a fairly complicated subject and the manuals don't go into
enough detail for full understanding. So don't adjust that TV set,
you're supposed to be confused.

-- greg

Andreas Kaiser

unread,

Oct 6, 2004, 5:57:07 PM10/6/04

to

On Wed, 06 Oct 2004 14:59:17 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>Somehow these manuals have got me confused; in particular, some of the
>manual seemed to say that an instruction could be broken into
>up-to-two macro-ops in the fast decoder (or more in the vector
>decoder), and a macro-op could be broken into two micro-ops later.

The K7 actually is rather simple - a single x86 instruction translates
into either a single macroop or into microcode.

One macroop is a combination of an address op (for the AGUs), a
load/store/load+store op for the LSU and either an integer op for the
ALU or a operation for the FPU complex. Interim results don't need any
scratch registers but are caught from result buses when available.
Some reservation station resources may be shared because the 2nd input
of an ALU op cannot be available until the address op has been issued,
so the same logic may be used to snoop both one of the address op
input operands and the interim load result.

Since the output of the decoders is a line of 3 macroop slots kept
together until retired, you get either up to 3 macroops from the fast
decoders or a single microcode line. If there are not enough macroops
to fill the line, slots are left empty - e.g. beyond a taken branch or
as sync loss when the next instruction is done by microcode.

There wasn't any SSE at design time though. Adding 128bit SSE
instructions to this design caused a problem. Without major changes
they had to be done 64bit at a time as in the P3. Therefore the Athlon
XP implements them by microcode, with 1 of those 3 slots left empty in
the microcode line and up to 2 slots left empty in a preceding
non-microcode line. Wastes both decode bandwidth and reorder buffer
space.

For the K8 there were two alternatives.

(a) Stick to the 1:1 rule. Efficient execution of 128bit SSE then
requires 128bit data paths in LSU and FPU and a redesigned FPU capable
of splitting those ops into two internally (unless you want two 64bit
multipliers). That's what Intel appears to have done in the P4 (a
nasty side effect being a wasted clock cycle in the latency of scalar
SSE ops).

(b) Rework the fast decoders to be able to translate a single x86
instruction into 2 macroops if necessary. While the complexity of the
decode engine increases, all other parts are unaffected. Well known
technology for AMD, because both the K5 and a never released missing
link between K5 and K7 already did that. Drawbacks: packed SSE ops
still need 2 decoders and 2 macroop slots and a 2-macroop instruction
cannot hang over, so there still is an empty slot if there is only a
single slot available (hint: interleave 2-macroop instructions with
1-macroop instructions).

Gruss, Andreas

Andreas Kaiser

unread,

Oct 6, 2004, 5:57:10 PM10/6/04

to

On 5 Oct 2004 08:41:18 -0700, Mitch...@aol.com (Mitch Alsup) wrote:

> ADD.W WORD[EAX+ESI*4+0x3BF8],ECX
>
>Which gets inserted into a single reservation station.

You don't seem to regard LS1 as a reservation station. Why?

Replace the integer instruction by some MMX/SSE/x87 instruction and
you may find it placed into 3 reservation stations, AGU,LS1,FPU.

> Not very RISCy at this point!

IMHO AMDs macroops have more similarity to VLIW instruction words.
Consisting of an address op, a load/store/load+store op and either an
integer or floating point op. With some implicit dependencies between
the ops though.

>The Intel approach is to do the decomposition in the decoder not in
>the schedulers.

That's right. However there are implicit dependencies in the
instructions like the one above. Decomposition means the knowledge of
these dependencies is lost in the decoder, so the scheduler later has
to re-learn that the microops, the instruction got translated to, form
a dependency chain.

Gruss, Andreas

Stephen Sprunk

unread,

Oct 7, 2004, 12:48:49 AM10/7/04

to

"Andreas Kaiser" <A.Ka...@gmx.net> wrote in message
news:4up8m0hce9f7jhsjd...@4ax.com...

> Since the output of the decoders is a line of 3 macroop slots kept
> together until retired, you get either up to 3 macroops from the fast
> decoders or a single microcode line. If there are not enough macroops
> to fill the line, slots are left empty - e.g. beyond a taken branch or
> as sync loss when the next instruction is done by microcode.

Are you sure that's correct?

The K8 added a pipeline stage called Pack that reportedly can move macro-ops
between "lanes" to even them out in case of microcoded instructions or other
bubbles. I haven't been able to verify this since I only found it in one
article; the rest of the articles I've read completely gloss over the new
stage.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking