IA64 and Alpha architecture comparison

28 views
Skip to first unread message

Jan Vorbrueggen

unread,
Dec 29, 1999, 3:00:00 AM12/29/99
to
If you haven't heard yet, DEC^H^H^HCompaQ have posted a nice little paper
(http://www.digital.com/hpc/ref/ref_alpha_ia64.pdf) comparing IA64 with
the Alpha architecture. Among other things, it contains very good (IMHO)
explanations of many of the ILP enhancing techniques. Also, for the first
time I've understood the real advantage of simultaneous multithreading,
which the EV8/21464 will implement (as per the CompaQ roadmap). I wonder
why that wasn't made clearer in all those papers, at least one in IEEE
Computer (which supposedly is directed at a general audience), the ex-DEC
group has published!

Jan

Bernd Paysan

unread,
Dec 29, 1999, 3:00:00 AM12/29/99
to
Jan Vorbrueggen wrote:
> If you haven't heard yet, DEC^H^H^HCompaQ have posted a nice little paper
> (http://www.digital.com/hpc/ref/ref_alpha_ia64.pdf) comparing IA64 with
> the Alpha architecture.

The article is certainly quite biased in some respects. I want to pick
the memory latency hidden by OOO execution as example:

Typical workloads either have streaming memory access (easy vectorizable
code, memory latency can be hidden by prefetch hints), or overall very
limited parallelism. Look at the pointer chasing parts of SPECint, e.g.
GCC. You load a pointer from a structure, and then use that pointer to
obtain a value from elsewhere to compare it with another value in memory
(DAG-subexpression comparison). Now it's clear that this sort of program
will stall after two outstanding memory requests in the queue, no matter
how OOO your architecture is, or how many speculative load hints your
compiler inserted.

The only way around that is SMT, because then you have at least other
threads to run. I'm not sure if this works good with GCC, because
although you can do a make -j to max out the threads available, the
working set of GCC is quite large, resulting in cache thrashing (at
least to L2).

Replace GCC with your favourite data base, and the picture doesn't
change that much. Still large working set, still lots of pointers, still
poor data locality.

Does anyone have data for how many outstanding misses accumulate in
typical "commercial" workloads - min, on average, max - or in scientific
workloads?

Same problem with branch prediction: that works wonderful on loops or
for exception handling, where branches are either almost always taken or
almost always untaken. However, the branches in data driven applications
are *the* data driven element of that application, and if the programmer
did a good job at balancing the search tree, hit/miss are 50/50, totally
unpredictable. That's why conditional moves (or other conditional
operations) help so much here - you remove the data dependent branch,
and only predict the looping branch that just checks for the end of the
tree. And the Alpha certainly *has* conditional moves for this sort of
code.

Baseline: IMHO IA64 shows clearly that you lose all of the VLIW
advantages (simpler decode, shorter pipeline, easier implementation), if
you take a design-by-comittee approach and try to eliminate the
downsides of VLIW by introducing every bell and whisle you heard about.
The net result is that the implementation is significantly more
difficult than a complex OOO execution core, and the success of all the
features aren't guaranteed. If you have a 20% longer cycle time, an
architectural speedup of 20% is gone. If you have a long pipeline, code
like object oriented programs with many indirect calls or other threaded
code systems fall flat (Gforth, threaded code, e.g. overall gains
absolutely nothing from the 21164 to 21264 transition, the fastest
processor in our Benchres file now is the 450 MHz overclocked
Mendochino, with pale 21264 behind 21164As except on sieve.).

The stated "low" instruction memory profile of the Alpha is also only in
comparison with IA-64, for the programs I know comparisons, Alpha
binaries are the largest (no IA-64 data available yet); compact ones
like PA-RISC is less than half the size, undercutting even CISCs like
IA-32.

Whether SMT is a silver bullet is yet unclear to me, too. You need a
larger register file, and due to the burst-like access (stall - high ILP
- stall - high ILP), dividing the register file isn't that sucessful.
You also need larger caches, to reduce thrashing. We can start bets
whether the 21264 core or the 21464 core will deliver more
performance/mm².

The only thing I'm really convinced to be a good one is DSM, Compaq's
new TLA for (CC)NUMA. But that NUMA scales better than system busses
isn't new.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Zalman Stern

unread,
Dec 30, 1999, 3:00:00 AM12/30/99
to
It contains some illustrative examples, but noone should confuse this for
quantitative analysis. Its basically a marketing piece.

Which is not to say its necessarily wrong, but there certainly isn't any
evidence provided other than trivially small examples, many of which
aren't scheduled well for IA64 anyway. The whole 33% code bloat thing for
example seems about as well thought out as early code density arguments the
CISC folks threw at RISC. What goes around comes around I guess.

The points about SMT do not consider the cost of implementation. Which in a
comparison to IA64 is fine 'cause its hard to imagine an SMT Alpha being
much harder or more resource intensive than a comparable technology IA64
implementation.

-Z-

Toon Moene

unread,
Dec 30, 1999, 3:00:00 AM12/30/99
to
Zalman Stern <zal...@netcom9.netcom.com> wrote:

> It contains some illustrative examples, but noone should confuse this for
> quantitative analysis. Its basically a marketing piece.

Yeah - now that's a good one ! Digital^H^H^H^H^H^H^HCompaq engineers try to
offset the marketing hype behind a non-existing chip with some detailed
analysis and this is what you can come up with ?!?!?!?

Note the date on the document: 11th of October 1999. Certainly no-one
outside Intel would have had his eyes - let alone his hands - on an Itanium
processor chip, no ?

--
Toon Moene (mailto:to...@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
GNU Fortran: http://gcc.gnu.org/onlinedocs/g77_news.html

Rob Young

unread,
Dec 30, 1999, 3:00:00 AM12/30/99
to
In article <84ga7c$mih$1...@nntp3.atl.mindspring.net>, Zalman Stern <zal...@netcom9.netcom.com> writes:
> It contains some illustrative examples, but noone should confuse this for
> quantitative analysis. Its basically a marketing piece.
>

How about "Executive Summary"? Says that at the top.

To presume there isn't quantative analysis behind this piece
would be rather presumptuous, wouldn't it? If this
is marketing fluff, seems to be rather high-level fluff.

Come to think of it... that paper is *MUCH* better than the marketing
pieces in MDRONLINE regarding Merced and IA64 hype, wouldn't you
say?

>
> Which is not to say its necessarily wrong, but there certainly isn't any
> evidence provided other than trivially small examples, many of which
> aren't scheduled well for IA64 anyway. The whole 33% code bloat thing for
> example seems about as well thought out as early code density arguments the
> CISC folks threw at RISC. What goes around comes around I guess.
>

So, you are saying that is not accurate? What would be a more
accurate code bloat number, 25%?

>
> The points about SMT do not consider the cost of implementation. Which in a
> comparison to IA64 is fine 'cause its hard to imagine an SMT Alpha being
> much harder or more resource intensive than a comparable technology IA64
> implementation.
>

This page is a good resource:

http://www.cs.washington.edu/research/smt/

Overview:

http://www.cs.washington.edu/research/smt/papers/ieee_micro.pdf

Database analysis:

http://www.cs.washington.edu/research/smt/papers/smtdatabase.pdf
http://www.cs.washington.edu/research/smt/papers/smtdatabaseabstract.html


"For example, using a memory-intensive OLTP workload, an 8-context SMT
processor achieves a 3-fold increase in instruction throughput over a
single-threaded superscalar with similar resources."

------

Regarding cost of implmentation... if the payback according to that
Marketing paper and the smtdatabase.pdf is a factor of 3
greater in database performance, seems like something worthy
to shoot for.

I can't find it but I seem to recall SMT adds 3% on to chip
real estate.. could be off there.

Rob


Zalman Stern

unread,
Dec 30, 1999, 3:00:00 AM12/30/99
to
Toon Moene <to...@moene.indiv.nluug.nl> wrote:
: Yeah - now that's a good one ! Digital^H^H^H^H^H^H^HCompaq engineers try to
: offset the marketing hype behind a non-existing chip with some detailed
: analysis and this is what you can come up with ?!?!?!?

Sorry, but I call 'em as I see 'em. The document does not contain enough
information to judge whether there was a (correct) quantitative analysis or
not. I allowed that they may be right, but some of the stuff in the
document is so oversimplified (or just wrong) that I consider it somewhere
between misleading and worthless.

I've said the same thing about Intel's 8 Queens examples and such: they're
worthless for drawing architectural conclusions. You need to measure what
happens over large programs. More and more true with the increased
complexity of huge OOO windows, massive ILP, and SMT. Small examples didn't
cut it then and they certainly don't cut it now.

IA64 may or may not suck, but we can't tell from this noise now can we? And
more noise doesn't help.

: Note the date on the document: 11th of October 1999. Certainly no-one

: outside Intel would have had his eyes - let alone his hands - on an Itanium
: processor chip, no ?

Yeah so? The document in question rarely mentions Itanium (Merced) but is
an architectural comparison. The implication is that DEC has a lot of
experience simulating architectures with features similar to IA64. I expect
this is definitely true, but if so, I'd like to have seen more data.

There's also this detail that Itanium is scheduled to ship well before an
SMT Alpha. It'll be interesting to see what happens in the interim... Its
very possible that Intel and HP have specified a standard runtime interface
for low-level thread creation and synchronization. If so, they could add MT
or SMT in a future implementation. I doubt it, but since they aren't saying
anything about the systems programming aspects, you can't tell what they're
thinking. Assuming they are idiots is probably a bad idea. The Compaq paper
also mentions more than once that they think OOO IA64 is so difficult that
it is unlikely to happen. People used to say the same thing about
superscalar x86...

Of course its all going to be moot after Transmeta announces now isn't it :-)

-Z-

Zalman Stern

unread,
Dec 30, 1999, 3:00:00 AM12/30/99
to
Rob Young <you...@eisner.decus.org> wrote:
: To presume there isn't quantative analysis behind this piece

: would be rather presumptuous, wouldn't it? If this
: is marketing fluff, seems to be rather high-level fluff.

Who presumed any such thing? I said it isn't in the paper.

: Come to think of it... that paper is *MUCH* better than the marketing


: pieces in MDRONLINE regarding Merced and IA64 hype, wouldn't you
: say?

I don't think I read the whole thing. MDR runs a lot of fluff too. Maybe
things will improve with recent editorial changes. (Or maybe not.)

[...]
: So, you are saying that is not accurate? What would be a more


: accurate code bloat number, 25%?

No, I didn't say its "not accurate." I said the logic they are using to
support "it must be at least 33% worse" is wrong. (E.g. it assumes that
predication bits are completely worthless. Were this so, ARM probably
wouldn't be known for good code density.) I expect it will be worse than
Alpha (and the Intel/HP folks seem to have admited as much).

Code density will have a lot to do with the compilers, in particular
whether the optimizer chooses to aggressively pursue stright line execution
speed, or chooses tighter code in a particular area. Put another way, I
wouldn't be surprised if you need a completely different set of compiler
switches on IA64 for number crunching kernels and for compiling a
database. (No, I don't consider that a good thing.)

-Z-

Bernd Paysan

unread,
Dec 30, 1999, 3:00:00 AM12/30/99
to
Rob Young wrote:
> How about "Executive Summary"? Says that at the top.
>
> To presume there isn't quantative analysis behind this piece
> would be rather presumptuous, wouldn't it? If this
> is marketing fluff, seems to be rather high-level fluff.
>
> Come to think of it... that paper is *MUCH* better than the marketing
> pieces in MDRONLINE regarding Merced and IA64 hype, wouldn't you
> say?

I would rather describe it as a biased study. There are several pieces
which are plain honest, and some pieces are just obviously biased,
because the author knows and likes the Alpha architecture better than
IA64; which is no surprise at all. Given the secrecy of Intel, we don't
have to wonder that more founded analysis are hard to obtain. E.g.
considering code bloat, I'd just compile a few programs and print the
report of size here (especially text segment). But there is no compiler
available now that would allow you to do that (to get a compiler, you
have to sign a NDA - although it's not binding for GCC, since the GPL
doesn't allow you to give up redistribution rights).

Toon Moene

unread,
Dec 31, 1999, 3:00:00 AM12/31/99
to
Zalman Stern <zal...@netcom17.netcom.com> wrote:

> I wrote:

> : Yeah - now that's a good one ! Digital^H^H^H^H^H^H^HCompaq engineers try
to
> : offset the marketing hype behind a non-existing chip with some detailed
> : analysis and this is what you can come up with ?!?!?!?

> Sorry, but I call 'em as I see 'em. The document does not contain enough
> information to judge whether there was a (correct) quantitative analysis or
> not. I allowed that they may be right, but some of the stuff in the
> document is so oversimplified (or just wrong) that I consider it somewhere
> between misleading and worthless.

> I've said the same thing about Intel's 8 Queens examples and such: they're
> worthless for drawing architectural conclusions.

Ah yes, but that's my whole point - the Alpha world has to offset the Intel
hype - but without the detailed knowledge Intel has about the choices that
went into their architecture.

So they produce a document that contains the same fallacies as the original
public Intel information. Tough luck.

What I would like to see is some information from Intel on how they're going
to beat the Alpha architecturally (or otherwise) with their "winning horse".
I simply don't see how they could do that.

See you in the next millennium ...

Paul DeMone

unread,
Dec 31, 1999, 3:00:00 AM12/31/99
to

Bernd Paysan wrote:
[snip]

> The stated "low" instruction memory profile of the Alpha is also only in
> comparison with IA-64, for the programs I know comparisons, Alpha
> binaries are the largest (no IA-64 data available yet); compact ones
> like PA-RISC is less than half the size, undercutting even CISCs like
> IA-32.

Are you comparing 32 bit PA-RISC binaries to 64 bit LP64 Alpha binaries
(which include 64 bit linkage tables)? Are you including run time
libraries?
What about compiler directives (Alpha compiler loves to octaword align
branch targets). Compare apples to apples please.

Your comment is far less useful than this Compaq white paper.

>
> Whether SMT is a silver bullet is yet unclear to me, too. You need a
> larger register file, and due to the burst-like access (stall - high ILP
> - stall - high ILP), dividing the register file isn't that sucessful.
> You also need larger caches, to reduce thrashing. We can start bets
> whether the 21264 core or the 21464 core will deliver more
> performance/mm².

And the 21264 probably sucks on this metric compared to StrongARM
So what's your point? High end processors always pay the price
for bleeding edge performance in terms of performance normalized
to some cost metric like cost, chip area, or power.

What interests me is how a ~300 mm2, ~140 Watt Merced compares
in performance to a ~150 mm2, 80 Watt EV68


--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
dem...@mosaid.com architectures with MIPSed results but ALPHA's well
pde...@igs.net that ends well.

John Dallman

unread,
Dec 31, 1999, 3:00:00 AM12/31/99
to
In article <386BD144...@gmx.de>, bernd....@gmx.de (Bernd Paysan)
wrote:

> ... considering code bloat, I'd just compile a few programs and print

> the report of size here (especially text segment). But there is no
> compiler available now that would allow you to do that (to get a
> compiler, you have to sign a NDA - although it's not binding for GCC,
> since the GPL doesn't allow you to give up redistribution rights).

After some time staring at the architecture descriptions, I think you can
expect code size figures to oscillate quite a bit as the compiler
technology matures. Given the tendency of some of the more pointy-haired
people involved with the computer industry to stick to arbitrary decisions
they made years ago[1], I'm not surprised that Intel wants to keep size
figures under wraps for a while. When you operate with a major eye on your
stock price, paranoia sets in.

I also suspect that the dodge common among less skillful Windows
programmers of "the release compile crashes - we'll ship the debug" will
hit a serious problem with unacceptable code sizes.

[1] "Those code sizes were too big on the original compiler, so they'll be
too big now!"

---
John Dallman j...@cix.co.uk


Frank Ch. Eigler

unread,
Dec 31, 1999, 3:00:00 AM12/31/99
to

Bernd Paysan <bernd....@gmx.de> writes:

> [...] But there is no compiler available now that would allow you to


> do that (to get a compiler, you have to sign a NDA - although it's
> not binding for GCC, since the GPL doesn't allow you to give up
> redistribution rights).

This is an incorrect understanding of the interactions between NDAs
and the GPL. It is not the case that "the GPL doesn't allow you to
give up redistribution rights". Rather, the GPL prevents a
distributor from forcing a recepient to give up her redistribution
privileges.

When the recepient undertakes the NDA, she waives her own GPL
privilege to share with non-NDA third parties -- it is not the
distributor forcing her to waive those privileges. Refusing to
distribute software to non-NDA third parties is fine by the GPL too.

The bottom line is that a non-NDA individual is unlikely to get hold
of an ia64 gcc port before the NDAs are dissolved.

- FChE

TTK Ciar

unread,
Dec 31, 1999, 3:00:00 AM12/31/99
to
In article <386A9263...@gmx.de>,

Bernd Paysan <bernd....@gmx.de> wrote:
>Jan Vorbrueggen wrote:
>> If you haven't heard yet, DEC^H^H^HCompaQ have posted a nice little paper
>> (http://www.digital.com/hpc/ref/ref_alpha_ia64.pdf) comparing IA64 with
>> the Alpha architecture.
>
>The article is certainly quite biased in some respects. I want to pick
>the memory latency hidden by OOO execution as example:
>
>Typical workloads either have streaming memory access (easy vectorizable
>code, memory latency can be hidden by prefetch hints), or overall very
>limited parallelism. Look at the pointer chasing parts of SPECint, e.g.
>GCC. You load a pointer from a structure, and then use that pointer to
>obtain a value from elsewhere to compare it with another value in memory
>(DAG-subexpression comparison). Now it's clear that this sort of program
>will stall after two outstanding memory requests in the queue, no matter
>how OOO your architecture is, or how many speculative load hints your
>compiler inserted.

You're right. I think OOO does hide some memory latency, but not a
whole lot when it comes to pointer-chasing apps like GCC (and many real-
life workstation applications are primarily pointer-chasers).

>The only way around that is SMT, because then you have at least other
>threads to run. I'm not sure if this works good with GCC, because
>although you can do a make -j to max out the threads available, the
>working set of GCC is quite large, resulting in cache thrashing (at
>least to L2).

I think you're thinking of OS-level multithreading. Using "make -j"
will create a lot of OS-level threads. Processor-level SMT is implemented
an abstraction layer below that, in the hardware, where the CPU will let
nondependent instruction sequences issue while a very large number of
dependent instruction sequences can block while waiting on resources. It
does not require explicit application or OS support -- ie, even single-
threaded, single-process sequential code can take advantage of SMT.

>Replace GCC with your favourite data base, and the picture doesn't
>change that much. Still large working set, still lots of pointers, still
>poor data locality.

Yep.

IMO the only practical way to overcome the "memory wall" is to integrate
main memory and processor logic into the same silicon, so you can reduce
access latency somewhat and widen the data path immensely (1Kb, 4Kb, etc)
to maximize cache utilization per main memory hit. A large cache would
also help (my gcc working sets usually max out around 5MB). Doing this
is equally applicable to x86, IA64, Alpha, or whatever. IBM currently
has a head start on this technology, but I can't imagine it taking long
for Intel or Compaq/Samsung to emulate them.

>The only thing I'm really convinced to be a good one is DSM, Compaq's
>new TLA for (CC)NUMA. But that NUMA scales better than system busses
>isn't new.

Yes, NUMA is the right approach for making multiprocessor systems
scale well, but amdahl's law still applies. Whether we have one ALU or
a few dozen, we need to reduce memory random access latencies!

-- TTK


Rob Young

unread,
Jan 1, 2000, 3:00:00 AM1/1/00
to
In article <386D35CF...@igs.net>, Paul DeMone <pde...@igs.net> writes:
>
> What interests me is how a ~300 mm2, ~140 Watt Merced compares
> in performance to a ~150 mm2, 80 Watt EV68
>

Ouch.. ..

Speaking of Merced/Itanium again.. did you catch
Linley's farewell?

Good ol' Linley. Linley pushing Merced so hard you would a
thunk he was working for Intel marketing.

Linley does a bit of a climb down:

http://www.mdronline.com/q/@20580451sllhpf/mpr/editorials/edit13_17.html

"One of Merced's problems is that its clock speed will not be
competitive with that of other 0.18-micron processors in 2000,
hampering performance. Wisely, Intel has targeted the 0.18-micron
McKinley at more than 1 GHz, making it a true Speed Demon. As long
as IA-64 compilers can match or beat RISC compilers, McKinley will
be an excellent performer."

-----

But wait... according to Intel spokespeople , Merced will indeed
get better clock. We know Craig is very angry about the low-speed
and that always makes them get to it.

I wonder if we will see more talk about Merced being a server
part. After all, if they crank the clock to 900 MHz wouldn't it
be running a bit hotter, hampering the whole workstation thing?

If at 700 MHz you can stuff it in a workstation, it may not
be competitive. Catch-22.

Rob


Rob Young

unread,
Jan 1, 2000, 3:00:00 AM1/1/00
to
In article <386BD144...@gmx.de>, Bernd Paysan <bernd....@gmx.de> writes:
> Rob Young wrote:
>> How about "Executive Summary"? Says that at the top.
>>
>> To presume there isn't quantative analysis behind this piece
>> would be rather presumptuous, wouldn't it? If this
>> is marketing fluff, seems to be rather high-level fluff.
>>
>> Come to think of it... that paper is *MUCH* better than the marketing
>> pieces in MDRONLINE regarding Merced and IA64 hype, wouldn't you
>> say?
>
> I would rather describe it as a biased study. There are several pieces
> which are plain honest, and some pieces are just obviously biased,
> because the author knows and likes the Alpha architecture better than
> IA64; which is no surprise at all. Given the secrecy of Intel, we don't
> have to wonder that more founded analysis are hard to obtain. E.g.
> considering code bloat, I'd just compile a few programs and print the
> report of size here (especially text segment). But there is no compiler

> available now that would allow you to do that (to get a compiler, you
> have to sign a NDA - although it's not binding for GCC, since the GPL
> doesn't allow you to give up redistribution rights).
>

Which if you look at the date on it .. October 11th. What, a week
after Intel presented Merced again at MPR Forum in Oct. 1999?

This thing shows up a week ago. In the interim (Oct-Dec), SGI has
demoed a Merced box as has others. Maybe in that timeframe
Compaq has seen silicon and has early compilers, etc. By posting
a paper from October knowing what they know in December, (being
careful not to reference NDA material even using good ol' Linley
as a reference) they didn't change anything.

So maybe the left hand doesn't know what the right hand is doing?
Or maybe more likely, the left hand does know what the right hand
is doing and by posting in December what they created in October...
they are essentially saying this is pretty accurate. OR they
lose face when Merced surprises us all and their little
summary in hind-sight appears silly. Maybe, maybe not.

Rob


Bernd Paysan

unread,
Jan 1, 2000, 3:00:00 AM1/1/00
to
Paul DeMone wrote:
> Are you comparing 32 bit PA-RISC binaries to 64 bit LP64 Alpha binaries
> (which include 64 bit linkage tables)? Are you including run time
> libraries?
> What about compiler directives (Alpha compiler loves to octaword align
> branch targets). Compare apples to apples please.

I'm certainly not comparing run time environments of different OSes,
just the text size of a dynamically linked program. Alpha compilers love
large branch target alignments, because Alphas bitterly need them;
without you get a significant performance degradation. And fairly: GCC
doesn't align as necessary, so the comparison is even "optimistic"
towards Alpha.

I can also compare a simple code snip between PA-RISC and Alpha, and
just count instructions, it may be a bit better (because the alignment
doesn't come into the picture), but it's still at least 50% more.
PA-RISC's "complicated" addressing modes and conditional execution helps
to shrink down code quite a bit.

Since the performance gap between PA-RISC and Alpha isn't that large, I
don't think it's an unfair comparison: I'm comparing fastest RISC with
second-fastest RISC.

Bernd Paysan

unread,
Jan 1, 2000, 3:00:00 AM1/1/00
to
Paul DeMone wrote:
> And the 21264 probably sucks on this metric compared to StrongARM
> So what's your point? High end processors always pay the price
> for bleeding edge performance in terms of performance normalized
> to some cost metric like cost, chip area, or power.

The entire point of SMT is that you *need* more than one thread to
utilize the processor's resources. The inherent ILP may be almost maxed
out on 21264, so running just one thread on a 21464 won't give you much
more performance; the small increased parallelism might be eaten up by
other tradeoffs, like keeping more function units in sync.

So you have the choice to either put a SMT core on a die, *or* to put
several single threading cores on the same die. Performance of a single
thread won't differ that much, so your metric is performance/mm².

Bernd Paysan

unread,
Jan 1, 2000, 3:00:00 AM1/1/00
to
John Dallman wrote:
> I also suspect that the dodge common among less skillful Windows
> programmers of "the release compile crashes - we'll ship the debug" will
> hit a serious problem with unacceptable code sizes.

Yeah, especially with the resulting "the IA-64 executable is too big for
one CD, let's compile to x86 instead".

It should be obvious to all compiler writers that debugging ought to
work with fully optimized code, so that during the entire development
optimization is tested. The problem is that with optimizations, the
compiler tells you "I just scheduled this 30 lines completely different,
and removed half of your variables - but worse, your control flow is
gone", and the debugger shows that instructions can execute backwards in
time ;-).

Since several bugs go away if you insert a printf here or there to find
out what's happening, I wonder if these printfs make it into the final
code, too?

Bernd Paysan

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
[discussion came up on comp.arch]

I really like to have this cleared up, so I cc'd it to Richard Stallman.

Frank Ch. Eigler wrote:
> When the recepient undertakes the NDA, she waives her own GPL
> privilege to share with non-NDA third parties -- it is not the
> distributor forcing her to waive those privileges. Refusing to
> distribute software to non-NDA third parties is fine by the GPL too.

Sorry, no, IMHO it isn't. The entire business around free software is
that you have redistribution rights as *natural rights* - noone, neither
distributor or any third party can resign that, nor yourself (can you
sell yourself as slave? No!). Otherwise, NDAs would be a easy way to get
around GPL. Let's assume that I want to "kidnap" an extended GPL'd
program, say GCC with IA-64 backend. I don't sell the enhanced GCC, I
sell NDAs. You have to sign an NDA to get my GCC version - certainly I'm
using a "straw company", so you don't have to sign the NDA with the GCC
redistributor.

This way, I could turn any GPL'd program into a proprietary program,
which is distributed only among people that signed that NDA. "Helping
neighbours" would be prohibited, because they didn't sign the NDA, and
might not even meet the conditions under which the NDA can be obtained;
NDAs can be discriminatory to anyone the offering party wants to. Read
the GNU manifesto again, and tell me if RMS really intented that. Note
that a NDA-signing community is in no way limited (other than by the
conditions to obtain the NDA), so you could let half the world sign (and
pay for) NDAs to get your derivative of a GPL'd software.

Currently, I can't get an IA-64 GCC without signing an NDA. If I had
signed an NDA, and obtained GCC, item 6 would apply (relevant phrase
underlined):

"6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
restrictions on the recipients' exercise of the rights granted herein.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You are not responsible for enforcing compliance by third parties to
this License."

The NDA imposes a further restriction on the recipient right to exercise
redistribution under GPL, and therefore either is null and void, or
Intel may no longer use/copy/modify any other program under GPL. It
doesn't matter in which order you accept the two legal documents, they
are just plain contradictionary. Since Intel isn't the original author
of GCC, they don't even license you GCC, that's RMS and other GCC
contributors. In other words: it's not their business.

The risk of violating the NDA WRT GCC is that Intel might not give you
further informations under NDA (given the quasi monopoly power Cygnus
Solutions has as experts in GCC, you are in a strong position ;-). If
you redistribute GCC or Trillian, they only can stand on their head and
wag with their feets, no more. GPL'd software distributed under NDA is a
pure goodwill action, not a legally binding one. I have nothing against
developers agreeing to ship only when it's ready, but if someone of them
decides otherwise, I can't and won't stop him - the freedom of free
software allows to continue work, even when consensus can't be reached.
I'm a coordinator of a GNU project myself, and it's more difficult to
get the codevelopers to check in their modifications in time than to
hinder uncontrolled escapes ;-).

The only "open issue" here is what a "redistribution" legally is. The
GPL doesn't define that term, so I just take common copyright practice:
a redistribution is when you make the program in whole or in relevant
parts (more than "fair use") accessible to another person. In other
words: when Microsoft would shout "piracy", you redistributed a program.
Heck, they shout "piracy" even if you copy the program just to another
computer of your own, so that would be a redistribution, too.

Rob Barris

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
In article <1999Dec31.220800.1@eisner>, you...@eisner.decus.org (Rob
Young) wrote:

> http://www.mdronline.com/q/@20580451sllhpf/mpr/editorials/edit13_17.html

Linley writes:

"In the subsequent seven years, better IC processes have greatly
improved both the IPC and the cycle time of microprocessors, leading
some vendors to claim to deliver the best of both worlds. But a chip
becomes a Speed Demon through microarchitecture design philosophy, not
IC process gains."

"With its focus on instruction-level parallelism (ILP), IA-64 seems
to be the ultimate Brainiac."

"One of Merced's problems is that its clock speed will not be
competitive with that of other 0.18-micron processors in 2000, hampering
performance. Wisely, Intel has targeted the 0.18-micron McKinley at more
than 1 GHz, making it a true Speed Demon."


What was that again, about speed demons being based on
"microarchitecture design philosophy, not IC process gains"? My head
spins. Which is it, man?

Rob

John

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to

Bernd Paysan wrote:
>
> Jan Vorbrueggen wrote:
> > If you haven't heard yet, DEC^H^H^HCompaQ have posted a nice little paper
> > (http://www.digital.com/hpc/ref/ref_alpha_ia64.pdf) comparing IA64 with
> > the Alpha architecture.
>
> The article is certainly quite biased in some respects. I want to pick
> the memory latency hidden by OOO execution as example:

I think that it is fairly widely believed that the only latency that OOO
is
good for tolerating is L2 cache (hit) latency.

By the time we get to >1 GHZ core speeds with the kinds of issue widths
being proposed for systems like EV8 or Power4 or McKinley, it will be
difficult to build large enough reorder buffers to tolerate the latency
of off-chip caches.

Consider some hypothetical processor for the 2003 era, with 4-way to 6-way
issue at 1.5 GHz and a very aggressively designed memory system. On-chip
caches will still be limited in size to ~2 MB (maybe 4?), so you have a choice
of either a large off-chip DRAM cache (~60 ns latency) or just going straight
to local memory. A really aggressive design might target ~100 ns for local
accesses. So L3 accesses are at 90 clocks of latency and local memory accesses
are at 150 clocks of latency. This suggests that if you expect any sort of
superscalar performance (>1 IPC sustained), then you would need reorder buffers
capable of holding ~200 instructions. I doubt that this could be done without
significantly impacting cycle time.

On the bright side, tolerating L2 latency effectively makes code generation
a whole lot easier.

Anton Ertl

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
In article <84ga7c$mih$1...@nntp3.atl.mindspring.net>,
Zalman Stern <zal...@netcom9.netcom.com> writes:
> The whole 33% code bloat thing for
>example seems about as well thought out as early code density arguments the
>CISC folks threw at RISC.

It reminds me more of an argument made in a paper (IIRC Dileep
Bhandarkar was one of the authors) around 1996 that compared the
Pentium Pro to some Alpha on an OLTP workload and argued that the
Pentium Pro had an advantage due to better code density.

> What goes around comes around I guess.

In this case quite literally.

Since there have been some arguments about code size in this thread,
here's some data. Everything was compiled with gcc (various
versions), and I report the text sizes of .o files (this should
eliminate library differences and reduce the influence of pointer and
long sizes); the files were build when testing the latest Snapshot of
Gforth:

main.o engine.o gcc version
IA32 Linux 4692 22192 2.7.2.1
IA32 Linux 6562 19836 2.7.2.3
MIPS Ultrix 4656 22928 2.4.5
MIPS Irix 7504 25360 egcs-1.1.2?
Alpha DU 7296 24368 2.7.2.2
Alpha Linux 9432 24856 egcs-1.0.3
Sparc Solaris 7486 21116 2.8.1
HPPA HP/UX 8264 19168 2.8.1

Looks like differences between different versions of the compiler have
more influence on the code size than the architecture. I am somewhat
mystified by the large differences between the very similar compiler
versions on the Linux-Intel boxes.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Chris Morgan

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
ttk@for_mail_remove_this_and_fruit.larva.apple.shinma.org (TTK Ciar) writes:

> I think you're thinking of OS-level multithreading. Using "make -j"
> will create a lot of OS-level threads. Processor-level SMT is implemented
> an abstraction layer below that, in the hardware, where the CPU will let
> nondependent instruction sequences issue while a very large number of
> dependent instruction sequences can block while waiting on
> resources.

I think this is correct, however

>It does not require explicit application or OS support -- ie, even
>single- threaded, single-process sequential code can take advantage
>of SMT.

I'm not sure this is true. I agree that make -j4 creates four
processes, and that SMT is one level down, but I thought that meant it
would accelerate threaded programs (e.g. pthreads used explicitly -
good webservers for example). If I compile a 10 thousand line FORTRAN
function, I really doubt an SMT processor can "discover" threads
within and execute them simultaneously. However if I use Ada or Java
(or ...) and express a threaded program, or just use the current OS's
threading interface directly via explicit creation of threads, then
the SMT machine is effectively handed a list of threads and told run
these whenever you like apart from when I tell you - an easier
problem.

Of course if the SMT machine can "discover" threads in previously
single-threaded code I'm all for it!

Chris

--
Chris Morgan <cm at mihalis.net> http://mihalis.net
mihalis.net now cruising at Warp Slackware 7.0 at an altitude of
48MB. Whoo-hoo!

Alan Lehotsky

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
In article <84nghr$h7h$1...@news.tuwien.ac.at>,
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

>>>In article <84ga7c$mih$1...@nntp3.atl.mindspring.net>,
>>> Zalman Stern <zal...@netcom9.netcom.com> writes:
>>>> The whole 33% code bloat thing for
>>>>example seems about as well thought out as early code density arguments
>>>>the
>>>>CISC folks threw at RISC.
>>>

...


>>>
>>> main.o engine.o gcc version
>>>IA32 Linux 4692 22192 2.7.2.1
>>>IA32 Linux 6562 19836 2.7.2.3
>>>MIPS Ultrix 4656 22928 2.4.5
>>>MIPS Irix 7504 25360 egcs-1.1.2?
>>>Alpha DU 7296 24368 2.7.2.2
>>>Alpha Linux 9432 24856 egcs-1.0.3
>>>Sparc Solaris 7486 21116 2.8.1
>>>HPPA HP/UX 8264 19168 2.8.1
>>>
>>>Looks like differences between different versions of the compiler have
>>>more influence on the code size than the architecture. I am somewhat
>>>mystified by the large differences between the very similar compiler
>>>versions on the Linux-Intel boxes.

If you're comparing .o's, code size may be swamped by the
artifacts caused by the object-format. Or are you looking at
the output of the 'size' command and extracting the .text
sizes? That in turn is affected by things like constant
strings being made read-only vs. storing in .data

One of the things I've done to the gcc backend I support is add
code that emits the size of each function as a comment into the
.s file. That makes it easier to do code size analysis.

-- Al Lehotsky

--
Quality Software Management
http://www.tiac.net/users/lehotsky

Process Improvement | Management Consulting | Compiler Implementation

Andy Glew

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
> Of course if the SMT machine can "discover" threads in previously
> single-threaded code I'm all for it!

SMT doesn't discover threads in previously single threaded code.


DMT discovers threads in monolithic single threaded code.


Toon Moene

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
Chris Morgan <c...@mihalis.net> wrote:

> Of course if the SMT machine can "discover" threads in previously
> single-threaded code I'm all for it!

Hmm, that's not how I read the Compaq paper. In fact, I do not need it.

If SMT can speed up my (compiler-enabled) multithreaded code

(see http://www.openmp.org, OK, it seems to be down at the moment (can you
say Y2K :-) - here's the alternative:
http://www-europe.sgi.com/software/openmp/)

by a factor of 1.5 - 3 on a *single* processor, I'm all for it :-)

Anton Ertl

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
In article <lehotsky-686552...@news.tiac.net>,

Alan Lehotsky <leho...@tiac.net> writes:
>In article <84nghr$h7h$1...@news.tuwien.ac.at>,
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>>> main.o engine.o gcc version
>>>>IA32 Linux 4692 22192 2.7.2.1
>>>>IA32 Linux 6562 19836 2.7.2.3
>>>>MIPS Ultrix 4656 22928 2.4.5
>>>>MIPS Irix 7504 25360 egcs-1.1.2?
>>>>Alpha DU 7296 24368 2.7.2.2
>>>>Alpha Linux 9432 24856 egcs-1.0.3
>>>>Sparc Solaris 7486 21116 2.8.1
>>>>HPPA HP/UX 8264 19168 2.8.1
>>>>
>>>>Looks like differences between different versions of the compiler have
>>>>more influence on the code size than the architecture. I am somewhat
>>>>mystified by the large differences between the very similar compiler
>>>>versions on the Linux-Intel boxes.
...

> Or are you looking at
> the output of the 'size' command and extracting the .text
> sizes?

Right.

> That in turn is affected by things like constant
> strings being made read-only vs. storing in .data

I had thought that gcc stores strings in the text section. However, it
looks like it stores strings in a section called .rodata with ELF on
Linux.

The differences in the main.o text sizes for the Linux-Intel are
mainly from different size commands: For the default output format GNU
size 2.7 reports just the text section as text size, whereas GNU size
2.9.1 reports the sum of .text and .rodata as text size.

The differences in the engine.o text sizes are mainly due to using
-m486 on one system, and not using it on the other system (Gforth's
configure automatically inserts this flag when it is run on a 486). I
think the main reason for the size difference from this flag is the
code alignment to 16-byte boundaries.

Here's a table of sizes of just the .text segment (strings seem to
reside in .rdata, .rodata on all machines except HP/UX; there I
subtracted the $LIT$ size from the text size):

main.o engine.o gcc version
IA32 Linux 4692 22192 2.7.2.1 -m486
IA32 Linux 4321 19780 2.7.2.3


MIPS Ultrix 4656 22928 2.4.5
MIPS Irix 7504 25360 egcs-1.1.2?
Alpha DU 7296 24368 2.7.2.2

Alpha Linux 6976 24736 egcs-1.0.3
SPARC Solaris 4908 21012 2.8.1
HPPA HP/UX 5656 19000 2.8.1

The conclusions don't change much.

Paul DeMone

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to

Rob Barris wrote:
[snip]


> What was that again, about speed demons being based on
> "microarchitecture design philosophy, not IC process gains"? My head
> spins. Which is it, man?

speed demon is a microarchitecture design choice.

IC process gains are a tide that raises all boats, brainiacs
and speed demons alike.

Anton Ertl

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
In article <1999Dec31.220800.1@eisner>,
> "One of Merced's problems is that its clock speed will not be
> competitive with that of other 0.18-micron processors in 2000,
> hampering performance. Wisely, Intel has targeted the 0.18-micron
> McKinley at more than 1 GHz, making it a true Speed Demon.

So he's telling us Merced won't even get 1GHz in 0.18um? Given the
length of the pipeline, I would have expected it to have at least the
clockrate of a 21164 in the same technology (unfortunately the fastest
21164s are in 0.35um, but I would expect that 1GHz should be possible
for a 0.25um 21164, and a 0.18um version should be quite a bit faster
still).

Next question, is a 0.18um 1GHz chip a speed demon? How many chips in
that technology will miss that mark? Well, Athlon will hit it, and
EV68 should do it, too. I also expect some IA-32-only implementation
from Intel to hit it. PPC, HPPA, MIPS, and SPARC will probably miss
it, so 1GHz could be seen as dividing line between speed demons and
other chips; it's funny to see IA32 implementations as speed demons.

John Dallman

unread,
Jan 2, 2000, 3:00:00 AM1/2/00
to
In article <386E7B24...@gmx.de>, bernd....@gmx.de (Bernd Paysan)
wrote:

> It should be obvious to all compiler writers that debugging ought to


> work with fully optimized code, so that during the entire development
> optimization is tested. The problem is that with optimizations, the
> compiler tells you "I just scheduled this 30 lines completely different,
> and removed half of your variables - but worse, your control flow is
> gone", and the debugger shows that instructions can execute backwards in
> time ;-).

You know that. I know that. Most people on comp.arch know that. Sadly,
this is one that compiler writers (or their managers) tend to rationalise
with "well, we won't add any bugs in the optimiser..." and "it's really
hard to express some of these concepts in the debug info".

I think I may try to see if I can persuade work to let me write about some
recent experiences for JILP.

---
John Dallman j...@cix.co.uk

Rob Young

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to

What about EV7's Network Interconnect?

http://www.digital.com/alphaoem/present/sld018.htm

With 32 EV7s you wonder if L3 would even be necessary. Maybe
I'm not "on target" but it seems routing at that speed and bandwidth
and asynchrounously .. it looks like a fairly large L2. Some
are more remote than others.. but not that remote... Wouldn't
a hypercube ensure CPUs are at most four hops from each other?
If so, then at most (acknowledging hidden overhead may add
a bit) remote L2 is 60 ns away, but "average" L2 access a little more
than half that (35-37 ns). But maybe more importantly, this
L2 would be 48 MByte or more aggregate.

Also, as an aside .. earlier you were wondering what was
Wildfire going to do with 64 PCI busses. From that slide
it appears each processor can handle 3 GByte I/O per second.
32 EV7's in a box could pretty much saturate 64 PCI busses with
I/O bandwidth needs/wants. Granted, it will require a "fork-lift"
swap-out to stick EV7s in there. But it seems the box would
have a good deal of I/O to ensure usefulness post-EV6. In other words,
your not stuck with a boat anchor. The box would support CPU upgrades
one or two generations out.

Rob


John McCalpin

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to

Rob Young wrote:
>
> In article <386F327F...@austin.rr.com>, John <jmcc...@austin.rr.com> writes:
> >
> > Consider some hypothetical processor for the 2003 era, with 4-way to 6-way
> > issue at 1.5 GHz and a very aggressively designed memory system. On-chip
> > caches will still be limited in size to ~2 MB (maybe 4?), so you have a choice
> > of either a large off-chip DRAM cache (~60 ns latency) or just going straight
> > to local memory. A really aggressive design might target ~100 ns for local
> > accesses. So L3 accesses are at 90 clocks of latency and local memory accesses
> > are at 150 clocks of latency. This suggests that if you expect any sort of
> > superscalar performance (>1 IPC sustained), then you would need reorder buffers
> > capable of holding ~200 instructions. I doubt that this could be done without
> > significantly impacting cycle time.
> >
>
> What about EV7's Network Interconnect?
>
> http://www.digital.com/alphaoem/present/sld018.htm
>
> With 32 EV7s you wonder if L3 would even be necessary. Maybe
> I'm not "on target" but it seems routing at that speed and bandwidth
> and asynchrounously .. it looks like a fairly large L2. Some
> are more remote than others.. but not that remote... Wouldn't
> a hypercube ensure CPUs are at most four hops from each other?
> If so, then at most (acknowledging hidden overhead may add
> a bit) remote L2 is 60 ns away, but "average" L2 access a little more
> than half that (35-37 ns). But maybe more importantly, this
> L2 would be 48 MByte or more aggregate.

There is very little detail in that EV7 presentation -- at least to someone
who has designed NUMA systems. The devil ends up being in the details, and
there are a lot of them.

A couple of points:

(1) Bandwidth Hierarchy:
L1 <-> registers >16 GB/s (2 loads at >1 GHz)
L2 <-> L1 16 GB/s peak (slide 16)
RDRAM <-> memory controller 6 GB/s peak (slide 17)
chip-to-chip 2 GB/s peak per channel
8 GB/s peak per chip

So access to a single remote L2 cache is at least 8x slower in
bandwidth than access to your own L2 cache, and 3x slower in
bandwidth than access to your own DRAM.

(2) Cache Coherence
Slide 17 says that the memory controller supports "directory-based
cache coherence". Although I like directory-based cache coherence,
there are lots of details that are hidden from us. For example:
(a) Is there any snooping done in the system, or is it all
directory-based?
If there is some snooping, what is the domain over which
snooping occurs? The tradeoff is that snooping provides
faster access to data in other processor's caches, but
typically increases local memory latency.

If there is no snooping, then each CPU must have a directory
to map the status of each line in its RDRAM with respect to
the other caches.

(b) Where are the directories?
The block diagrams don't show any extra connections for
directories, so it is possible that (like the
SGI Origin2000) the directories are held in the DRAM. This
can be done either with extra-wide RAM (like the Origin2000),
or by allocating a part of each RIMM to holding directory data.
The latter approach allows the use of standard RDRAM parts,
but also introduces a lot of complexity into the memory
controller.

The key issue here is that if the directory is in the RDRAM,
then accesses to data in other processor's caches has an
RDRAM latency in *series* with the actual intervention latency.
This is because you must look up the cache line in the directory
in order to know which processor to ask for the data.

(3) Latency Hierarchy
L1: a couple of clocks (like the EV6)
L2: "12 ns load-to-use" (slide 16)
RDRAM: depends on the cache coherency protocol used.
A fully directory-based scheme might be as low as ~100 ns for
local accesses (note that for RDRAM accesses, the
directory
lookup can be done in parallel with the data access)
A hybrid snooping + directory scheme will have higher local
access latency.
Remote RDRAM: You need to look at what they are claiming very carefully.
On slide 18, they claim "15 ns processor to processor" latency.
It is not clear to me what this means, but I am certain that it
does not mean what it looks like it means!

What sort of processor-to-processor transactions are you interested
in? I do not think that the Alpha ISA supports any!

What it supports is a cache-coherent memory space, so all
that you can
do are processor to memory transactions, and that memory may
be cached
on another processor's die.
If the coherence is directory based, then there is a
RDRAM lookup
to find out which remote node has the data, followed by
an intervention
request, followed by the L2 lookup on the remote node,
followed by the
return of the data, followed by the processing of the
returned data.
This is not a "15 ns" sequence, since it includes a 30 ns
Rambus CAS
latency plus a 12 ns L2 access. I would be really,
really surprised
if they can do this operation in less than:
100 ns directory lookup
30 ns average chip-to-chip latency for a 16p system
12 ns remote L2 access
30 ns average chip-to-chip latency for a 16p system
12 ns network interface to CPU core
------
~200 ns load-to-use for remotely cached (dirty) data

Getting to this level would be an amazing feat, since the fastest
NUMA machine for this operation right now is the SGI Origin2000,
and it takes about 1100 ns. The HP NUMA machines are considerably
slower than this. Actually, I would be slightly
surprised if the
system were able to do these remote interventions in 400 ns.

The numbers are different if the system supports some degree
of snooping,
or if the directory is not stored in the RDRAM, but the
principal is the
same. They carefully provided a "load-to-use" latency on
slide 16, and
then let you make the mistake of thinking that the latency on
slide 18
meant something equally useful.


> Also, as an aside .. earlier you were wondering what was
> Wildfire going to do with 64 PCI busses. From that slide
> it appears each processor can handle 3 GByte I/O per second.
> 32 EV7's in a box could pretty much saturate 64 PCI busses with
> I/O bandwidth needs/wants. Granted, it will require a "fork-lift"
> swap-out to stick EV7s in there. But it seems the box would
> have a good deal of I/O to ensure usefulness post-EV6. In other words,
> your not stuck with a boat anchor. The box would support CPU upgrades
> one or two generations out.

The issues have to do with how much system cost is incurred in order to support
the scalable I/O capability that very few customers need. I agree that
"overkill" on I/O capability is appropriate for the first version of a server
line, but if it costs too much it could hurt the sales in areas where immense
I/O is not a priority. I will have to wait until the real product
details
emerge before drawing any conclusions..... Maybe February?

Larry Kilgallen

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to
In article <3870B653...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:

> The issues have to do with how much system cost is incurred in order to support
> the scalable I/O capability that very few customers need. I agree that
> "overkill" on I/O capability is appropriate for the first version of a server
> line, but if it costs too much it could hurt the sales in areas where immense
> I/O is not a priority. I will have to wait until the real product
> details
> emerge before drawing any conclusions..... Maybe February?

Is there an industry-wide venue for such details in February,
or did the vendor promise something then, or is that just an
expert estimate of when they would say more ?

Rob Young

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to
In article <3870B653...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:
>

[snip all the hard questions you pose. Maybe pick at
one or two of them later. My brain hurts (me of
BS in comp sci background).]


>
>
>
>> Also, as an aside .. earlier you were wondering what was
>> Wildfire going to do with 64 PCI busses. From that slide
>> it appears each processor can handle 3 GByte I/O per second.
>> 32 EV7's in a box could pretty much saturate 64 PCI busses with
>> I/O bandwidth needs/wants. Granted, it will require a "fork-lift"
>> swap-out to stick EV7s in there. But it seems the box would
>> have a good deal of I/O to ensure usefulness post-EV6. In other words,
>> your not stuck with a boat anchor. The box would support CPU upgrades
>> one or two generations out.
>
> The issues have to do with how much system cost is incurred in order to support
> the scalable I/O capability that very few customers need. I agree that
> "overkill" on I/O capability is appropriate for the first version of a server
> line, but if it costs too much it could hurt the sales in areas where immense
> I/O is not a priority. I will have to wait until the real product
> details
> emerge before drawing any conclusions..... Maybe February?


Maybe not February. But regarding cost.. sure. Today Compaq is
shipping multi-path fiber controllers (HSG80). 100 MByte/sec sustained
bandwidth. Maybe in 2 years they ship HSG100 with 500 MByte/sec
sustained going to Fibre to the busses. Currently , they are
UltraSCSI busses. At that point, maybe it helps to have more PCI
busses.

Also (more likely use for 64 busses), in a Galaxy or multiple hard
partitions of Tru64 it may make sense (especially for Tru64 ,
lacking QIO Server in shared memory ;-) to have a direct connect to
each controller. 64 Tru64 instances, each instance using a PCI bus,
each bus with several cards out to a bunch of switches off of which
are found a dozen or so HSG100s (if there is such a thing in a couple
years, I am positing). 64 instances of Tru64 in the near future may
be accurate given MARVEL may have 256 CPUs (or so). Also, each
instance must be able to see more than 1 PCI bus... I think I saw
that in a slide somewhere.

Regarding package cost.. maybe later if EV7 Wildfire ships with no L3,
that helps a great deal to keep server costs down. Much more powerful
controllers means fewer controllers.

Rob


rub...@my-deja.com

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to
In article <84nrda$8...@spool.cs.wisc.edu>,

"Andy Glew" <gl...@cs.wisc.edu> wrote:
> > Of course if the SMT machine can "discover" threads in previously
> > single-threaded code I'm all for it!
>
> SMT doesn't discover threads in previously single threaded code.
>
> DMT discovers threads in monolithic single threaded code.
>
>

What is a DMT?


Sent via Deja.com http://www.deja.com/
Before you buy.

John McCalpin

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to

Larry Kilgallen wrote:
>
> In article <3870B653...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:
>

> > I will have to wait until the real product
> > details
> > emerge before drawing any conclusions..... Maybe February?
>

> Is there an industry-wide venue for such details in February,
> or did the vendor promise something then, or is that just an
> expert estimate of when they would say more ?

"February" was my recollection of when the industry analysts
thought that Compaq was going to publically reveal the system.
Then again, I am running a high fever, so this could be the
result of an hallucination. :-)

Frank Ch. Eigler

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to

Bernd Paysan <bernd....@gmx.de> writes:

> I really like to have this cleared up, so I cc'd it to Richard Stallman.

Good, perhaps he will respond too. By the way, this is an old issue:
look up the thread `Re: GPL and "free" software' in dejanews, for
example.


> [...]


> This way, I could turn any GPL'd program into a proprietary program,

> which is distributed only among people that signed that NDA. [...]

Perhaps, though even if this practice goes against an interpretation
of the intent of the GNU Manifesto, the actual license text would
supercede that.


> Currently, I can't get an IA-64 GCC without signing an NDA. If I had

> signed an NDA, and obtained GCC, item 6 would apply [...]
>> "6. [...] You may not impose any further


>> restrictions on the recipients' exercise of the rights granted herein.

Right. Since you are hypothetically under NDA, then there is a
conflict, and therefore you cannot redistribute at all. This
situation is spelled out in the GPL.


- FChE

Jonathan Thornburg

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to
In article <386F327F...@austin.rr.com>,
John McCalpin <jmcc...@austin.rr.com> writes [[lines rewrapped]]

> Consider some hypothetical processor for the 2003 era, with 4-way to
> 6-way issue at 1.5 GHz and a very aggressively designed memory system.
> On-chip caches will still be limited in size to ~2 MB (maybe 4?), so you
> have a choice of either a large off-chip DRAM cache (~60 ns latency) or
> just going straight to local memory.

For just these reasons, I would expect such a system to use at least
one level of off-chip SRAM cache in front of the DRAM. At current SRAM
speeds (which it will be hard to speed up _too_ much, just due to power
and PCB-layout problems) that would still give 10-20 clock accesses.


> This suggests that if you expect any sort of superscalar performance
> (>1 IPC sustained), then you would need reorder buffers capable of
> holding ~200 instructions. I doubt that this could be done without
> significantly impacting cycle time.

Certainly the "memory wall" is there. But I think we can fudge things
a bit more with lots of cache levels...

--
-- Jonathan Thornburg <jth...@galileo.thp.univie.ac.at>
http://www.thp.univie.ac.at/~jthorn/home.html
Universitaet Wien (Vienna, Austria) / Institut fuer Theoretische Physik
"Stocks are now at what looks like a permanent high plateau" -- noted
economist Irving Fisher, 2 weeks before the 1929 stock market crash

Larry Kilgallen

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to
In article <3870D7EC...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:
>
>
> Larry Kilgallen wrote:
>>
>> In article <3870B653...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:
>>
>> > I will have to wait until the real product
>> > details
>> > emerge before drawing any conclusions..... Maybe February?
>>
>> Is there an industry-wide venue for such details in February,
>> or did the vendor promise something then, or is that just an
>> expert estimate of when they would say more ?
>
> "February" was my recollection of when the industry analysts
> thought that Compaq was going to publically reveal the system.
> Then again, I am running a high fever, so this could be the
> result of an hallucination. :-)

Isn't banishment from the newsgroup the penalty for believing
industry analysts ? :-)

Terje Mathisen

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to
rub...@my-deja.com wrote:
>
> In article <84nrda$8...@spool.cs.wisc.edu>,
> "Andy Glew" <gl...@cs.wisc.edu> wrote:
> > > Of course if the SMT machine can "discover" threads in previously
> > > single-threaded code I'm all for it!
> >
> > SMT doesn't discover threads in previously single threaded code.
> >
> > DMT discovers threads in monolithic single threaded code.
> >
> >
>
> What is a DMT?

I don't _know_, but I assumed he meant Dynamic Multi Threading, i.e. the
hw does instruction stream lookahead to discover paths that are
independent of the currently executing code, and can therefore be
started in parallel.

Terje

--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

rub...@my-deja.com

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to
In article <3870FF...@hda.hydro.com>,

Terje Mathisen <Terje.M...@hda.hydro.com> wrote:
> rub...@my-deja.com wrote:
> >
> > In article <84nrda$8...@spool.cs.wisc.edu>,
> > "Andy Glew" <gl...@cs.wisc.edu> wrote:
> > > > Of course if the SMT machine can "discover" threads in
previously
> > > > single-threaded code I'm all for it!
> > >
> > > SMT doesn't discover threads in previously single threaded code.
> > >
> > > DMT discovers threads in monolithic single threaded code.
> > >
> > >
> >
> > What is a DMT?
>
> I don't _know_, but I assumed he meant Dynamic Multi Threading, i.e.
the
> hw does instruction stream lookahead to discover paths that are
> independent of the currently executing code, and can therefore be
> started in parallel.
>
OK, then what is the difference between this so called DMT and OOO hw?

John McCalpin

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to

Jonathan Thornburg wrote:
>
> In article <386F327F...@austin.rr.com>,
> John McCalpin <jmcc...@austin.rr.com> writes [[lines rewrapped]]
> > Consider some hypothetical processor for the 2003 era, with 4-way to
> > 6-way issue at 1.5 GHz and a very aggressively designed memory system.
> > On-chip caches will still be limited in size to ~2 MB (maybe 4?), so you
> > have a choice of either a large off-chip DRAM cache (~60 ns latency) or
> > just going straight to local memory.
>
> For just these reasons, I would expect such a system to use at least
> one level of off-chip SRAM cache in front of the DRAM. At current SRAM
> speeds (which it will be hard to speed up _too_ much, just due to power
> and PCB-layout problems) that would still give 10-20 clock accesses.

The first problem is that the SRAM costs too much, so you can't really
make the L3 as much larger than the L2 as you would like. With wintel
boxes running very, very well at very low prices, other vendors can
scarcely afford to start adding ~$1600 for 16 MB of fast SRAM L3.

You can build an L3 with fast DRAM (and IBM has a very nice DDR SRAM
interface for their DRAM-in-ASIC technology), but then you run into
the second problem -- where to put the tags.

If you put the tags on chip, you get the best memory latency (i.e. fastest
time to identify a last-level cache miss), but you are stuck with burning
a fair amount of your expensive on-chip cache as external tags, and when
you choose a max size for the external cache, it is *really* fixed.
On the machines I am most familiar with (the SGI Origin2000 with MIPS
R10000/R12000 cpus), there is one (half-width ?) tag RAM for each
four data RAMS, so 10% of the SRAM is tag. This is really bad if you
want to use mega-sized DRAM based caches, since your on-chip space
is limited to about 2 MB for the next few years and you would really
prefer to use only a small part for tags. You can sector the cache,
of course, but this only works well for data access patterns with large
contiguous blocks, and provides effectively a rather small cache when
the data access patterns are small, independent blocks.

If you put the tags off-chip, then you can build them out of the same
cheap DRAM that you are using for the data, but you have just added an
additional DRAM latency to the critical path for L3 cache misses.

I am not happy with either alternative.


> > This suggests that if you expect any sort of superscalar performance
> > (>1 IPC sustained), then you would need reorder buffers capable of
> > holding ~200 instructions. I doubt that this could be done without
> > significantly impacting cycle time.
>
> Certainly the "memory wall" is there. But I think we can fudge things
> a bit more with lots of cache levels...

Unfortunately, it is a bit of a fudge, and each time you add this sort
of complexity you get less and less performance benefit from it.

So far, people have generally been able to ignore L1 sizes and block
for L2 when doing performance tuning. I wonder how much more difficult
this will get when everyone has 3-levels of cache, some SRAM, some DRAM,
some as small as 4 MB, some as large as 64 MB (or more?). Is there a
reasonable way to code any more?

Message has been deleted

Andy Glew

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to
> > SMT doesn't discover threads in previously single threaded code.
> >
> > DMT discovers threads in monolithic single threaded code.
>
> What is a DMT?

DMT = Dynamic Multi-Threading.

Haitham Akkary's Ph.D. thesis at Portland State University; 1998.

Published "A Dynamic Multithreading Processor", Akkary & Driscoll,
31st Annual ACM/IEEE International Symposium on Microarchitecture,
available online to IEEE Computer Society members with an e-account.
http://computer.org/proceedings/micro/8609/8609toc.htm

May be considered an OOO/SMT, with the ability to fork "threads"
out of a uniprocessor code by copying register state and predicting
memory and register dependencies, at several simple places:
+ at subroutine calls, fork a new thread after the return
+ fork after loop
In addition, Haitham and I discussed, and he planned to do
+ fork loop bodies
although other researchers may have beaten him to it here.

Extra logic added to ensure that these speculative threads execute
correctly in the presence of dependencies between earlier threads
and later threads.

All of this is done totally transparently, without compiler support
(although that would be nice).

Performance summary: 1.15x - 1.30x over an OOO; pretty much the
same improvement that an OOO gets over an In-Order machine.
I suspect Haitham's evaluation is conservative, and that there is more
upside.

===

Anecdote: I had planned to do my Ph.D. on my own version of this
(which I called IMT, Implicitly Multi-Threaded), when I learned that
Haitham had already started. I therefore tried to skip a generation.
IMHO DMT or the like is the next generation of microarchitecture
for single threaded applications.

Andy Glew

unread,
Jan 3, 2000, 3:00:00 AM1/3/00
to
> OK, then what is the difference between this so called DMT and OOO hw?

Present OOO machines (Intel P6, Alpha 21264, etc.) have a single instruction
sequencer. If anything interrupts that sequencing - an I-cache miss, or a branch
misprediction, then the entire machine clogs up - you can only execute
instructions already in the window. Similarly, if there are many long-latency
instructions in the instruction window, it clogs up.

Present OOO machines are characterized by a single instruction window
that is contiguous in the dynamic instruction stream.

DMT and related microarchitectures have multiple instruction sequencers.
If one instruction sequencer clogs up, hopefully the other instruction sequencers
can keep executing.

SMT has multiple instruction sequencers, but they are explicit, programmer
visible.

DMT has multiple implicit instruction sequencers. Programmer invisible.

DMT is an example of a "skip-ahead" microarchitecture, one that attempts to skip
past a performance problem to more independent code. DMT takes advantage
of "obvious" places to skip ahead, obvious pieces of code that are guaranteed
to be executed: fork after procedure return, fork after loop exit, fork loop bodies.

Other "skip-ahead" places include "control independence" points, such as places
where control flow converges; DMT does not attack those.

===

Here is my rough sketch of microarchitecture history:

Simple In-Order
blocks on any cache miss

Scoreboarded In-Order
Does not block on cache miss
but blocks on first use of cache miss data

OOO
Does not block on use of cache miss data
if there is independent code.
Blocks on pointer chasing.
Blocks on I-cache misses and branch mispredictions

DMT -- the next generation
Does not block on I-cache misses and branch mispredictions
if a speculative thread is in execution
Does not block on pointer chasing cache misses
if a speculative thread is executing independent code

MLP???? -- the next+1 generation
Parallelizes pointer chasing cache misses,
with
+ skip lists
+ data traces
Unfortunately, these MLP techniques only work for low valency
data structures.

Bernd Paysan

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
John McCalpin wrote:
> The key issue here is that if the directory is in the RDRAM,
> then accesses to data in other processor's caches has an
> RDRAM latency in *series* with the actual intervention latency.
> This is because you must look up the cache line in the directory
> in order to know which processor to ask for the data.

If you put your directory into RDRAM, you can also use the cache
hierarchy of the processor to get it closer to the CPU if it is needed
often, well, at least up into L2 (L1 is waste). This would give remotely
cached dirty data about the same latency as local RDRAM access - for
frequent accesses (infrequent one have to go through TLB remapping,
which is costly enough by itself).

I'm quite surprised by the low latency of remote cached lines. Usually,
when you program NUMAs, you'd rather jump through all necessary hoops to
remove remote reads from your program (in favour of remote writes, which
can be pipelined easily, and allow to get the available bandwidth). From
an architectural point of view, this is also easier to implement, so if
the software guys already know how to do it that way, why not completely
eliminate the expensive snooping or directories for cache consistency?

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Bernd Paysan

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
Frank Ch. Eigler wrote:
> > I really like to have this cleared up, so I cc'd it to Richard Stallman.
>
> Good, perhaps he will respond too.

He did. Since he did in private mail, I'll not cite him completely, but
he wrote that the people who work on Trillian and GCC

RMS> have a non-binding "gentleman's agreement" not to release them;

and therefore don't violate GPL. He also concluded that inhouse copys
are not distributions

RMS> but transfer outside the company is distribution.

Thus it would be possible to have under-NDA inhouse development (you
can't ship, but copying to another cubicle isn't shipping), but as soon
as two companies or persons of different companies are engaged in the
same project (and need copies to proceed), the NDA degrades to a
non-binding gentleman's agreement. It may work so well in case of IA-64,
because the gentlemen in question don't understand that it isn't binding
;-).

After all, the GPL is not an end, but a mean to achieve the goals of the
GNU manifesto, and if the GPL fails somehow, a new, revised version will
come out. That's the reason why GPL'd programs typically say that you
may use any later revision of GPL if you like to.

Greg Lindahl

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
Bernd Paysan <bernd....@gmx.de> writes:

> The entire point of SMT is that you *need* more than one thread to
> utilize the processor's resources. The inherent ILP may be almost maxed
> out on 21264, so running just one thread on a 21464 won't give you much
> more performance; the small increased parallelism might be eaten up by
> other tradeoffs, like keeping more function units in sync.

I'm amazed that in the year 2000, people still make generalizations
like this.

It's well known that different codes have different amounts of ILP
available on a given processor. For example, a single thread of a
dense, regular matrix computation might be able to max out an entire
SMT chip, while it may take 2 or 3 compiler threads (lots of branches,
hard to make wide without heroic efforts.)

> So you have the choice to either put a SMT core on a die, *or* to put
> several single threading cores on the same die. Performance of a single
> thread won't differ that much, so your metric is performance/mm².

Given the above, it should be clear to you that the point of SMT is
flexablity -- threads which can use the entire core will, and they'll
see speedup over 1 thread on a chip with multiple single threaded
cores.

-- g


TTK Ciar

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
In article <3870FF...@hda.hydro.com>,
Terje Mathisen <Terje.M...@hda.hydro.com> wrote:
>> >
>> > SMT doesn't discover threads in previously single threaded code.
>> >
>> > DMT discovers threads in monolithic single threaded code.
>>
>> What is a DMT?
>
>I don't _know_, but I assumed he meant Dynamic Multi Threading, i.e. the
>hw does instruction stream lookahead to discover paths that are
>independent of the currently executing code, and can therefore be
>started in parallel.

I apologize profusely; in my earlier post I was thinking that SMT
referred to DMT (the technology IBM has been aggressively developing
to such great effect).

-- TTK


Zalman Stern

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
Bernd Paysan <bernd....@gmx.de> wrote:
: Yeah, especially with the resulting "the IA-64 executable is too big for
: one CD, let's compile to x86 instead".

a) Hard to imagine even doubling code size causing this. (Stuff that's CD
sized typically has lots of data relative to the amount of compiled code.)
b) IA64 code should compress beautifully given the information redundancy
therein. Compression on distribution media is an existing technology. (And
I wouldn't be surprised to learn compression vendors are already working on
special purpose compressors for IA64 code.)
c) The intial IA64 apps will be targeted at customers who think having two
CDs rather than one demonstrates that the product is "serious" :-)

-Z-

Zalman Stern

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
: It reminds me more of an argument made in a paper (IIRC Dileep
: Bhandarkar was one of the authors) around 1996 that compared the
: Pentium Pro to some Alpha on an OLTP workload and argued that the
: Pentium Pro had an advantage due to better code density.

Were they using code sizes measured from real programs or systems? If so,
they're already much better than the stuff we're talking about here.

-Z-

Jan Vorbrueggen

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
John McCalpin <jmcc...@austin.rr.com> writes:

> The first problem is that the SRAM costs too much, so you can't really
> make the L3 as much larger than the L2 as you would like. With wintel
> boxes running very, very well at very low prices, other vendors can
> scarcely afford to start adding ~$1600 for 16 MB of fast SRAM L3.

If it saves me the money and especially the (programming, administrative,
etc) hassle of buying a second computer? - I'd say then a fast SRAM L3
becomes attractive.

> So far, people have generally been able to ignore L1 sizes and block
> for L2 when doing performance tuning. I wonder how much more difficult
> this will get when everyone has 3-levels of cache, some SRAM, some DRAM,
> some as small as 4 MB, some as large as 64 MB (or more?). Is there a
> reasonable way to code any more?

I've heard there's some linear algebra code that systematically tries out
all reasonable blocking variants, running for a week or so, and then has
learnt the proper blocking for that particular machine. All you have to do
is to extend that, possibly employing some "modern" optimization techniques
such as a genetic algorithm on a little network of machines (0.5 8-)).

Jan

Terje Mathisen

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
rub...@my-deja.com wrote:
>
> In article <3870FF...@hda.hydro.com>,
> Terje Mathisen <Terje.M...@hda.hydro.com> wrote:
> > rub...@my-deja.com wrote:
> > > What is a DMT?
> >
> > I don't _know_, but I assumed he meant Dynamic Multi Threading, i.e.
> the
> > hw does instruction stream lookahead to discover paths that are
> > independent of the currently executing code, and can therefore be
> > started in parallel.
> >
> OK, then what is the difference between this so called DMT and OOO hw?

Just that DMT would do a _far_ wider scan of upcoming instructions,
afaik.

It would probably need quite a bit of rollback (speculation) capability
as well.

Andy, where are you? Please tell us exactly what you're looking for in a
DMT cpu!

Jan Vorbrueggen

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
lin...@pbm.com (Greg Lindahl) writes:

> Given the above, it should be clear to you that the point of SMT is
> flexablity -- threads which can use the entire core will, and they'll
> see speedup over 1 thread on a chip with multiple single threaded cores.

Quite. That's what I meant in my original post: an SMT presents a single
queue to multiple consumers (here, threads) of resources (here, functional
units). And queueing theory tells us that that is the best configuration.

Jan

Larry Kilgallen

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
In article <84rqb9$k...@spool.cs.wisc.edu>, "Andy Glew" <gl...@cs.wisc.edu> writes:

> DMT has multiple implicit instruction sequencers. Programmer invisible.
>
> DMT is an example of a "skip-ahead" microarchitecture, one that attempts to skip
> past a performance problem to more independent code. DMT takes advantage
> of "obvious" places to skip ahead, obvious pieces of code that are guaranteed
> to be executed: fork after procedure return, fork after loop exit, fork loop bodies.

I think of the word "fork" when applied to computers as being some
Unix operating system concept, whereas I understand jump, branch
and others as referring to machine instructions.

I don't see how something can be "Programmer invisible" and be based
on constructs of a single operating system, so I presume you must
mean something else by the word "fork". Please elaborate.

Larry Kilgallen

Andy Glew

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
> I apologize profusely; in my earlier post I was thinking that SMT
> referred to DMT (the technology IBM has been aggressively developing
> to such great effect).

I am not aware that IBM has done any work in this area.

The multithreaded machine that IBM has published on is
an explicitly parallel machine - I would call it not even SMT,
but Interleaved Multi-Threading.

Andy Glew

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to
Larry Kilgallen:
> "Andy Glew":

>
> > DMT has multiple implicit instruction sequencers. Programmer invisible.
> >
> > DMT is an example of a "skip-ahead" microarchitecture, one that attempts to skip
> > past a performance problem to more independent code. DMT takes advantage
> > of "obvious" places to skip ahead, obvious pieces of code that are guaranteed
> > to be executed: fork after procedure return, fork after loop exit, fork loop bodies.
>
> I think of the word "fork" when applied to computers as being some
> Unix operating system concept, whereas I understand jump, branch
> and others as referring to machine instructions.
>
> I don't see how something can be "Programmer invisible" and be based
> on constructs of a single operating system, so I presume you must
> mean something else by the word "fork". Please elaborate

By "fork" I mean "create a new thread".

The "fork/join" model of parallelism, where fork and join are hardware
concepts, is certainly quite old, and may even predate UNIX.

However, there are some similarities: "forking" a new hardware thread
on a DMT can be considered as making a copy-on-write copy of the parent
thread's register and memory state, modulo predictions of data dependency,
and with the difference that updates to an older thread will be propagated
to speculative threads.

Explicitly parallel, non-DMT, machines have also done this. Perhaps the most
explicit synthesis of hardware fork/join and UNIX fork was the Myrias machine
- a 68000 based Canadian supercomputer wannabe of the 1980s. Myrias's
compiler created parallel threads, e.g. to parallelize loops, using COW on
the virtual memory system; when it came time to join, non-overlapping
modified pages would be merged back together. There may be some Myrias
people here who can tell us what happened if there were overlaps - I conjecture
that they may have declared a misspeculation, and discarded the speculative thread.
I conjecture that Myria's mechanism may have been totally based on UNIX COW
fork, with a corresponding join primitive added to the OS.

Anyway - I hope that you can now see how such forking may be programmer invisible.
If there is a special instruction, it is explicit.
But, in DMT, the forks are typically implied by existing constructs such as
procedure CALL/RETURN.

----

I'm afraid that I can't resist a politically incorrect pun:

I have a half-written paper somewhere that describes alternative implementations
of forking, typically with variations in how state is extracted from the parent thread.

Working title: "A Whole Family of Mother Forkers".

(The paper and title are real, by the way, although I admit that I am more motivated
by the pun than by the content.)

John McCalpin

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to

Bernd Paysan wrote:
>
> John McCalpin wrote:
> > The key issue here is that if the directory is in the RDRAM,
> > then accesses to data in other processor's caches has an
> > RDRAM latency in *series* with the actual intervention latency.
> > This is because you must look up the cache line in the directory
> > in order to know which processor to ask for the data.
>
> If you put your directory into RDRAM, you can also use the cache
> hierarchy of the processor to get it closer to the CPU if it is needed
> often, well, at least up into L2 (L1 is waste).

The directory is typically not visible to the processor -- it is accessed
only by the memory controller. You can build a directory cache into
the memory controller. This does not help local latency, but does
provide
a slight reduction in the latency of remote dirty interventions.

> I'm quite surprised by the low latency of remote cached lines. Usually,
> when you program NUMAs, you'd rather jump through all necessary hoops to
> remove remote reads from your program (in favour of remote writes, which
> can be pipelined easily, and allow to get the available bandwidth).

For uncached stores this may be true, but you generally do not want to
do cached stores to remote memory locations. Most systems still do an
allocate on the store, so the inter-cpu network traffic is doubled for
stores compared to reads. Also, the store cannot graduate until it has
been given exclusive access to the line, which requires a round trip to
the home directory. Additional time may be required for invalidates,
depending on how relaxed the consistency protocol happens to be.


> From
> an architectural point of view, this is also easier to implement, so if
> the software guys already know how to do it that way, why not completely
> eliminate the expensive snooping or directories for cache consistency?

If you want a machine with fully user-controlled cache coherency, then
of course it could be made to go very fast. There are no good current
examples of such a machine, though the T3E is close in some ways. IBM
at one point made a non-cache-coherent parallel machine out of 4 of the
original POWER/RIOS RS/6000 boxes.

Of course, you would have to write all your software for such a machine
from scratch, especially including the O/S. Debugging can be a nightmare
as well, which is one of the reasons that IBM dropped its offering.

John McCalpin

unread,
Jan 4, 2000, 3:00:00 AM1/4/00
to

Jan Vorbrueggen wrote:
>
> John McCalpin <jmcc...@austin.rr.com> writes:
>
> > The first problem is that the SRAM costs too much, so you can't really
> > make the L3 as much larger than the L2 as you would like. With wintel
> > boxes running very, very well at very low prices, other vendors can
> > scarcely afford to start adding ~$1600 for 16 MB of fast SRAM L3.
>
> If it saves me the money and especially the (programming, administrative,
> etc) hassle of buying a second computer? - I'd say then a fast SRAM L3
> becomes attractive.

It seems to me that all the vendors have decided that there are relatively
few customers in your position. Several vendors (IBM, Compaq, SGI, Sun)
offer 8 MB L2 caches on at least some of their servers, but I have not
heard a big demand for 16 MB caches. Given the increasing prevalence of
multiprocessor-capable systems, it makes a lot more sense to add another
$800 cpu to the system than to get a minor performance boost from an
extra $800 in L2 cache.

> > So far, people have generally been able to ignore L1 sizes and block
> > for L2 when doing performance tuning. I wonder how much more difficult
> > this will get when everyone has 3-levels of cache, some SRAM, some DRAM,
> > some as small as 4 MB, some as large as 64 MB (or more?). Is there a
> > reasonable way to code any more?
>
> I've heard there's some linear algebra code that systematically tries out
> all reasonable blocking variants, running for a week or so, and then has
> learnt the proper blocking for that particular machine. All you have to do
> is to extend that, possibly employing some "modern" optimization techniques
> such as a genetic algorithm on a little network of machines (0.5 8-)).

"Just a minor matter of programming"?

The trouble is that it has taken Jack Dongarra and his group decades to
get to this point for simple algorithms in dense linear algebra.
Unfortunately, dense linear algebra only accounts for a small fraction
of the identifiable computational kernels in user codes. The more
important kernels are typically sparse matrix solvers, which are much
more difficult to code, to tune, and to parallelize.

I have done some work in "brute force" attempts at optimization of this
sort, and it is not particularly satisfying. It is especially hard to
be sure that your source code generator is outputting code that is of
the "style" that the compiler knows how to best optimize. For example,
try to find a single coding of the following that compiles to excellent
code on a large range of systems (my apologies for the Fortran):

double precision function dot (a,b,n)
double precision a(n),b(n)
integer n
double precision sum
sum = 0.0
do i=1,n
sum = sum + a(i)*b(i)
end do
dot = sum
return
end

I suspect that the coding problem is even worse in C....

Chris Gray

unread,
Jan 4, 2000, 3:00:00 AM1/4/00