IA64 and Alpha architecture comparison

Jan Vorbrueggen

ungelesen,

29.12.1999, 03:00:0029.12.99

an

If you haven't heard yet, DEC^H^H^HCompaQ have posted a nice little paper
(http://www.digital.com/hpc/ref/ref_alpha_ia64.pdf) comparing IA64 with
the Alpha architecture. Among other things, it contains very good (IMHO)
explanations of many of the ILP enhancing techniques. Also, for the first
time I've understood the real advantage of simultaneous multithreading,
which the EV8/21464 will implement (as per the CompaQ roadmap). I wonder
why that wasn't made clearer in all those papers, at least one in IEEE
Computer (which supposedly is directed at a general audience), the ex-DEC
group has published!

Jan

Bernd Paysan

ungelesen,

29.12.1999, 03:00:0029.12.99

an

Jan Vorbrueggen wrote:
> If you haven't heard yet, DEC^H^H^HCompaQ have posted a nice little paper
> (http://www.digital.com/hpc/ref/ref_alpha_ia64.pdf) comparing IA64 with
> the Alpha architecture.

The article is certainly quite biased in some respects. I want to pick
the memory latency hidden by OOO execution as example:

Typical workloads either have streaming memory access (easy vectorizable
code, memory latency can be hidden by prefetch hints), or overall very
limited parallelism. Look at the pointer chasing parts of SPECint, e.g.
GCC. You load a pointer from a structure, and then use that pointer to
obtain a value from elsewhere to compare it with another value in memory
(DAG-subexpression comparison). Now it's clear that this sort of program
will stall after two outstanding memory requests in the queue, no matter
how OOO your architecture is, or how many speculative load hints your
compiler inserted.

The only way around that is SMT, because then you have at least other
threads to run. I'm not sure if this works good with GCC, because
although you can do a make -j to max out the threads available, the
working set of GCC is quite large, resulting in cache thrashing (at
least to L2).

Replace GCC with your favourite data base, and the picture doesn't
change that much. Still large working set, still lots of pointers, still
poor data locality.

Does anyone have data for how many outstanding misses accumulate in
typical "commercial" workloads - min, on average, max - or in scientific
workloads?

Same problem with branch prediction: that works wonderful on loops or
for exception handling, where branches are either almost always taken or
almost always untaken. However, the branches in data driven applications
are *the* data driven element of that application, and if the programmer
did a good job at balancing the search tree, hit/miss are 50/50, totally
unpredictable. That's why conditional moves (or other conditional
operations) help so much here - you remove the data dependent branch,
and only predict the looping branch that just checks for the end of the
tree. And the Alpha certainly *has* conditional moves for this sort of
code.

Baseline: IMHO IA64 shows clearly that you lose all of the VLIW
advantages (simpler decode, shorter pipeline, easier implementation), if
you take a design-by-comittee approach and try to eliminate the
downsides of VLIW by introducing every bell and whisle you heard about.
The net result is that the implementation is significantly more
difficult than a complex OOO execution core, and the success of all the
features aren't guaranteed. If you have a 20% longer cycle time, an
architectural speedup of 20% is gone. If you have a long pipeline, code
like object oriented programs with many indirect calls or other threaded
code systems fall flat (Gforth, threaded code, e.g. overall gains
absolutely nothing from the 21164 to 21264 transition, the fastest
processor in our Benchres file now is the 450 MHz overclocked
Mendochino, with pale 21264 behind 21164As except on sieve.).

The stated "low" instruction memory profile of the Alpha is also only in
comparison with IA-64, for the programs I know comparisons, Alpha
binaries are the largest (no IA-64 data available yet); compact ones
like PA-RISC is less than half the size, undercutting even CISCs like
IA-32.

Whether SMT is a silver bullet is yet unclear to me, too. You need a
larger register file, and due to the burst-like access (stall - high ILP
- stall - high ILP), dividing the register file isn't that sucessful.
You also need larger caches, to reduce thrashing. We can start bets
whether the 21264 core or the 21464 core will deliver more
performance/mm².

The only thing I'm really convinced to be a good one is DSM, Compaq's
new TLA for (CC)NUMA. But that NUMA scales better than system busses
isn't new.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Zalman Stern

ungelesen,

30.12.1999, 03:00:0030.12.99

an

It contains some illustrative examples, but noone should confuse this for
quantitative analysis. Its basically a marketing piece.

Which is not to say its necessarily wrong, but there certainly isn't any
evidence provided other than trivially small examples, many of which
aren't scheduled well for IA64 anyway. The whole 33% code bloat thing for
example seems about as well thought out as early code density arguments the
CISC folks threw at RISC. What goes around comes around I guess.

The points about SMT do not consider the cost of implementation. Which in a
comparison to IA64 is fine 'cause its hard to imagine an SMT Alpha being
much harder or more resource intensive than a comparable technology IA64
implementation.

-Z-

Toon Moene

ungelesen,

30.12.1999, 03:00:0030.12.99

an

Zalman Stern <zal...@netcom9.netcom.com> wrote:

> It contains some illustrative examples, but noone should confuse this for
> quantitative analysis. Its basically a marketing piece.

Yeah - now that's a good one ! Digital^H^H^H^H^H^H^HCompaq engineers try to
offset the marketing hype behind a non-existing chip with some detailed
analysis and this is what you can come up with ?!?!?!?

Note the date on the document: 11th of October 1999. Certainly no-one
outside Intel would have had his eyes - let alone his hands - on an Itanium
processor chip, no ?

--
Toon Moene (mailto:to...@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
GNU Fortran: http://gcc.gnu.org/onlinedocs/g77_news.html

Rob Young

ungelesen,

30.12.1999, 03:00:0030.12.99

an

In article <84ga7c$mih$1...@nntp3.atl.mindspring.net>, Zalman Stern <zal...@netcom9.netcom.com> writes:
> It contains some illustrative examples, but noone should confuse this for
> quantitative analysis. Its basically a marketing piece.
>

How about "Executive Summary"? Says that at the top.

To presume there isn't quantative analysis behind this piece
would be rather presumptuous, wouldn't it? If this
is marketing fluff, seems to be rather high-level fluff.

Come to think of it... that paper is *MUCH* better than the marketing
pieces in MDRONLINE regarding Merced and IA64 hype, wouldn't you
say?

>
> Which is not to say its necessarily wrong, but there certainly isn't any
> evidence provided other than trivially small examples, many of which
> aren't scheduled well for IA64 anyway. The whole 33% code bloat thing for
> example seems about as well thought out as early code density arguments the
> CISC folks threw at RISC. What goes around comes around I guess.
>

So, you are saying that is not accurate? What would be a more
accurate code bloat number, 25%?

>
> The points about SMT do not consider the cost of implementation. Which in a
> comparison to IA64 is fine 'cause its hard to imagine an SMT Alpha being
> much harder or more resource intensive than a comparable technology IA64
> implementation.
>

This page is a good resource:

http://www.cs.washington.edu/research/smt/

Overview:

http://www.cs.washington.edu/research/smt/papers/ieee_micro.pdf

Database analysis:

http://www.cs.washington.edu/research/smt/papers/smtdatabase.pdf
http://www.cs.washington.edu/research/smt/papers/smtdatabaseabstract.html

"For example, using a memory-intensive OLTP workload, an 8-context SMT
processor achieves a 3-fold increase in instruction throughput over a
single-threaded superscalar with similar resources."

------

Regarding cost of implmentation... if the payback according to that
Marketing paper and the smtdatabase.pdf is a factor of 3
greater in database performance, seems like something worthy
to shoot for.

I can't find it but I seem to recall SMT adds 3% on to chip
real estate.. could be off there.

Rob

Zalman Stern

ungelesen,

30.12.1999, 03:00:0030.12.99

an

Toon Moene <to...@moene.indiv.nluug.nl> wrote:
: Yeah - now that's a good one ! Digital^H^H^H^H^H^H^HCompaq engineers try to

: offset the marketing hype behind a non-existing chip with some detailed
: analysis and this is what you can come up with ?!?!?!?

Sorry, but I call 'em as I see 'em. The document does not contain enough
information to judge whether there was a (correct) quantitative analysis or
not. I allowed that they may be right, but some of the stuff in the
document is so oversimplified (or just wrong) that I consider it somewhere
between misleading and worthless.

I've said the same thing about Intel's 8 Queens examples and such: they're
worthless for drawing architectural conclusions. You need to measure what
happens over large programs. More and more true with the increased
complexity of huge OOO windows, massive ILP, and SMT. Small examples didn't
cut it then and they certainly don't cut it now.

IA64 may or may not suck, but we can't tell from this noise now can we? And
more noise doesn't help.

: Note the date on the document: 11th of October 1999. Certainly no-one

: outside Intel would have had his eyes - let alone his hands - on an Itanium
: processor chip, no ?

Yeah so? The document in question rarely mentions Itanium (Merced) but is
an architectural comparison. The implication is that DEC has a lot of
experience simulating architectures with features similar to IA64. I expect
this is definitely true, but if so, I'd like to have seen more data.

There's also this detail that Itanium is scheduled to ship well before an
SMT Alpha. It'll be interesting to see what happens in the interim... Its
very possible that Intel and HP have specified a standard runtime interface
for low-level thread creation and synchronization. If so, they could add MT
or SMT in a future implementation. I doubt it, but since they aren't saying
anything about the systems programming aspects, you can't tell what they're
thinking. Assuming they are idiots is probably a bad idea. The Compaq paper
also mentions more than once that they think OOO IA64 is so difficult that
it is unlikely to happen. People used to say the same thing about
superscalar x86...

Of course its all going to be moot after Transmeta announces now isn't it :-)

-Z-

Zalman Stern

ungelesen,

30.12.1999, 03:00:0030.12.99

an

Rob Young <you...@eisner.decus.org> wrote:
: To presume there isn't quantative analysis behind this piece

: would be rather presumptuous, wouldn't it? If this
: is marketing fluff, seems to be rather high-level fluff.

Who presumed any such thing? I said it isn't in the paper.

: Come to think of it... that paper is *MUCH* better than the marketing

: pieces in MDRONLINE regarding Merced and IA64 hype, wouldn't you
: say?

I don't think I read the whole thing. MDR runs a lot of fluff too. Maybe
things will improve with recent editorial changes. (Or maybe not.)

[...]
: So, you are saying that is not accurate? What would be a more

: accurate code bloat number, 25%?

No, I didn't say its "not accurate." I said the logic they are using to
support "it must be at least 33% worse" is wrong. (E.g. it assumes that
predication bits are completely worthless. Were this so, ARM probably
wouldn't be known for good code density.) I expect it will be worse than
Alpha (and the Intel/HP folks seem to have admited as much).

Code density will have a lot to do with the compilers, in particular
whether the optimizer chooses to aggressively pursue stright line execution
speed, or chooses tighter code in a particular area. Put another way, I
wouldn't be surprised if you need a completely different set of compiler
switches on IA64 for number crunching kernels and for compiling a
database. (No, I don't consider that a good thing.)

-Z-

Bernd Paysan

ungelesen,

30.12.1999, 03:00:0030.12.99

an

Rob Young wrote:
> How about "Executive Summary"? Says that at the top.
>

> To presume there isn't quantative analysis behind this piece
> would be rather presumptuous, wouldn't it? If this
> is marketing fluff, seems to be rather high-level fluff.
>

> Come to think of it... that paper is *MUCH* better than the marketing
> pieces in MDRONLINE regarding Merced and IA64 hype, wouldn't you
> say?

I would rather describe it as a biased study. There are several pieces
which are plain honest, and some pieces are just obviously biased,
because the author knows and likes the Alpha architecture better than
IA64; which is no surprise at all. Given the secrecy of Intel, we don't
have to wonder that more founded analysis are hard to obtain. E.g.
considering code bloat, I'd just compile a few programs and print the
report of size here (especially text segment). But there is no compiler
available now that would allow you to do that (to get a compiler, you
have to sign a NDA - although it's not binding for GCC, since the GPL
doesn't allow you to give up redistribution rights).

Toon Moene

ungelesen,

31.12.1999, 03:00:0031.12.99

an

Zalman Stern <zal...@netcom17.netcom.com> wrote:

> I wrote:

> : Yeah - now that's a good one ! Digital^H^H^H^H^H^H^HCompaq engineers try
to
> : offset the marketing hype behind a non-existing chip with some detailed
> : analysis and this is what you can come up with ?!?!?!?

> Sorry, but I call 'em as I see 'em. The document does not contain enough
> information to judge whether there was a (correct) quantitative analysis or
> not. I allowed that they may be right, but some of the stuff in the
> document is so oversimplified (or just wrong) that I consider it somewhere
> between misleading and worthless.

> I've said the same thing about Intel's 8 Queens examples and such: they're
> worthless for drawing architectural conclusions.

Ah yes, but that's my whole point - the Alpha world has to offset the Intel
hype - but without the detailed knowledge Intel has about the choices that
went into their architecture.

So they produce a document that contains the same fallacies as the original
public Intel information. Tough luck.

What I would like to see is some information from Intel on how they're going
to beat the Alpha architecturally (or otherwise) with their "winning horse".
I simply don't see how they could do that.

See you in the next millennium ...

Paul DeMone

ungelesen,

31.12.1999, 03:00:0031.12.99

an

Bernd Paysan wrote:
[snip]

> The stated "low" instruction memory profile of the Alpha is also only in
> comparison with IA-64, for the programs I know comparisons, Alpha
> binaries are the largest (no IA-64 data available yet); compact ones
> like PA-RISC is less than half the size, undercutting even CISCs like
> IA-32.

Are you comparing 32 bit PA-RISC binaries to 64 bit LP64 Alpha binaries
(which include 64 bit linkage tables)? Are you including run time
libraries?
What about compiler directives (Alpha compiler loves to octaword align
branch targets). Compare apples to apples please.

Your comment is far less useful than this Compaq white paper.

>
> Whether SMT is a silver bullet is yet unclear to me, too. You need a
> larger register file, and due to the burst-like access (stall - high ILP
> - stall - high ILP), dividing the register file isn't that sucessful.
> You also need larger caches, to reduce thrashing. We can start bets
> whether the 21264 core or the 21464 core will deliver more
> performance/mm².

And the 21264 probably sucks on this metric compared to StrongARM
So what's your point? High end processors always pay the price
for bleeding edge performance in terms of performance normalized
to some cost metric like cost, chip area, or power.

What interests me is how a ~300 mm2, ~140 Watt Merced compares
in performance to a ~150 mm2, 80 Watt EV68

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
dem...@mosaid.com architectures with MIPSed results but ALPHA's well
pde...@igs.net that ends well.

John Dallman

ungelesen,

31.12.1999, 03:00:0031.12.99

an

In article <386BD144...@gmx.de>, bernd....@gmx.de (Bernd Paysan)
wrote:

> ... considering code bloat, I'd just compile a few programs and print

> the report of size here (especially text segment). But there is no
> compiler available now that would allow you to do that (to get a
> compiler, you have to sign a NDA - although it's not binding for GCC,
> since the GPL doesn't allow you to give up redistribution rights).

After some time staring at the architecture descriptions, I think you can
expect code size figures to oscillate quite a bit as the compiler
technology matures. Given the tendency of some of the more pointy-haired
people involved with the computer industry to stick to arbitrary decisions
they made years ago[1], I'm not surprised that Intel wants to keep size
figures under wraps for a while. When you operate with a major eye on your
stock price, paranoia sets in.

I also suspect that the dodge common among less skillful Windows
programmers of "the release compile crashes - we'll ship the debug" will
hit a serious problem with unacceptable code sizes.

[1] "Those code sizes were too big on the original compiler, so they'll be
too big now!"

---
John Dallman j...@cix.co.uk

Frank Ch. Eigler

ungelesen,

31.12.1999, 03:00:0031.12.99

an

Bernd Paysan <bernd....@gmx.de> writes:

> [...] But there is no compiler available now that would allow you to

> do that (to get a compiler, you have to sign a NDA - although it's
> not binding for GCC, since the GPL doesn't allow you to give up
> redistribution rights).

This is an incorrect understanding of the interactions between NDAs
and the GPL. It is not the case that "the GPL doesn't allow you to
give up redistribution rights". Rather, the GPL prevents a
distributor from forcing a recepient to give up her redistribution
privileges.

When the recepient undertakes the NDA, she waives her own GPL
privilege to share with non-NDA third parties -- it is not the
distributor forcing her to waive those privileges. Refusing to
distribute software to non-NDA third parties is fine by the GPL too.

The bottom line is that a non-NDA individual is unlikely to get hold
of an ia64 gcc port before the NDAs are dissolved.

- FChE

TTK Ciar

ungelesen,

31.12.1999, 03:00:0031.12.99

an

In article <386A9263...@gmx.de>,

Bernd Paysan <bernd....@gmx.de> wrote:
>Jan Vorbrueggen wrote:
>> If you haven't heard yet, DEC^H^H^HCompaQ have posted a nice little paper
>> (http://www.digital.com/hpc/ref/ref_alpha_ia64.pdf) comparing IA64 with
>> the Alpha architecture.
>
>The article is certainly quite biased in some respects. I want to pick
>the memory latency hidden by OOO execution as example:
>
>Typical workloads either have streaming memory access (easy vectorizable
>code, memory latency can be hidden by prefetch hints), or overall very
>limited parallelism. Look at the pointer chasing parts of SPECint, e.g.
>GCC. You load a pointer from a structure, and then use that pointer to
>obtain a value from elsewhere to compare it with another value in memory
>(DAG-subexpression comparison). Now it's clear that this sort of program
>will stall after two outstanding memory requests in the queue, no matter
>how OOO your architecture is, or how many speculative load hints your
>compiler inserted.

You're right. I think OOO does hide some memory latency, but not a
whole lot when it comes to pointer-chasing apps like GCC (and many real-
life workstation applications are primarily pointer-chasers).

>The only way around that is SMT, because then you have at least other
>threads to run. I'm not sure if this works good with GCC, because
>although you can do a make -j to max out the threads available, the
>working set of GCC is quite large, resulting in cache thrashing (at
>least to L2).

I think you're thinking of OS-level multithreading. Using "make -j"
will create a lot of OS-level threads. Processor-level SMT is implemented
an abstraction layer below that, in the hardware, where the CPU will let
nondependent instruction sequences issue while a very large number of
dependent instruction sequences can block while waiting on resources. It
does not require explicit application or OS support -- ie, even single-
threaded, single-process sequential code can take advantage of SMT.

>Replace GCC with your favourite data base, and the picture doesn't
>change that much. Still large working set, still lots of pointers, still
>poor data locality.

Yep.

IMO the only practical way to overcome the "memory wall" is to integrate
main memory and processor logic into the same silicon, so you can reduce
access latency somewhat and widen the data path immensely (1Kb, 4Kb, etc)
to maximize cache utilization per main memory hit. A large cache would
also help (my gcc working sets usually max out around 5MB). Doing this
is equally applicable to x86, IA64, Alpha, or whatever. IBM currently
has a head start on this technology, but I can't imagine it taking long
for Intel or Compaq/Samsung to emulate them.

>The only thing I'm really convinced to be a good one is DSM, Compaq's
>new TLA for (CC)NUMA. But that NUMA scales better than system busses
>isn't new.

Yes, NUMA is the right approach for making multiprocessor systems
scale well, but amdahl's law still applies. Whether we have one ALU or
a few dozen, we need to reduce memory random access latencies!

-- TTK

Rob Young

ungelesen,

01.01.2000, 03:00:0001.01.00

an

In article <386D35CF...@igs.net>, Paul DeMone <pde...@igs.net> writes:
>
> What interests me is how a ~300 mm2, ~140 Watt Merced compares
> in performance to a ~150 mm2, 80 Watt EV68
>

Ouch.. ..

Speaking of Merced/Itanium again.. did you catch
Linley's farewell?

Good ol' Linley. Linley pushing Merced so hard you would a
thunk he was working for Intel marketing.

Linley does a bit of a climb down:

http://www.mdronline.com/q/@20580451sllhpf/mpr/editorials/edit13_17.html

"One of Merced's problems is that its clock speed will not be
competitive with that of other 0.18-micron processors in 2000,
hampering performance. Wisely, Intel has targeted the 0.18-micron
McKinley at more than 1 GHz, making it a true Speed Demon. As long
as IA-64 compilers can match or beat RISC compilers, McKinley will
be an excellent performer."

-----

But wait... according to Intel spokespeople , Merced will indeed
get better clock. We know Craig is very angry about the low-speed
and that always makes them get to it.

I wonder if we will see more talk about Merced being a server
part. After all, if they crank the clock to 900 MHz wouldn't it
be running a bit hotter, hampering the whole workstation thing?

If at 700 MHz you can stuff it in a workstation, it may not
be competitive. Catch-22.

Rob

Rob Young

ungelesen,

01.01.2000, 03:00:0001.01.00

an

In article <386BD144...@gmx.de>, Bernd Paysan <bernd....@gmx.de> writes:
> Rob Young wrote:
>> How about "Executive Summary"? Says that at the top.
>>
>> To presume there isn't quantative analysis behind this piece
>> would be rather presumptuous, wouldn't it? If this
>> is marketing fluff, seems to be rather high-level fluff.
>>
>> Come to think of it... that paper is *MUCH* better than the marketing
>> pieces in MDRONLINE regarding Merced and IA64 hype, wouldn't you
>> say?
>
> I would rather describe it as a biased study. There are several pieces
> which are plain honest, and some pieces are just obviously biased,
> because the author knows and likes the Alpha architecture better than
> IA64; which is no surprise at all. Given the secrecy of Intel, we don't
> have to wonder that more founded analysis are hard to obtain. E.g.

> considering code bloat, I'd just compile a few programs and print the

> report of size here (especially text segment). But there is no compiler

> available now that would allow you to do that (to get a compiler, you
> have to sign a NDA - although it's not binding for GCC, since the GPL
> doesn't allow you to give up redistribution rights).
>

Which if you look at the date on it .. October 11th. What, a week
after Intel presented Merced again at MPR Forum in Oct. 1999?

This thing shows up a week ago. In the interim (Oct-Dec), SGI has
demoed a Merced box as has others. Maybe in that timeframe
Compaq has seen silicon and has early compilers, etc. By posting
a paper from October knowing what they know in December, (being
careful not to reference NDA material even using good ol' Linley
as a reference) they didn't change anything.

So maybe the left hand doesn't know what the right hand is doing?
Or maybe more likely, the left hand does know what the right hand
is doing and by posting in December what they created in October...
they are essentially saying this is pretty accurate. OR they
lose face when Merced surprises us all and their little
summary in hind-sight appears silly. Maybe, maybe not.

Rob

Bernd Paysan

ungelesen,

01.01.2000, 03:00:0001.01.00

an

Paul DeMone wrote:
> Are you comparing 32 bit PA-RISC binaries to 64 bit LP64 Alpha binaries
> (which include 64 bit linkage tables)? Are you including run time
> libraries?
> What about compiler directives (Alpha compiler loves to octaword align
> branch targets). Compare apples to apples please.

I'm certainly not comparing run time environments of different OSes,
just the text size of a dynamically linked program. Alpha compilers love
large branch target alignments, because Alphas bitterly need them;
without you get a significant performance degradation. And fairly: GCC
doesn't align as necessary, so the comparison is even "optimistic"
towards Alpha.

I can also compare a simple code snip between PA-RISC and Alpha, and
just count instructions, it may be a bit better (because the alignment
doesn't come into the picture), but it's still at least 50% more.
PA-RISC's "complicated" addressing modes and conditional execution helps
to shrink down code quite a bit.

Since the performance gap between PA-RISC and Alpha isn't that large, I
don't think it's an unfair comparison: I'm comparing fastest RISC with
second-fastest RISC.

Bernd Paysan

ungelesen,

01.01.2000, 03:00:0001.01.00

an

Paul DeMone wrote:
> And the 21264 probably sucks on this metric compared to StrongARM
> So what's your point? High end processors always pay the price
> for bleeding edge performance in terms of performance normalized
> to some cost metric like cost, chip area, or power.

The entire point of SMT is that you *need* more than one thread to
utilize the processor's resources. The inherent ILP may be almost maxed
out on 21264, so running just one thread on a 21464 won't give you much
more performance; the small increased parallelism might be eaten up by
other tradeoffs, like keeping more function units in sync.

So you have the choice to either put a SMT core on a die, *or* to put
several single threading cores on the same die. Performance of a single
thread won't differ that much, so your metric is performance/mm².

Bernd Paysan

ungelesen,

01.01.2000, 03:00:0001.01.00

an

John Dallman wrote:
> I also suspect that the dodge common among less skillful Windows
> programmers of "the release compile crashes - we'll ship the debug" will
> hit a serious problem with unacceptable code sizes.

Yeah, especially with the resulting "the IA-64 executable is too big for
one CD, let's compile to x86 instead".

It should be obvious to all compiler writers that debugging ought to
work with fully optimized code, so that during the entire development
optimization is tested. The problem is that with optimizations, the
compiler tells you "I just scheduled this 30 lines completely different,
and removed half of your variables - but worse, your control flow is
gone", and the debugger shows that instructions can execute backwards in
time ;-).

Since several bugs go away if you insert a printf here or there to find
out what's happening, I wonder if these printfs make it into the final
code, too?

Bernd Paysan

ungelesen,

02.01.2000, 03:00:0002.01.00

an

[discussion came up on comp.arch]

I really like to have this cleared up, so I cc'd it to Richard Stallman.

Frank Ch. Eigler wrote:
> When the recepient undertakes the NDA, she waives her own GPL
> privilege to share with non-NDA third parties -- it is not the
> distributor forcing her to waive those privileges. Refusing to
> distribute software to non-NDA third parties is fine by the GPL too.

Sorry, no, IMHO it isn't. The entire business around free software is
that you have redistribution rights as *natural rights* - noone, neither
distributor or any third party can resign that, nor yourself (can you
sell yourself as slave? No!). Otherwise, NDAs would be a easy way to get
around GPL. Let's assume that I want to "kidnap" an extended GPL'd
program, say GCC with IA-64 backend. I don't sell the enhanced GCC, I
sell NDAs. You have to sign an NDA to get my GCC version - certainly I'm
using a "straw company", so you don't have to sign the NDA with the GCC
redistributor.

This way, I could turn any GPL'd program into a proprietary program,
which is distributed only among people that signed that NDA. "Helping
neighbours" would be prohibited, because they didn't sign the NDA, and
might not even meet the conditions under which the NDA can be obtained;
NDAs can be discriminatory to anyone the offering party wants to. Read
the GNU manifesto again, and tell me if RMS really intented that. Note
that a NDA-signing community is in no way limited (other than by the
conditions to obtain the NDA), so you could let half the world sign (and
pay for) NDAs to get your derivative of a GPL'd software.

Currently, I can't get an IA-64 GCC without signing an NDA. If I had
signed an NDA, and obtained GCC, item 6 would apply (relevant phrase
underlined):

"6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
restrictions on the recipients' exercise of the rights granted herein.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You are not responsible for enforcing compliance by third parties to
this License."

The NDA imposes a further restriction on the recipient right to exercise
redistribution under GPL, and therefore either is null and void, or
Intel may no longer use/copy/modify any other program under GPL. It
doesn't matter in which order you accept the two legal documents, they
are just plain contradictionary. Since Intel isn't the original author
of GCC, they don't even license you GCC, that's RMS and other GCC
contributors. In other words: it's not their business.

The risk of violating the NDA WRT GCC is that Intel might not give you
further informations under NDA (given the quasi monopoly power Cygnus
Solutions has as experts in GCC, you are in a strong position ;-). If
you redistribute GCC or Trillian, they only can stand on their head and
wag with their feets, no more. GPL'd software distributed under NDA is a
pure goodwill action, not a legally binding one. I have nothing against
developers agreeing to ship only when it's ready, but if someone of them
decides otherwise, I can't and won't stop him - the freedom of free
software allows to continue work, even when consensus can't be reached.
I'm a coordinator of a GNU project myself, and it's more difficult to
get the codevelopers to check in their modifications in time than to
hinder uncontrolled escapes ;-).

The only "open issue" here is what a "redistribution" legally is. The
GPL doesn't define that term, so I just take common copyright practice:
a redistribution is when you make the program in whole or in relevant
parts (more than "fair use") accessible to another person. In other
words: when Microsoft would shout "piracy", you redistributed a program.
Heck, they shout "piracy" even if you copy the program just to another
computer of your own, so that would be a redistribution, too.

Rob Barris

ungelesen,

02.01.2000, 03:00:0002.01.00

an

In article <1999Dec31.220800.1@eisner>, you...@eisner.decus.org (Rob
Young) wrote:

> http://www.mdronline.com/q/@20580451sllhpf/mpr/editorials/edit13_17.html

Linley writes:

"In the subsequent seven years, better IC processes have greatly
improved both the IPC and the cycle time of microprocessors, leading
some vendors to claim to deliver the best of both worlds. But a chip
becomes a Speed Demon through microarchitecture design philosophy, not
IC process gains."

"With its focus on instruction-level parallelism (ILP), IA-64 seems
to be the ultimate Brainiac."

"One of Merced's problems is that its clock speed will not be
competitive with that of other 0.18-micron processors in 2000, hampering
performance. Wisely, Intel has targeted the 0.18-micron McKinley at more
than 1 GHz, making it a true Speed Demon."

What was that again, about speed demons being based on
"microarchitecture design philosophy, not IC process gains"? My head
spins. Which is it, man?

Rob

John

ungelesen,

02.01.2000, 03:00:0002.01.00

an

Bernd Paysan wrote:
>
> Jan Vorbrueggen wrote:
> > If you haven't heard yet, DEC^H^H^HCompaQ have posted a nice little paper
> > (http://www.digital.com/hpc/ref/ref_alpha_ia64.pdf) comparing IA64 with
> > the Alpha architecture.
>
> The article is certainly quite biased in some respects. I want to pick
> the memory latency hidden by OOO execution as example:

I think that it is fairly widely believed that the only latency that OOO
is
good for tolerating is L2 cache (hit) latency.

By the time we get to >1 GHZ core speeds with the kinds of issue widths
being proposed for systems like EV8 or Power4 or McKinley, it will be
difficult to build large enough reorder buffers to tolerate the latency
of off-chip caches.

Consider some hypothetical processor for the 2003 era, with 4-way to 6-way
issue at 1.5 GHz and a very aggressively designed memory system. On-chip
caches will still be limited in size to ~2 MB (maybe 4?), so you have a choice
of either a large off-chip DRAM cache (~60 ns latency) or just going straight
to local memory. A really aggressive design might target ~100 ns for local
accesses. So L3 accesses are at 90 clocks of latency and local memory accesses
are at 150 clocks of latency. This suggests that if you expect any sort of
superscalar performance (>1 IPC sustained), then you would need reorder buffers
capable of holding ~200 instructions. I doubt that this could be done without
significantly impacting cycle time.

On the bright side, tolerating L2 latency effectively makes code generation
a whole lot easier.

Anton Ertl

ungelesen,

02.01.2000, 03:00:0002.01.00

an

In article <84ga7c$mih$1...@nntp3.atl.mindspring.net>,
Zalman Stern <zal...@netcom9.netcom.com> writes:
> The whole 33% code bloat thing for
>example seems about as well thought out as early code density arguments the
>CISC folks threw at RISC.

It reminds me more of an argument made in a paper (IIRC Dileep
Bhandarkar was one of the authors) around 1996 that compared the
Pentium Pro to some Alpha on an OLTP workload and argued that the
Pentium Pro had an advantage due to better code density.

> What goes around comes around I guess.

In this case quite literally.

Since there have been some arguments about code size in this thread,
here's some data. Everything was compiled with gcc (various
versions), and I report the text sizes of .o files (this should
eliminate library differences and reduce the influence of pointer and
long sizes); the files were build when testing the latest Snapshot of
Gforth:

main.o engine.o gcc version
IA32 Linux 4692 22192 2.7.2.1
IA32 Linux 6562 19836 2.7.2.3
MIPS Ultrix 4656 22928 2.4.5
MIPS Irix 7504 25360 egcs-1.1.2?
Alpha DU 7296 24368 2.7.2.2
Alpha Linux 9432 24856 egcs-1.0.3
Sparc Solaris 7486 21116 2.8.1
HPPA HP/UX 8264 19168 2.8.1

Looks like differences between different versions of the compiler have
more influence on the code size than the architecture. I am somewhat
mystified by the large differences between the very similar compiler
versions on the Linux-Intel boxes.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Chris Morgan

ungelesen,

02.01.2000, 03:00:0002.01.00

an

ttk@for_mail_remove_this_and_fruit.larva.apple.shinma.org (TTK Ciar) writes:

> I think you're thinking of OS-level multithreading. Using "make -j"
> will create a lot of OS-level threads. Processor-level SMT is implemented
> an abstraction layer below that, in the hardware, where the CPU will let
> nondependent instruction sequences issue while a very large number of
> dependent instruction sequences can block while waiting on
> resources.

I think this is correct, however

>It does not require explicit application or OS support -- ie, even
>single- threaded, single-process sequential code can take advantage
>of SMT.

I'm not sure this is true. I agree that make -j4 creates four
processes, and that SMT is one level down, but I thought that meant it
would accelerate threaded programs (e.g. pthreads used explicitly -
good webservers for example). If I compile a 10 thousand line FORTRAN
function, I really doubt an SMT processor can "discover" threads
within and execute them simultaneously. However if I use Ada or Java
(or ...) and express a threaded program, or just use the current OS's
threading interface directly via explicit creation of threads, then
the SMT machine is effectively handed a list of threads and told run
these whenever you like apart from when I tell you - an easier
problem.

Of course if the SMT machine can "discover" threads in previously
single-threaded code I'm all for it!

Chris

--
Chris Morgan <cm at mihalis.net> http://mihalis.net
mihalis.net now cruising at Warp Slackware 7.0 at an altitude of
48MB. Whoo-hoo!

Alan Lehotsky

ungelesen,

02.01.2000, 03:00:0002.01.00

an

In article <84nghr$h7h$1...@news.tuwien.ac.at>,
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

>>>In article <84ga7c$mih$1...@nntp3.atl.mindspring.net>,
>>> Zalman Stern <zal...@netcom9.netcom.com> writes:
>>>> The whole 33% code bloat thing for
>>>>example seems about as well thought out as early code density arguments
>>>>the
>>>>CISC folks threw at RISC.
>>>

...

>>>
>>> main.o engine.o gcc version
>>>IA32 Linux 4692 22192 2.7.2.1
>>>IA32 Linux 6562 19836 2.7.2.3
>>>MIPS Ultrix 4656 22928 2.4.5
>>>MIPS Irix 7504 25360 egcs-1.1.2?
>>>Alpha DU 7296 24368 2.7.2.2
>>>Alpha Linux 9432 24856 egcs-1.0.3
>>>Sparc Solaris 7486 21116 2.8.1
>>>HPPA HP/UX 8264 19168 2.8.1
>>>
>>>Looks like differences between different versions of the compiler have
>>>more influence on the code size than the architecture. I am somewhat
>>>mystified by the large differences between the very similar compiler
>>>versions on the Linux-Intel boxes.

If you're comparing .o's, code size may be swamped by the
artifacts caused by the object-format. Or are you looking at
the output of the 'size' command and extracting the .text
sizes? That in turn is affected by things like constant
strings being made read-only vs. storing in .data

One of the things I've done to the gcc backend I support is add
code that emits the size of each function as a comment into the
.s file. That makes it easier to do code size analysis.

-- Al Lehotsky

--
Quality Software Management
http://www.tiac.net/users/lehotsky

Process Improvement | Management Consulting | Compiler Implementation

Andy Glew

ungelesen,

02.01.2000, 03:00:0002.01.00

an

> Of course if the SMT machine can "discover" threads in previously
> single-threaded code I'm all for it!

SMT doesn't discover threads in previously single threaded code.

DMT discovers threads in monolithic single threaded code.

Toon Moene

ungelesen,

02.01.2000, 03:00:0002.01.00

an

Chris Morgan <c...@mihalis.net> wrote:

> Of course if the SMT machine can "discover" threads in previously
> single-threaded code I'm all for it!

Hmm, that's not how I read the Compaq paper. In fact, I do not need it.

If SMT can speed up my (compiler-enabled) multithreaded code

(see http://www.openmp.org, OK, it seems to be down at the moment (can you
say Y2K :-) - here's the alternative:
http://www-europe.sgi.com/software/openmp/)

by a factor of 1.5 - 3 on a *single* processor, I'm all for it :-)

Anton Ertl

ungelesen,

02.01.2000, 03:00:0002.01.00

an

In article <lehotsky-686552...@news.tiac.net>,

Alan Lehotsky <leho...@tiac.net> writes:
>In article <84nghr$h7h$1...@news.tuwien.ac.at>,
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>>> main.o engine.o gcc version
>>>>IA32 Linux 4692 22192 2.7.2.1
>>>>IA32 Linux 6562 19836 2.7.2.3
>>>>MIPS Ultrix 4656 22928 2.4.5
>>>>MIPS Irix 7504 25360 egcs-1.1.2?
>>>>Alpha DU 7296 24368 2.7.2.2
>>>>Alpha Linux 9432 24856 egcs-1.0.3
>>>>Sparc Solaris 7486 21116 2.8.1
>>>>HPPA HP/UX 8264 19168 2.8.1
>>>>
>>>>Looks like differences between different versions of the compiler have
>>>>more influence on the code size than the architecture. I am somewhat
>>>>mystified by the large differences between the very similar compiler
>>>>versions on the Linux-Intel boxes.

...

> Or are you looking at
> the output of the 'size' command and extracting the .text
> sizes?

Right.

> That in turn is affected by things like constant
> strings being made read-only vs. storing in .data

I had thought that gcc stores strings in the text section. However, it
looks like it stores strings in a section called .rodata with ELF on
Linux.

The differences in the main.o text sizes for the Linux-Intel are
mainly from different size commands: For the default output format GNU
size 2.7 reports just the text section as text size, whereas GNU size
2.9.1 reports the sum of .text and .rodata as text size.

The differences in the engine.o text sizes are mainly due to using
-m486 on one system, and not using it on the other system (Gforth's
configure automatically inserts this flag when it is run on a 486). I
think the main reason for the size difference from this flag is the
code alignment to 16-byte boundaries.

Here's a table of sizes of just the .text segment (strings seem to
reside in .rdata, .rodata on all machines except HP/UX; there I
subtracted the $LIT$ size from the text size):

main.o engine.o gcc version
IA32 Linux 4692 22192 2.7.2.1 -m486
IA32 Linux 4321 19780 2.7.2.3

MIPS Ultrix 4656 22928 2.4.5
MIPS Irix 7504 25360 egcs-1.1.2?
Alpha DU 7296 24368 2.7.2.2

Alpha Linux 6976 24736 egcs-1.0.3
SPARC Solaris 4908 21012 2.8.1
HPPA HP/UX 5656 19000 2.8.1

The conclusions don't change much.

Paul DeMone

ungelesen,

02.01.2000, 03:00:0002.01.00

an

Rob Barris wrote:
[snip]

> What was that again, about speed demons being based on
> "microarchitecture design philosophy, not IC process gains"? My head
> spins. Which is it, man?

speed demon is a microarchitecture design choice.

IC process gains are a tide that raises all boats, brainiacs
and speed demons alike.

Anton Ertl

ungelesen,

02.01.2000, 03:00:0002.01.00

an

In article <1999Dec31.220800.1@eisner>,

you...@eisner.decus.org (Rob Young) writes:
> Linley does a bit of a climb down:
>
>http://www.mdronline.com/q/@20580451sllhpf/mpr/editorials/edit13_17.html
>

> "One of Merced's problems is that its clock speed will not be
> competitive with that of other 0.18-micron processors in 2000,
> hampering performance. Wisely, Intel has targeted the 0.18-micron
> McKinley at more than 1 GHz, making it a true Speed Demon.

So he's telling us Merced won't even get 1GHz in 0.18um? Given the
length of the pipeline, I would have expected it to have at least the
clockrate of a 21164 in the same technology (unfortunately the fastest
21164s are in 0.35um, but I would expect that 1GHz should be possible
for a 0.25um 21164, and a 0.18um version should be quite a bit faster
still).

Next question, is a 0.18um 1GHz chip a speed demon? How many chips in
that technology will miss that mark? Well, Athlon will hit it, and
EV68 should do it, too. I also expect some IA-32-only implementation
from Intel to hit it. PPC, HPPA, MIPS, and SPARC will probably miss
it, so 1GHz could be seen as dividing line between speed demons and
other chips; it's funny to see IA32 implementations as speed demons.

John Dallman

ungelesen,

02.01.2000, 03:00:0002.01.00

an

In article <386E7B24...@gmx.de>, bernd....@gmx.de (Bernd Paysan)
wrote:

> It should be obvious to all compiler writers that debugging ought to

> work with fully optimized code, so that during the entire development
> optimization is tested. The problem is that with optimizations, the
> compiler tells you "I just scheduled this 30 lines completely different,
> and removed half of your variables - but worse, your control flow is
> gone", and the debugger shows that instructions can execute backwards in
> time ;-).

You know that. I know that. Most people on comp.arch know that. Sadly,
this is one that compiler writers (or their managers) tend to rationalise
with "well, we won't add any bugs in the optimiser..." and "it's really
hard to express some of these concepts in the debug info".

I think I may try to see if I can persuade work to let me write about some
recent experiences for JILP.

---
John Dallman j...@cix.co.uk

Rob Young

ungelesen,

03.01.2000, 03:00:0003.01.00

an

What about EV7's Network Interconnect?

http://www.digital.com/alphaoem/present/sld018.htm

With 32 EV7s you wonder if L3 would even be necessary. Maybe
I'm not "on target" but it seems routing at that speed and bandwidth
and asynchrounously .. it looks like a fairly large L2. Some
are more remote than others.. but not that remote... Wouldn't
a hypercube ensure CPUs are at most four hops from each other?
If so, then at most (acknowledging hidden overhead may add
a bit) remote L2 is 60 ns away, but "average" L2 access a little more
than half that (35-37 ns). But maybe more importantly, this
L2 would be 48 MByte or more aggregate.

Also, as an aside .. earlier you were wondering what was
Wildfire going to do with 64 PCI busses. From that slide
it appears each processor can handle 3 GByte I/O per second.
32 EV7's in a box could pretty much saturate 64 PCI busses with
I/O bandwidth needs/wants. Granted, it will require a "fork-lift"
swap-out to stick EV7s in there. But it seems the box would
have a good deal of I/O to ensure usefulness post-EV6. In other words,
your not stuck with a boat anchor. The box would support CPU upgrades
one or two generations out.

Rob

John McCalpin

ungelesen,

03.01.2000, 03:00:0003.01.00

an

Rob Young wrote:
>
> In article <386F327F...@austin.rr.com>, John <jmcc...@austin.rr.com> writes:
> >
> > Consider some hypothetical processor for the 2003 era, with 4-way to 6-way
> > issue at 1.5 GHz and a very aggressively designed memory system. On-chip
> > caches will still be limited in size to ~2 MB (maybe 4?), so you have a choice
> > of either a large off-chip DRAM cache (~60 ns latency) or just going straight
> > to local memory. A really aggressive design might target ~100 ns for local
> > accesses. So L3 accesses are at 90 clocks of latency and local memory accesses
> > are at 150 clocks of latency. This suggests that if you expect any sort of
> > superscalar performance (>1 IPC sustained), then you would need reorder buffers
> > capable of holding ~200 instructions. I doubt that this could be done without
> > significantly impacting cycle time.
> >
>
> What about EV7's Network Interconnect?
>
> http://www.digital.com/alphaoem/present/sld018.htm
>
> With 32 EV7s you wonder if L3 would even be necessary. Maybe
> I'm not "on target" but it seems routing at that speed and bandwidth
> and asynchrounously .. it looks like a fairly large L2. Some
> are more remote than others.. but not that remote... Wouldn't
> a hypercube ensure CPUs are at most four hops from each other?
> If so, then at most (acknowledging hidden overhead may add
> a bit) remote L2 is 60 ns away, but "average" L2 access a little more
> than half that (35-37 ns). But maybe more importantly, this
> L2 would be 48 MByte or more aggregate.

There is very little detail in that EV7 presentation -- at least to someone
who has designed NUMA systems. The devil ends up being in the details, and
there are a lot of them.

A couple of points:

(1) Bandwidth Hierarchy:
L1 <-> registers >16 GB/s (2 loads at >1 GHz)
L2 <-> L1 16 GB/s peak (slide 16)
RDRAM <-> memory controller 6 GB/s peak (slide 17)
chip-to-chip 2 GB/s peak per channel
8 GB/s peak per chip

So access to a single remote L2 cache is at least 8x slower in
bandwidth than access to your own L2 cache, and 3x slower in
bandwidth than access to your own DRAM.

(2) Cache Coherence
Slide 17 says that the memory controller supports "directory-based
cache coherence". Although I like directory-based cache coherence,
there are lots of details that are hidden from us. For example:
(a) Is there any snooping done in the system, or is it all
directory-based?
If there is some snooping, what is the domain over which
snooping occurs? The tradeoff is that snooping provides
faster access to data in other processor's caches, but
typically increases local memory latency.

If there is no snooping, then each CPU must have a directory
to map the status of each line in its RDRAM with respect to
the other caches.

(b) Where are the directories?
The block diagrams don't show any extra connections for
directories, so it is possible that (like the
SGI Origin2000) the directories are held in the DRAM. This
can be done either with extra-wide RAM (like the Origin2000),
or by allocating a part of each RIMM to holding directory data.
The latter approach allows the use of standard RDRAM parts,
but also introduces a lot of complexity into the memory
controller.

The key issue here is that if the directory is in the RDRAM,
then accesses to data in other processor's caches has an
RDRAM latency in *series* with the actual intervention latency.
This is because you must look up the cache line in the directory
in order to know which processor to ask for the data.

(3) Latency Hierarchy
L1: a couple of clocks (like the EV6)
L2: "12 ns load-to-use" (slide 16)
RDRAM: depends on the cache coherency protocol used.
A fully directory-based scheme might be as low as ~100 ns for
local accesses (note that for RDRAM accesses, the
directory
lookup can be done in parallel with the data access)
A hybrid snooping + directory scheme will have higher local
access latency.
Remote RDRAM: You need to look at what they are claiming very carefully.
On slide 18, they claim "15 ns processor to processor" latency.
It is not clear to me what this means, but I am certain that it
does not mean what it looks like it means!

What sort of processor-to-processor transactions are you interested
in? I do not think that the Alpha ISA supports any!

What it supports is a cache-coherent memory space, so all
that you can
do are processor to memory transactions, and that memory may
be cached
on another processor's die.
If the coherence is directory based, then there is a
RDRAM lookup
to find out which remote node has the data, followed by
an intervention
request, followed by the L2 lookup on the remote node,
followed by the
return of the data, followed by the processing of the
returned data.
This is not a "15 ns" sequence, since it includes a 30 ns
Rambus CAS
latency plus a 12 ns L2 access. I would be really,
really surprised
if they can do this operation in less than:
100 ns directory lookup
30 ns average chip-to-chip latency for a 16p system
12 ns remote L2 access
30 ns average chip-to-chip latency for a 16p system
12 ns network interface to CPU core
------
~200 ns load-to-use for remotely cached (dirty) data

Getting to this level would be an amazing feat, since the fastest
NUMA machine for this operation right now is the SGI Origin2000,
and it takes about 1100 ns. The HP NUMA machines are considerably
slower than this. Actually, I would be slightly
surprised if the
system were able to do these remote interventions in 400 ns.

The numbers are different if the system supports some degree
of snooping,
or if the directory is not stored in the RDRAM, but the
principal is the
same. They carefully provided a "load-to-use" latency on
slide 16, and
then let you make the mistake of thinking that the latency on
slide 18
meant something equally useful.

> Also, as an aside .. earlier you were wondering what was
> Wildfire going to do with 64 PCI busses. From that slide
> it appears each processor can handle 3 GByte I/O per second.
> 32 EV7's in a box could pretty much saturate 64 PCI busses with
> I/O bandwidth needs/wants. Granted, it will require a "fork-lift"
> swap-out to stick EV7s in there. But it seems the box would
> have a good deal of I/O to ensure usefulness post-EV6. In other words,
> your not stuck with a boat anchor. The box would support CPU upgrades
> one or two generations out.

The issues have to do with how much system cost is incurred in order to support
the scalable I/O capability that very few customers need. I agree that
"overkill" on I/O capability is appropriate for the first version of a server
line, but if it costs too much it could hurt the sales in areas where immense
I/O is not a priority. I will have to wait until the real product
details
emerge before drawing any conclusions..... Maybe February?

Larry Kilgallen

ungelesen,

03.01.2000, 03:00:0003.01.00

an

In article <3870B653...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:

> The issues have to do with how much system cost is incurred in order to support
> the scalable I/O capability that very few customers need. I agree that
> "overkill" on I/O capability is appropriate for the first version of a server
> line, but if it costs too much it could hurt the sales in areas where immense
> I/O is not a priority. I will have to wait until the real product
> details
> emerge before drawing any conclusions..... Maybe February?

Is there an industry-wide venue for such details in February,
or did the vendor promise something then, or is that just an
expert estimate of when they would say more ?

Rob Young

ungelesen,

03.01.2000, 03:00:0003.01.00

an

In article <3870B653...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:
>

[snip all the hard questions you pose. Maybe pick at
one or two of them later. My brain hurts (me of
BS in comp sci background).]

>
>
>
>> Also, as an aside .. earlier you were wondering what was
>> Wildfire going to do with 64 PCI busses. From that slide
>> it appears each processor can handle 3 GByte I/O per second.
>> 32 EV7's in a box could pretty much saturate 64 PCI busses with
>> I/O bandwidth needs/wants. Granted, it will require a "fork-lift"
>> swap-out to stick EV7s in there. But it seems the box would
>> have a good deal of I/O to ensure usefulness post-EV6. In other words,
>> your not stuck with a boat anchor. The box would support CPU upgrades
>> one or two generations out.
>
> The issues have to do with how much system cost is incurred in order to support
> the scalable I/O capability that very few customers need. I agree that
> "overkill" on I/O capability is appropriate for the first version of a server
> line, but if it costs too much it could hurt the sales in areas where immense
> I/O is not a priority. I will have to wait until the real product
> details
> emerge before drawing any conclusions..... Maybe February?

Maybe not February. But regarding cost.. sure. Today Compaq is
shipping multi-path fiber controllers (HSG80). 100 MByte/sec sustained
bandwidth. Maybe in 2 years they ship HSG100 with 500 MByte/sec
sustained going to Fibre to the busses. Currently , they are
UltraSCSI busses. At that point, maybe it helps to have more PCI
busses.

Also (more likely use for 64 busses), in a Galaxy or multiple hard
partitions of Tru64 it may make sense (especially for Tru64 ,
lacking QIO Server in shared memory ;-) to have a direct connect to
each controller. 64 Tru64 instances, each instance using a PCI bus,
each bus with several cards out to a bunch of switches off of which
are found a dozen or so HSG100s (if there is such a thing in a couple
years, I am positing). 64 instances of Tru64 in the near future may
be accurate given MARVEL may have 256 CPUs (or so). Also, each
instance must be able to see more than 1 PCI bus... I think I saw
that in a slide somewhere.

Regarding package cost.. maybe later if EV7 Wildfire ships with no L3,
that helps a great deal to keep server costs down. Much more powerful
controllers means fewer controllers.

Rob

rub...@my-deja.com

ungelesen,

03.01.2000, 03:00:0003.01.00

an

In article <84nrda$8...@spool.cs.wisc.edu>,

"Andy Glew" <gl...@cs.wisc.edu> wrote:
> > Of course if the SMT machine can "discover" threads in previously
> > single-threaded code I'm all for it!
>

> SMT doesn't discover threads in previously single threaded code.
>
> DMT discovers threads in monolithic single threaded code.
>
>

What is a DMT?

Sent via Deja.com http://www.deja.com/
Before you buy.

John McCalpin

ungelesen,

03.01.2000, 03:00:0003.01.00

an

Larry Kilgallen wrote:
>
> In article <3870B653...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:
>

> > I will have to wait until the real product
> > details
> > emerge before drawing any conclusions..... Maybe February?
>

> Is there an industry-wide venue for such details in February,
> or did the vendor promise something then, or is that just an
> expert estimate of when they would say more ?

"February" was my recollection of when the industry analysts
thought that Compaq was going to publically reveal the system.
Then again, I am running a high fever, so this could be the
result of an hallucination. :-)

Frank Ch. Eigler

ungelesen,

03.01.2000, 03:00:0003.01.00

an

Bernd Paysan <bernd....@gmx.de> writes:

> I really like to have this cleared up, so I cc'd it to Richard Stallman.

Good, perhaps he will respond too. By the way, this is an old issue:
look up the thread `Re: GPL and "free" software' in dejanews, for
example.

> [...]

> This way, I could turn any GPL'd program into a proprietary program,

> which is distributed only among people that signed that NDA. [...]

Perhaps, though even if this practice goes against an interpretation
of the intent of the GNU Manifesto, the actual license text would
supercede that.

> Currently, I can't get an IA-64 GCC without signing an NDA. If I had

> signed an NDA, and obtained GCC, item 6 would apply [...]
>> "6. [...] You may not impose any further

>> restrictions on the recipients' exercise of the rights granted herein.

Right. Since you are hypothetically under NDA, then there is a
conflict, and therefore you cannot redistribute at all. This
situation is spelled out in the GPL.

- FChE

Jonathan Thornburg

ungelesen,

03.01.2000, 03:00:0003.01.00

an

In article <386F327F...@austin.rr.com>,
John McCalpin <jmcc...@austin.rr.com> writes [[lines rewrapped]]

> Consider some hypothetical processor for the 2003 era, with 4-way to
> 6-way issue at 1.5 GHz and a very aggressively designed memory system.
> On-chip caches will still be limited in size to ~2 MB (maybe 4?), so you
> have a choice of either a large off-chip DRAM cache (~60 ns latency) or
> just going straight to local memory.

For just these reasons, I would expect such a system to use at least
one level of off-chip SRAM cache in front of the DRAM. At current SRAM
speeds (which it will be hard to speed up _too_ much, just due to power
and PCB-layout problems) that would still give 10-20 clock accesses.

> This suggests that if you expect any sort of superscalar performance
> (>1 IPC sustained), then you would need reorder buffers capable of
> holding ~200 instructions. I doubt that this could be done without
> significantly impacting cycle time.

Certainly the "memory wall" is there. But I think we can fudge things
a bit more with lots of cache levels...

--
-- Jonathan Thornburg <jth...@galileo.thp.univie.ac.at>
http://www.thp.univie.ac.at/~jthorn/home.html
Universitaet Wien (Vienna, Austria) / Institut fuer Theoretische Physik
"Stocks are now at what looks like a permanent high plateau" -- noted
economist Irving Fisher, 2 weeks before the 1929 stock market crash

Larry Kilgallen

ungelesen,

03.01.2000, 03:00:0003.01.00

an

In article <3870D7EC...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:
>
>
> Larry Kilgallen wrote:
>>

>> In article <3870B653...@austin.rr.com>, John McCalpin <jmcc...@austin.rr.com> writes:
>>
>> > I will have to wait until the real product
>> > details
>> > emerge before drawing any conclusions..... Maybe February?
>>

>> Is there an industry-wide venue for such details in February,
>> or did the vendor promise something then, or is that just an
>> expert estimate of when they would say more ?
>
> "February" was my recollection of when the industry analysts
> thought that Compaq was going to publically reveal the system.
> Then again, I am running a high fever, so this could be the
> result of an hallucination. :-)

Isn't banishment from the newsgroup the penalty for believing
industry analysts ? :-)

Terje Mathisen

ungelesen,

03.01.2000, 03:00:0003.01.00

an

rub...@my-deja.com wrote:
>
> In article <84nrda$8...@spool.cs.wisc.edu>,
> "Andy Glew" <gl...@cs.wisc.edu> wrote:
> > > Of course if the SMT machine can "discover" threads in previously
> > > single-threaded code I'm all for it!
> >
> > SMT doesn't discover threads in previously single threaded code.
> >
> > DMT discovers threads in monolithic single threaded code.
> >
> >
>
> What is a DMT?

I don't _know_, but I assumed he meant Dynamic Multi Threading, i.e. the
hw does instruction stream lookahead to discover paths that are
independent of the currently executing code, and can therefore be
started in parallel.

Terje

--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

rub...@my-deja.com

ungelesen,

03.01.2000, 03:00:0003.01.00

an

In article <3870FF...@hda.hydro.com>,

Terje Mathisen <Terje.M...@hda.hydro.com> wrote:
> rub...@my-deja.com wrote:
> >
> > In article <84nrda$8...@spool.cs.wisc.edu>,
> > "Andy Glew" <gl...@cs.wisc.edu> wrote:
> > > > Of course if the SMT machine can "discover" threads in
previously
> > > > single-threaded code I'm all for it!
> > >
> > > SMT doesn't discover threads in previously single threaded code.
> > >
> > > DMT discovers threads in monolithic single threaded code.
> > >
> > >
> >
> > What is a DMT?
>
> I don't _know_, but I assumed he meant Dynamic Multi Threading, i.e.
the
> hw does instruction stream lookahead to discover paths that are
> independent of the currently executing code, and can therefore be
> started in parallel.
>

OK, then what is the difference between this so called DMT and OOO hw?

John McCalpin

ungelesen,

03.01.2000, 03:00:0003.01.00

an

Jonathan Thornburg wrote:
>
> In article <386F327F...@austin.rr.com>,
> John McCalpin <jmcc...@austin.rr.com> writes [[lines rewrapped]]
> > Consider some hypothetical processor for the 2003 era, with 4-way to
> > 6-way issue at 1.5 GHz and a very aggressively designed memory system.
> > On-chip caches will still be limited in size to ~2 MB (maybe 4?), so you
> > have a choice of either a large off-chip DRAM cache (~60 ns latency) or
> > just going straight to local memory.
>
> For just these reasons, I would expect such a system to use at least
> one level of off-chip SRAM cache in front of the DRAM. At current SRAM
> speeds (which it will be hard to speed up _too_ much, just due to power
> and PCB-layout problems) that would still give 10-20 clock accesses.

The first problem is that the SRAM costs too much, so you can't really
make the L3 as much larger than the L2 as you would like. With wintel
boxes running very, very well at very low prices, other vendors can
scarcely afford to start adding ~$1600 for 16 MB of fast SRAM L3.

You can build an L3 with fast DRAM (and IBM has a very nice DDR SRAM
interface for their DRAM-in-ASIC technology), but then you run into
the second problem -- where to put the tags.

If you put the tags on chip, you get the best memory latency (i.e. fastest
time to identify a last-level cache miss), but you are stuck with burning
a fair amount of your expensive on-chip cache as external tags, and when
you choose a max size for the external cache, it is *really* fixed.
On the machines I am most familiar with (the SGI Origin2000 with MIPS
R10000/R12000 cpus), there is one (half-width ?) tag RAM for each
four data RAMS, so 10% of the SRAM is tag. This is really bad if you
want to use mega-sized DRAM based caches, since your on-chip space
is limited to about 2 MB for the next few years and you would really
prefer to use only a small part for tags. You can sector the cache,
of course, but this only works well for data access patterns with large
contiguous blocks, and provides effectively a rather small cache when
the data access patterns are small, independent blocks.

If you put the tags off-chip, then you can build them out of the same
cheap DRAM that you are using for the data, but you have just added an
additional DRAM latency to the critical path for L3 cache misses.

I am not happy with either alternative.

> > This suggests that if you expect any sort of superscalar performance
> > (>1 IPC sustained), then you would need reorder buffers capable of
> > holding ~200 instructions. I doubt that this could be done without
> > significantly impacting cycle time.
>
> Certainly the "memory wall" is there. But I think we can fudge things
> a bit more with lots of cache levels...

Unfortunately, it is a bit of a fudge, and each time you add this sort
of complexity you get less and less performance benefit from it.

So far, people have generally been able to ignore L1 sizes and block
for L2 when doing performance tuning. I wonder how much more difficult
this will get when everyone has 3-levels of cache, some SRAM, some DRAM,
some as small as 4 MB, some as large as 64 MB (or more?). Is there a
reasonable way to code any more?

Die Nachricht wurde gelöscht

Andy Glew

ungelesen,

03.01.2000, 03:00:0003.01.00

an

> > SMT doesn't discover threads in previously single threaded code.
> >
> > DMT discovers threads in monolithic single threaded code.
>
> What is a DMT?

DMT = Dynamic Multi-Threading.

Haitham Akkary's Ph.D. thesis at Portland State University; 1998.

Published "A Dynamic Multithreading Processor", Akkary & Driscoll,
31st Annual ACM/IEEE International Symposium on Microarchitecture,
available online to IEEE Computer Society members with an e-account.
http://computer.org/proceedings/micro/8609/8609toc.htm

May be considered an OOO/SMT, with the ability to fork "threads"
out of a uniprocessor code by copying register state and predicting
memory and register dependencies, at several simple places:
+ at subroutine calls, fork a new thread after the return
+ fork after loop
In addition, Haitham and I discussed, and he planned to do
+ fork loop bodies
although other researchers may have beaten him to it here.

Extra logic added to ensure that these speculative threads execute
correctly in the presence of dependencies between earlier threads
and later threads.

All of this is done totally transparently, without compiler support
(although that would be nice).

Performance summary: 1.15x - 1.30x over an OOO; pretty much the
same improvement that an OOO gets over an In-Order machine.
I suspect Haitham's evaluation is conservative, and that there is more
upside.

===

Anecdote: I had planned to do my Ph.D. on my own version of this
(which I called IMT, Implicitly Multi-Threaded), when I learned that
Haitham had already started. I therefore tried to skip a generation.
IMHO DMT or the like is the next generation of microarchitecture
for single threaded applications.

Andy Glew

ungelesen,

03.01.2000, 03:00:0003.01.00

an

> OK, then what is the difference between this so called DMT and OOO hw?

Present OOO machines (Intel P6, Alpha 21264, etc.) have a single instruction
sequencer. If anything interrupts that sequencing - an I-cache miss, or a branch
misprediction, then the entire machine clogs up - you can only execute
instructions already in the window. Similarly, if there are many long-latency
instructions in the instruction window, it clogs up.

Present OOO machines are characterized by a single instruction window
that is contiguous in the dynamic instruction stream.

DMT and related microarchitectures have multiple instruction sequencers.
If one instruction sequencer clogs up, hopefully the other instruction sequencers
can keep executing.

SMT has multiple instruction sequencers, but they are explicit, programmer
visible.

DMT has multiple implicit instruction sequencers. Programmer invisible.

DMT is an example of a "skip-ahead" microarchitecture, one that attempts to skip
past a performance problem to more independent code. DMT takes advantage
of "obvious" places to skip ahead, obvious pieces of code that are guaranteed
to be executed: fork after procedure return, fork after loop exit, fork loop bodies.

Other "skip-ahead" places include "control independence" points, such as places
where control flow converges; DMT does not attack those.

===

Here is my rough sketch of microarchitecture history:

Simple In-Order
blocks on any cache miss

Scoreboarded In-Order
Does not block on cache miss
but blocks on first use of cache miss data

OOO
Does not block on use of cache miss data
if there is independent code.
Blocks on pointer chasing.
Blocks on I-cache misses and branch mispredictions

DMT -- the next generation
Does not block on I-cache misses and branch mispredictions
if a speculative thread is in execution
Does not block on pointer chasing cache misses
if a speculative thread is executing independent code

MLP???? -- the next+1 generation
Parallelizes pointer chasing cache misses,
with
+ skip lists
+ data traces
Unfortunately, these MLP techniques only work for low valency
data structures.

Bernd Paysan

ungelesen,

04.01.2000, 03:00:0004.01.00

an

John McCalpin wrote:
> The key issue here is that if the directory is in the RDRAM,
> then accesses to data in other processor's caches has an
> RDRAM latency in *series* with the actual intervention latency.
> This is because you must look up the cache line in the directory
> in order to know which processor to ask for the data.

If you put your directory into RDRAM, you can also use the cache
hierarchy of the processor to get it closer to the CPU if it is needed
often, well, at least up into L2 (L1 is waste). This would give remotely
cached dirty data about the same latency as local RDRAM access - for
frequent accesses (infrequent one have to go through TLB remapping,
which is costly enough by itself).

I'm quite surprised by the low latency of remote cached lines. Usually,
when you program NUMAs, you'd rather jump through all necessary hoops to
remove remote reads from your program (in favour of remote writes, which
can be pipelined easily, and allow to get the available bandwidth). From
an architectural point of view, this is also easier to implement, so if
the software guys already know how to do it that way, why not completely
eliminate the expensive snooping or directories for cache consistency?

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Bernd Paysan

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Frank Ch. Eigler wrote:
> > I really like to have this cleared up, so I cc'd it to Richard Stallman.
>
> Good, perhaps he will respond too.

He did. Since he did in private mail, I'll not cite him completely, but
he wrote that the people who work on Trillian and GCC

RMS> have a non-binding "gentleman's agreement" not to release them;

and therefore don't violate GPL. He also concluded that inhouse copys
are not distributions

RMS> but transfer outside the company is distribution.

Thus it would be possible to have under-NDA inhouse development (you
can't ship, but copying to another cubicle isn't shipping), but as soon
as two companies or persons of different companies are engaged in the
same project (and need copies to proceed), the NDA degrades to a
non-binding gentleman's agreement. It may work so well in case of IA-64,
because the gentlemen in question don't understand that it isn't binding
;-).

After all, the GPL is not an end, but a mean to achieve the goals of the
GNU manifesto, and if the GPL fails somehow, a new, revised version will
come out. That's the reason why GPL'd programs typically say that you
may use any later revision of GPL if you like to.

Greg Lindahl

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Bernd Paysan <bernd....@gmx.de> writes:

> The entire point of SMT is that you *need* more than one thread to
> utilize the processor's resources. The inherent ILP may be almost maxed
> out on 21264, so running just one thread on a 21464 won't give you much
> more performance; the small increased parallelism might be eaten up by
> other tradeoffs, like keeping more function units in sync.

I'm amazed that in the year 2000, people still make generalizations
like this.

It's well known that different codes have different amounts of ILP
available on a given processor. For example, a single thread of a
dense, regular matrix computation might be able to max out an entire
SMT chip, while it may take 2 or 3 compiler threads (lots of branches,
hard to make wide without heroic efforts.)

> So you have the choice to either put a SMT core on a die, *or* to put
> several single threading cores on the same die. Performance of a single
> thread won't differ that much, so your metric is performance/mm².

Given the above, it should be clear to you that the point of SMT is
flexablity -- threads which can use the entire core will, and they'll
see speedup over 1 thread on a chip with multiple single threaded
cores.

-- g

TTK Ciar

ungelesen,

04.01.2000, 03:00:0004.01.00

an

In article <3870FF...@hda.hydro.com>,
Terje Mathisen <Terje.M...@hda.hydro.com> wrote:
>> >

>> > SMT doesn't discover threads in previously single threaded code.
>> >
>> > DMT discovers threads in monolithic single threaded code.
>>
>> What is a DMT?
>

>I don't _know_, but I assumed he meant Dynamic Multi Threading, i.e. the
>hw does instruction stream lookahead to discover paths that are
>independent of the currently executing code, and can therefore be
>started in parallel.

I apologize profusely; in my earlier post I was thinking that SMT
referred to DMT (the technology IBM has been aggressively developing
to such great effect).

-- TTK

Zalman Stern

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Bernd Paysan <bernd....@gmx.de> wrote:
: Yeah, especially with the resulting "the IA-64 executable is too big for
: one CD, let's compile to x86 instead".

a) Hard to imagine even doubling code size causing this. (Stuff that's CD
sized typically has lots of data relative to the amount of compiled code.)
b) IA64 code should compress beautifully given the information redundancy
therein. Compression on distribution media is an existing technology. (And
I wouldn't be surprised to learn compression vendors are already working on
special purpose compressors for IA64 code.)
c) The intial IA64 apps will be targeted at customers who think having two
CDs rather than one demonstrates that the product is "serious" :-)

-Z-

Zalman Stern

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
: It reminds me more of an argument made in a paper (IIRC Dileep
: Bhandarkar was one of the authors) around 1996 that compared the
: Pentium Pro to some Alpha on an OLTP workload and argued that the
: Pentium Pro had an advantage due to better code density.

Were they using code sizes measured from real programs or systems? If so,
they're already much better than the stuff we're talking about here.

-Z-

Jan Vorbrueggen

ungelesen,

04.01.2000, 03:00:0004.01.00

an

John McCalpin <jmcc...@austin.rr.com> writes:

> The first problem is that the SRAM costs too much, so you can't really
> make the L3 as much larger than the L2 as you would like. With wintel
> boxes running very, very well at very low prices, other vendors can
> scarcely afford to start adding ~$1600 for 16 MB of fast SRAM L3.

If it saves me the money and especially the (programming, administrative,
etc) hassle of buying a second computer? - I'd say then a fast SRAM L3
becomes attractive.

> So far, people have generally been able to ignore L1 sizes and block
> for L2 when doing performance tuning. I wonder how much more difficult
> this will get when everyone has 3-levels of cache, some SRAM, some DRAM,
> some as small as 4 MB, some as large as 64 MB (or more?). Is there a
> reasonable way to code any more?

I've heard there's some linear algebra code that systematically tries out
all reasonable blocking variants, running for a week or so, and then has
learnt the proper blocking for that particular machine. All you have to do
is to extend that, possibly employing some "modern" optimization techniques
such as a genetic algorithm on a little network of machines (0.5 8-)).

Jan

Terje Mathisen

ungelesen,

04.01.2000, 03:00:0004.01.00

an

rub...@my-deja.com wrote:
>
> In article <3870FF...@hda.hydro.com>,
> Terje Mathisen <Terje.M...@hda.hydro.com> wrote:

> > rub...@my-deja.com wrote:
> > > What is a DMT?
> >
> > I don't _know_, but I assumed he meant Dynamic Multi Threading, i.e.
> the
> > hw does instruction stream lookahead to discover paths that are
> > independent of the currently executing code, and can therefore be
> > started in parallel.
> >

> OK, then what is the difference between this so called DMT and OOO hw?

Just that DMT would do a _far_ wider scan of upcoming instructions,
afaik.

It would probably need quite a bit of rollback (speculation) capability
as well.

Andy, where are you? Please tell us exactly what you're looking for in a
DMT cpu!

Jan Vorbrueggen

ungelesen,

04.01.2000, 03:00:0004.01.00

an

lin...@pbm.com (Greg Lindahl) writes:

> Given the above, it should be clear to you that the point of SMT is
> flexablity -- threads which can use the entire core will, and they'll
> see speedup over 1 thread on a chip with multiple single threaded cores.

Quite. That's what I meant in my original post: an SMT presents a single
queue to multiple consumers (here, threads) of resources (here, functional
units). And queueing theory tells us that that is the best configuration.

Jan

Larry Kilgallen

ungelesen,

04.01.2000, 03:00:0004.01.00

an

In article <84rqb9$k...@spool.cs.wisc.edu>, "Andy Glew" <gl...@cs.wisc.edu> writes:

> DMT has multiple implicit instruction sequencers. Programmer invisible.
>
> DMT is an example of a "skip-ahead" microarchitecture, one that attempts to skip
> past a performance problem to more independent code. DMT takes advantage
> of "obvious" places to skip ahead, obvious pieces of code that are guaranteed
> to be executed: fork after procedure return, fork after loop exit, fork loop bodies.

I think of the word "fork" when applied to computers as being some
Unix operating system concept, whereas I understand jump, branch
and others as referring to machine instructions.

I don't see how something can be "Programmer invisible" and be based
on constructs of a single operating system, so I presume you must
mean something else by the word "fork". Please elaborate.

Larry Kilgallen

Andy Glew

ungelesen,

04.01.2000, 03:00:0004.01.00

an

> I apologize profusely; in my earlier post I was thinking that SMT
> referred to DMT (the technology IBM has been aggressively developing
> to such great effect).

I am not aware that IBM has done any work in this area.

The multithreaded machine that IBM has published on is
an explicitly parallel machine - I would call it not even SMT,
but Interleaved Multi-Threading.

Andy Glew

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Larry Kilgallen:
> "Andy Glew":

>
> > DMT has multiple implicit instruction sequencers. Programmer invisible.
> >
> > DMT is an example of a "skip-ahead" microarchitecture, one that attempts to skip
> > past a performance problem to more independent code. DMT takes advantage
> > of "obvious" places to skip ahead, obvious pieces of code that are guaranteed
> > to be executed: fork after procedure return, fork after loop exit, fork loop bodies.
>
> I think of the word "fork" when applied to computers as being some
> Unix operating system concept, whereas I understand jump, branch
> and others as referring to machine instructions.
>
> I don't see how something can be "Programmer invisible" and be based
> on constructs of a single operating system, so I presume you must
> mean something else by the word "fork". Please elaborate

By "fork" I mean "create a new thread".

The "fork/join" model of parallelism, where fork and join are hardware
concepts, is certainly quite old, and may even predate UNIX.

However, there are some similarities: "forking" a new hardware thread
on a DMT can be considered as making a copy-on-write copy of the parent
thread's register and memory state, modulo predictions of data dependency,
and with the difference that updates to an older thread will be propagated
to speculative threads.

Explicitly parallel, non-DMT, machines have also done this. Perhaps the most
explicit synthesis of hardware fork/join and UNIX fork was the Myrias machine
- a 68000 based Canadian supercomputer wannabe of the 1980s. Myrias's
compiler created parallel threads, e.g. to parallelize loops, using COW on
the virtual memory system; when it came time to join, non-overlapping
modified pages would be merged back together. There may be some Myrias
people here who can tell us what happened if there were overlaps - I conjecture
that they may have declared a misspeculation, and discarded the speculative thread.
I conjecture that Myria's mechanism may have been totally based on UNIX COW
fork, with a corresponding join primitive added to the OS.

Anyway - I hope that you can now see how such forking may be programmer invisible.
If there is a special instruction, it is explicit.
But, in DMT, the forks are typically implied by existing constructs such as
procedure CALL/RETURN.

----

I'm afraid that I can't resist a politically incorrect pun:

I have a half-written paper somewhere that describes alternative implementations
of forking, typically with variations in how state is extracted from the parent thread.

Working title: "A Whole Family of Mother Forkers".

(The paper and title are real, by the way, although I admit that I am more motivated
by the pun than by the content.)

John McCalpin

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Bernd Paysan wrote:
>
> John McCalpin wrote:
> > The key issue here is that if the directory is in the RDRAM,
> > then accesses to data in other processor's caches has an
> > RDRAM latency in *series* with the actual intervention latency.
> > This is because you must look up the cache line in the directory
> > in order to know which processor to ask for the data.
>
> If you put your directory into RDRAM, you can also use the cache
> hierarchy of the processor to get it closer to the CPU if it is needed
> often, well, at least up into L2 (L1 is waste).

The directory is typically not visible to the processor -- it is accessed
only by the memory controller. You can build a directory cache into
the memory controller. This does not help local latency, but does
provide
a slight reduction in the latency of remote dirty interventions.

> I'm quite surprised by the low latency of remote cached lines. Usually,
> when you program NUMAs, you'd rather jump through all necessary hoops to
> remove remote reads from your program (in favour of remote writes, which
> can be pipelined easily, and allow to get the available bandwidth).

For uncached stores this may be true, but you generally do not want to
do cached stores to remote memory locations. Most systems still do an
allocate on the store, so the inter-cpu network traffic is doubled for
stores compared to reads. Also, the store cannot graduate until it has
been given exclusive access to the line, which requires a round trip to
the home directory. Additional time may be required for invalidates,
depending on how relaxed the consistency protocol happens to be.

> From
> an architectural point of view, this is also easier to implement, so if
> the software guys already know how to do it that way, why not completely
> eliminate the expensive snooping or directories for cache consistency?

If you want a machine with fully user-controlled cache coherency, then
of course it could be made to go very fast. There are no good current
examples of such a machine, though the T3E is close in some ways. IBM
at one point made a non-cache-coherent parallel machine out of 4 of the
original POWER/RIOS RS/6000 boxes.

Of course, you would have to write all your software for such a machine
from scratch, especially including the O/S. Debugging can be a nightmare
as well, which is one of the reasons that IBM dropped its offering.

John McCalpin

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Jan Vorbrueggen wrote:
>
> John McCalpin <jmcc...@austin.rr.com> writes:
>
> > The first problem is that the SRAM costs too much, so you can't really
> > make the L3 as much larger than the L2 as you would like. With wintel
> > boxes running very, very well at very low prices, other vendors can
> > scarcely afford to start adding ~$1600 for 16 MB of fast SRAM L3.
>
> If it saves me the money and especially the (programming, administrative,
> etc) hassle of buying a second computer? - I'd say then a fast SRAM L3
> becomes attractive.

It seems to me that all the vendors have decided that there are relatively
few customers in your position. Several vendors (IBM, Compaq, SGI, Sun)
offer 8 MB L2 caches on at least some of their servers, but I have not
heard a big demand for 16 MB caches. Given the increasing prevalence of
multiprocessor-capable systems, it makes a lot more sense to add another
$800 cpu to the system than to get a minor performance boost from an
extra $800 in L2 cache.

> > So far, people have generally been able to ignore L1 sizes and block
> > for L2 when doing performance tuning. I wonder how much more difficult
> > this will get when everyone has 3-levels of cache, some SRAM, some DRAM,
> > some as small as 4 MB, some as large as 64 MB (or more?). Is there a
> > reasonable way to code any more?
>
> I've heard there's some linear algebra code that systematically tries out
> all reasonable blocking variants, running for a week or so, and then has
> learnt the proper blocking for that particular machine. All you have to do
> is to extend that, possibly employing some "modern" optimization techniques
> such as a genetic algorithm on a little network of machines (0.5 8-)).

"Just a minor matter of programming"?

The trouble is that it has taken Jack Dongarra and his group decades to
get to this point for simple algorithms in dense linear algebra.
Unfortunately, dense linear algebra only accounts for a small fraction
of the identifiable computational kernels in user codes. The more
important kernels are typically sparse matrix solvers, which are much
more difficult to code, to tune, and to parallelize.

I have done some work in "brute force" attempts at optimization of this
sort, and it is not particularly satisfying. It is especially hard to
be sure that your source code generator is outputting code that is of
the "style" that the compiler knows how to best optimize. For example,
try to find a single coding of the following that compiles to excellent
code on a large range of systems (my apologies for the Fortran):

double precision function dot (a,b,n)
double precision a(n),b(n)
integer n
double precision sum
sum = 0.0
do i=1,n
sum = sum + a(i)*b(i)
end do
dot = sum
return
end

I suspect that the coding problem is even worse in C....

Chris Gray

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Andy Glew wrote:

> Explicitly parallel, non-DMT, machines have also done this. Perhaps the most
> explicit synthesis of hardware fork/join and UNIX fork was the Myrias machine
> - a 68000 based Canadian supercomputer wannabe of the 1980s. Myrias's
> compiler created parallel threads, e.g. to parallelize loops, using COW on
> the virtual memory system; when it came time to join, non-overlapping
> modified pages would be merged back together. There may be some Myrias
> people here who can tell us what happened if there were overlaps - I conjecture
> that they may have declared a misspeculation, and discarded the speculative thread.
> I conjecture that Myria's mechanism may have been totally based on UNIX COW
> fork, with a corresponding join primitive added to the OS.

I guess that's my call. The hardware line had 3 generations in it, based
on MC68000, MC68020 and MC68040. Then we gave up trying to sell to U.S.
institutions from Canada, and concentrated on the software.

You're pretty close in the first part, Andy, but somewhat off in the
latter
half. Our parallelism was explicit, rather than implicit, regardless of
what
our marketing folks may have advertised. On a "join" (end of pardo in
our
speak), values merged back into the parent task were:

- if no child writes, value is old parent value
- if one child writes, value is that child's value
- if multiple children write same value, value is that value
- if multiple children write different values, value is undefined

Pretty simple, actually, and works fairly well. Our debugging tools
allowed
you to detect the latter case on a reference to such an undefined value.
On
our own hardware, we had our own OS, so UNIX didn't come into it. Later,
when
we were software-only, we did use COW semantics on top of UNIX, via lots
of
'mmap()', 'mprotect()', etc. calls.

--
[Don't design inefficiency in - it'll come in the implementation]
Chris Gray c...@myrias.com
Myrias Software Corp. (780) 435-1000 http://www.myrias.com

Jan Vorbrueggen

ungelesen,

04.01.2000, 03:00:0004.01.00

an

John McCalpin <jmcc...@austin.rr.com> writes:

> It seems to me that all the vendors have decided that there are relatively
> few customers in your position. Several vendors (IBM, Compaq, SGI, Sun)
> offer 8 MB L2 caches on at least some of their servers, but I have not
> heard a big demand for 16 MB caches. Given the increasing prevalence of
> multiprocessor-capable systems, it makes a lot more sense to add another
> $800 cpu to the system than to get a minor performance boost from an
> extra $800 in L2 cache.

The point is, the extra processor, especially in a "cheap" SMP system, might
not buy me much, or actually be harmful. In that case, having the option of
spending my money where it hurts (cache size/bandwidth) might be the right
decision.

> "Just a minor matter of programming"?

Just so 8-).

> The trouble is that it has taken Jack Dongarra and his group decades to
> get to this point for simple algorithms in dense linear algebra.
> Unfortunately, dense linear algebra only accounts for a small fraction
> of the identifiable computational kernels in user codes. The more
> important kernels are typically sparse matrix solvers, which are much
> more difficult to code, to tune, and to parallelize.

As an aside, I know of at least one other example: FFTW, which will even
tune itself at run time if so desired.

I see two aspects to this matter. One is creating the instrumentation, test
harness and validation so that given one algorithm, one can explore its
parameter space and select (this probably by hand) a "good" setting for
a particular machine. The other is to develop a new way of expressing the
same computation (a new algorithm, in a way) and establishing the proper
parameter dependencies. That of course remains a human's job, but how many
different ways to slay this horse are there? And in a way, only by performing
these kinds of experiments - sorely lacking in many areas of computer
"science" - can one establish the value of a certain approach.

> For example, try to find a single coding of the following that compiles to
> excellent code on a large range of systems (my apologies for the Fortran):
>
> double precision function dot (a,b,n)

I'd say use the F90/F95 intrinsic function DOT_PRODUCT (0.5 8-)).

Jan

Terje Mathisen

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Jan Vorbrueggen wrote:
>
> John McCalpin <jmcc...@austin.rr.com> writes:

> > For example, try to find a single coding of the following that compiles to
> > excellent code on a large range of systems (my apologies for the Fortran):
> >
> > double precision function dot (a,b,n)
>
> I'd say use the F90/F95 intrinsic function DOT_PRODUCT (0.5 8-)).

I suspect John regrets using a dot product as his example, since this
particular problem is hard to not optimize:

With just two stride-1 input streams, and no output at all except the
return value, this function should be memory bandwidth limited on even
quite naive implementations.

The only optimizations that seems worthwhile would be having more than
one accumulator, and probably add some prefetch hints as well:

while ((n -= 4) >= 0) {
s0 += a[i] * b[i];
s1 += a[i+1] * b[i+1];
s2 += a[i+2] * b[i+2];
s3 += a[i+3] * b[i+3];
prefetch(a[i+8]);
prefetch(b[i+8]);
i += 4;
}
n += 4;
s0 += s1;
s2 += s3;
s0 += s1;
while (n-- > 0) {
s0 += a[i] * b[i];
i++;

Froilan P Montenegro

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Re: IA64 and Alpha architec.. by "Andy Glew"@cs.wisc.edu

> May be considered an OOO/SMT, with the ability to fork "threads"
> out of a uniprocessor code by copying register state and predicting
> memory and register dependencies, at several simple places:
> + at subroutine calls, fork a new thread after the return
> + fork after loop
> In addition, Haitham and I discussed, and he planned to do
> + fork loop bodies
> although other researchers may have beaten him to it here.
>
> Extra logic added to ensure that these speculative threads execute
> correctly in the presence of dependencies between earlier threads
> and later threads.

[...]

Would you consider Sun's MAJC "Space Time Computing" an example of
this? While execution units aren't totally shared among threads since
each is assigned to a processing unit, future method calls can be
speculatively executed using a speculative heap, and a rollback done
when the speculative thread has a dependence on a value written in the
head thread. Perhaps not a full implementation of DMT, but the core
idea appears to be the same.

- Froilan

James Van Buskirk

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Terje Mathisen wrote in message <38722F...@hda.hydro.com>...

>Jan Vorbrueggen wrote:

>> I'd say use the F90/F95 intrinsic function DOT_PRODUCT (0.5 8-)).

>With just two stride-1 input streams, and no output at all except the

>return value, this function should be memory bandwidth limited on even
>quite naive implementations.

Not true! Fortran 90 has an array passing syntax called "assumed-
shape arrays" which creates pitfalls that many early F90 compilers
fell into. LF90 2.0 implemented all of its F90 intrinsics as library
calls rather than inlining them as e.g. DVF 5.0 did. When assumed
shape arrays were passed to a subroutine and then DOT_PRODUCTs were
performed on them, the compiler had to do a make a copy of the
vectors in question so that the (F77 style) library functions would
work because the DOT_PRODUCT functions assumed unit stride and
assumed shape arrays can have any stride. And the worst part of it
was that in the copy code array addresses were calculated with integer
multiplies, so you got two integer multiplies for every FP multiply,
and this was in fact about 4X slower than memory limited code on my
90 MHz Pentium. DVF 5.0 inlined everything as mentioned above, but
at high enough optimization settings it would also choose to do
array address calculations for assumed shape arrays via integer
multiplies which proved to be expensive on my 21164. If an
implementation of F90 is naive enough and only tested in an
optimization sense on F77 code as was common in the early days of
F90 (just a couple of years ago) the programmer was exposed to all
kinds of horrors when actually counting on high performance from
F90 syntax.

Greg Lindahl

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Jan Vorbrueggen <j...@mailhost.neuroinformatik.ruhr-uni-bochum.de> writes:

> I've heard there's some linear algebra code that systematically tries out
> all reasonable blocking variants, running for a week or so, and then has
> learnt the proper blocking for that particular machine. All you have to do
> is to extend that, possibly employing some "modern" optimization techniques
> such as a genetic algorithm on a little network of machines (0.5 8-)).

You're probably referring to the Altas project, which attacks a lot
more than just blocking. It's extremely limited in the size of the
code it can successfully optimize -- while they are expanding it to
look at more sequences, in its original incarnation it only does
matrix-matrix multiply, and all the rest of the routines are based on
that one.

It is not reasonable to use this technique on any large body of code.

-- g

John McCalpin

ungelesen,

04.01.2000, 03:00:0004.01.00

an

Terje Mathisen wrote:
>
> Jan Vorbrueggen wrote:
> >
> > John McCalpin <jmcc...@austin.rr.com> writes:
> > > For example, try to find a single coding of the following that compiles to
> > > excellent code on a large range of systems (my apologies for the Fortran):
> > >
> > > double precision function dot (a,b,n)
> >

> > I'd say use the F90/F95 intrinsic function DOT_PRODUCT (0.5 8-)).
>

> I suspect John regrets using a dot product as his example, since this
> particular problem is hard to not optimize:

No, I don't regret using this as an example. The kernel is actually
part of the "STREAM2" benchmark that I am developing, and I am not
at all happy with the way that it performs on a variety of systems.

> With just two stride-1 input streams, and no output at all except the
> return value, this function should be memory bandwidth limited on even
> quite naive implementations.

The code "should be" memory bandwidth limited, but for short vectors
it is rather easy for a compiler to do a poor job.

For "STREAM2", I am interested in L1-contained data as much as I am
interested in RAM-contained data.

Cache-contained dot-products are a reasonably interesting kernel
to look at when studying the performance of some sorts of sparse
matrix solvers.

> The only optimizations that seems worthwhile would be having more than
> one accumulator, and probably add some prefetch hints as well:

Yes, more than one accumulator is good.

One might suspect that you need N*M accumulators, where N is the
number of floating-point functional units and M is their latency
for dependent operations. This product can be considerably larger
than 4 for some implementations.

Prefetching is also useful, but is non-trivial to optimize, since
the same code is used for vector lengths from ~30 to ~30M.

Bernd Paysan

ungelesen,

04.01.2000, 03:00:0004.01.00

an

John McCalpin wrote:
> For uncached stores this may be true, but you generally do not want to
> do cached stores to remote memory locations. Most systems still do an
> allocate on the store, so the inter-cpu network traffic is doubled for
> stores compared to reads. Also, the store cannot graduate until it has
> been given exclusive access to the line, which requires a round trip to
> the home directory. Additional time may be required for invalidates,
> depending on how relaxed the consistency protocol happens to be.

You typically try to map these regions write combined, otherwise
uncached. The software ensures that you are the single writer (and
there's only a single reader on the other side). I agree that a cache
coherency protocol just makes it slower and more difficult to use. And
yes, you can't use your old OS and your old FORTRAN programs on such a
box. It's not incremental development, and therefore it's quite likely
that this approach is doomed.

Andy Glew

ungelesen,

04.01.2000, 03:00:0004.01.00

an

> Would you consider Sun's MAJC "Space Time Computing" an example of

> this? [DMT] While execution units aren't totally shared among threads since

> each is assigned to a processing unit, future method calls can be
> speculatively executed using a speculative heap, and a rollback done
> when the speculative thread has a dependence on a value written in the
> head thread. Perhaps not a full implementation of DMT, but the core
> idea appears to be the same.
>
> - Froilan

MAJC is certainly an example of a multiple sequencer architecture,
used to do parallel speculation on what appears to the programmer
to be a single thread of execution, with hardware support for speculation.

However, MAJC is a multiple sequencer *architecture*, not microarchitecture
- it's explicitly parallel, at least to the compiler.
Or, rather, the decision of when to create threads is explicit,
but there is hardware support for detecting dependencies between threads.

In DMT both are explicit - although it's a spectrum, since DMT can certainly
benefit from compiler hints.

MAJC is like Sohi, Franklin, et al's Multiscalar - compiler visible explicit forking
+ implicit dependence checking combining to make it appear to be architecturally
single threaded to the programmer.

Certainly, the key idea is multiple sequencers executing a single threaded
program, with dependency checking hardware.

David Mosberger-Tang

ungelesen,

05.01.2000, 03:00:0005.01.00

an

>>>>> On Tue, 04 Jan 2000 01:41:00 +0100, Bernd Paysan <bernd....@gmx.de> said:

Bernd> Thus it would be possible to have under-NDA inhouse
Bernd> development (you can't ship, but copying to another cubicle
Bernd> isn't shipping), but as soon as two companies or persons of
Bernd> different companies are engaged in the same project (and need
Bernd> copies to proceed), the NDA degrades to a non-binding
Bernd> gentleman's agreement. It may work so well in case of IA-64,
Bernd> because the gentlemen in question don't understand that it
Bernd> isn't binding ;-).

Nope.

--david

Terje Mathisen

ungelesen,

05.01.2000, 03:00:0005.01.00

an

John McCalpin wrote:
>
> Terje Mathisen wrote:
> >
> > Jan Vorbrueggen wrote:
> > >
> > > John McCalpin <jmcc...@austin.rr.com> writes:
> > > > For example, try to find a single coding of the following that compiles to
> > > > excellent code on a large range of systems (my apologies for the Fortran):
> > > >
> > > > double precision function dot (a,b,n)
> > >
> > > I'd say use the F90/F95 intrinsic function DOT_PRODUCT (0.5 8-)).
> >
> > I suspect John regrets using a dot product as his example, since this
> > particular problem is hard to not optimize:
>
> No, I don't regret using this as an example. The kernel is actually
> part of the "STREAM2" benchmark that I am developing, and I am not
> at all happy with the way that it performs on a variety of systems.
>
> > With just two stride-1 input streams, and no output at all except the
> > return value, this function should be memory bandwidth limited on even
> > quite naive implementations.
>
> The code "should be" memory bandwidth limited, but for short vectors
> it is rather easy for a compiler to do a poor job.
>
> For "STREAM2", I am interested in L1-contained data as much as I am
> interested in RAM-contained data.

OK, with L1 resident operands, code scheduling becomes more interesting,
but I still believe my post covered that as well:

> > The only optimizations that seems worthwhile would be having more than
> > one accumulator, and probably add some prefetch hints as well:
>
> Yes, more than one accumulator is good.
>
> One might suspect that you need N*M accumulators, where N is the
> number of floating-point functional units and M is their latency
> for dependent operations. This product can be considerably larger
> than 4 for some implementations.

Yes, but it should still be easy for a compiler to setup:

With L1-resident data, but not all in registers already, the same dot
product code will be limited by the number of load/store units, and
probably not by the number of fp functional units, unless you have more
than twice as many load units:

The exception is for systems without FMAC, where each of those have to
be synthesized by an FMUL + dependent FADD (i.e. x86 double prec).

On a Pentium I would need 3 (or 4) accumulators and the same number of
temporary regs to keep the FMUL products, but the PPro+ OOO core would
probably run (nearly?) as fast with a naive implementation:

next:
fld a[esi]
fmul b[esi]
add esi,8
faddp st(1),st
dec ecx
jnz next

> Prefetching is also useful, but is non-trivial to optimize, since
> the same code is used for vector lengths from ~30 to ~30M.

So if this is important, why not have 2 or 3 different versions,
selecting the proper one at runtime based on the array length?

A somewhat simpler idea, without any extra branching, would be to make
the prefetch offset distance dependent upon the size (and therefore
presumed cache level residency) of the arrays: (I assume CMOV to
eliminate the if's)

prefetch_offset = 32; // 1 cache line
if (size > 1024) prefetch_offset = 64;
if (size > 64*1024) prefetch_offset = 128;

Put those offset values in a table and initialize it during program
startup to get parameters that are tuned for the current environment.

Terje

Jan Vorbrueggen

ungelesen,

05.01.2000, 03:00:0005.01.00

an

John McCalpin <jmcc...@austin.rr.com> writes:

> Prefetching is also useful, but is non-trivial to optimize, since
> the same code is used for vector lengths from ~30 to ~30M.

This also results in another problem, especially for short vectors: you can
easily have the case that N=8 is faster than N=7, if the code generated
happens to unroll by 4 (as an example). I found this behaviour in a commercial
BLAS1 library for the T800. My code wasn't only (slightly) better for N=8,
but also had strictly monotonic timing behaviour for growing N (at the expense
of additional code, obviously).

Jan

Terje Mathisen

ungelesen,

05.01.2000, 03:00:0005.01.00

an

James Van Buskirk wrote:
>
> Terje Mathisen wrote in message <38722F...@hda.hydro.com>...
>
> >Jan Vorbrueggen wrote:
>

> >> I'd say use the F90/F95 intrinsic function DOT_PRODUCT (0.5 8-)).
>

> >With just two stride-1 input streams, and no output at all except the
> >return value, this function should be memory bandwidth limited on even
> >quite naive implementations.
>

> Not true! Fortran 90 has an array passing syntax called "assumed-
> shape arrays" which creates pitfalls that many early F90 compilers

[snip]

> multiplies which proved to be expensive on my 21164. If an
> implementation of F90 is naive enough and only tested in an
> optimization sense on F77 code as was common in the early days of
> F90 (just a couple of years ago) the programmer was exposed to all
> kinds of horrors when actually counting on high performance from
> F90 syntax.

Oops indeed.

Even with non-unit stride fairly naive code + prefetch hints would still
work very well, at a much smaller cost in code complexity.

The library dot_product() function could easily be extended with a
stride parameter, this would affect optimal prefetching a bit, but
unless the stride becomes comparable to L1 cache set size, it should all
'just work'.

Replacing (multiple) integer muls with additions instead is a _very_ old
compiler optimization.

Chris Dickson

ungelesen,

05.01.2000, 03:00:0005.01.00

an

John McCalpin wrote:
>
> Bernd Paysan wrote:
> >
> > John McCalpin wrote:
> > > The key issue here is that if the directory is in the RDRAM,
> > > then accesses to data in other processor's caches has an
> > > RDRAM latency in *series* with the actual intervention latency.
> > > This is because you must look up the cache line in the directory
> > > in order to know which processor to ask for the data.
> >
> > If you put your directory into RDRAM, you can also use the cache
> > hierarchy of the processor to get it closer to the CPU if it is needed
> > often, well, at least up into L2 (L1 is waste).
>
> The directory is typically not visible to the processor --

Quite. Quite apart from requiring the processor need to have
platform-specific
features, a cacheable directory would have to conform to the coherence
protocol
- resulting in infinite directory lookup recursion on any cacheable
reference.

> -- it is accessed
> only by the memory controller. You can build a directory cache into
> the memory controller. This does not help local latency, but does
> provide > a slight reduction in the latency of remote dirty interventions.
>

You could also have a directory cache with no actual directory backing
it.
Has anyone published any studies on something like that?

------------------------------------------------------------------------
Chris Dickson MTS Performance modeling and simulation
c...@sgi.com (+1) 651 683 5375

NOTE: as of Dec 23 1999 - my old email address "c...@cray.com" will no
longer be working. Cray got "demerged", but I didn't.
------------------------------------------------------------------------

Doug Siebert

ungelesen,

05.01.2000, 03:00:0005.01.00

an

John McCalpin <jmcc...@austin.rr.com> writes:

>Jan Vorbrueggen wrote:
>>
>> If it saves me the money and especially the (programming, administrative,
>> etc) hassle of buying a second computer? - I'd say then a fast SRAM L3
>> becomes attractive.

>It seems to me that all the vendors have decided that there are relatively

>few customers in your position. Several vendors (IBM, Compaq, SGI, Sun)
>offer 8 MB L2 caches on at least some of their servers, but I have not
>heard a big demand for 16 MB caches. Given the increasing prevalence of
>multiprocessor-capable systems, it makes a lot more sense to add another
>$800 cpu to the system than to get a minor performance boost from an
>extra $800 in L2 cache.

Not to be too vendor-negative here, but with SPEC2000 out now, and the
problem sizes bigger, if it turns out that the working set of enough
things can fit into a cache of size X versus X/2 or even X/4, that alone
is IMHO sufficient motivation for vendors to produce a variant of their
machines with a cache of size X, even if it costs quite a bit more.
Their salespeople will be able to say that model gets a certain SPEC
rating, and sort of forget to mention those results were obtained on a
machine with a larger cache than the buyer is considering. Digital did
that a few years ago when they were the first with a model with an 8MB
L3 (this was on a 21164) that edged out the HP PA-8000 or PA-8200 based
workstation that had recently taken over the FP performance lead -- of
course the external L1 caches HP was using back then were fairly large
as well!

--
Douglas Siebert Director of Computing Facilities
douglas...@uiowa.edu Division of Mathematical Sciences, U of Iowa

I'm planning on being dead for most of the new millennium, how about you?

Andy Glew

ungelesen,

05.01.2000, 03:00:0005.01.00

an

Woops! In the previous post, I said
"In DMT both [thread creation and correctness detection] are explicit"
when I should have said implicit.

Corrected below.

Andy Glew <gl...@cs.wisc.edu> wrote in message news:84ufno$7...@spool.cs.wisc.edu...

> > Would you consider Sun's MAJC "Space Time Computing" an example of
> > this? [DMT] While execution units aren't totally shared among threads since
> > each is assigned to a processing unit, future method calls can be
> > speculatively executed using a speculative heap, and a rollback done
> > when the speculative thread has a dependence on a value written in the
> > head thread. Perhaps not a full implementation of DMT, but the core
> > idea appears to be the same.
> >
> > - Froilan

MAJC is certainly an example of a multiple sequencer architecture,
used to do parallel speculation on what appears to the programmer
to be a single thread of execution, with hardware support for speculation.

However, MAJC is a multiple sequencer *architecture*, not microarchitecture
- it's explicitly parallel, at least to the compiler.
Or, rather, the decision of when to create threads is explicit,
but there is hardware support for detecting dependencies between threads.

In DMT both are implicit - although it's a spectrum, since DMT can certainly

Robert Harley

ungelesen,

06.01.2000, 03:00:0006.01.00

an

dsie...@icaen.uiowa.edu (Doug Siebert) writes:
> [...] Their salespeople will be able to say that model gets a certain SPEC

> rating, and sort of forget to mention those results were obtained on a
> machine with a larger cache than the buyer is considering. Digital did
> that a few years ago when they were the first with a model with an 8MB L3

Are you suggesting improper antics by Digital here? Or are you just
still bitter after 3 years that HP's SPEC lead was so short-lived due
to being surpassed by a faster Alpha?

As a happy user of that same model with an 8MB L3 and various others
with smaller caches, who never witnessed any such behaviour from
Digital, I suspect the latter. You could find plenty of faults with
Digital but this wasn't one of them. Au contraire I had good
experiences in this area (*).

Bye,
Rob.

(*): When my Linux machine with was running a bit slow due to a bad
2MB cache, a Digital guy came to my office *immediately* with a
new one in hand and installed it in minutes, no questions asked
(turned out someone had put a cache for a 433 MHz machine in a
500 MHz one).

Anton Ertl

ungelesen,

06.01.2000, 03:00:0006.01.00

an

In article <84s9l7$fhs$2...@nntp2.atl.mindspring.net>,
Zalman Stern <zal...@netcom10.netcom.com> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>: It reminds me more of an argument made in a paper (IIRC Dileep
>: Bhandarkar was one of the authors) around 1996 that compared the
>: Pentium Pro to some Alpha on an OLTP workload and argued that the
>: Pentium Pro had an advantage due to better code density.
>
>Were they using code sizes measured from real programs or systems?

I don't remember, and I haven't found the reference (maybe I was wrong
about the author). I think there were at least real performance
numbers, but given the different L2 cache sizes of the CPUs involved
that does not tell us much about the effect of code size.

> If so,
>they're already much better than the stuff we're talking about here.

That paper was certainly better wrt real numbers, but if you want real
numbers, you shouldn't read anything about IA64 (not even from Intel)
before you can buy it.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Doug Siebert

ungelesen,

07.01.2000, 03:00:0007.01.00

an

Robert Harley <har...@corton.inria.fr> writes:

>dsie...@icaen.uiowa.edu (Doug Siebert) writes:
>> [...] Their salespeople will be able to say that model gets a certain SPEC
>> rating, and sort of forget to mention those results were obtained on a
>> machine with a larger cache than the buyer is considering. Digital did
>> that a few years ago when they were the first with a model with an 8MB L3

>Are you suggesting improper antics by Digital here? Or are you just
>still bitter after 3 years that HP's SPEC lead was so short-lived due
>to being surpassed by a faster Alpha?

I'm not suggesting Digital did anything improper. Digital's market
position since the introduction of the Alpha has been that they sell the
fastest systems you can buy. There have been some periods of several
months after HP introduced the 8000, 8200, and 8500, and Intel the PPro
and now current Coppermine, but other than that Alpha has held the
undisputed performance lead. When that is your sales strategy, your
salespeople sell the systems that way, even when the particular system
they are trying to sell you isn't necessarily faster than a competitive
system you may be considering. (I.e., Digital didn't do anything
improper, but some salespeople do, whether they work for Digital or any
other vendor in any industry)

I do know that Digital did hurry the intro of that 8MB model, and
published results well before it was actually available. I was told
in private email by a Digital employee that this was done to retake
the SPEC lead, which certain management types in Digital felt was
extremely important for their survival (but alone, not important enough
it seems, given what happened to Digital) It's obviously in Digital's
interest when that is the basis upon which they are selling their systems.

IMHO, if Alpha evolution was stopped in its tracks today, their sales
in the Unix marketplace would begin to suffer as other vendors started
passing them by in performance, since that's how they market their
systems. But Sun could probably stand still with the Sparc for quite a
while before their market would start to care, since that isn't how they
market their systems.

Terry C. Shannon

ungelesen,

07.01.2000, 03:00:0007.01.00

an

"Doug Siebert" <dsie...@icaen.uiowa.edu> wrote in message
news:855akr$8mf$1...@server05.icaen.uiowa.edu...

<snip>

>
>
> I'm not suggesting Digital did anything improper. Digital's market
> position since the introduction of the Alpha has been that they sell the
> fastest systems you can buy. There have been some periods of several
> months after HP introduced the 8000, 8200, and 8500, and Intel the PPro
> and now current Coppermine, but other than that Alpha has held the
> undisputed performance lead.

The PPro lead was a tad vaporous... remember the 023.eqntott benchmark
"errata?"

I believe IBM briefly wrested the FP lead from Alpha with a POWER chip back
in ~1994 or so.

Bernd Paysan

ungelesen,

07.01.2000, 03:00:0007.01.00

an

Doug Siebert wrote:
> IMHO, if Alpha evolution was stopped in its tracks today, their sales
> in the Unix marketplace would begin to suffer as other vendors started
> passing them by in performance, since that's how they market their
> systems. But Sun could probably stand still with the Sparc for quite a
> while before their market would start to care, since that isn't how they
> market their systems.

Sun has the software advantage. A lot of proprietary Unix software came
out on Suns long before the Alpha was available, and the vendors still
think that there are just two kinds of workstations: Suns and HPs. Even
for those EDA software that runs on Alpha, you find funny statements
like

"2. Verilog-XL 2.0.5 DOES run under DEC/Alpha OSF 1.3. Please be aware
that the DEC/Alpha platform has also entered End-Of-Life. There will be
NO
further releases of Verilog-XL on DEC/Alpha except for P0 support."

I wasn't aware that DEC denounced the Alpha platform in 1995 (the date
of that document).

Paul DeMone

ungelesen,

07.01.2000, 03:00:0007.01.00

an

Bernd Paysan wrote:
[snip]

> Sun has the software advantage. A lot of proprietary Unix software came
> out on Suns long before the Alpha was available, and the vendors still
> think that there are just two kinds of workstations: Suns and HPs.

ECAD vendors are about worst offender. Sometimes I think they
like Suns so much because the uniprocessor performance is so
mediocre you end up needing high processor count SPARC servers
and have to pay for more copies of floating licenses.

Just today I saw a new Fujitsu SPARC server go up the freight
elevator and thought we could have got a fully loaded DS20
pair of oxen for what we paid for that box of chickens. :-)

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
dem...@mosaid.com architectures with MIPSed results but ALPHA's well
pde...@igs.net that ends well.

Andy Glew

ungelesen,

07.01.2000, 03:00:0007.01.00

an

> The PPro lead was a tad vaporous... remember the 023.eqntott benchmark
> "errata?"

Sigh. Here we go again...

Yes, there was a compiler bug, that produced code that happened
to work for vectors that were a multiple of 4 in length, but which was
incorrect for vectors of length 5. (IIRC).

No, there was nothing deliberate. It was, literally, a parenthesization
error.

Moreover, once the bug was fixed, the performance changed
negligibly.

No, I didn't make the compiler bug, although I did do the evaluation
that pointed the way to this optimization. I continue to believe that
this is a legitimate optimization - as, by the way, did Digital, since
they did almost exactly the same optimization IIRC. Optimizing a scan of
16 bit quantities to process 2 at a time in a 32 bit register (or 4 at a
time in a 64 bit register) is an eminently sensible thing to do,
the sort of thing that good assembly programmers like Terje do
all the time. I wish that compilers did this more often - then maybe
JPEG and MPEG would have less assembly code.

Terje Mathisen

ungelesen,

08.01.2000, 03:00:0008.01.00

an

I have to agree with Andy, I'd love to see compilers that could figure
out how to combine multiple smaller elements into a larger register.

It seems like the only really common use of this, is on cpus like the
Alpha which was designed from the beginnning to work with byte values
this way.

The add-on MMX/SSE/3D-NOW instruction sets have basically no compiler
support at all, excepting the way Intel's compiler allow you to enter
the opcodes directly as virtual function calls, and the compiler will
handle register allocation for you.

Not a way to get portable code. :-(

Chris Cox

ungelesen,

08.01.2000, 03:00:0008.01.00

an

In article <387705...@hda.hydro.com>, Terje Mathisen
<Terje.M...@hda.hydro.com> wrote:

> Andy Glew wrote:
> > Optimizing a scan of
> > 16 bit quantities to process 2 at a time in a 32 bit register (or 4 at a
> > time in a 64 bit register) is an eminently sensible thing to do,
> > the sort of thing that good assembly programmers like Terje do
> > all the time. I wish that compilers did this more often - then maybe
> > JPEG and MPEG would have less assembly code.
>
> I have to agree with Andy, I'd love to see compilers that could figure
> out how to combine multiple smaller elements into a larger register.
>
> It seems like the only really common use of this, is on cpus like the
> Alpha which was designed from the beginnning to work with byte values
> this way.
>
> The add-on MMX/SSE/3D-NOW instruction sets have basically no compiler
> support at all, excepting the way Intel's compiler allow you to enter
> the opcodes directly as virtual function calls, and the compiler will
> handle register allocation for you.

Have you looked at Metrowerks' 3D-NOW auto-vectorization support?
Still a bit limited, but a good start in that direction...

Also, AltiVec has some high-level language support - but most of the
operations are 'intrinsics' that later get scheduled (unlike most of the
MMX/SSE intrinsics).

Chris

Terje Mathisen

ungelesen,

09.01.2000, 03:00:0009.01.00

an

This is _exactly_ the same setup as Intel uses for MMX (and presumabley
SSE, I haven't got a new copy of their compiler), i.e. each opcode
becomes a 'compiler intrinsic'.

Richard Stacpoole

ungelesen,

09.01.2000, 03:00:0009.01.00

an

In article <387859...@hda.hydro.com>, Terje Mathisen
<Terje.M...@hda.hydro.com> wrote:

> Chris Cox wrote:
> >
> > In article <387705...@hda.hydro.com>, Terje Mathisen
> > <Terje.M...@hda.hydro.com> wrote:
> > > The add-on MMX/SSE/3D-NOW instruction sets have basically no compiler
> > > support at all, excepting the way Intel's compiler allow you to enter
> > > the opcodes directly as virtual function calls, and the compiler will
> > > handle register allocation for you.
> >
> > Have you looked at Metrowerks' 3D-NOW auto-vectorization support?
> > Still a bit limited, but a good start in that direction...
> >
> > Also, AltiVec has some high-level language support - but most of the
> > operations are 'intrinsics' that later get scheduled (unlike most of the
> > MMX/SSE intrinsics).
>
> This is _exactly_ the same setup as Intel uses for MMX (and presumabley
> SSE, I haven't got a new copy of their compiler), i.e. each opcode
> becomes a 'compiler intrinsic'.
>
> Terje

Excuse my ignorance, but could someone tell me another way to do it?

It seems that without a kind of explicit high level instruction support,
or some kind of hinting system, it would be very difficult for a
compiler to auto-vectorise.

Any info appreciated

Paul Hsieh

ungelesen,

09.01.2000, 03:00:0009.01.00

an

stac...@primus.com.au says...
> Terje Mathisen wrote:
> > Chris Cox wrote:

> > > Terje Mathisen wrote:
> > > > The add-on MMX/SSE/3D-NOW instruction sets have basically no compiler
> > > > support at all, excepting the way Intel's compiler allow you to enter
> > > > the opcodes directly as virtual function calls, and the compiler will
> > > > handle register allocation for you.
> > >
> > > Have you looked at Metrowerks' 3D-NOW auto-vectorization support?
> > > Still a bit limited, but a good start in that direction...
> > >
> > > Also, AltiVec has some high-level language support - but most of the
> > > operations are 'intrinsics' that later get scheduled (unlike most of the
> > > MMX/SSE intrinsics).
> >
> > This is _exactly_ the same setup as Intel uses for MMX (and presumabley
> > SSE, I haven't got a new copy of their compiler), i.e. each opcode
> > becomes a 'compiler intrinsic'.

They also support special "classes" in which they use operator
overloading, so that you can even avoid these intrinsics if you want.

> Excuse my ignorance, but could someone tell me another way to do it?
>
> It seems that without a kind of explicit high level instruction support,
> or some kind of hinting system, it would be very difficult for a
> compiler to auto-vectorise.

Indeed, it does not. You still need to write all the vectorizable code
as such yourself.

--
Paul Hsieh
http://www.pobox.com/~qed/mailme.html

Toon Moene

ungelesen,

09.01.2000, 03:00:0009.01.00

an

DONT.qed...@pobox.com (Paul Hsieh) wrote:

> stac...@primus.com.au says...

> > Excuse my ignorance, but could someone tell me another way to do it?
> >
> > It seems that without a kind of explicit high level instruction support,
> > or some kind of hinting system, it would be very difficult for a
> > compiler to auto-vectorise.

Hmmm, so all those autovectorising compilers by Cray, Convex, NEC, Fujitsu,
etc. are just a figment of my imagination ...

> Indeed, it does not. You still need to write all the vectorizable code
> as such yourself.

Pah. That just means their compilers are behind the state of the art.

[ Before anyone asks - yes, the GNU Compiler Collection is behind the
state of the art too ]

--
Toon Moene (mailto:to...@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
GNU Fortran: http://gcc.gnu.org/onlinedocs/g77_news.html

st...@blighty.com

ungelesen,

09.01.2000, 03:00:0009.01.00

an

In article <stacpoole-34E24...@news.primus.com.au>, Richard says...

>
>In article <387859...@hda.hydro.com>, Terje Mathisen
><Terje.M...@hda.hydro.com> wrote:

>> This is _exactly_ the same setup as Intel uses for MMX (and presumabley
>> SSE, I haven't got a new copy of their compiler), i.e. each opcode
>> becomes a 'compiler intrinsic'.
>>

>> Terje

>
>Excuse my ignorance, but could someone tell me another way to do it?
>
>It seems that without a kind of explicit high level instruction support,
>or some kind of hinting system, it would be very difficult for a
>compiler to auto-vectorise.

It can't be that difficult. I was using Convex fortran compilers over
a decade ago that would vectorise standard fortran code quite nicely.
(The C compiler too, though not quite as well).

They certainly vectorised loops over arrays or matrices very efficiently, and
could improve some other code structures as well.

They certainly benefitted from a few cycles round the compile - profile - add
hints/tweak code loop, but did pretty well on the first pass.

It isn't new technology. It's fairly sad that it has taken as long as it has
for the workstation/PC compiler vendors to use some of the same techniques.

Cheers,
Steve

--
-- Steve Atkins -- st...@blighty.com

Greg Lindahl

ungelesen,

09.01.2000, 03:00:0009.01.00

an

Richard Stacpoole <stac...@primus.com.au> writes:

> It seems that without a kind of explicit high level instruction support,
> or some kind of hinting system, it would be very difficult for a
> compiler to auto-vectorise.

Vectorizing compilers are old hat. The problem with these instruction
sets is that they have nasty run-time alignment restrictions.

-- g

Bernd Paysan

ungelesen,

09.01.2000, 03:00:0009.01.00

an

Toon Moene wrote:
> Pah. That just means their compilers are behind the state of the art.
>
> [ Before anyone asks - yes, the GNU Compiler Collection is behind the
> state of the art too ]

Indeed. One might be inclined to see this as a case study that
commercial development sucks, even when done in the open source area.
About 10 years ago, the Cygnus founders started with a compiler that was
better than vendor C compilers (back then, it wasn't a compiler
collection, it was just C). Now, ten years later, we have a huge
compiler collection with about 80% of the contribution is from slave
labor*, a code fork and a re"merge" (most of the 2.8 parts got into the
egcs fork, anyway), and it's definitely behind proprietary commercial
offerings - a rewrite from scratch is still imminent since several years
now. According to the theory (the TCatB theory), this could never
happen.

*) "slave labor" is when the labor doesn't belong to the worker, but is
"compensated" by money instead.

Andy Glew

ungelesen,

09.01.2000, 03:00:0009.01.00

an

> Excuse my ignorance, but could someone tell me another way to do it?
>

> It seems that without a kind of explicit high level instruction support,
> or some kind of hinting system, it would be very difficult for a
> compiler to auto-vectorise.

Why should it be so hard?

FORTRAN compilers were converting loops like

DO 99 I=1,100
99 A[I]=B[I]+C[I]

into vector statements

VL=64
VLD V1,B
VLD V2,C
VADD V3,V1,V2
VST V3,A
VL=36
VLD V1,B+64*4
VLD V2,C +64*4
VADD V3,V1,V2
VST V3,A+64*4

when I was in high school.

FYI, the above performs a 64 element,
and then a 36 element, vector add, assuming that
the vector registers are 64 elements long.

All that we are asking is that similar compiler
vectorizing be applied to the 4 element and 8 element
vectors that are common in MMX, SSE, etc.

Andrew Reilly

ungelesen,

10.01.2000, 03:00:0010.01.00

an

In article <85cah9$3ec$1...@sparky.wolfe.net>, TTK Ciar wrote:
>the task of the compiler. Consider, for instance, the C (so-called)
>function:
>
>void foo ( char *a, char *b ) {
> int i;
> for ( i = 0; a[i] != 0; i++ )
> a[i] ^= b[0];
> }
>
> Seems pretty straightforward to optimize with SIMD (or just plain
>word-sized register ops), until you realize that processing chunks of
>a[] 8 or 16 chars at a time can generate results which are different
>than those from processing chunks of a[] 1 char at a time, when a[]
>and b[] overlap.

Yeah, but that was obviously the intention of the procedure,
otherwise it would have been written:

void foo ( char *a, char *b ) {
int i; char b0;
b0 = b[0];
for ( i = 0; a[i] != 0; i++ )
a[i] ^= b0;
}

C lets you write non-ambiguous code just fine. You just have to
remember that it's code that you're writing, not arithmetic.
Treat it like an assembler and it will behave like one.

Of course the corollary to that is that if you _want_ to be able
to write arithmetic, or UML diagrams, or whatever, then you probably
shouldn't be using C (first).

> If you say "we had better languages than C when I was in high
>school!", then I'll be the first to agree with you. Unfortunately
>the industry has standardized on the neanderthal bastard child of
>PL/I, and there's enough momentum behind the Church of C that it'll
>be a long while before it is profitable for most compiler developers
>to stop bending their efforts to C optimization and start optimizing
>code written in a more agreeable language.

It's been asked here in this thread a little differently, but
what would a language that was easier to optimise for (while
still getting predictable results) look like? I've never used
or really seen the predecessors to C, (BCPL and B?). Would they
fit the bill? Forth anyone? What does C-- have going for it?

--
Andrew

Jesus Cea Avion

ungelesen,

10.01.2000, 03:00:0010.01.00

an

> Also, Fortran has a design advantage over C in that C semantics
> are defined such that the compiler must assume that any data structs
> which cannot be disproven to overlap, do overlap, and in most non-

Coumpilers can, of course, include a run-time check in order to detect
overlap.

--
Jesus Cea Avion _/_/ _/_/_/ _/_/_/
jc...@argo.es http://www.argo.es/~jcea/ _/_/ _/_/ _/_/ _/_/ _/_/
_/_/ _/_/ _/_/_/_/_/
PGP Key Available at KeyServ _/_/ _/_/ _/_/ _/_/ _/_/
"Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
"My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

Paul Hsieh

ungelesen,

10.01.2000, 03:00:0010.01.00

an

TTK Ciar writes:

> Andy Glew <gl...@cs.wisc.edu> wrote:
> >Why should it be so hard?
> >FORTRAN compilers were converting loops like
> >
> > DO 99 I=1,100
> > 99 A[I]=B[I]+C[I]
> >
> >into vector statements
> >

> [..assembly snipped..]
> >
> >when I was in high school.
>
> You're right, that's not too hard. But how often do embarrassingly
> simple code sequences like that comprise the inner loop of processor-
> intensive real-life code?

See SpecFp. I think nearly all the loops in SpecFP are "that simple". I
think that the scientific community is not necessarily concerned with
complicated programming structures as they are with putting together
their complicated formulas of exceedingly large amounts of data.

> [...] I pose this as a question because, even
> though I've seen a lot of code which is more complex than this, I do
> not pretend to have seen a majority of all source codes, so I am not
> sure of the answer to the question. I would be surprised if the
> answer were "quite often", or even "often enough to justify the work",
> but I'm not going to make a claim one way or the other. It's possible
> that my experiences are nonrepresentative.

As you mention subsequently, the language itself can have a lot to do
with it.

> Also, Fortran has a design advantage over C in that C semantics
> are defined such that the compiler must assume that any data structs
> which cannot be disproven to overlap, do overlap, and in most non-

> trivial code the compiler cannot disprove that *most* data structures
> do not overlap, which means that all indirectly-referenced data must
> be assumed to overlap in one way or another. This greatly complicates

> the task of the compiler. Consider, for instance, the C (so-called)
> function:
>
> void foo ( char *a, char *b ) {
> int i;
> for ( i = 0; a[i] != 0; i++ )
> a[i] ^= b[0];
> }
>
> Seems pretty straightforward to optimize with SIMD (or just plain
> word-sized register ops), until you realize that processing chunks of
> a[] 8 or 16 chars at a time can generate results which are different
> than those from processing chunks of a[] 1 char at a time, when a[]
> and b[] overlap.

Interesting -- so how hard is it to convince your compiler to do the
following (and is it worth it?):

#define ALIAS_SIMD(a,b) ((unsigned long)((a) - (b) + sizeof(SIMDWORD))\
<= sizeof(SIMDWORD))

void foo ( char *a, char *b ) {
int i;

/* Pointers do not alias */
for( i = 0; !ALIAS_SIMD(a,b) && a[i] != 0; i+=sizeof(SIMDWORD) ) {
(SIMDWORD&)a[i] ^= (SIMDWORD&)b[0];
}

/* Pointer aliasing case -- just do the code as usual */
for( ; ALIAS_SIMD(a,b) && a[i] != 0; i++ ) {

a[i] ^= b[0];
}

/* Pointers do not alias */
for( ; !ALIAS_SIMD(a,b) && a[i] != 0; i+=sizeof(SIMDWORD) ) {
(SIMDWORD&)a[i] ^= (SIMDWORD&)b[0];
}

Of course this is not a completely trivial problem -- the seemingly
similar code:

void foo ( char *a, char *b ) {
int i;
for ( i = 0; a[i] != 0; i++ )

a[i] ^= b[i];
}

has a completely different breakdown (only a single alias test is
required.)

> If you say "we had better languages than C when I was in high
> school!", then I'll be the first to agree with you. Unfortunately
> the industry has standardized on the neanderthal bastard child of
> PL/I, and there's enough momentum behind the Church of C that it'll
> be a long while before it is profitable for most compiler developers
> to stop bending their efforts to C optimization and start optimizing
> code written in a more agreeable language.
>

> >All that we are asking is that similar compiler
> >vectorizing be applied to the 4 element and 8 element
> >vectors that are common in MMX, SSE, etc.
>

> I understand your grievance, and frankly I'd like to see a lot
> more effort put into auto-vectorizing compilers. Unfortunately
> there are a few forces working against it: (1) the difficulty of
> optimizing C, (2) the arrogance of certain compiler engineers at
> a certain company who don't believe auto-vectorization is worth
> their time,

This certain company would have happened to have had all their serious
competition throw in the towel in the past few years would they?

> [...] and (3) the limited benefit of vectorizing code which
> is memory-limited anyway (ie, when you hand-code SIMD assembly,
> you know from familiarity with the algorithm that some things can
> be kept in registers, but the compiler generates lots of superfluous
> memory access, which means that most inner-loop code is starving
> for data. It doesn't matter if a superscalar processor can perform
> two unit adds per clock or if a SIMD processor can perform eight
> unit adds per clock, when data is trickling through the memory bus
> at an average of one unit data per two clock cycles! You can't
> process data faster than you receive data to process).

Well that just means that you need to add prefetch into the compiler too
doesn't it?

Andy Glew

ungelesen,

10.01.2000, 03:00:0010.01.00

an

> You're right, that's not too hard. But how often do embarrassingly
> simple code sequences like that comprise the inner loop of processor-
> intensive real-life code?

In JPEG and MPEG, quite a lot.

In EQNTOTT, enough to start benchmark flame wars
(the really non-justifiable part of EQNTOTT optimizations
lay in some IF transformations within the loop).

In the Perfect Club...

...

Moreover, there's the effect that Michael Wolfe (of "Optimizing
SuperCompilers for SuperComputers") described, whereby programmers
learn to code loops that are automatically vectorized by the compiler.
I.e. using a HLL as a flexible "macro assembler".

Andreas Krall

ungelesen,

10.01.2000, 03:00:0010.01.00

an

In article <stacpoole-34E24...@news.primus.com.au>,

Richard Stacpoole <stac...@primus.com.au> writes:
> In article <387859...@hda.hydro.com>, Terje Mathisen
> <Terje.M...@hda.hydro.com> wrote:
>

>> Chris Cox wrote:
>> >
>> > Have you looked at Metrowerks' 3D-NOW auto-vectorization support?
>> > Still a bit limited, but a good start in that direction...
>> >
>> > Also, AltiVec has some high-level language support - but most of the
>> > operations are 'intrinsics' that later get scheduled (unlike most of the
>> > MMX/SSE intrinsics).
>>

>> This is _exactly_ the same setup as Intel uses for MMX (and presumabley
>> SSE, I haven't got a new copy of their compiler), i.e. each opcode
>> becomes a 'compiler intrinsic'.
>>
>> Terje
>

> Excuse my ignorance, but could someone tell me another way to do it?
>
> It seems that without a kind of explicit high level instruction support,
> or some kind of hinting system, it would be very difficult for a
> compiler to auto-vectorise.
>

> Any info appreciated

There is no need for high level instruction support or some kind of hinting
system. Vectorizing compilers are available for vector processors since
20 years. The same algorithms can be used for AltiVec-SSE code generation.
We have developed a compiler for the SPARC-VIS instruction set using both
classic vectorisation and vectorisation by loop unrolling and instruction
grouping. A group in india has developed a vectorizing compiler generating
MMX instructions based on the SUIF compiler suite.

--
an...@complang.tuwien.ac.at Andreas Krall
http://www.complang.tuwien.ac.at/andi/ Inst. f. Computersprachen, TU Wien
tel: (+431) 58801/18511 Argentinierstr. 8/4/1851
fax: (+431) 58801/18598 A-1040 Wien AUSTRIA EUROPE

rub...@my-deja.com

ungelesen,

10.01.2000, 03:00:0010.01.00

an

In article <85d38a$qhf$1...@news.tuwien.ac.at>,
an...@complang.tuwien.ac.at (Andreas Krall) wrote:

>
> There is no need for high level instruction support or some kind of
hinting
> system. Vectorizing compilers are available for vector processors
since
> 20 years. The same algorithms can be used for AltiVec-SSE code
generation.
> We have developed a compiler for the SPARC-VIS instruction set using
both
> classic vectorisation and vectorisation by loop unrolling and
instruction
> grouping. A group in india has developed a vectorizing compiler
generating
> MMX instructions based on the SUIF compiler suite.

Which group is that? Do they have any papers available on-line?

> --
> an...@complang.tuwien.ac.at Andreas Krall
> http://www.complang.tuwien.ac.at/andi/ Inst. f. Computersprachen, TU
Wien
> tel: (+431) 58801/18511 Argentinierstr. 8/4/1851
> fax: (+431) 58801/18598 A-1040 Wien AUSTRIA
EUROPE
>

Sent via Deja.com http://www.deja.com/
Before you buy.

David Dunn

ungelesen,

10.01.2000, 03:00:0010.01.00

an

Terje Mathisen wrote:

> I have to agree with Andy, I'd love to see compilers that could figure
> out how to combine multiple smaller elements into a larger register.

Then complain to spec. They refused to make a change in
the ijpeg sources that looks something like this:

typedef int foo; /* needs to be at least 16 bits */
typedef long bar; /* needs to be at least 32 bits */

If that had been changed to short/int as the original author
had commented, every compiler on the planet would have
generated MMX style code for ijpeg.

But it was to late and they didn't want us to cheat. Plus
I'm not sure how strongly our rep represented the case. He
always felt optimization of any type was really cheating.

Dave Dunn

Not the opinion of present, past, or future employers.

TTK Ciar

ungelesen,

10.01.2000, 03:00:0010.01.00

an

In article <387900A6...@gmx.de>,

Bernd Paysan <bernd....@gmx.de> wrote:
>Toon Moene wrote:
>> Pah. That just means their compilers are behind the state of the art.
>>
>> [ Before anyone asks - yes, the GNU Compiler Collection is behind the
>> state of the art too ]
>
>Indeed. One might be inclined to see this as a case study that
>commercial development sucks, even when done in the open source area.
>About 10 years ago, the Cygnus founders started with a compiler that was
>better than vendor C compilers (back then, it wasn't a compiler
>collection, it was just C). Now, ten years later, we have a huge
>compiler collection with about 80% of the contribution is from slave
>labor*, a code fork and a re"merge" (most of the 2.8 parts got into the
>egcs fork, anyway), and it's definitely behind proprietary commercial
>offerings -

I suppose that depends on what you mean by "behind". The versions of
gcc maintained by Cygnus are less buggy than most proprietary compilers,
gcc is the ultimate portable cross-compiler, making it the choice of
embedded systems developers and corporations which need to support the
same code base on multiple target platforms. There's nothing sweeter
than having a hugely-multiprocessor SUN or SGI box on which you can
rapidly build all of your MIPS/m68k/PPC/i386 binaries, all from the
same Makefile and all into the same filesystem.

On the other hand, gcc's optimization capability is somewhat lacking,
and its memory footprint (and hence running time) is immense. "Jack of
all trades, master of none" comes to mind; it is much easier to write
a single-host + single-target optimizing compiler than it is to maintain
a much-hacked compiler codebase which has to be correct for each of many
hundreds (thousands?) of host/target combinations.

>a rewrite from scratch is still imminent since several years
>now. According to the theory (the TCatB theory), this could never
>happen.

"The Cathedral and the Bazaar" contains much truth, but it does not
describe the entire truth. As the power of our technology increases
exponentially, so does the size of the set of interesting applications
which can be practically implemented at a given price-point, and the
demand for skilled programmers increases proportionately. Moreover,
most of the programmers entering the market are "lightweights" whose
lack of experience and narrow knowledge base makes them much less
valuable than the well-seasoned old-time coding geeks. The latter are
being aggressively sought after, and once hired they are being "spread
thin" and worked hard, giving them less time to work on open source
projects.

The same is even more true of compiler engineers. Compiler tech
is pretty dense stuff, it's getting more complicated all of the time
(making it harder for "new blood" to break in and learn enough about
it to be useful in the industry), and correspondingly there are even
fewer compiler engineers than there are other kinds of software
engineers. I'm not at all surprised that between greater pressure
to spend time on corporate work and the increasing complexity of the
compiler itself, it is becoming prohibitively difficult to advance
the GNU compiler technology on a revolutionary level, and programmer
time is being spent instead on incremental improvements.

Mind you, there is still a *lot* of incremental improvment potential
left in gcc. It will need to lose RTL eventually (or at least supplant
it, either way a lot of work), but the decision to leave that hill
unclimbed for now and focus on other things isn't simply the result of
shortsighted profit-seeking suited MBA's bullying the engineers around
-- there are many, many more worthwhile incremental improvements to be
worked on than there are code contributers. Just look at the to-do
list, we need more monkeys! :-)

>*) "slave labor" is when the labor doesn't belong to the worker, but is
>"compensated" by money instead.

No, "slave labor" is when the worker has no choice but to labor in
the manner prescribed, and to accept whatever compensation (if any) the
employer chooses to give them. Software engineers in particular are
working in the midst of a strong "seller's market"; if they don't like
what they're being asked to do, or if they don't like the way they're
being compensated, then they can wave their resumes in the air and
whistle up an army of slobbering recruiters, who will beg them to look
at other job offers.

Or if they don't like engineering for a living, they can grin and
bear it for a few years, cash in their stock options, move into the
boondocks (woohoo! cheap land!), and spend the rest of their lives
happily coding whatever open-source software they like. (It seems
really tempting some days, for sure.)

-- TTK