Message from discussion software architecture and abstraction
NNTP-Posting-Date: Wed, 27 Aug 2003 08:37:07 -0500
From: Robert Myers <rmy...@rustuck.com>
Subject: Re: software architecture and abstraction
Date: Wed, 27 Aug 2003 09:37:11 -0400
References: <firstname.lastname@example.org> <p30Xa.469$Bk.email@example.com> <firstname.lastname@example.org> <FkbZa.168$JG5.email@example.com> <firstname.lastname@example.org> <email@example.com> <OvFZa.531$FS.firstname.lastname@example.org>
X-Newsreader: Forte Agent 1.92/32.572
Content-Type: text/plain; charset=us-ascii
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
On Mon, 11 Aug 2003 05:06:54 GMT, "Andy Glew"
>"Robert Myers" <rmy...@rustuck.com>
>> P6 is anything but simple... constantly
>> redoing the same optimization only to forget it the next time it sees
>> the code (once it's gone from the trace cache) to make compiler
>> optimization easier strikes me as, well, um, er, suboptimal.
>Yup. (BTW, P6 did not have a trace cache (although I tried to
>get one in, and, so far as I know, am the originator of the term
>"trace cache" (back at Gould/Motorola)). Willamette's
>trace cache doesn't store optimizations.)
>(1) Amortizing the optimizations that the dynamic hardware performs
>- along them to be reused instead of recomputed - is definitely one
>of the hot topics in CPU architectue today.
>Although, writing it that way makes me wonder if it is a good idea:
>one of the current rules of thumb is that it is easier to recompute
>than look up in a storage or cache array.
If you're doing compiler optimization, the equivalent question is:
does the increase in code size resulting from the encoding of
elaborate optimization cost more than the speedup created by the
optimization. The fact that no one can answer this question a priori
results in people searching the external optimization space provided
by the compiler--the optimization flags--by hand.
If you're doing software-driven runtime optimization, the question is
more complicated. One would expect run-time optimization to bloat the
code, but then the optimizer has to decide how much time it is willing
to spend organizing that bloated code so it can be fetched and
In the (unreasonably) ideal case, the run-time optimizer would produce
the equivalent of compiler output on the first pass, and it would then
just be a question of the cost of running the run-time optimizer's
bloated code on subsequent passes.
In a case more closely resembling reality, I would imagine run-time
optimization making some kind of judicious balance between the cost of
producing large swaths of code that can be executed without further
modification by the run-time system and the cost of doing lots of
pattern matching and fetching of stored instruction groups.
Just thinking of software-driven runtime optimization makes my head
hurt: you have one optimization problem (optimizing the optimizer)
piled on top of another (the original code). To try to clarify what I
mean by that last sentence a little bit: the optimum optimizer will be
very problem dependent (just like the unkown in any optimization
problem is what input the code will actually be given when it is run).
Naturally, this brings us to the well-worn subject of
metaoptimization...just kidding, not going to go down that road--at
least not today. ;-).
>(2) One of the issues wrt amortizing optimization is that there
>is often no fixed schedule: the location of cache misses,
>and hence the schedule, cannot be predicted for a static
>instruction by either the compiler or hardware (although
>patterns of cache misses can be predicted for dynamic
>instructions fairly well).
I didn't respond to your post when I read it the first time because I
didn't understand exactly what you meant. Your post is so full of
good stuff that I'm going to risk betraying my absolute ignorance once
again by guessing that you mean statically-scheduled instructions vs.
dynamically-scheduled instructions, that is, an in-order processor vs.
an out-of-order processor.
I'm going to further guess that your reasoning (or *the* reason, if
you understand the matter completely) is that a statically-scheduled
instruction stream is so inflexible, that the smallest pebble on the
rails results in a train wreck, whereas an out of order processor
rides on a cushion of scheduling flexibility, and corners more like a
car with a fancy suspension and rubber wheels than like a train with
no suspension on steel rails. That little bit of extra flexibility
makes the future worth predicting and the past worth remembering.
>(3) I'm overall in favour of compiler optimization so long as
>it is worthwhile. It's not clear to me that precisely scheduling
>the pipelines is worthwhile:
> a) the hardware to do greedy scheduling is pretty simple
The premise of Intel's silly promo book about Itanium, and one of the
few things in the book that would be of interest to a technical
reader, is that Intel went with static scheduling because they
"discovered" (and I believe they actually use that word) that the cost
of dynamic scheduling is superlinear (quadratic, if I remember
correctly, is what they claimed it to be) in the issue width.
Regardless of how well the scaling of scheduling hardware with issue
width is understood (is it really quadratic, or did they just fit the
curve to the next term in the Taylor series), what you just said and
what Intel has said about the design premise of Itanium don't seem to
square with one another.
> b) because the schedule changes both dynamically
> (as a result of things like cache misses)
> over time (as a result of timesharing)
> and over the long term (as a result of changes in
> microarchitecture, and even on the same microarchitectue
> because DRAMs differ)
>Similary, having the compiler insert prefetch instructions has already
>proven to be a dead end: paraphrasing an ISCA talk
>"Compiler inserted prefetches are never at the right place,
>and are almost always worse than hardware generated
But prefetch hints from a compiler shouldn't hurt.
>Ironically, the compiler often has a *smaller* optimization window
>than the hardware prefetch mechanisms, and will soon have a
>smaller scheduling window than the hardware. Static compilation
>suffers combinatoric explosion in building large windows.
That problem is solved by making arbitrary heuristic assumptions to
make the optimization space smaller. Doing that and expecting to be
successful in the general case without any feedback from experience
wouldn't even pass the laugh test if that weren't how compilers are
almost always designed.
Can anybody offer generalizations as to how large the instruction
groups are that current Itanium compiler technology is capable of
generating? The instruction window that has to be considered would be
larger than the stream produced, but somebody has his hands on a
significant amount of interesting data: the statistics of the size of
Itanium instruction groupings that, say, the Intel compiler emits for
the usual suspects (say, something easy, like gcc).
>(4) I suspect that a microarchitecture targetted to support aggressive
>compiler optimization nowadays would not look that different
>from one that does not: it would probably have the same large
>instruction window, but it might just have hint bits so that the
>hardware didn't have to spend a long time (re)learning what the
>code was going to do.
Maybe a compiler without feedback. But a truly feedback-driven
compiler would look quite different from your standard aggressive
>I think the only reasonable system for which such large instruction window
>support would not be justified would be in high performace technical
>computing using lots of processors, where it may not be unreasonable
>to make each individual processor less efficient, forgoing the large
>instruction window, in order to squeeze a few more processors
>on chip or into the system. But even in HPTC it is not necessarily
>a home run - lots of HPTC codes are supposedly dusty decks
>that good optimizers cannot handle. (Actually, I think dusty decks
Wonder why I haven't read this issue addressed in any documents from
Intel or HP about Itanium? In fairness to them, DynamoRIO is an
attempt to build a larger instruction window on top of static
scheduling. How well it works...
And to raise the point again of actual cost: you say it's low and
Intel/HP says it's not. If it were truly low, I suspect we'd have
seen an OoO Itanium by now. Or are they holding out from
stubbornness? They wouldn't be *that* silly would they?
>> It also makes sense to me that once you
>> have figured out a particular way of doing things when the code
>> actually runs, you should be able to exploit that knowledge the next
>> time you see the same situation.
>But the problem is that the next time you see the same situation
>is not necessarily the next time you see the same set of static
>instructions. It is more accurately the next time you see those
>instructions, with the same configuration of cached data, the
>same configuration, with the same set of preceding instructions
>in the pipeline, etc. etc.
I think this is an argument for some kind of hardware dynamic
scheduling, not necessarily an argument for forgetting the past.
>Compilers attempt to approximate this "same situation"
>by replicating code - unrolling for static patterns of cache misses,
>etc. - but that has obvious costs.
Code bloat, as addressed in more detail earlier.
>And, to put it bluntly,
>compilers just plain don't do it very well. They can comprehend
>simple things like scheduling execution units; scheduling cache
>misses seems to be just plain beyond them.
>I've spent hours in meetings with compiler folk, especially
>Intel's compiler folk, trying to persuade them to use information
>suh as cache miss profiles in their optimization; their response
>always amounts to they'd like to, but they are so overbooked
>doing the stuff they are currently doing.
>(Which is in part why I'd like to free them up from the simple
>stuff that is easy for hardware to do, so that they can concentrate
>on the hard stuff.)
Compilers aren't very good at a priori prediction for two reasons that
you have stated: only a relatively small bit of the optimization space
can be explored, and the best optimization is sensitive to tiny
changes in the environment that fundamentally cannot be predicted by
any amount of anlaysis of the source code.
If you add at least some scheduling flexibility and some runtime
feedback to the compiler, I think you can blow both of those
limitations out of the water. Experience so far with the profiling
optimization capability of icc (the intel c compiler) is very