On Mar 15, 12:14 am, Ivan Godard <
i...@ootbcomp.com> wrote:
> On 3/14/2013 7:36 PM, Paul A. Clayton wrote:
[snip]
> Wow, you cite a ton of interesting ideas. Shame you weren't on our
> effort several years ago :-)
It was once proposed that I would be useful for
brainstorming.
> I'm going to ignore all the things related to pre-parsing and trace
> cache where it cache is not simply a repository of bytes. Those are
> interesting but require either being able to to recognize instruction
> boundaries in an arbitrary byte sequence, or require post-decoration
> techniques where the trace is built as untraced raw instructions are
> parsed and executed. I've no argument that the methods work, but they
> seem mostly of value when you must deal with legacy codes, and I have
> little experience with that.
Fill-time predecode is a fairly common technique, even
for x86. With fixed-size (aligned) instructions
finding instruction boundaries is trivial and some VLEs
facilitate low overhead boundary determination (e.g.,
S/360's two-bit length indicator in a fixed position of
the first parcel or using one or two bits in each
parcel to indicate end of instruction). Even if the
line must be significantly decoded before cache fill,
this need not greatly hinder performance (some x86
implementations do this to put a stop/continue bit for
each byte to ease later decoding).
[snip]
> It's clear that the extra size is irrelevant in the absence of
> thrashing. Many programs spend all their time in tight loops that fit.
> The extra cache is an attack on Amdahl's law: if only 10% of execution
> is in code that thrashes, and thrashing causes a 10X loss of throughput,
> then getting out of thrashing mode doubles the overall performance. Of
> course, some programs will thrash in the doubled cache too, and some
> can't use it, so the doubling is only valuable in codes within a window
> of working set sizes. In my experience that window is populated enough
> that I care, YMMV.
I was under the impression that the Icache busting codes
tended to be much larger rather than just a little larger.
(I think the current Itanium's have a 512KiB L2 Icache--
even with its low code density that would probably be
greater than 128KiB for x86--and PA-RISC [for which
Itanium is a replacement] used huge L1 Icaches.)
> As for why existing chips don't make the i$1 bigger, I appeal to Andy or
> Mitch to suggest an answer. My understanding is that by and large the
> cache is made at least 32K for thrash prevention, and as much bigger as
> possible as will not cause an extra pipe stage under the clock and stage
> height constraints dictated by other parts of the microarchitecture. I'm
> told that power is more determined by wayness and fetch width (i.e.
> decode width and/or bank width) than raw size, but size does have a cost.
With good branch predictors and instruction buffering,
going from a two cycle Icache to a three cycle Icache
might involve a small penalty (<5%). Unfortunately, one
would pay that penalty even for the codes that are
friendly to a smaller Icache and only benefit for the
codes benefiting from the larger Icache.
[snip]
> Isn't this just an i$1 pulling from an i$2?
Separation of critical and less critical portions
in this way is not just an L2 cache. The index
and way would be the same (or possibly only slightly
different) and only one tag would be checked, at
least for a simple trace cache. This is effectively
a width pipelined Icache; fetching chunk A from the
fast section and A+1 from the slow section but
initiating the fetch at the same time.
(Obviously, in a non-trace cache the next block
might not be sequential, so it might act more like
an L2 that bypasses L1 (requiring tag checks and
using a different index) and a more complex
placement choice might be needed [or just accept
that some taken branches sacrifice performance]. As
noted, this is just NUCA with a recognition of
prefetchability--any access which can be prefetched
can tolerate higher access latency.)
[snip]
> That's an approach we never considered, in part because we don't keep
> track of branches and cannot recognize a branch in pre-decode because of
> the short, exposed, pipeline. By the time we know it's a branch we are
> already executing it :-) We also can't use a BTB (or rather, don't need
> one) for the same reason. It's not clear to me what advantage a BTIC has
> over a BTB. Needs thought.
The main advantage of BTIC over BTB is the next few
instructions are available earlier, avoiding a bubble
due to Icache access latency (in some designs), such
could also reduce power use slightly (a fraction of
instructions being fetched from a small cache) or
increase bandwidth (whether by avoiding a bubble or
fetching from the Icache in parallel). Instruction
buffering (before and/or after decode) and branch
prediction run ahead would reduce the usefulness of
such.
(I had proposed a function entry point cache behind
the L1 Icache with a similar latency-hiding motive,
under the assumption that function calls are major
discontinuities for fetch while sequential prefetch
works fairly well within a function.)
[snip NUCA with criticality placement informed by
prefetchability]
> But how do you determine which is which? Predicteds go in fast, and slow
> is used for post-miss fetch?
My proposal was to have the more predictable fetch
blocks go in the slow portion. Fetch redirects
would be biased toward the fast portion (thus kind
of like a BTIC). Even placement by address might
provide a modest improvement (half [or some fraction
of] the time a redirect might have no bubble). The
branch predictor would predict two blocks for fetching.
With criticality/prefetchability information, placement
might be optimized.
[snip]
> Don't understand - aren't your numbers backwards (i.e. bigger should be
> slower)?
The second portion would be the same size as the first
portion but placed more remotely (thus the square root
rule assumption--which I am guessing would not apply
for smaller wide caches and even with larger caches
layout is cannot have extremely fine granularity).
If the slow portion was twice as large (which might be
useful since many instruction fetches might not be
timing critical--i.e., the address is available early),
one might imagine a layout like:
SSSS S -> "slow" chunk
SFFS F -> "fast" chunk
SFFS
or perhaps (with a three times larger slow portion):
SSSS
SSFFSS
SSFFSS
Note: I am not a hardware person; I am just guessing
based on a vague distance estimate.
[snip]
> Cutting the number of rows doesn't buy much unless you actually shut it
> off for leakage, and I don't want to think about turning it back on.
Turning off sections of cache can be quite reasonable.
This is effectively a specialization of the core to
fit the workload (and energy and performance goals).
The reconfiguration cost would be less than for a
migration to a more appropriate core (and the benefit
would likewise be less) and the area use would be less
than with more numerous more specialized (and less
configurable) cores (and the interconnect should be
simpler).
> Cutting wayness (assuming parallel ways) would save some power. But
For instruction fetch, I think way prediction is
sufficiently accurate that parallel read might not be
necessary. However, since parallel read still seems
to be used (though it is not easy to tell based on
public information), it seems that way prediction is
not good enough.
[snip]
> That's an interesting notion. There was a bunch of academic work
> involving compiler placement of code for optimal cache behavior for
> various size/wayness/etc. They got decent numbers, but as far as I know
> commercial compilers don't do it, because the necessary trace info is
> rarely available and the targets vary too much.
It is kind of frustrating that software is less
aggressively optimized. For cache optimizations, it
probably does not help that there is diversity of
targets even in the same time period, though some
optimizations are "cache oblivious" (I dislike that
term since it implies forgetting that caches exist
rather than forgetting the specifics of cache
implementation, though I can understand the desire
to distinguish more general techniques from cache
aware techniques that exploit details of a
particular cache design.).
[snip]
> Complex hell - downright weird, as weird as a Mill :-) The drawback of
> splitting (unless you are using a pre-decoded trace cache) is that
> either your fetch granularity must be smaller, or your execution average
> granularity (basic block length) must be bigger, or internal
> fragmentation will kill you.
I agree--complex, weird, and constraining. (I was
thinking smaller fetch granularity.)
[snip OoO rename/issue]
> I can see where this might be useful if you have register renaming; we
> don't, nor have registers to rename.
Hmm. Is the Mill a transport-triggered architecture?
(Even a memory-memory architecture like AT&T's CRISP
can have "registers". CRISP used a stack cache for
"registers". One paper on the development of CRISP
suggested the possibility of a global/TLS cache,
which would have made it a bit like Itanium [or
SPARC or AMD 29k {I think}].)
[snip partially merged Icache/BTB]
> That might be useful if you have a BTB. What did Andy think?
He thought it could make sense with fixed length
instructions--obviously if an Alpha architect had
considered such. (I think the idea could be made
to work with a less classic RISC ISA, but the
complexity would be greater [and so it would be
less worthwhile, which might mean a negative value].)
[snip Icache bypass]
> Simpler to just double the cache :-) Can you find a cite or two on this?
It seems most of the work is on bypassing the data
cache (e.g., one PA-RISC implementation used a
per-page classification of non-temporal to avoid
allocation in the huge off-chip L1, providing a
modest L0 [assist?] cache). So far I have not
found any references to bypassing an Icache
(except for prefetching), but I am almost certain
that I read the idea somewhere (though I could be
misremembering and simply applying an optimization
for data accesses to instruction accesses).
(At least one StrongARM implementation had a two
way associative "non-temporal" data cache which
likewise used page annotations to determine
allocation.)
Retaining reuse information for a sufficient
period (with just hardware) might be difficult.
If (single) reuse distance for a cache block is
too great for the capacity of the Icache, it is
perhaps likely to be too great for the L2 cache
capacity (though some extra tracking might be
useful for better replacement choices).
Page-granular marking of non-reuse might be
problematic, though allocating to a special
low-associativity small cache might be practical.
However, such might be almost just a way or
way group preference indicator and not particularly
useful.
[snip]
> The decode paper for CDES goes in Monday, and there's an informal
> presentation next month and another at a different venue in May.
> Somewhere along then I'll splash it here and you guys can comment on our
> baby. Better say good things - it is self evidently the most scrumptious
> baby that has ever been :-)
Rotten eggs and tomatoes do not work very well
over the Internet. :-)
Even if criticism is harsh, I suspect that the
readers here will at least find such interesting
(based on what you have stated already).