On 12/16/2012 7:09 PM, Tacit wrote:
> Andy (Super) Glew:
>> On 12/16/2012 10:28 AM, Tacit wrote:
>>
>> Some ops are fused when emitted from decoders (thus not counting against
>> things like the Intel 2111 decoder template.
>
> Intel never had such template. They have now 4111 or 4+. All of the
> mops are fused by decoder.
Intel has evaluated many, many, different decoder templates.
444
422
411
333
311
As for what actually gets shipped ... (1) memory gets hazy, (2) even if
memory isn't hazy, I can't say unless it is public, (3) I'm too lazy to
go and hunt down a reference to exactly what is public, (4) and exactly
what is public is often no longer an accurate description, in the
original sense of decoder template
By the way, I may well have been the guy who created the term "decoder
template", as a shorthand for describing them to compiler writers.
Although I had nothing to do with how the decoder actually did this
- I think AMD's K7 decoder approach was better. Oh, and I *am* the guy
who invented the concept of separate store-address and store-data uops,
or at least who brought that concept to Intel, without further diffident
waffling language.
BTW^2, you know that the 411 or whatever template does not mean that all
single uop instructions get handled? Or, for that matter, that all
instructions of 4 or less uops get handled in the 4 slot? It's more of
a guideline, really. (Actually, it's more of an upper bound - there may
be no more wires than that, but you can always drop instructions from
the decoder "PLA"s.)
For example: one of the big motivations for load-alu fusion is so that
instructions like ADD_reg+=mem can be emitted in the 111 slots. But
this can be a challenge to fit, in terms of numbers of inputs and
immediates. Plus, often long instructions can only be fit into the
first slot.
For example: one of the big reasons to do store-address/data fusion is
to allow stores, MOV_mem:=reg, to be fit in the 111 slots.
But if you do store-address/data fusion, then the most common CISCy
instructions, instructions that read-modify-write memory, now are 3 uops
rather than 4 uops:
P6-style, with separate store address and store data
ADD mem += reg
tmp := load( mem )
tmp := temp + reg
store-address( mem )
store-data( tmp )
Store-address and store-data fused at the decoder:
ADD mem += reg
tmp := load( mem )
tmp := temp + reg
store( mem, reg )
As usual, fused store-address+data is a bit of a challenge - worst case
3 reg inputs + immediate, or 2 reg inputs + 2 immediates. Often fused
store-address+data can only handle some cases, not the most general case
of store. But if you can handle the most general case, then it is
tempting to make what is traditionally the "4" decoder slot into a "3"
decoder slot. I.e. able to emit only 3 uops. Rather than 4.
Does Intel do this in current processors? Frankly, I don't know (and I
don't really care). I would be surprised if they had not considered it.
I would not be surprised if there were not some other constraint, such
as not supporting fused store-address+data for all operand combinations.
And not wanting to spill to the slower decoder paths if you had the
more complicated addressing modes.
I would also not be surprised if the hardware were actually emitting 3
fused uops from the first decoder slot, but that marketing insisted on
calling it a 4111 decoder rather than a 3111 decoder, because 3 might
look bad. Or because they did not want to give away what they doing in
the microarchitecture. Possibly for fear of patent litigation.
Possibly Agner's tests can distinguish these cases. However, since I
know of several cases were Agner reported incorrect configurations so I
would be surprised if he is accurate in all cases. I am sure that some
patent lawyers would love it if Agner is accurate in all cases.
And then if you go further, to
ADD mem += reg
tmp := load-and-store-address( mem )
tmp := store-data-and-alu( tmp + reg )
you can make the most powerful decoder slot "only" 2 uops wide.
But... you don't have to fit just reg+=mem and mem+=reg into decoder 0.
You also need to fit instructions like add-with-carry-to-memory,
and CALL, and ... Some of which you can reduce the uop count for,
some of which are more challenging.
So hence the idea of m111 - 4111, 3111, 2111 - with m+ (4+, etc.) for
instructions that you don't want to penalize too heavily by going all
the way to the microcode engine, but which don't fit in your m-uop
decoder 0.
>> Sometimes these fused ops are unfused before being placed in the
>> scheduler / reservation stations / whatever.
>
> Both Intel's and AMD's CPU keep mops fused before back-end and in
> dispatcher (including ROB/RQ) and unfuse on issuing in scheduler(s)
> and/or MAU/LSU.
>
>> Sometimes these remain fused in the scheduler, but are fired twice,
>> being emitted as two separate execution ops by the scheduler.
>
> Which CPU use this?
You may not have understood my terminology:
Your "unfuse on issuing in scheduler"
Is approximately my "fired twice"
If, say, you have a fused store-address+data uop sitting in your
scheduler, then, if you can start, say, the store-address part before
the store-data arrives, I call that a first firing of the scheduler.
And then when the store-data arrives, you emit that in a second firing
of the scheduler.
It would be quite suboptimal to fire only once, bit to different pipelines.
By the way: for many years, since the original P6, you don't "issue in
the scheduler". You "dispatch from the scheduler". This differs from
the terminology at many other companies, where dispatch is from igfetch
to the scheduler, and issue is after scheduler into execution. I still
remember whren MAF and AB deliberately switched the terms to be
different from other companies - I flamed them for making a gratuitous
distortion of relatively standard terminology.
I would not be surprised if the term "issue from scheduler" is
creeping back into Intel: Intel has hired so many people from other
computer companies like DEC and Sun and AMD and IBM. (When I am feeling
snarky, I say "failed computer companuies"). Atom, in particular, has
a lot of ex-DECcies.
>> And then sometimes they are emitted by the scheduler, and then flow
>> through different parts of the pipeline.
>
> How can a scheduler emit a new mop? It doesn't recieve original
> instruction to do it.
I trust that you understand what I have explained above.
>> Sometimes in parallel, sometimes sequential (load-op or load-op-store).
>
> That's loads (where source is in the memory), but I was talking about
> stores (source is the register) and op-store (source is op result).
Sure.
op-store is not that common. Never was, even in the i486 generation.
And is not so much any more, after so many years of -op- and load-op
being optimized for.
I don't know the relative frequencies, but I would be surprised if
op-store is more common than load-op-store. And any pipeline that can
fit op-store as a fused uop, can also fuse load-op-store --- with only a
slight application of intelligence.
>> Plus then there is fusion of separate instructions into fewer fused ops.
>
> Yes, that's what I mean above. Accesses should always be fused,
> because CPU need to generate the address. However, officially mop is
> considered as fused only if there is data computation associated with
> memory access. MOV [m],r1 is: 1) read RF or bypass for AG-registers,
> make AG, transfer address to store Q; 2) read RF or bypass for
> (renamed) r1, transfer it to store Q. But Intel doesn't consider it as
> fused…
Official, shmofficial.
Fused store-address-data happened first. It was called fused
store-stores inside Intel forever. Heck, I called it fused
store-address-data BEFORE I brought I went to Intel. Or maybe it was
split-store-address-data. Fusing, splitting, unfusing, unsplitting, it
all depends on your point of view.
What marketing calls it is another issue.
Then load-op fusion.
Then it really gets interesting when you fuse between separate
instructions. Adjacent first, and then not. I think this is now the
Pedantically, I would prefer to say that you "fuse" only when two
independent things are combined. Like fusing two separate instructions.
If the "sub-operations" are combined in the instruction, and then
emitted combined, then you aren't really fusing them - you are just not
splitting them apart at that pipestage - although you may split them
apart later.
But, unsplitting and unfusing is often the same: double firing from the
scheduler, or stuttering and placing into two separate schgeduler and
ROB entries.
>
>> Put it occurs to me that perhaps you meant "mops" == "memory ops", as
>> opposed to "micro ops" or "macro ops".
>
> No, that was micro operations :)
Intel terminology is uop. Where u should be mu, for micro.
>
>> Anyway: I like to use qualifiers like decoder-ops
>
> Yes, there are several types of them and only one name. AMD used to
> have ROPs (RISC ops) in K5 (and K6?) and now — macro-ops.
AMD had ROPs and COPs. COPs and ROPpers, as I like to say.