Memory store — mop or not?

Tacit

unread,

Dec 16, 2012, 1:28:26 PM12/16/12

to

As we all know, most CPUs now use fused mops, that unfuse before
storing in reservation (in-order schedulers should behave the same).
Usually they contain data operation and memory access for one operand,
which is further separated on address generation and access itself.
Suppose, we need to execute «MOV [mem], reg» instruction, so there's
no data computation. Various x86 CPU's do that very differently (I'm
using Agner Fog's data and Optimisation Manuals — Fog's data can be
erroneous!):

1. K7 & K8: for GPRs generate only AGU mop, but no store mop; for
vectors (aligned — here and further) there is FMISC (FSTORE) port
usage, no AGU mop indicated.

2. K10: differs in that in gives 2 mops (FMUL & FMISC) for vector
stores (not sure, if that's correct).

3. Bulldozer & Piledriver: GPRs — 1 fused mop activate any of ALU
ports and any of AGUs; vectors — P3 port gets store mop (FPSTO).

4. Bobcat: GPRs — same as K7-K10; vectors — use FP1 port for stores.
AGU mentioned only for loads.

5. Intel CPUs (except P4s and Atom) & VIA Nano: all stores use Port 3
(SA for Nano) for store address generation and Port 4 (ST) for store
data.

6. Atom: GPU stores need 1 fused mop that go to ALU0 and Mem ports;
vectors only need Mem port.

Obviously, it's impossible to store without AGU, so 1 unfused mop per
simple store must use AGU (even were none indicated). Practically,
there are 3 approaches:

A. Use dedicated port for all store mops.

B. Select ONE of the computational ports for store mops.

C. Select ANY of the computational ports for store mops (but only 1
store/clock).

And here's the question: suppose we're doing new architecture — which
one of these options is better for GPRs and vectors, considering port
usage conflicts and implementation complexity?

Andy (Super) Glew

unread,

Dec 16, 2012, 4:08:45 PM12/16/12

to

On 12/16/2012 10:28 AM, Tacit wrote:
> As we all know, most CPUs now use fused mops, that unfuse before
> storing in reservation (in-order schedulers should behave the same).

Actually, there is more variety:

Some ops are fused when emitted from decoders (thus not counting against
things like the Intel 2111 decoder template.

Sometimes these fused ops are unfused before being placed in the
scheduler / reservation stations / whatever.

Sometimes these remain fused in the scheduler, but are fired twice,
being emitted as two separate execution ops by the scheduler.

And then sometimes they are emitted by the scheduler, and then flow
through different parts of the pipeline. Sometimes in parallel,
sometimes sequential (load-op or load-op-stre).

Plus then there is fusion of separate instructions into fewer fused ops.

Put it occurs to me that perhaps you meant "mops" == "memory ops", as
opposed to "micro ops" or "macro ops". In which case, well, my comment
still applies - although I that I would amplify this to say that
memory-ops are not the only ops that can benefit from fusion.

Anyway: I like to use qualifiers like
decoder-ops

> Usually they contain data operation and memory access for one operand,
> which is further separated on address generation and access itself.
> Suppose, we need to execute «MOV [mem], reg» instruction, so there's
> no data computation. Various x86 CPU's do that very differently (I'm
> using Agner Fog's data and Optimisation Manuals — Fog's data can be
> erroneous!):

Yes, quite often. But it is still great public info.

>
> 1. K7 & K8: for GPRs generate only AGU mop, but no store mop; for
> vectors (aligned — here and further) there is FMISC (FSTORE) port
> usage, no AGU mop indicated.
>
> 2. K10: differs in that in gives 2 mops (FMUL & FMISC) for vector
> stores (not sure, if that's correct).
>
> 3. Bulldozer & Piledriver: GPRs — 1 fused mop activate any of ALU
> ports and any of AGUs; vectors — P3 port gets store mop (FPSTO).
>
> 4. Bobcat: GPRs — same as K7-K10; vectors — use FP1 port for stores.
> AGU mentioned only for loads.
>
> 5. Intel CPUs (except P4s and Atom) & VIA Nano: all stores use Port 3
> (SA for Nano) for store address generation and Port 4 (ST) for store
> data.
>
> 6. Atom: GPU stores need 1 fused mop that go to ALU0 and Mem ports;
> vectors only need Mem port.
>
> Obviously, it's impossible to store without AGU, so 1 unfused mop per
> simple store must use AGU (even were none indicated). Practically,
> there are 3 approaches:
>
> A. Use dedicated port for all store mops.
>
> B. Select ONE of the computational ports for store mops.
>
> C. Select ANY of the computational ports for store mops (but only 1
> store/clock).
>
> And here's the question: suppose we're doing new architecture — which
> one of these options is better for GPRs and vectors, considering port
> usage conflicts and implementation complexity?

It all depends...

* how many overall ports do you have (e.g. if you are one wide, period,
no options)

* do you have scatter/gather addressing
** if not, then a separate scalar AGU port for load and store addresses
is fine
** but if you have scatter/gather, especially address vectors, then it
becomes tempting to use a vector compute port for address generation.

* what is your scheduler structure - how many read ports, how many
operands...

By the way, I quite like the trick of compute-and-store-data.

If you are from the Intel camp, you tend to think of stores as being
store-address and store-data:

STORE M[basereg+indexreg*scale+imm] := srcreg
store-address StoreAddressBuffer[#] := basereg+indexreg*scale+imm
store-data StoreDataBuffer[#] := srcreg

(By the way, notice that the store-address and store-data "ops" only put
data in the store buffer. There is really a final op, at store commit,
after retirement, that does the true store:
Memory[StoreAddressBuffer[#]]:=StoreDataBuffer[#]. It's just implicit,
not explicit.)

But the store-data part so often only needs a single operand. Which is
a waste of resources, of you have scheduler that supports multiple
operands. Like the original P6 CAM based RS, but not necessarily for
bitmap schedulers.

Store-datas also waste an RF read port if dispatched to a computational
port that has two dedicated read ports. Or 3, or ....

Perhaps what you need are ports dedicated not computational/store-data,
but ports dedicated 2-input, 3-input, or 1-input.

Examples of 1-input:
* store-data
* NEG, NOT
* ALU wth constant operands

Or - just treat RF read ports separately, not tightly coupled to
computational ports. You get most operands off the bypass.

But getting back to my original point - I quite like fusing store-data
with the computational operand that produces the value. Makes better use
of resources, of there are any that are specific to 2-operand.

This is easy to do on x86 CISC instructions

ADD to MEMORY M[basereg+indexreg*scale+imm] += src
tmp :=ld&st-addr StoreAddressBuffer[#] := basereg+indexreg*scale+imm
StoreDataBuffer[#] := tmp + src

It requires fusion between separate instructions on a RISC.

Requires only 1 load/store buffer entry - if you have a unified LSQ as
many companies do, but that doesn't help if, like Intel's P6, you have
separate LB and SBs.

Then, of course, we think about writing a register destination,
either returning the old value tmp for an RMW, or the new value tmp+src.

--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies; in the past of companies
such as Intellectual Ventures and QIPS, Intel, AMD, Motorola, and
Gould), I reveal this only so that the reader may account for any
possible bias I may have towards my employer's products. The statements
I make here in no way represent my employers' positions on the issue,
nor am I authorized to speak on behalf of my employers, past or present.

Tacit

unread,

Dec 16, 2012, 10:09:59 PM12/16/12

to

Andy (Super) Glew:

> On 12/16/2012 10:28 AM, Tacit wrote:
>
> Some ops are fused when emitted from decoders (thus not counting against
> things like the Intel 2111 decoder template.

Intel never had such template. They have now 4111 or 4+. All of the
mops are fused by decoder.

> Sometimes these fused ops are unfused before being placed in the
> scheduler / reservation stations / whatever.

Both Intel's and AMD's CPU keep mops fused before back-end and in
dispatcher (including ROB/RQ) and unfuse on issuing in scheduler(s)
and/or MAU/LSU.

> Sometimes these remain fused in the scheduler, but are fired twice,
> being emitted as two separate execution ops by the scheduler.

Which CPU use this?

> And then sometimes they are emitted by the scheduler, and then flow
> through different parts of the pipeline.

How can a scheduler emit a new mop? It doesn't recieve original
instruction to do it.

> Sometimes in parallel, sometimes sequential (load-op or load-op-stre).

That's loads (where source is in the memory), but I was talking about
stores (source is the register) and op-store (source is op result).

> Plus then there is fusion of separate instructions into fewer fused ops.

Yes, that's what I mean above. Accesses should always be fused,
because CPU need to generate the address. However, officially mop is
considered as fused only if there is data computation associated with
memory access. MOV [m],r1 is: 1) read RF or bypass for AG-registers,
make AG, transfer address to store Q; 2) read RF or bypass for
(renamed) r1, transfer it to store Q. But Intel doesn't consider it as
fused…

> Put it occurs to me that perhaps you meant "mops" == "memory ops", as
> opposed to "micro ops" or "macro ops".

No, that was micro operations :)

> Anyway: I like to use qualifiers like decoder-ops

Yes, there are several types of them and only one name. AMD used to
have ROPs (RISC ops) in K5 (and K6?) and now — macro-ops.

Tacit

unread,

Dec 16, 2012, 10:11:13 PM12/16/12

to

…
Andy (Super) Glew:

>
> It all depends...
>
> * how many overall ports do you have (e.g. if you are one wide, period,
> no options)

Say 4. But Intel is dare to make 8 ports for Haswell :)

> * do you have scatter/gather addressing
> ** if not, then a separate scalar AGU port for load and store addresses
> is fine

Agree. All of modern CPU do so, if not counting K7…K12 :)

> ** but if you have scatter/gather, especially address vectors, then it
> becomes tempting to use a vector compute port for address generation.

Not agree. Vector address computing can be as easy as adding scalar
base to offset vector (i.e. add scalar turned to vector to another
vector) and that's ALU's job. No need for separate ALU (AGU) for
vector addresses.

> * what is your scheduler structure - how many read ports, how many
> operands...

Suppose it's «tagging scheduler» with PRF (physical RF) as in Intel's
Bridges and Bulldozers. I don't think anything else here matters. I
wonder why AMD have so different scalar and vector schedulers — both
in Athlons and Bulldozers?…

> If you are from the Intel camp, you tend to think of stores as being
> store-address and store-data:
>
> STORE M[basereg+indexreg*scale+imm] := srcreg
> store-address StoreAddressBuffer[#] := basereg+indexreg*scale+imm
> store-data StoreDataBuffer[#] := srcreg

Isn't StoreAddressBuffer[N] and StoreDataBuffer[N] are in the same
Store Buffer entry N?

> There is really a final op, at store commit, after retirement, that does the true store:
> Memory[StoreAddressBuffer[#]]:=StoreDataBuffer[#]. It's just implicit,
> not explicit.)

Well, in case we have OoO-MA (memory access), there is a possibility
for stores to bypass loads and (sometimes) other stores. But that
requires 2-way syncro: MAU to dispatcher «I can store this data now
with no address conflicts and access violation» and dispatcher to MAU
«I can retire originating instruction now». But that's getting more
complicated and out of topic :)

> But the store-data part so often only needs a single operand. Which is
> a waste of resources, of you have scheduler that supports multiple
> operands.

Yes, that's a side-effect. To avoid that, there can be separate
scheduler for mops that reqire only 1 operand (like store), and 1 port
may be enougth for it. But is it worth it? After K7-12 the only modern
CPUs that use non-unified (1-port) schedulers are ARM Cortexes.

> Store-datas also waste an RF read port if dispatched to a computational
> port that has two dedicated read ports. Or 3, or ....

No, instead it won't use the write port. Because we're writing the
result in the memory, not in RF. But MOV is even simpler: read from RF
or bypass to Store queue of MAU.

> Perhaps what you need are ports dedicated not computational/store-data,
> but ports dedicated 2-input, 3-input, or 1-input.

That hurts, if there is unbalance in the program. BD's unified
scheduler for GPRs have 40 mops and 4 ports and in extreme case can
handle 40 mops of same type (i.e. MUL) that can only go to port 0.
Segmented scheduler of same size will have 4 10-mop mini-RS's instead
and can only hold 10 mop of certain type. Too cruel :)

> Or - just treat RF read ports separately, not tightly coupled to
> computational ports. You get most operands off the bypass.

That's reads. Of course, in case of «OP r1,r2,r3 + MOV [m],r1» result
can be written in r1 and bypassed to store at the same clock. Read
ports are irrelevant here.

> But getting back to my original point - I quite like fusing store-data
> with the computational operand that produces the value. Makes better use
> of resources, of there are any that are specific to 2-operand.

Yes, but I don't like load-op-store — these generate 2 accesses per
instruction, which is complicating hardware a lot (atomics, etc…).

> Requires only 1 load/store buffer entry - if you have a unified LSQ as
> many companies do, but that doesn't help if, like Intel's P6, you have
> separate LB and SBs.

AMD also have SQ and LQ in LSU. How about ARM?…

Quadibloc

unread,

Dec 16, 2012, 10:24:25 PM12/16/12

to

On Dec 16, 8:11 pm, Tacit <tacit.mu...@gmail.com> wrote:

> Not agree. Vector address computing can be as easy as adding scalar
> base to offset vector (i.e. add scalar turned to vector to another
> vector) and that's ALU's job. No need for separate ALU (AGU) for
> vector addresses.

I'm not sure it was what he was talking about, but if you use a
different ALU for address generation and arithmetic, your pipeline
becomes simpler.

John Savard

Andy (Super) Glew

unread,

Dec 17, 2012, 12:15:57 AM12/17/12

to

On 12/16/2012 7:09 PM, Tacit wrote:
> Andy (Super) Glew:
>> On 12/16/2012 10:28 AM, Tacit wrote:
>>
>> Some ops are fused when emitted from decoders (thus not counting against
>> things like the Intel 2111 decoder template.
>
> Intel never had such template. They have now 4111 or 4+. All of the
> mops are fused by decoder.

Intel has evaluated many, many, different decoder templates.

444
422
411
333
311

As for what actually gets shipped ... (1) memory gets hazy, (2) even if
memory isn't hazy, I can't say unless it is public, (3) I'm too lazy to
go and hunt down a reference to exactly what is public, (4) and exactly
what is public is often no longer an accurate description, in the
original sense of decoder template

By the way, I may well have been the guy who created the term "decoder
template", as a shorthand for describing them to compiler writers.
Although I had nothing to do with how the decoder actually did this
- I think AMD's K7 decoder approach was better. Oh, and I *am* the guy
who invented the concept of separate store-address and store-data uops,
or at least who brought that concept to Intel, without further diffident
waffling language.

BTW^2, you know that the 411 or whatever template does not mean that all
single uop instructions get handled? Or, for that matter, that all
instructions of 4 or less uops get handled in the 4 slot? It's more of
a guideline, really. (Actually, it's more of an upper bound - there may
be no more wires than that, but you can always drop instructions from
the decoder "PLA"s.)

For example: one of the big motivations for load-alu fusion is so that
instructions like ADD_reg+=mem can be emitted in the 111 slots. But
this can be a challenge to fit, in terms of numbers of inputs and
immediates. Plus, often long instructions can only be fit into the
first slot.

For example: one of the big reasons to do store-address/data fusion is
to allow stores, MOV_mem:=reg, to be fit in the 111 slots.

But if you do store-address/data fusion, then the most common CISCy
instructions, instructions that read-modify-write memory, now are 3 uops
rather than 4 uops:

P6-style, with separate store address and store data
ADD mem += reg
tmp := load( mem )
tmp := temp + reg
store-address( mem )
store-data( tmp )

Store-address and store-data fused at the decoder:
ADD mem += reg
tmp := load( mem )
tmp := temp + reg
store( mem, reg )

As usual, fused store-address+data is a bit of a challenge - worst case
3 reg inputs + immediate, or 2 reg inputs + 2 immediates. Often fused
store-address+data can only handle some cases, not the most general case
of store. But if you can handle the most general case, then it is
tempting to make what is traditionally the "4" decoder slot into a "3"
decoder slot. I.e. able to emit only 3 uops. Rather than 4.

Does Intel do this in current processors? Frankly, I don't know (and I
don't really care). I would be surprised if they had not considered it.
I would not be surprised if there were not some other constraint, such
as not supporting fused store-address+data for all operand combinations.
And not wanting to spill to the slower decoder paths if you had the
more complicated addressing modes.

I would also not be surprised if the hardware were actually emitting 3
fused uops from the first decoder slot, but that marketing insisted on
calling it a 4111 decoder rather than a 3111 decoder, because 3 might
look bad. Or because they did not want to give away what they doing in
the microarchitecture. Possibly for fear of patent litigation.

Possibly Agner's tests can distinguish these cases. However, since I
know of several cases were Agner reported incorrect configurations so I
would be surprised if he is accurate in all cases. I am sure that some
patent lawyers would love it if Agner is accurate in all cases.

And then if you go further, to
ADD mem += reg
tmp := load-and-store-address( mem )
tmp := store-data-and-alu( tmp + reg )

you can make the most powerful decoder slot "only" 2 uops wide.

But... you don't have to fit just reg+=mem and mem+=reg into decoder 0.
You also need to fit instructions like add-with-carry-to-memory,
and CALL, and ... Some of which you can reduce the uop count for,
some of which are more challenging.

So hence the idea of m111 - 4111, 3111, 2111 - with m+ (4+, etc.) for
instructions that you don't want to penalize too heavily by going all
the way to the microcode engine, but which don't fit in your m-uop
decoder 0.

>> Sometimes these fused ops are unfused before being placed in the
>> scheduler / reservation stations / whatever.
>
> Both Intel's and AMD's CPU keep mops fused before back-end and in
> dispatcher (including ROB/RQ) and unfuse on issuing in scheduler(s)
> and/or MAU/LSU.
>
>> Sometimes these remain fused in the scheduler, but are fired twice,
>> being emitted as two separate execution ops by the scheduler.
>
> Which CPU use this?

You may not have understood my terminology:
Your "unfuse on issuing in scheduler"
Is approximately my "fired twice"

If, say, you have a fused store-address+data uop sitting in your
scheduler, then, if you can start, say, the store-address part before
the store-data arrives, I call that a first firing of the scheduler.
And then when the store-data arrives, you emit that in a second firing
of the scheduler.

It would be quite suboptimal to fire only once, bit to different pipelines.

By the way: for many years, since the original P6, you don't "issue in
the scheduler". You "dispatch from the scheduler". This differs from
the terminology at many other companies, where dispatch is from igfetch
to the scheduler, and issue is after scheduler into execution. I still
remember whren MAF and AB deliberately switched the terms to be
different from other companies - I flamed them for making a gratuitous
distortion of relatively standard terminology.
I would not be surprised if the term "issue from scheduler" is
creeping back into Intel: Intel has hired so many people from other
computer companies like DEC and Sun and AMD and IBM. (When I am feeling
snarky, I say "failed computer companuies"). Atom, in particular, has
a lot of ex-DECcies.

>> And then sometimes they are emitted by the scheduler, and then flow
>> through different parts of the pipeline.
>
> How can a scheduler emit a new mop? It doesn't recieve original
> instruction to do it.

I trust that you understand what I have explained above.

>> Sometimes in parallel, sometimes sequential (load-op or load-op-store).

>
> That's loads (where source is in the memory), but I was talking about
> stores (source is the register) and op-store (source is op result).

Sure.

op-store is not that common. Never was, even in the i486 generation.
And is not so much any more, after so many years of -op- and load-op
being optimized for.

I don't know the relative frequencies, but I would be surprised if
op-store is more common than load-op-store. And any pipeline that can
fit op-store as a fused uop, can also fuse load-op-store --- with only a
slight application of intelligence.

>> Plus then there is fusion of separate instructions into fewer fused ops.
>
> Yes, that's what I mean above. Accesses should always be fused,
> because CPU need to generate the address. However, officially mop is
> considered as fused only if there is data computation associated with
> memory access. MOV [m],r1 is: 1) read RF or bypass for AG-registers,
> make AG, transfer address to store Q; 2) read RF or bypass for
> (renamed) r1, transfer it to store Q. But Intel doesn't consider it as
> fused…

Official, shmofficial.

Fused store-address-data happened first. It was called fused
store-stores inside Intel forever. Heck, I called it fused
store-address-data BEFORE I brought I went to Intel. Or maybe it was
split-store-address-data. Fusing, splitting, unfusing, unsplitting, it
all depends on your point of view.

What marketing calls it is another issue.

Then load-op fusion.

Then it really gets interesting when you fuse between separate
instructions. Adjacent first, and then not. I think this is now the

Pedantically, I would prefer to say that you "fuse" only when two
independent things are combined. Like fusing two separate instructions.

If the "sub-operations" are combined in the instruction, and then
emitted combined, then you aren't really fusing them - you are just not
splitting them apart at that pipestage - although you may split them
apart later.

But, unsplitting and unfusing is often the same: double firing from the
scheduler, or stuttering and placing into two separate schgeduler and
ROB entries.

>
>> Put it occurs to me that perhaps you meant "mops" == "memory ops", as
>> opposed to "micro ops" or "macro ops".
>
> No, that was micro operations :)

Intel terminology is uop. Where u should be mu, for micro.

>
>> Anyway: I like to use qualifiers like decoder-ops
>
> Yes, there are several types of them and only one name. AMD used to
> have ROPs (RISC ops) in K5 (and K6?) and now — macro-ops.

AMD had ROPs and COPs. COPs and ROPpers, as I like to say.

MitchAlsup

unread,

Dec 17, 2012, 11:57:54 AM12/17/12

to an...@spam.comp-arch.net

On Sunday, December 16, 2012 11:15:57 PM UTC-6, Andy (Super) Glew wrote:
> Intel terminology is uop. Where u should be mu, for micro.

<alt>0181 = µ

Be not restricted to ASCII 1958.

One of the interesting parts of multiple firings from the same station
(entry) is that you can remember the address of the load for the soon
to be encountered store and therefore use the same TLB entry to guar-
antee that the load loaded from the same emory locatioin the store will
store to. This is significantly harder to do with µOps.

Mitch

Tacit

unread,

Dec 17, 2012, 2:29:29 PM12/17/12

to

On 17 дек, 05:24, Quadibloc <jsav...@ecn.ab.ca> wrote:
> I'm not sure it was what he was talking about, but if you use a
> different ALU for address generation and arithmetic, your pipeline
> becomes simpler.

Correct. But often we need to execute 2 or 3 ALU ops per clock
(including vector AG)? There should be a very detailed statistic
gathering to answer that. CPU vendors have large clusters devoted to
study real code stats in real-time execution. But no one is releasing
any results :(

Tacit

unread,

Dec 17, 2012, 2:38:52 PM12/17/12

to

On 17 дек, 18:57, MitchAlsup <MitchAl...@aol.com> wrote:

> you can remember the address of the load for the soon
> to be encountered store

How can CPU discover «soon to be encountered store» with same address
before comparing these addresses in MAU's STLF logic?

> and therefore use the same TLB entry to guarantee

> that the load loaded from the same emory locatioin

No big saves here (only few nanowatts). N-ported cache should have N-
ported TLB anyway.

Tacit

unread,

Dec 17, 2012, 5:14:09 PM12/17/12

to

On 17 дек, 07:15, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:

> Intel has evaluated many, many, different decoder templates.
> 444 > 422 > 411 > 333 > 311

PPro till P-M had 411, but where did you get others from? P4s had 1-
way decoder with 1-4 mops/cl throughput. Atom has 21/2+.

> As for what actually gets shipped ... (1) memory gets hazy

What do you mean? There is renamed rSP and return stack (operated by
front-end), that's not a secret.

> (2) even if
> memory isn't hazy, I can't say unless it is public, (3) I'm too lazy to
> go and hunt down a reference to exactly what is public, (4) and exactly
> what is public is often no longer an accurate description, in the
> original sense of decoder template

OK, bottom line: just say it! :-D You're not in Intel anymore ;-)

> I think AMD's K7 decoder approach was better.

That's 222/3+, as they say.

> Oh, and I *am* the guy
> who invented the concept of separate store-address and store-data uops,
> or at least who brought that concept to Intel

Aha! So I'm at bull's eye here! :-)

> you know that the 411 or whatever template does not mean that all
> single uop instructions get handled?

Yes, 2 years ago I and Fog (while testing SB on our (iXBT) testbed —
for his manual and my detailed review) had discovered, that «long
nops» (starting with 0F 1F) can only be decoded by way 0 (where
complex translator is, responsible for 2-4 µops/cl), though generating
1 mop as expected. SB team designer confessed later about it, but
didn't say — was it deliberate or a HW flaw. Do you have more
examples?

> Actually, it's more of an upper bound - there may
> be no more wires than that

Of course, 4 µops/cl is denoted maximum of both complex translator and
µROM sequencer.

> For example: one of the big motivations for load-alu fusion is so that
> instructions like ADD_reg+=mem can be emitted in the 111 slots. But
> this can be a challenge to fit, in terms of numbers of inputs and
> immediates.

Exactly. Fitting 4 operations in 1 µop means it's not RISC-like core
anymore. While we can forget ideology, but surely we need to save
power and area. And growing µop size is a wrong way, though
inevitable. The only number known to me for sure is 118 bits for P6.
Later I've calculated full and compressed (used in trace cache and µop
cache) approximate µop sizes: 119 (53) for Willamette & Northwood, 138
(64) for Prescott and 139-147 (85) for SB. But these could be off a
lot, so I'll take any corrections :-)

> often long instructions can only be fit into the first slot.

Any particular restrictions you can say for post-2000 CPUs?

> one of the big reasons to do store-address/data fusion is
> to allow stores, MOV_mem:=reg, to be fit in the 111 slots.

Yes. My original message was about executing them.

> fused store-address+data is a bit of a challenge - worst case
> 3 reg inputs + immediate, or 2 reg inputs + 2 immediates.

1) 3*4+64 bits; 2) 2*4+2*32 bits. Just a little more than MOV r64,
imm64.

> Often fused
> store-address+data can only handle some cases, not the most general case

Yes, SB can't microfuse if immediate is present with RIP-relative
addressing. Also, load-ex ops that read index register (and some more
rare cases) on writing to IDQ (from either µop cache or decoder) are
unfused to pairs. I.e. in case there are 4 such instructions per
clock, IDQ will receive 8 resulting µops. Probably, further µop
storages (IDQ, ROB and RS) have compacted µop format.

> I would not be surprised if there were not some other constraint, such
> as not supporting fused store-address+data for all operand combinations.

Bingo ;-)

> I would also not be surprised if the hardware were actually emitting 3
> fused uops from the first decoder slot, but that marketing insisted on
> calling it a 4111 decoder rather than a 3111 decoder, because 3 might
> look bad. Or because they did not want to give away what they doing in
> the microarchitecture. Possibly for fear of patent litigation.

You may be right, because according to Fog's tables, there are no
multi-µop instructions (including microcoded), that can execute 4
fused µops/cl, only up to 3. With only possible exception — REP CMPS
for large counts. What's more interesting is that Bulldozer is also
like this :-)

> Possibly Agner's tests can distinguish these cases. However, since I
> know of several cases were Agner reported incorrect configurations so I
> would be surprised if he is accurate in all cases.

I find his mistakes very often, but he's not fixing most of them :-)

> some patent lawyers would love it if Agner is accurate in all cases.

No easy fishing here. He and I can testify — we've tested and get all
of these numbers ourselves. With 2 witnesses :-)

> And then if you go further, to
> ADD mem += reg
> tmp := load-and-store-address( mem )
> tmp := store-data-and-alu( tmp + reg )

Yes, ADD [m],r generates 2 µops, but Intel does data op in 1st µop and
store only in 2nd.

> But, unsplitting and unfusing is often the same: double firing from the
> scheduler, or stuttering and placing into two separate schgeduler and
> ROB entries.

As I know, no modern CPU keep split µops in ROB/RQ.

> Your "unfuse on issuing in scheduler" Is approximately my "fired twice"

So there are no cases of fused µops still kept in RS?

> say, you have a fused store-address+data uop sitting in your scheduler

Argh! So it IS possible? Where?

> you can start, say, the store-address part before
> the store-data arrives, I call that a first firing of the scheduler.
> And then when the store-data arrives, you emit that in a second firing

I can't figure out how source register number comparator matrix
(strangely called CAM) would work in this case. There must be separate
comparison of address and data sources, which is equivalent of using 2
entries. Right?

> for many years, since the original P6, you don't "issue in the scheduler".

Yes, dispatcher is renaming, unfusing, dispatching and retiring, and
scheduler is issuing :-) My English plays tricks here.

> You "dispatch from the scheduler".

Wrong. Just wrong %-)

> This differs from
> the terminology at many other companies, where dispatch is from igfetch
> to the scheduler, and issue is after scheduler into execution. I still
> remember whren MAF and AB deliberately switched the terms to be

> different from other companies.

Stupid. Few more steps like this and your people are about to build
Babylon tower-CPU. Who are MAF and AB?

> Intel has hired so many people from other
> computer companies like DEC and Sun and AMD and IBM. (When I am feeling
> snarky, I say "failed computer companuies").

Luckily for us all, AMD hadn't failed. And I wish them good luck —
monopoly is bad for everyone.

> op-store is not that common. Never was, even in the i486 generation.
> And is not so much any more, after so many years of -op- and load-op
> being optimized for.

Because op-store need second opcode for same instruction, where r/m
field of modRM byte could denote destination. And before AVX there was
little space in the opcode tables and fears for enlarging instructions
by prefixes and slowing them in decoding for same reason. But even now
nothing is changing.

> I don't know the relative frequencies, but I would be surprised if
> op-store is more common than load-op-store.

It would much more in case if it were allowed for vectors.

> And any pipeline that can
> fit op-store as a fused uop, can also fuse load-op-store --- with only a
> slight application of intelligence.

And a large impact for memory consistency model. That's why I don't
support RMW for all instructions, but only to selected few, specially
designed to sync threads (like all of CASs).

> Then it really gets interesting when you fuse between separate
> instructions. Adjacent first, and then not. I think this is now the

Errm, missed a line here? :-) I didn't knew of non-adjacent fuses.

> Pedantically, I would prefer to say that you "fuse" only when two
> independent things are combined. Like fusing two separate instructions.

So that's macrofusion. Also adjacent.

> If the "sub-operations" are combined in the instruction, and then
> emitted combined, then you aren't really fusing them - you are just not
> splitting them apart at that pipestage - although you may split them
> apart later.

Right, better name is packed µops, but that was reserved for integer
vectors.

MitchAlsup

unread,

Dec 17, 2012, 9:28:29 PM12/17/12

to

On Monday, December 17, 2012 1:38:52 PM UTC-6, Tacit wrote:
> How can CPU discover «soon to be encountered store» with same address before
> comparing these addresses in MAU's STLF logic?

The Load-op-store is stored in one reservation station entry. The address
of the store has to be the address of the load--both linearly and physically.
That is; it is idempotent.

Mitch

MitchAlsup

unread,

Dec 17, 2012, 9:33:26 PM12/17/12

to

On Monday, December 17, 2012 1:38:52 PM UTC-6, Tacit wrote:

> No big saves here (only few nanowatts). N-ported cache should have N- ported
> TLB anyway.

Time is not the question. The question is what happens when one remote CPU
is altering the page tables while the load-op-store is in operation--changing
the PTE that translates teh memory address of that load-op-store.

In x86 it is illegal to have the load come from a different memory location
the store goes to.

Now consider that the op takse a serious amount of time--like a divide.
The laod may transpire, the divide commense. Then while the divide is in
progress, the TLB invalidate arrives and the TLB entry is knocked out.
When the store starts it will walk the page tables and get a DIFFERENT
translation. This is not Kosher!

So, either you have to have a way of keeping track of load-op-store trans-
lations wrt unfinished Ld-op-Sts, or you remember the PA instead of the
LA in the store queue. All in all its easier to remember the PA.

Mitch

Andy (Super) Glew

unread,

Dec 17, 2012, 9:37:06 PM12/17/12

to

On 12/16/2012 7:11 PM, Tacit wrote:
> …
> Andy (Super) Glew:
>>
>> It all depends...
>>
>> * how many overall ports do you have (e.g. if you are one wide, period,
>> no options)
>
> Say 4. But Intel is dare to make 8 ports for Haswell :)
>
>> * do you have scatter/gather addressing
>> ** if not, then a separate scalar AGU port for load and store addresses
>> is fine
>
> Agree. All of modern CPU do so, if not counting K7…K12 :)
>
>> ** but if you have scatter/gather, especially address vectors, then it
>> becomes tempting to use a vector compute port for address generation.
>
> Not agree. Vector address computing can be as easy as adding scalar
> base to offset vector (i.e. add scalar turned to vector to another
> vector) and that's ALU's job. No need for separate ALU (AGU) for
> vector addresses.

You keep saying "not agree" when we are in fact agreeing.

There is no _need_ for a separate AGU *ever*.

There is nothing fundamentally special about an AGU. And AGU is a
computational unit, much like an ALU - usually more tightly coupled to
the cache and TLB, perhaps with a few operations that you don't have in
a regular ALU, and probably without operations like logic that a regular
ALU has.

So, if you are cheap, the same computational logic serves as ALU and AGU.

When you have more gates, it is common to add an AGU rather than a full
ALU, because you can special case the control. Easier to do than full
superscalar.

As you get wider and wider ... you *could* use any ALU as an AGU. But
the need for wires to couple the ALU to memory ports probably means that
you may special case a few as AGUs.

When you can only afford a single vector unit, well, at that stage you
probably don't have scatter/gather or index vector addressing modes at
all. But if you did, you could use the same vector ALU as a vector AGU.

When you increase the superscalarness of the vector units, you probably
add more vector computer units, because dense vectors are more common.

But when you get serious about scatter/gather, you probably add a
dedicated vector AGU.

One possible evolutionary path:

Low end machine
SCALAR_EU = agu+alu

Superscalar-2
SCALAR_EU0 = alu
SCALAR_EU1 = agu

Superscalar-3
SCALAR_EU0 = alu
SCALAR_EU1 = alu
SCALAR_EU2 = agu

Superscalar-4
SCALAR_EU0 = alu
SCALAR_EU1 = alu
SCALAR_EU2 = alu
SCALAR_EU3 = agu
or
SCALAR_EU0 = alu
SCALAR_EU1 = alu
SCALAR_EU2 = agu+slow-alu
SCALAR_EU3 = agu

Superscalar-5
SCALAR_EU0 = alu
SCALAR_EU1 = alu
SCALAR_EU2 = alu
SCALAR_EU3 = agu
SCALAR_EU4 = agu

and for vectors

1-vector
VECTOR_EU0 = vector_alu

1-vector + s/g
VECTOR_EU0 = vector_alu + vector agu

2-vector
VECTOR_EU0 = vector_alu
VECTOR_EU1 = vector_alu
2-vector + s/g
VECTOR_EU0 = vector_alu
VECTOR_EU1 = vector_alu + vector_agu

3-vector + s/g
VECTOR_EU0 = vector_alu
VECTOR_EU1 = vector_alu
VECTOR_EU2 = vector_agu + restricted vector agu

4-vector + s/g
VECTOR_EU0 = vector_alu
VECTOR_EU1 = vector_alu
VECTOR_EU2 = vector_alu
VECTOR_EU3 = vector_agu

5-vector + s/g
VECTOR_EU0 = vector_alu
VECTOR_EU1 = vector_alu
VECTOR_EU2 = vector_alu
VECTOR_EU3 = vector_agu
VECTOR_EU3 = vector_agu

However, as far as I know Intel hasn't yet built a real scatter/gather
vector AGU yet. Larrabee took baby steps - perhaps MIC has. But AFAIK
Intel has not yet built a vector agu for the main x86 yet.

>> If you are from the Intel camp, you tend to think of stores as being
>> store-address and store-data:
>>
>> STORE M[basereg+indexreg*scale+imm] := srcreg
>> store-address StoreAddressBuffer[#] := basereg+indexreg*scale+imm
>> store-data StoreDataBuffer[#] := srcreg
>
> Isn't StoreAddressBuffer[N] and StoreDataBuffer[N] are in the same
> Store Buffer entry N?

That's the way we logically think of them.

However, on many chips (beginning with P6) they don't need to be
physically adjacent. So it is convenient to build separate structures.

Heck, some chips have separate IntegerStoreDataBffer and
FloatingPointStoreDataBuffer. And ultimately PackedVectorStoreDataBuffer.

It is stupid to have all store data buffer entries capable of holding
128-512 bits, and store only a byte in them.

>> Store-datas also waste an RF read port if dispatched to a computational
>> port that has two dedicated read ports. Or 3, or ....
>
> No, instead it won't use the write port. Because we're writing the
> result in the memory, not in RF. But MOV is even simpler: read from RF
> or bypass to Store queue of MAU.

Not sure what you are saying. Often you cannot take a write port and
make it a read port.

But, the fact that store data is special - 1 read port, no register data
write port, + whatever you call it to signal complete (not really a port
at all, doesn't need to exit the scheduler, unless you can take a fault
as Intel x87 can on a store data) is one big reason why Intel has had
dedicated store data pipelines for so many years. (Since P6.)

From the point of view of RF ports and bypasses, store data is
especially cheap.

However, on many schedulers it is not much cheaper than other ops.

>> Or - just treat RF read ports separately, not tightly coupled to
>> computational ports. You get most operands off the bypass.
>
> That's reads. Of course, in case of «OP r1,r2,r3 + MOV [m],r1» result
> can be written in r1 and bypassed to store at the same clock. Read
> ports are irrelevant here.

And writes.

Not every value produced needs to be written to the register file.

Not every REGISTER VALUE produced needs to be written to the register file.

>> But getting back to my original point - I quite like fusing store-data
>> with the computational operand that produces the value. Makes better use
>> of resources, of there are any that are specific to 2-operand.
>
> Yes, but I don't like load-op-store — these generate 2 accesses per
> instruction, which is complicating hardware a lot (atomics, etc…).

On x86 you already have to support them.

But my point is that although it is an RMW instruction,
in terms of microoperations it is three separate microperations. Just
like it is now.

Three separate micro-operations, that only need 2 scheduler entries, and
only a single reorder buffer entry.

Attractive to x86. Would require inter-instruction optimization for a RISC.

Andy (Super) Glew

unread,

Dec 17, 2012, 9:40:07 PM12/17/12

to

Yep.

I think that not doing this was one f my biggest mistakes ever.

It is one of the reasons that I like the load-and-store-address "uop".

The kluges to work around the possible inconsistency in TLB entries
between the load and store are just amazingly stupid.

Andy (Super) Glew

unread,

Dec 17, 2012, 9:46:25 PM12/17/12

to

On 12/17/2012 11:38 AM, Tacit wrote:
> On 17 дек, 18:57, MitchAlsup <MitchAl...@aol.com> wrote:
>
>> you can remember the address of the load for the soon
>> to be encountered store
>
> How can CPU discover «soon to be encountered store» with same address
> before comparing these addresses in MAU's STLF logic?

In x86, they are guaranteed to be the same:

ADD M[addr] += reg
tmp := load M[addr]
tmp += reg
store( M[addr] := tmp )

>> and therefore use the same TLB entry to guarantee
>> that the load loaded from the same emory locatioin
>
> No big saves here (only few nanowatts). N-ported cache should have N-
> ported TLB anyway.

Not power. Complexity.

You can't believe the bugs that can happen if the TLB entry can change
between the load and the store in these instructions. Especially if
LOCKed atomic RMWs.

And you probably won't also believe how ugly the kluges are when people
resist doing what Mitch described.

--

While we are at it:

What happens on your favorite RISC if in LL/SC:

reg := load-linked( M[addr] )
...
store-conditional M[addr] := new_value

the LL and SC M[addr] do not habve the same physical address translation?

Shouldn't be bad if LL/SC.

But often load-linked is promoted to load-locked to guarantee
non-starvation.

Better not lock based on physical address...

Tacit

unread,

Dec 17, 2012, 10:34:07 PM12/17/12

to

On 18 дек, 04:28, MitchAlsup <MitchAl...@aol.com> wrote:
> The Load-op-store is stored in one reservation station entry. The address
> of the store has to be the address of the load--both linearly and physically.

Oh, in that case yes. However, it's better to have special load+store
op released by dispatcher or scheduler to both load and store Qs of
MAU with same address.

Tacit

unread,

Dec 17, 2012, 10:41:20 PM12/17/12

to

On 18 дек, 04:33, MitchAlsup <MitchAl...@aol.com> wrote:
> Time is not the question. The question is what happens when one remote CPU
> is altering the page tables while the load-op-store is in operation--changing
> the PTE that translates teh memory address of that load-op-store.

I.e. memory consistency. Hard and bad.

> In x86 it is illegal to have the load come from a different memory location
> the store goes to.

Generally yes. But that's still possible for PUSH [m], MOVS* <2
implicit addresses> and maybe some more.

> So, either you have to have a way of keeping track of load-op-store trans-
> lations wrt unfinished Ld-op-Sts, or you remember the PA instead of the
> LA in the store queue. All in all its easier to remember the PA.

Even better is to disable RMW for general instructions at all.

Tacit

unread,

Dec 17, 2012, 10:51:46 PM12/17/12

to

On 18 дек, 04:40, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:
> On 12/17/2012 8:57 AM, MitchAlsup wrote:
>
> I think that not doing this was one f my biggest mistakes ever.
>
> It is one of the reasons that I like the load-and-store-address "uop".

You mean the one that generates an address for 2 accesses? Intel must
use this with RMWs.

Tacit

unread,

Dec 17, 2012, 11:39:17 PM12/17/12

to

On 18 дек, 04:37, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:

> You keep saying "not agree" when we are in fact agreeing.
>
> There is no _need_ for a separate AGU *ever*.

Oh, then I do agree in case of vector adresses, but still not for
scalar adresses (that are 100% of them all, before Xeon Phi (Knights
Corner) appeared). Explained below…

> There is nothing fundamentally special about an AGU.

Very wrong! AGU is a 1-clock latency specialized arithmetic unit, that
has 3-operand adder (probably used nowhere else in the core) and
barell shifter for up to 3 bits «leftwards» (for index). It can't do
logic, more complex shifts and rotates, subtractions, negates and
other things ALU can. It can't use 8-bit operands, don't read and
write flags and don't write results back in RF (thats why LEAs are not
executing in same AGUs as actual address generations). So these are
very special EUs, that mostly require their own ports and (in case of
Bobcat) even section of the scheduler.

> As you get wider and wider ... you *could* use any ALU as an AGU. But
> the need for wires to couple the ALU to memory ports probably means that
> you may special case a few as AGUs.

After funeral of Athlons, that «special case» is actually 100%
common ;-) there aro no x86 CPUs now with no AGUs as separate EUs.

> But when you get serious about scatter/gather, you probably add a
> dedicated vector AGU.

It' can't be very serious, because as of now it's much more cache
restricted. To be exact — address sort and combine is still slow.

> Superscalar-2
> SCALAR_EU0 = alu
> SCALAR_EU1 = agu

In case of 2-way superscalar, it must have 6-7 ports: 2 scalar and 2
vector for data (with ALUs, MULs etc.), 2 for AG (AGU) and 1 for store
data (Intel style). AMD style will be using computational ports for
store µops (as explained on top here) — and that's exactly how 2-way
Bobcat is done.

In case it's just scalar datapath, 2-way is so little.

> Superscalar-4
> SCALAR_EU0 = alu
> SCALAR_EU1 = alu
> SCALAR_EU2 = alu
> SCALAR_EU3 = agu

That's Bobcat's and Bulldozer's solution.

> SCALAR_EU0 = alu
> SCALAR_EU1 = alu
> SCALAR_EU2 = agu+slow-alu
> SCALAR_EU3 = agu

Instead AMD figured out how to upgrade AGUs to «almost ALUs» (called
AGLU) so they can do something simple with data too — but expected
only with Steamroller (in 2014).

> Superscalar-5
> SCALAR_EU0 = alu
> SCALAR_EU1 = alu
> SCALAR_EU2 = alu
> SCALAR_EU3 = agu
> SCALAR_EU4 = agu

That's SB/IB.

> However, as far as I know Intel hasn't yet built a real scatter/gather
> vector AGU yet.

Thy did for Phi. Slow, so they had to boost SMT to 4 threads/core.

> on many chips (beginning with P6) they don't need to be
> physically adjacent. So it is convenient to build separate structures.

Yes, but are they still addressable by the same number for both parts?

> Heck, some chips have separate IntegerStoreDataBffer and
> FloatingPointStoreDataBuffer. And ultimately PackedVectorStoreDataBuffer.

That's smart. The only thing left to do is to make completely separate
data caches (and MAUs) for scalars and vectors. And not only because
of …

> It is stupid to have all store data buffer entries capable of holding
> 128-512 bits, and store only a byte in them.

> > No, instead it won't use the write port. Because we're writing the
> > result in the memory, not in RF. But MOV is even simpler: read from RF
> > or bypass to Store queue of MAU.
>
> Not sure what you are saying. Often you cannot take a write port and
> make it a read port.

I mean MOV [m],r is not using RF for write, so there's no need to
activate write port. But you made some point about read port
challenge.

> But, the fact that store data is special … is one big reason why Intel has had

> dedicated store data pipelines for so many years. (Since P6.)

Yes but they don't have special scheduler section for writes —
scheduler is totally unified.

> Not every REGISTER VALUE produced needs to be written to the register file.

Why is that?

Andy (Super) Glew

unread,

Dec 18, 2012, 2:58:08 AM12/18/12

to

On 12/17/2012 2:14 PM, Tacit wrote:
> On 17 дек, 07:15, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:
>
>> Intel has evaluated many, many, different decoder templates.
>> 444 > 422 > 411 > 333 > 311
>
> PPro till P-M had 411, but where did you get others from?

Wouldn't you evaluate many different alternatives?

>> As for what actually gets shipped ... (1) memory gets hazy

>> (2) even if
>> memory isn't hazy, I can't say unless it is public, (3) I'm too lazy to
>> go and hunt down a reference to exactly what is public, (4) and exactly
>> what is public is often no longer an accurate description, in the
>> original sense of decoder template
>
> OK, bottom line: just say it! :-D You're not in Intel anymore ;-)

But I am still bound by Intel's NDAs. And by AMD's NDAs. And by MIPS'
NDAs. And by Motorola's NDAs. (Although Moto was so long ago...)

>
>> I think AMD's K7 decoder approach was better.
>
> That's 222/3+, as they say.

That's not how I would describe it.

Or, perhaps, that's not how I imagined it would be built when I first
learned about it at Intel. And when I worked at AMD it seemed much like
I had imagined it.

Mitch, can you say? Or were you always into decoded I$?

>> For example: one of the big motivations for load-alu fusion is so that
>> instructions like ADD_reg+=mem can be emitted in the 111 slots. But
>> this can be a challenge to fit, in terms of numbers of inputs and
>> immediates.
>
> Exactly. Fitting 4 operations in 1 µop means it's not RISC-like core
> anymore. While we can forget ideology, but surely we need to save
> power and area. And growing µop size is a wrong way, though
> inevitable.

A lot of those bits are just related to operands.

So, a single scheduler entry that holds
Multiply-Add dest := A * B + C
--- (3 inputs, 1 output)
may also be able to hold
Load dest := M[segbase + basereg+indexreg*scale+imm]
--- 2 full inputs + 1 output
segbase (which is cheaper than a register)
+ imm
Load-Op dest :op= M[segbase + basereg+indexreg*scale+imm
--- 3 full inputs + 1 output
segbase (which is cheaper than a register)
+ imm
Load-Op-Store M[segbase + basereg+indexreg*scale+imm] op= reg
--- 3 full inputs,
no output (unless CMPXCHG or XADD)
segbase (which is cheaper than a register)
+ imm
It's all incremental complexity, a slipper slope.

Load-Op-Store in some respects is cheaper than Load-Op.

The main additional cost is having opcode fields:

Load/Store ((8/16/32)[su])/64/128/256/512
= 2 * ((3*2)+4) = 20 => 4 or 5 bits
+ ALU opcode

i.e. costing only 5 bits extra... and probably less,
since not all instructions have RMW forms.

So, if you could add 5 bits, and save 2 138-bit uops X% of the time,
what does X% have to be?

> I can't figure out how source register number comparator matrix
> (strangely called CAM) would work in this case. There must be separate
> comparison of address and data sources, which is equivalent of using 2
> entries. Right?

The P6 scheduler "should I be made ready?" logic actually was a CAM.

In the sense that every entry stored two encoded physical register
numbers, as well as other stuff. These psrcs were compared to the
pdests of instructions being written back - i.e. the content (the psrcs)
was used to address (select) an entry. If matched, then the data was
captured.

See https://www.semipublic.comp-arch.net/wiki/CAM.

Anyway, for an equality, encoded CAM, no, firing twice is not equivalent
to having two entries. It's just a tweak on the logic.

I.e. if the equality CAM match sets a bit, psrc1.ready, psrc2.ready,
psrc3.ready

then the ordinary ready condition is psrc1.ready & psrc2.ready &
psrc3.ready.

and the store-address ready condition is
psrc1.ready & psrc2.ready

and the store-data- ready condition is psrc3.ready.

I.e.

Any_Ready :=
(is_STA & psrc1.ready & psrc2.ready)
| (is_ST2 & psrc3.ready)
| (psrc1.ready & psrc2.ready & psrc3.ready)

with an obvious CSE that you may or may not want to use.

However, for a decoded CAM, in a bitmatrix scheduler, yes, firing twice
does require the equivalent of two mask-CAMs.

Or, you may have only a single entry, but you may rewrite it after the
first firing.

Therefore, encoded CAMs can fairly easily do STA then STD, or STD then
STA. I.e. they can fire in any order.

Whereas in an encoded CAM it is easiest to fire, say, the STA first,
then the STD. Not all of the systems that do STA then STD are encoded
CAMs, but often they may be one lurking underneath, e.g. using an
encoded CAM to track bypasses.

>
>> for many years, since the original P6, you don't "issue in the scheduler".
>
> Yes, dispatcher is renaming, unfusing, dispatching and retiring, and
> scheduler is issuing :-) My English plays tricks here.

No,it doesn't. They are all similar words, can almost be used
interchangeably except for conventions.

It's not your English that is confused.
It is English that is confusing.
Especially as used by computer companies.

> And a large impact for memory consistency model. That's why I don't
> support RMW for all instructions, but only to selected few, specially
> designed to sync threads (like all of CASs).

Huh?

Unlocked RMWs like
Mem :op= reg
Mem :op= reg

have no special implication for memory ordering.

They are not guaranteed to be atomic.

They are only atomic if the LOCK prefix is applied.
Except for XCHG, ...

The compiler can use them safely.

>
>> Then it really gets interesting when you fuse between separate
>> instructions. Adjacent first, and then not. I think this is now the
>
> Errm, missed a line here? :-) I didn't knew of non-adjacent fuses.

Not yet. But they will happen.

>> Pedantically, I would prefer to say that you "fuse" only when two
>> independent things are combined. Like fusing two separate instructions.
>
> So that's macrofusion. Also adjacent.

No. I am willing to say that you may fuse microops, if they were
independent to begin with.

But if they were never independent, that's not fusion. That's not
splitting them up in the first place.

Andy (Super) Glew

unread,

Dec 18, 2012, 4:26:46 AM12/18/12

to

On 12/17/2012 8:39 PM, Tacit wrote:
> On 18 дек, 04:37, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:
>
>> You keep saying "not agree" when we are in fact agreeing.
>>
>> There is no _need_ for a separate AGU *ever*.
>
> Oh, then I do agree in case of vector adresses, but still not for
> scalar adresses (that are 100% of them all, before Xeon Phi (Knights
> Corner) appeared). Explained below…
>
>> There is nothing fundamentally special about an AGU.
>
> Very wrong! AGU is a 1-clock latency specialized arithmetic unit, that
> has 3-operand adder (probably used nowhere else in the core) and
> barell shifter for up to 3 bits «leftwards» (for index).

> It can't do
> logic, more complex shifts and rotates, subtractions, negates and
> other things ALU can. It can't use 8-bit operands, don't read and
> write flags and don't write results back in RF (thats why LEAs are not
> executing in same AGUs as actual address generations). So these are
> very special EUs, that mostly require their own ports and (in case of
> Bobcat) even section of the scheduler.

They are relatively cheap ALUs, except for the three or four input
adder. Because they are cheap

But "probably used nowhere else in the core"? How about LEA
instructions? Compilers would love to use LEA instructions more, if
they were cheap.

There is a constant debate about whether LEA should be done by non-AGU
ALUs, whether the AGU should do LEA, if so whether LEA is 1 or 2 ccles
of latency, and which cases of LEA

>> on many chips (beginning with P6) they don't need to be
>> physically adjacent. So it is convenient to build separate structures.
>
> Yes, but are they still addressable by the same number for both parts?

Yes, but I suspect for not much longer.

Again: you might have 32 Store Address Buffer entries.

32 32 bit store data buffer entries.
Which can be treated as 16 64 bit store buffer entries.
(32 bits are still more common than 64).

And maybe 48 128 bit vector store data buffer entries,
treatable as 24 256 bit stre data buffer entries.

Yes - potentially more store data buffer entries than store address
buffer entries.

Why?

Because you can treat really wide speculative store data buffer entries
as fill buffers r write combining buffers. Instead of copying them to a
fill buffer / write combining buffer after leaving the store buffer,
just move the address, and then transfer directly to the cache.

Conversely, you can have multiple store address buffer entries point
into the same 128, 256, or 512 bit wide store data buffer entry. I.e,
you can be doing write combining in the store buffers.

>> Not every REGISTER VALUE produced needs to be written to the register file.
>
> Why is that?

A very large fraction of values created are overwritten very soon.

If you can't take an exception to see such a value, then you can keep it
in the bypasses, and not send it to the RF.

Andy (Super) Glew

unread,

Dec 18, 2012, 4:32:48 AM12/18/12

to

Not when I was there.

I went to Intel with a bit of a RISC mindset. I imagined that I was
translating x86 to RISC microcode.

Slowly it sunk in that you can do things with microcode that you can't
do with conventional RISCs.

Consider:

tmp := load-and-store-address( Maddr )
dest := std_alu( tmp + value )

This has no equivalent on a RISC load-store microarchitecture.

Quadibloc

unread,

Dec 18, 2012, 6:36:17 AM12/18/12

to

On Dec 17, 7:37 pm, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:

> There is no _need_ for a separate AGU *ever*.

That depends on how you define "need".

The PDP-8/S, except for the finite size of its memory, was Turing-
complete. So you don't need a fancy implementation to be able to
handle any given ISA.

However, today's microprocessors generally need, to be competitive in
their intended markets, to be superpipelined, to have Wallace Tree
multipliers, to have cache... and, thus, they have to look like a
360/195 internally, not a 360/50 or an 80386, let alone a PDP-8/S.

John Savard

Joe keane

unread,

Dec 18, 2012, 9:57:55 AM12/18/12

to

In article <50CEAA8...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>Pedantically, I would prefer to say that you "fuse" only when two
>independent things are combined. Like fusing two separate instructions.
>
>If the "sub-operations" are combined in the instruction, and then
>emitted combined, then you aren't really fusing them - you are just not
>splitting them apart at that pipestage - although you may split them
>apart later.

Too bad, the compiler has already done it for you.

I write:

*p += x;

The compiler changes it to:

tmp = *p;
tmp += x;
*p = tmp;

There's no point to 'fixing' this; since 'everyone knows' that RISC
machines are the future. RMW operands are some archaic thing.

'Everyone knows'

movl (%esi), %ecx
addl %eax, %ecx
movl %ecx, (%esi)

is faster than

addl %eax, (%esi)

MitchAlsup

unread,

Dec 18, 2012, 12:34:14 PM12/18/12

to an...@spam.comp-arch.net

On Tuesday, December 18, 2012 3:32:48 AM UTC-6, Andy (Super) Glew wrote:
> I went to Intel with a bit of a RISC mindset. I imagined that I was
> translating x86 to RISC microcode.

> Slowly it sunk in that you can do things with microcode that you can't
> do with conventional RISCs.

My story is much the same. I think it is easy to put the x86 instruction set
assunder from without. Yet, working from within for a few years teaches how
much they really did get right.

[Base+index<<scale+Displacement] is one of those things.
Prefixes are one of those things.
Segmentation is not one of those things.

Mitch

Tacit

unread,

Dec 18, 2012, 2:46:37 PM12/18/12

to

On 18 дек, 11:32, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:
> I went to Intel with a bit of a RISC mindset. I imagined that I was
> translating x86 to RISC microcode.
>
> Slowly it sunk in that you can do things with microcode that you can't
> do with conventional RISCs.

In late 80's Fairchild made Clipper CPU with same idea: ROM is good
for complex and most useful operations, but they didn't called it
microcode — that was same machine code as outside of CPU. So that
wasn't a «conventional RISC».

Tacit

unread,

Dec 18, 2012, 3:20:46 PM12/18/12

to

On 18 дек, 11:26, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:

> But "probably used nowhere else in the core"? How about LEA
> instructions?

Handled by ALU. Most complex take 2-3 clocks. SB takes 3 cl. for «LEA
r, [r+r*8+disp8]» and only to port 1 (of 3).

> Compilers would love to use LEA instructions more, if they were cheap.

So it is. Bulldozer module can fire 2 complex LEAs per clock per
thread (!) with 2 cl latency (pipelined) — probably because these
ports (EX0 & EX1) have real AGLUs. LEAs are also better for code
compaction.

> There is a constant debate about whether LEA should be done by non-AGU
> ALUs, whether the AGU should do LEA, if so whether LEA is 1 or 2 ccles
> of latency, and which cases of LEA

I stand for executing simple cases in ALUs and hard ones in a special
ALU-added logic (1 is enougth).

> A very large fraction of values created are overwritten very soon.
> If you can't take an exception to see such a value, then you can keep it
> in the bypasses, and not send it to the RF.

CPU will never know if that value is needed by some other instruction
still to be decoded. Like:

OP1 r1, r2, r3; — r1 here is not written in RF but bypassed to OP2
OP2 r3, r1, 123
… ; — more code here
OPx r4, r1; — nowhere to get r1 from.

However, in case of «OP2 r1, r1, 123» old r1 is not needed, but you
can never be sure, that there will be no exceprions during OP2
execution, that require to save context. So I still don't see such
cases.

Ivan Godard

unread,

Dec 18, 2012, 3:24:01 PM12/18/12

to

"Can't"? Why not?

What's to prevent putting memory-address and memory-data registers into
the ISA and exposing them in the macrocode?

Tacit

unread,

Dec 18, 2012, 4:27:34 PM12/18/12

to

On 18 дек, 09:58, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:
> > PPro till P-M had 411, but where did you get others from?
>
> Wouldn't you evaluate many different alternatives?

Sure I will, if only I had virtual pipeline constructor and simulator.
Are they available outside of CPU vendors?

> But I am still bound by Intel's NDAs. And by AMD's NDAs. And by MIPS'
> NDAs. And by Motorola's NDAs. (Although Moto was so long ago...)

Are they infinite and eternal? :-)

> Load dest := M[segbase + basereg+indexreg*scale+imm]
> --- 2 full inputs + 1 output
> segbase (which is cheaper than a register)
> + imm

Segbase is not kept in µops scince x86-64. In case it's not 0 (were
allowed), there will be additional µop for address add (segbase +
rest).

> So, if you could add 5 bits, and save 2 138-bit uops X% of the time,
> what does X% have to be?

Again — we need a statistic gathering: how often do we need RMW in
case we could afford it for everything? I think less then 1 per 50
instructions.

> However, for a decoded CAM, in a bitmatrix scheduler, yes, firing twice
> does require the equivalent of two mask-CAMs.

Decoded CAM would be huge and power inefficient for 3-4 ports, 3
sources, 100+ µops and 100+ PRF regs. Bulldozer FPU scheduler has 1080
8-bit comparators (6 writes * 3 sources * 60 µops).

> Therefore, encoded CAMs can fairly easily do STA then STD, or STD then
> STA. I.e. they can fire in any order.

So it is possible for modern scheduler to keep fused µops ? Than it's
obvious why Intel still sticks to unified sceduler.

> It is English that is confusing. Especially as used by computer companies.

Yes, Intel has very strange rules for terms and acronims. They admit
it off-the-record :-)

> They are only atomic if the LOCK prefix is applied.

The problem is it can be applied to many of them. Too many.

> Not yet. But they will happen.

How to fuse over a stray instruction (or µop)?

> I am willing to say that you may fuse microops, if they were
> independent to begin with.

Examples of it, please? Currenly we have:
•cmp + jcc — done;
•flag-affecting-alu + jcc — common for of above;
•any-non-nop + nop — used by Nano;
•MOV r1,r2 + compute with r1 as modificand — used by K10 and later AMD
CPUs;
•FMUL + FADD or FMUL + FSUB (with common register) — proposed but will
never be, because FMA is here.

MitchAlsup

unread,

Dec 19, 2012, 12:54:44 AM12/19/12

to

On Tuesday, December 18, 2012 2:24:01 PM UTC-6, Ivan Godard wrote:
> What's to prevent putting memory-address and memory-data registers into
> the ISA and exposing them in the macrocode?

Nothing, and this is in effect what the CDC6600 ISA did.

A write to the A0-through-A5 would cause a load operation to X0-through-X5.
A write to the A6-through-A7 would cause a store operation from X6-through-X7.

This causes the compiler (or assembly language programmer) to have rather
exotic control over loads and stores, while still allowing the memory system
to service memory refs (out side of atomics) with little regard to memory
order.

{And, BTW, I still regard the 6600 as the first RISC machine.}

However, the ISA, I was commenting on, did not have either set of these
registers.

Mitch

MitchAlsup

unread,

Dec 19, 2012, 1:02:08 AM12/19/12

to

On Tuesday, December 18, 2012 3:27:34 PM UTC-6, Tacit wrote:
> How to fuse over a stray instruction (or µop)?

I build a machine, one time, where the decoder would recognize:

Mov T,X
Mov X,Y
Mov Y,T
...
op T,-,-

and convert it into a decoded instruction cache as:

Mov X,Y
Mov Y,X ; // where the lack of a semicolon on the above
// cause there to be no defined order between
// either move
...
op T,-,- // This gets rid of the visiblity of T:=Y

This is something one cannot do with a vonNeumann ISA, but something
any peephole optimizer could find--even a hardware one.

If would also find things like:

PUSH X
PUSH Y

and change it into

ST X,[SP-4]
ST Y,[SP-8]
SUB SP,SP,8

And now all three ops can fire at the same time.

Mitch

Terje Mathisen

unread,

Dec 19, 2012, 6:18:10 AM12/19/12

to

Andy (Super) Glew wrote:
> However, as far as I know Intel hasn't yet built a real scatter/gather
> vector AGU yet. Larrabee took baby steps - perhaps MIC has. But AFAIK

LRB/MIC does have scatter/gather, I believe the performance used to be
limited by the regular cache interface:

I.e. if you can transfer N 64-byte cache lines from L1$ to the cpu per
cycle, then an optimal vector gather operation can pick all the vector
entries which happened to reside in one of those N entires, in a single
cycle.

In the next cycle you pick the next set of cache lines, looping until
all is transferred.

This should work quite well for mostly dense vectors or limited stride
array access, but in the worst case you have as many cache transfers as
you have vector entries.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Paul A. Clayton

unread,

Dec 19, 2012, 9:46:07 AM12/19/12

to

On Wednesday, December 19, 2012 6:18:10 AM UTC-5, Terje Mathisen wrote:
[snip]

> I.e. if you can transfer N 64-byte cache lines from L1$ to the cpu per
> cycle, then an optimal vector gather operation can pick all the vector
> entries which happened to reside in one of those N entires, in a single
> cycle.
>
> In the next cycle you pick the next set of cache lines, looping until
> all is transferred.
>
> This should work quite well for mostly dense vectors or limited stride
> array access, but in the worst case you have as many cache transfers as
> you have vector entries.

Dense vectors (at least those using a mask) and strided accesses
seem to be special cases that can be optimized somewhat easily.
(I was a little disappointed that the Alpha Tarantula proposal
did not have special support for stride-2 accesses. ISTR
reading that ARM's Neon has good support for array of structure
accesses such that something like RGB to YUV uses fewer
instructions, providing a unit stride load scattered across
multiple registers; but I could *very* easily be misremembering!)

(BTW, is there a common term for unit stride accesses that are
scattered across vector registers? I would guess that such a
term should also include the case where the structure members
are not the same size. Supporting efficient vectorization of
array of structures seems likely to be useful. Even just
supporting two-element structures could be useful, e.g., for
complex numbers.)

I think cache banking (IIRC, the Tarantula design included
this feature) would be needed for better scatter/gather
support. For HPC, using huge pages might be a practical way
to reduce TLB capacity and bandwidth issues (allowing more
accesses to be unified), but cache tag checking overhead
could be problematic. To reduce power use for tag accesses,
it might make sense to use hash/rehash (possibly with way
group prediction, possibly based on address or even a
software hint [possibly communicated by register number])
and/or partial tags (or some other filter) .

I am curious how Haswell implements gather. Tarantula
used a different consistency domain for vector accesses
(a more auxiliary processor style design), but such is
less practical for Intel. The inclusion of gather
accesses in a load-store queue would seem be expensive
in terms of power. (Are LSQs fully associative? Using
a few index bits would seem to allow significant power
saving for ordinary operation and support bank-based
pseudo-multiporting for gather [possibly also useful
if more aggressive address generation was used--plausible
for sp-relative accesses but perhaps not too WACI for
other accesses].)

Tacit

unread,

Dec 19, 2012, 4:16:19 PM12/19/12

to

On 19 дек, 08:02, MitchAlsup <MitchAl...@aol.com> wrote:
> I build a machine, one time, where the decoder would recognize:
> Mov T,X
> Mov X,Y
> Mov Y,T
> ...
> op T,-,-

That's rarely needed in orthogonal ISAs, especialy with XCHG present.

> If would also find things like:
> PUSH X
> PUSH Y
> and change it into
> ST X,[SP-4]
> ST Y,[SP-8]
> SUB SP,SP,8

Renamed SP and stack engine (in Intel — stack pointer tracker) is long-
time implemented everywhere.

Tacit

unread,

Dec 19, 2012, 4:51:09 PM12/19/12

to

On 19 дек, 16:46, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> I think cache banking (IIRC, the Tarantula design included
> this feature) would be needed for better scatter/gather
> support.

Interestingly, Haswell removed address collisions for reads (2 per
clock), which means there are 3-ported banks (2R+1W) vs 2-ported in SB/
IB. Also, 32-byte MAU ports also require at least doubling bank port
widths (8 to 16 bytes), otherwise 3 32-byte accesses per clock would
need 12 banks. Now (with 16-byte accesses) there can be up to 6 (of 8)
banks active. But I can't fugure out how to check it, when we'd get
one :)

> I am curious how Haswell implements gather.

Say you've got one. How would you test it to know?

> The inclusion of gather
> accesses in a load-store queue would seem be expensive
> in terms of power. (Are LSQs fully associative?

Yes. As any scheduler Q. But there can be separate gather Q.

> Using
> a few index bits would seem to allow significant power
> saving for ordinary operation and support bank-based
> pseudo-multiporting for gather

Well, looks like they've just got rid of pseudo-multiporting.

Joe keane

unread,

Dec 19, 2012, 6:02:59 PM12/19/12

to

In article <db14a6e0-be05-4d08...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>[Base+index<<scale+Displacement] is one of those things.
>Prefixes are one of those things.
>Segmentation is not one of those things.

It seems like almost everything in the 286 was pretty bad, and almost
everything in the 386 was pretty good. [I still think it's ripped off
the 360, but whatever.]

Robert Wessel

unread,

Dec 19, 2012, 6:53:04 PM12/19/12

to

On Wed, 19 Dec 2012 23:02:59 +0000 (UTC), j...@panix.com (Joe keane)
wrote:

S/360 had base+index+displacement, but never a scale. And many
instructions, including very common ones, only had a base+displacement
form.

VAX had a base+index*scale+displacement mode, although the scale was
fixed to the operand size of the instruction (so for a word-sized add,
the scale for the index would be 2).

Robert Wessel

unread,

Dec 19, 2012, 6:55:01 PM12/19/12

to

On Tue, 18 Dec 2012 11:46:37 -0800 (PST), Tacit
<tacit...@gmail.com> wrote:

PALCode on Alpha had certain similarities, and the millicode on Z
microprocessors is actually used to implement the non-hardwired part
of the ISA using a "hard" subset of the real ISA.

Paul A. Clayton

unread,

Dec 19, 2012, 10:55:05 PM12/19/12

to

On Wednesday, December 19, 2012 4:51:09 PM UTC-5, Tacit wrote:
> On 19 дек, 16:46, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:

[snip]

>> I am curious how Haswell implements gather.
>
> Say you've got one. How would you test it to know?

Reverse engineering with a scanning tunneling
microscope? :-)

It is amazing (to me at least) how some people can
extract so much information about microarchitecture
just from observing software performance.

>> The inclusion of gather
>> accesses in a load-store queue would seem be expensive
>> in terms of power. (Are LSQs fully associative?
>
> Yes. As any scheduler Q. But there can be separate gather Q.

Not using full associativity could ISTM provide a
significant power savings with modest performance
impact at large-ish sizes. Non-unified instruction
schedulers are not that uncommon, afterall; so
dividing other structures could be practical.

MitchAlsup

unread,

Dec 20, 2012, 1:37:30 AM12/20/12

to

On Wednesday, December 19, 2012 5:53:04 PM UTC-6, robert...@yahoo.com wrote:
> VAX had a base+index*scale+displacement mode, although the scale was
> fixed to the operand size of the instruction (so for a word-sized add,
> the scale for the index would be 2).

I got edumacated on this in my early days with Moto in the 88100 eara.
Scaling is good, fixing to the operand size os bad; having a displacement
is even better.

It is too bad doing this orthoganaly takes so many opcode bits!

Mitch

Robert Wessel

unread,

Dec 20, 2012, 3:06:46 AM12/20/12

to

Then throw in that you really want a few operations with three inputs,
and some with two outputs, and pretty soon you're back to VAX...

Terje Mathisen

unread,

Dec 20, 2012, 3:31:10 AM12/20/12

to

Tacit wrote:

> On 19 Ð´ÐµÐº, 16:46, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
>> I think cache banking (IIRC, the Tarantula design included
>> this feature) would be needed for better scatter/gather
>> support.
>
> Interestingly, Haswell removed address collisions for reads (2 per
> clock), which means there are 3-ported banks (2R+1W) vs 2-ported in SB/
> IB. Also, 32-byte MAU ports also require at least doubling bank port
> widths (8 to 16 bytes), otherwise 3 32-byte accesses per clock would
> need 12 banks. Now (with 16-byte accesses) there can be up to 6 (of 8)
> banks active. But I can't fugure out how to check it, when we'd get
> one :)
>
>> I am curious how Haswell implements gather.
>
> Say you've got one. How would you test it to know?

(I assume Haswell has a 32-byte vector load operation available?)

1) Set up a smallish area (4-8 KB, well less than L1$ size) and fill it
with sequential float values.

2) Set up various strides/patterns of access for vector loads, i.e.
offsets of (0,1,2,3,4,5 times the vector width, then the same with
stride 2,3,4 etc.

3) Then you do the same with the individual loads permuted, to check if
there is any difference between having the lowest address
first/last/intermediate.

4) Finally a set of random offsets.

Each vector load can be unrolled/repeated sufficiently to get rid of the
timing overhead.

Andy (Super) Glew

unread,

Dec 20, 2012, 6:17:38 AM12/20/12

to Tacit

If you decode and rename these instructions together,

> OP1 r1 <- r2, r3; — r1 here is not written in RF but bypassed to OP2
> OP2 r1 <- r1, 123

then you know that the old value f r1 is overwritten.

If OP1 and OP2 are ALU instructions, ADDs, then they cannot get
exceptions or traps. (Not memory, just ALU.)

You must prevent external interrupts and breakpoints and singlestepping.

Here's the rub: most instructions cannot fault, except for loads and
stores. Floating point. etc.

Andy (Super) Glew

unread,

Dec 20, 2012, 6:25:52 AM12/20/12

to

Two separate issues:

a) microcode storage, microcode privilge: , ROM, holding "HAL" or
PALcode sequences. Sequences expressed in a RISC language, but probably
with special security and atomicity rules.

and

b) microcode format and ISA definition: there are things that you would
never want to put in a RISC, that are appropriate in microcode.

Or, rather: that you might imagine putting in a RISC or VLIW, but which
you woyuld regret within a few years. But which you don't regret wiyth
microcode.

Andy (Super) Glew

unread,

Dec 20, 2012, 6:37:58 AM12/20/12

to

On 12/18/2012 1:27 PM, Tacit wrote:
> On 18 дек, 09:58, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:

>
>> However, for a decoded CAM, in a bitmatrix scheduler, yes, firing twice
>> does require the equivalent of two mask-CAMs.
>
> Decoded CAM would be huge and power inefficient for 3-4 ports, 3
> sources, 100+ µops and 100+ PRF regs. Bulldozer FPU scheduler has 1080
> 8-bit comparators (6 writes * 3 sources * 60 µops).

Decoded CAM is not affected by #ports. In fact, decoded CAM handles
multiple inputs - 3, 4, 5, 6, as many as you want.
100x100 => roughly 10K bits. each bit rougly 4 transistors, NAND.
Plus overhead for the ANding across the row.

Encoded CAM = inputs * outputs * uops.
1080 8 bit comparators = 8K XORs => roughly 64K transistors.
Depends on how costly your XOR is.

Decoded CAM doesn't scale well, since it is Nuops^2.
But it can be surprisingly efficiemt, surprsingly far.

I am sure the Bulldozer people did the math.
But it is not so cut and dried as you think.

(By the way, I am also sure that they did the math in Year N.
And that the math was inaccurate when they shipped, several years later
than the original target date. Q: which direction are the trends going?
A: in the direction of a) shorter wires, but b) more gates per clock.)

Andy (Super) Glew

unread,

Dec 20, 2012, 6:50:01 AM12/20/12

to

Right.

So if we had hardware that did the RMW optimization that I mentiomed,
CSE'ing the address, converting the above to

ecx := load-and-store-address(esi)
ecx := std-and-add( ecx, eax )

this would be where hardware woukld be tried to do fusion between
instructions.

Mitch's hardware could do this.

Or, Ivan points out that on a TTA (I hate that term):
the compiler could emit

MAddreReg_i := esi
ecx, MDestReg_i := (ecx := MSrcReg_i) + eax

Trouble is, neither Ivan nor I know how to make a TTA cheap enough.
Look at how expensive that second instruction is in terms of bits. How
many bits are in the target regs for memory?

Note that the TTA divides the functionality differently.

It is
address :=
reg := store := (reg:=load) + reg

rather than
reg := load-and-store address
reg := storedata( reg + reg )

i.e. TTA is imbalanced

Yes, I am trying to figure out how to modify TTA to be as nice as ucode.

But it always seems to take more bits, unless you have only 1 or 2
memory target register-tuples.

And TTA is always a lossage for non-memory - I haven't found a place
where it wins.

Michael S

unread,

Dec 20, 2012, 9:53:54 AM12/20/12

to

On Dec 20, 1:02 am, j...@panix.com (Joe keane) wrote:
> In article <db14a6e0-be05-4d08...@googlegroups.com>,
>

> MitchAlsup <MitchAl...@aol.com> wrote:
> >[Base+index<<scale+Displacement] is one of those things.
> >Prefixes are one of those things.
> >Segmentation is not one of those things.
>
> It seems like almost everything in the 286 was pretty bad, and almost
> everything in the 386 was pretty good. [I still think it's ripped off
> the 360, but whatever.]

I also don't like 286 ISA additions. But if you leave protected mode
aside and treat 286 as 'faster 8086", it was pretty good CPU for its
time frame and transistor budget. In particular, 286 bus had well-
thought pipelined mode, something that x386 failed borrow.
May be, 286 was not popular amongst Intel Santa Clara engineers
working on 386 (and even less so in Intel Oregon), but being the first
x86 CPU that was not dead slow, it served as trailblazer for 386.

Paul A. Clayton

unread,

Dec 20, 2012, 11:36:33 AM12/20/12

to

On Thursday, December 20, 2012 6:17:38 AM UTC-5, Andy (Super) Glew wrote:
[snip]

> If OP1 and OP2 are ALU instructions, ADDs, then they cannot get
> exceptions or traps. (Not memory, just ALU.)
>
> You must prevent external interrupts and breakpoints and singlestepping.

ISTM that one could do something like what POWER4 did
with its instruction bundling as an ROB optimization.
An exception just generates a replay (ISTR that POWER4
went into single-step mode for the instructions from
the start of the bundle to the exception point rather
than having a more complex bundle formation that
placed all those instructions in a single bundle.).

Exceptions should be sufficiently rare that
refetching from an instruction cache that did not
include such fusing would not represent excessive
overhead. Alternatively, a method to crack such
fusions could be provided--it would not have to be
fast or efficient--; this could avoid the 'need' for
an inclusive (non-fused) Icache.

Asynchronous exceptions should not be a problem.

> Here's the rub: most instructions cannot fault, except for loads and
> stores. Floating point. etc.

If one did something crazy like supporting precise
interrupts on performance counters, then any instruction
could generate an interrupt.

Ivan Godard

unread,

Dec 20, 2012, 12:13:16 PM12/20/12

to

Why is non-operand-size scaling useful? The only case I am familiar with
is A[i].f when the array element size is within one of the available
scalings and the selected field size is other than the array element
size; the canonical example is arrays of complex and the reference is
A[i].re. However, I've never heard that the case is common enough in
open code to be worth the encoding entropy, while in loops the explicit
shift vanishes into the software pipeline.

Your experience is different?

Ivan

Tacit

unread,

Dec 20, 2012, 12:28:33 PM12/20/12

to

On 20 дек, 05:55, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> >> I am curious how Haswell implements gather.
>
> > Say you've got one. How would you test it to know?
>
> Reverse engineering with a scanning tunneling
> microscope? :-)
>
> It is amazing (to me at least) how some people can
> extract so much information about microarchitecture
> just from observing software performance.

Well, that's what me and Agner were doing for some time :-) However,
testing is one thing, but analizing the results is another. You can
easily miss important conclusion while staring on a table with 1000's
of numbers.

> Not using full associativity could ISTM provide a
> significant power savings with modest performance
> impact at large-ish sizes.

Correct, but we are talking about 48-72-entry queues. Making them way-
associative will cause underusage in many cases. OTH, Intel's L1 TLB
have 64 entries and 4 ways.

Tacit

unread,

Dec 20, 2012, 12:33:38 PM12/20/12

to

On 20 дек, 08:37, MitchAlsup <MitchAl...@aol.com> wrote:
> Scaling is good, fixing to the operand size os bad; having a displacement
> is even better.

Why «fixing to the operand size [is] bad» ? If we have aligned word
access, there will be no cases with other then non-word-size scale?
E.g. reading of 8 bytes needs either 1 (no scale) of 8 (word-size
scale), but not 2 and 4. So it's enough to select between «scale
present or not», that is «byte or word addressing». x86 could save 1
bit :-)

Tacit

unread,

Dec 20, 2012, 12:46:40 PM12/20/12

to

On 20 дек, 10:31, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
> > Say you've got one. How would you test it to know?
>
> (I assume Haswell has a 32-byte vector load operation available?)

Sure, even AVX1 have it. HWL will support AVX2 too.

> 1) Set up a smallish area (4-8 KB, well less than L1$ size) and fill it
> with sequential float values.

DP or SP? Why not integers?

> Each vector load can be unrolled/repeated sufficiently to get rid of the
> timing overhead.

So, we've got tonnss of numbers. Now what? It's possible, that bank
presence can be made invisible even for latencies. Intel source said
they removed bank conflicts, bot not banks.

Tacit

unread,

Dec 20, 2012, 12:54:41 PM12/20/12

to

On 20 дек, 13:17, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:
> If you decode and rename these instructions together,
> > OP1 r1 <- r2, r3; — r1 here is not written in RF but bypassed to OP2
> > OP2 r1 <- r1, 123
> then you know that the old value f r1 is overwritten.

In this case, yes.

> You must prevent external interrupts and breakpoints and singlestepping.
> Here's the rub: most instructions cannot fault, except for loads and
> stores. Floating point. etc.

Can you say if any CPU is using such power-save feature? (Because it
wouldn't help the speed.) It needs some prospective analysis: is any
future µop rewrite current destination? If yes, can there be an
interrupt of some sort between them? I assume such checking logic can
consume more, than RF write port savings.

MitchAlsup

unread,

Dec 20, 2012, 12:58:20 PM12/20/12

to

On Thursday, December 20, 2012 11:13:16 AM UTC-6, Ivan Godard wrote:
> Why is non-operand-size scaling useful?

Consider an array of 2 word structs: array[i].field.
You want the scale factor to be sizeof( array[0] ).

Also consider when you want the index to be unscaled,
and you want memory[base+index+disp].

Mitch

Tacit

unread,

Dec 20, 2012, 1:10:30 PM12/20/12

to

On 20 дек, 13:37, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:
> And that the math was inaccurate when they shipped, several years later
> than the original target date. Q: which direction are the trends going?
> A: in the direction of a) shorter wires, but b) more gates per clock.

They apprehend a), but chose to ignore b), having low-FO4 pipeline
stages to boost the frequency (at the cost of active TDP). But that's
not a main spoiler. IMO, they did several crude mistakes in
architecture itself.

Robert Wessel

unread,

Dec 20, 2012, 1:10:59 PM12/20/12

to

On Thu, 20 Dec 2012 09:33:38 -0800 (PST), Tacit
<tacit...@gmail.com> wrote:

Lack of unaligned accesses is evil.

Tacit

unread,

Dec 20, 2012, 1:34:05 PM12/20/12

to

On 20 дек, 16:53, Michael S <already5cho...@yahoo.com> wrote:
> I also don't like 286 ISA additions. But if you leave protected mode
> aside and treat 286 as 'faster 8086", it was pretty good CPU for its
> time frame and transistor budget.

Not for transistor budget! Me and other HW-people out there tried to
figure out: how did Intel managed to waste a whole total of 134'000
transistors for (even a CISC, but still) 16-bit CPU with dinamic core?
(I.e. not static CMOS, but n-MOS, which takes less space for same
logic.) While there were about 10 examples of 32-bit CPUs (several
CISCs too) that took less than that and worked just as fast. Like
68000, released 3 years before 286.

> 286 bus had well thought pipelined mode, something that x386 failed borrow.

It couldn't. Parity checking, byte mask, cache could be present…

> May be, 286 was not popular amongst Intel Santa Clara engineers
> working on 386 (and even less so in Intel Oregon), but being the first
> x86 CPU that was not dead slow, it served as trailblazer for 386.

They could make «Celeron-286» by replacing ill-fated protected mode
with 8-bit offset for segment base address, that could enable 24-bit
memory space and save a lot of die area and power. But such a chip
will spoil the glory for 386 :-)

Ivan Godard

unread,

Dec 20, 2012, 1:52:58 PM12/20/12

to

Agreed that unscaled is necessary; the Mill has it. Otherwise, your
example for a scale-different-from-load-size is the same as the one I
gave - I just never found it to be profitable compared to an explicit
scaling shift as a separate operation.

Of course, the Mill is much wider issue than any x86, so the shift op
vanishes into a software pipeline. In open code the explicit shift adds
one ALU op to the dependency chain, but it's going to add a cycle to do
the shift when the scaling is in the load op too, so the latency impact
of explicit shift op vs. scale factor in the load is the same and the
any difference is encoding entropy and decode cost. Often the compiler
CSE's out the shift too.

Ah! maybe that's why our experience is different: on the Mill an extra
dependent ALU op adds only one cycle to the whole chain, whereas on an
OOO the extra op adds a writeback cycle and another issue cycle besides
the execute cycle. I'd think hard about moving the scale factor into the
load if an explicit shift op cost me three or four added cycles before
the load itself went out.

I guess that we both only deal with power-of-two scale factors, i.e.
shift, and if the struct size is uncooperative we just punt and do the
multiply. That so?

Having fast dependency chains changes a lot of how you want to do
things. :-)

Tacit

unread,

Dec 20, 2012, 1:59:16 PM12/20/12

to

On 20 дек, 19:58, MitchAlsup <MitchAl...@aol.com> wrote:
> Consider an array of 2 word structs: array[i].field.
> You want the scale factor to be sizeof( array[0] ).

That's the same as with vector addressing. Vectors will grow, as
structs do. So you'll need more bits to encode bigger scales: 16, 32,
… And 1 bit just selects «scale or not» for any word. In rare cases
LEA and shifts will help.

Tacit

unread,

Dec 20, 2012, 2:00:43 PM12/20/12

to

On 20 дек, 20:10, Robert Wessel <robertwess...@yahoo.com> wrote:
> Lack of unaligned accesses is evil.

Agree. So you'll have it, with byte-granular addresses. Just put 0 in
that bit :-)

Robert Wessel

unread,

Dec 20, 2012, 2:15:44 PM12/20/12

to

On Thu, 20 Dec 2012 10:34:05 -0800 (PST), Tacit
<tacit...@gmail.com> wrote:

I think you're forgetting just how successful 16 bit actually
protected mode was. That was what pretty much all Windows apps ran in
from Win3.0 to until Win95 started getting a bunch of native
development.

Yes the OS preferred to run in 386 mode, but exposed very, very little
of that to conventional 16 bit windows applications.

In any event, having 8-bit offset segments would have created a
noteworthy software incompatibility with prior versions. Perhaps a
mode bit.

OTOH, much of the pain of the 286 could have been eliminated with a
few simple changes. For example, allowing stores into CS (perhaps a
mode bit?) would have allowed much easier porting of real mode code
(you'd still get hit in places where you did arithmetic on selectors).
For arithmetic on selectors, you could have eased the problems by
moving the table ID and RPL to the top of the selector, and then
making the TI for the LDT 0 instead of 1, and then an OS could, for
compatibility set up an LDT that mapped 128KB with exactly the same
set of selector values as you'd use in real mode. Combine that with
the ability to store into CS, and you ought to be darn close to
executing unchanged real mode programs that could live in the first
128KB, and didn't hit I/O devices directly. Obviously 128KB would not
have been enough, but some other small extensions out to be able to
expand that.

That's all ancient history, but the 286 protected mode was clearly a
lot more difficult to port to from real mode x86 than was necessary.

Terje Mathisen

unread,

Dec 20, 2012, 2:24:49 PM12/20/12

to

Tacit wrote:

> On 20 Ð´ÐµÐº, 10:31, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>> Say you've got one. How would you test it to know?
>>
>> (I assume Haswell has a 32-byte vector load operation available?)
>
> Sure, even AVX1 have it. HWL will support AVX2 too.
>
>> 1) Set up a smallish area (4-8 KB, well less than L1$ size) and fill it
>> with sequential float values.
>
> DP or SP? Why not integers?

float or int32_t, doesnt' really matter. :-)

>
>> Each vector load can be unrolled/repeated sufficiently to get rid of the
>> timing overhead.
>
> So, we've got tonnss of numbers. Now what? It's possible, that bank
> presence can be made invisible even for latencies. Intel source said
> they removed bank conflicts, bot not banks.

Latencies is what you are looking for!

If we do a vector gather of 32-bit ints, then you can use those ints in
the next vector load, at which point is will be very obvious when
pattern cause an additional cycle of latency per gather.

nm...@cam.ac.uk

unread,

Dec 20, 2012, 3:05:21 PM12/20/12

to

In article <t1o6d856mdmia5m3o...@4ax.com>,

Robert Wessel <robert...@yahoo.com> wrote:
>
>I think you're forgetting just how successful 16 bit actually
>protected mode was. That was what pretty much all Windows apps ran in
>from Win3.0 to until Win95 started getting a bunch of native
>development.

I'm not. It was a technical horror, and its commercial success
was almost entirely due to the fact that IBM had won the marketing
war for Microsoft and the majority of the market had jumped on the
bandwagon. All right, it was an improvement over the previous
state, but that would have been hard to avoid.

Regards,
Nick Maclaren.

Robert Wessel

unread,

Dec 20, 2012, 3:15:06 PM12/20/12

to

Neither am I - having spend far too much of my life wrestling with it.
But Tacit said "ill fated", not "technical horror". iAPX432 and IPF
would be ill fated (as well as technical horrors).

But my main point was that it was worse than it had to be, for trivial
reasons.

Tacit

unread,

Dec 20, 2012, 3:48:08 PM12/20/12

to

On 20 дек, 21:15, Robert Wessel <robertwess...@yahoo.com> wrote:
> I think you're forgetting just how successful 16 bit actually
> protected mode was. That was what pretty much all Windows apps ran in
> from Win3.0 to until Win95

That's OK, but remind everyone when Win3.0 and 286 were out? ;-) How
much more people still used DOS then?

> having 8-bit offset segments would have created a
> noteworthy software incompatibility with prior versions. Perhaps a
> mode bit.

Yes, I mean that.

Tacit

unread,

Dec 20, 2012, 3:51:46 PM12/20/12

to

On 20 дек, 21:24, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
> Latencies is what you are looking for!
> If we do a vector gather of 32-bit ints, then you can use those ints in
> the next vector load, at which point is will be very obvious when
> pattern cause an additional cycle of latency per gather.

I still don't see, how we can measure bank size and their number. Even
if most expectable values are just 16 and 32 for bank size.

Robert Wessel

unread,

Dec 20, 2012, 4:16:55 PM12/20/12

to

On Thu, 20 Dec 2012 12:48:08 -0800 (PST), Tacit
<tacit...@gmail.com> wrote:

>On 20 ???, 21:15, Robert Wessel <robertwess...@yahoo.com> wrote:
>> I think you're forgetting just how successful 16 bit actually
>> protected mode was. That was what pretty much all Windows apps ran in
>> from Win3.0 to until Win95
>
>That's OK, but remind everyone when Win3.0 and 286 were out? ;-) How
>much more people still used DOS then?

Not quite a fair question, since Win3.x required MS-DOS to run. It's
hard to say, but it's possible that between 1990 and 1995, the
majority of the world software development effort was for 16 bit x86
protected mode.

nm...@cam.ac.uk

unread,

Dec 20, 2012, 5:12:52 PM12/20/12

to

In article <d6s6d8pv9ha7kjsg1...@4ax.com>,

Ah. Sorry for misunderstanding. Yes, agreed.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Dec 20, 2012, 6:48:08 PM12/20/12

to

On Thursday, December 20, 2012 1:37:30 AM UTC-5, MitchAlsup wrote:
> On Wednesday, December 19, 2012 5:53:04 PM UTC-6, robert...@yahoo.com wrote:
>> VAX had a base+index*scale+displacement mode, although the scale was
>> fixed to the operand size of the instruction (so for a word-sized add,
>> the scale for the index would be 2).
>
> I got edumacated on this in my early days with Moto in the 88100 eara.

> Scaling is good, fixing to the operand size os bad; having a displacement
> is even better.
>

> It is too bad doing this orthoganaly takes so many opcode bits!

Could one use the register ID to encode some of this
information? Such would increase the complexity of
register allocation (effectively introducing special
purpose registers), but with 5-bit register IDs it might
be possible to extract 3 bits for such use--even 2 bits
would support shifting the index by 0, 1S, 2S, or 4S
(where S is the log2 of the operand size). (One might
be able to add a bit or _possibly_ two if both the base
and index register IDs were used.)

Doing this for hints was suggested in Hans Vandierendonck
and Koen De Bosschere's "Implicit Hints: Embedding Hint
Bits in Programs without ISA Changes" (2010).

Such special-purposing of registers might not be so bad
for relatively uncommon operations that can be
synthesized (without register ID concerns) by two
instructions (or provided by a much larger single
instruction). (Even for somewhat common operations, such
might not be so bad if code density was extremely
important.)

Robert Wessel

unread,

Dec 20, 2012, 8:21:41 PM12/20/12

to

Well you could, but that would by definition pretty much trash any
hope of orthogonality, wouldn't it? Whether or not that would be an
acceptable tradeoff is a different question. But yes, the compiler
guys hate that sort of thing.

Paul A. Clayton

unread,

Dec 20, 2012, 9:15:41 PM12/20/12

to

On Thursday, December 20, 2012 8:21:41 PM UTC-5, robert...@yahoo.com wrote:
> On Thu, 20 Dec 2012 15:48:08 -0800 (PST), "Paul A. Clayton"
> <paaron...@gmail.com> wrote:
>
> >On Thursday, December 20, 2012 1:37:30 AM UTC-5, MitchAlsup wrote:

[snip]

>>> It is too bad doing this orthoganaly takes so many opcode bits!
>>
>> Could one use the register ID to encode some of this
>> information? Such would increase the complexity of
>> register allocation (effectively introducing special
>> purpose registers), but with 5-bit register IDs it might
>> be possible to extract 3 bits for such use--even 2 bits
>> would support shifting the index by 0, 1S, 2S, or 4S
>> (where S is the log2 of the operand size). (One might
>> be able to add a bit or _possibly_ two if both the base
>> and index register IDs were used.)

[snip]

> Well you could, but that would by definition pretty much trash any
> hope of orthogonality, wouldn't it?

Yes, or at least trade orthogonality in register use
for greater orthogonality in shift amount relative to
operand size.

> Whether or not that would be an
> acceptable tradeoff is a different question. But yes, the compiler
> guys hate that sort of thing.

So I have read. Variable length instructions would
probably be much more attractive for compiler writers
(though for M88k, which Mitch referenced, getting VL
instructions approved might well have been as
difficult).

Unfortunately, such could not just be implemented as
a relatively simple filter that fused instructions
since at least some cases would add a temporary that
added spill/fill code and operations may have been
separated to provide more efficient superscalar
execution. (Even so, it might not be *that*
difficult to produce a filter that fused most
instances. However, I suspect that the benefit is
too small to justify such efforts.)

Terje Mathisen

unread,

Dec 21, 2012, 2:57:18 AM12/21/12

to

Tacit wrote:

I don't know how to get an exact answer, but I would try with various
combinations of stride 1 vs stride 1 modulo possible bank size.

I.e. load 32-bit values from offsets 0,4,8,12... vs 0,36,8,44 or
0,68,8,76 etc.

Tacit

unread,

Dec 21, 2012, 7:58:50 AM12/21/12

to

All right, we're far out of the original question, so lets get beck to
it. How and were would it be best to issue a write µop (unfused or
fused and double-issued) among abovementioned 3 options (A, B, С)? And
why CPU vendors use such different tactics for it?

Michael S

unread,

Dec 21, 2012, 9:21:17 AM12/21/12

to

I think, for integer stores Power7 (in ST and SMT2 modes) does
following:
D. Stores are kept in the (unified) issue queue as a single uOP. Each
store is issued twice - to any of two LSU for AGEN and to any of two
FXUs for data steering. Two stores per clock are perfectly legal.

Tacit

unread,

Dec 21, 2012, 11:59:33 AM12/21/12

to

On 21 дек, 16:21, Michael S <already5cho...@yahoo.com> wrote:
> I think, for integer stores Power7 (in ST and SMT2 modes) does
> following:
> D. Stores are kept in the (unified) issue queue as a single uOP. Each
> store is issued twice - to any of two LSU for AGEN and to any of two
> FXUs for data steering. Two stores per clock are perfectly legal.

OK, so it's a variant of option C: «Select ANY of the computational
ports for store mops», but with 2 stores/clock. (BTW, is the cache
itself is capable of 2 stores/clock?) But other CPUs have selected
port for «also stores» (B) and even «only stores» (A). Why is that?

Joe keane

unread,

Dec 21, 2012, 12:30:48 PM12/21/12

to

In article <50D2FB69...@SPAM.comp-arch.net>,
Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>Look at how expensive that second instruction is in terms of bits. How
>many bits are in the target regs for memory?

What registers?

Let's say i write in VAX:

addl 4(r0), 8(r1)[r2]

How many registers do you need to write back?