microcode and micro-ops

Frank Yang

unread,

Sep 20, 2013, 6:54:08 PM9/20/13

to

Hi,

I am confused about the microcode and micro-ops in X86. As far as what I understand, microcode is just a piece of software code stored in the ROM for some pre-defined routines - more like a firmware - while micro-ops are the actual micro-instructions executed inside the pipeline.

Is that correct? If so, when do we need to run the microcode?

Thanks,
Frank

Paul A. Clayton

unread,

Sep 20, 2013, 11:44:09 PM9/20/13

to

Basically that is correct. Microcode is a convenient way to provide
a higher level abstraction to software that allows optimizations
specific to a hardware implementation and optimizations that would
not be safe in generic code. Microcode can also allow the providing
of an instruction that is too complex for a direct hardware
implementation when initially provided or for specific products
which instruction might later be directly implemented.

Note that the complexity applies both to the amount of hardware
required to implement the functionality and the risk of incorrect
operation. Microcode can shift the time of commitment forward,
even being able to fix products that have already shipped.
Microcode may also involve a trade-off between execution efficiency
of a special case (a hard-wired implementation would be more
efficient than a microcoded implementation) and the efficiency of
more general cases (which may be made less power efficient or
slower by adding extra control logic or extra functional units).

(The common implementations of microcode also provide guarantees
about code caching--the microcode sequence will always be "cache
resident" in the sense that the code is on-chip.)

Microcode is used for complex instructions such instructions using
the REP prefixes. Microcode can also be used for handling special
exceptional circumstances, allowing the hardware to be simplified.
Handling of floating-point denormal operands is one case where
microcode has been used by generating a trap to microcode.

Microcode can use operations and state that are not defined in the
ISA. This allows flexibility in the definition of the supported
operations and state since backward compatibility is not required
for microcode. Microcode can also perform operations in less
privileged modes which would normally require greater privilege
(such as temporarily disabling interrupts or accessing special
purpose registers) because the code is known not to abuse such
privilege.

(The Alpha ISA's Privileged Architecture Library code performed
a similar function. Such code was hyper-privileged, could run with
interrupts disabled, and could use implementation-specific features.
However, such was targeted to system-level operations not common
user-level operations like memory copies.)

(Instructions which can be broken into three (or four) or fewer
micro-ops are directly translated. Load+op instructions would be
a classic example of such instructions. Although technically
such might be called microcode, they are generally distinguished
from code sequences provided by a microcode ROM.)

(The definition of a micro-op becomes more confusing when two
instructions [a.k.a. macro-ops] can be fused into a single
micro-op, when a single micro-op in terms of ROB and
scheduler entries can be executed as two operations, and
when micro-ops from a single instruction can be fused into a
single micro-op after the instruction has been decoded into
multiple micro-ops.)

Stephen Sprunk

unread,

Sep 21, 2013, 12:03:57 AM9/21/13

to

Modern CPUs do not directly execute x86 instructions; they first decode
them into one or more micro-ops, which are primitive operations such as
you'd find on a RISC CPU. The decoders will directly spit out 1-4 (or
so) micro-ops for simple instructions, but they need to fetch a sequence
of dozens to hundreds of micro-ops from microcode for the complex ones.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

EricP

unread,

Sep 25, 2013, 12:50:06 PM9/25/13

to

Paul A. Clayton wrote:
> On Friday, September 20, 2013 6:54:08 PM UTC-4, Frank Yang wrote:
>> Hi,
>>
>> I am confused about the microcode and micro-ops in X86. As far as
>> what I understand, microcode is just a piece of software code stored
>> in the ROM for some pre-defined routines - more like a firmware -
>> while micro-ops are the actual micro-instructions executed inside
>> the pipeline.
>>
>> Is that correct? If so, when do we need to run the microcode?
>
> Basically that is correct.

It would be interesting to see the implementation
details of the micro-ops but no one documents this.
I can't imagine there is anything terribly secret about it.
Perhaps once you define the functions then the
implementation is blitheringly obvious.

Still, it would be nice to see the details.

Eric

Michael Engel

unread,

Sep 25, 2013, 4:23:04 PM9/25/13

to

Am Mittwoch, 25. September 2013 18:50:06 UTC+2 schrieb EricP:
> It would be interesting to see the implementation
> details of the micro-ops but no one documents this.
> I can't imagine there is anything terribly secret about it.
> Perhaps once you define the functions then the
> implementation is blitheringly obvious.

The only thing coming close to this is the RISC86 instruction set used
internally by AMD's K6 processors - see, e.g., the AMD patent at
http://www.google.com/patents/US6336178

There's also an interesting book on the K6 architecture: Bruce Shriver's
"The Anatomy of a High-Performance Microprocessor: A Systems Perspective".
However, ISTR that the book does not give too many details on the RISC86
instruction set itself.

Of course, the u-instructions used by current x86/x64 CPUs from intel and
AMD are probably completely different and perhaps more low-level than the
RISC86 instruction set.

Another interesting read in this area may be the documentation of the reverse
engineering of Transmeta's Crusoe VLIW architecture, especially since trace
scheduling (used for VLIW processors) was originally born out of the need to find
a more compact microcode representation - see
http://www.realworldtech.com/crusoe-intro/

For non-x86 architectures, lots of information on the microcode ops and even
complete listings are available - e.g., Bob Supnik links to several PDP11 and
VAX microcode listings on the SimH web site:
http://simh.trailing-edge.com/dsarchive.html
and some more in the bitsavers archives.

-- Michael

EricP

unread,

Sep 26, 2013, 4:41:19 PM9/26/13

to

Michael Engel wrote:
> Am Mittwoch, 25. September 2013 18:50:06 UTC+2 schrieb EricP:
>> It would be interesting to see the implementation
>> details of the micro-ops but no one documents this.
>> I can't imagine there is anything terribly secret about it.
>> Perhaps once you define the functions then the
>> implementation is blitheringly obvious.
>
> The only thing coming close to this is the RISC86 instruction set used
> internally by AMD's K6 processors - see, e.g., the AMD patent at
> http://www.google.com/patents/US6336178

That's the idea.
They lift their skirt a bit there... now how about a full monty?

> There's also an interesting book on the K6 architecture: Bruce Shriver's
> "The Anatomy of a High-Performance Microprocessor: A Systems Perspective".
> However, ISTR that the book does not give too many details on the RISC86
> instruction set itself.
>
> Of course, the u-instructions used by current x86/x64 CPUs from intel and
> AMD are probably completely different and perhaps more low-level than the
> RISC86 instruction set.
>
> Another interesting read in this area may be the documentation of the reverse
> engineering of Transmeta's Crusoe VLIW architecture, especially since trace
> scheduling (used for VLIW processors) was originally born out of the need to find
> a more compact microcode representation - see
> http://www.realworldtech.com/crusoe-intro/
>
> For non-x86 architectures, lots of information on the microcode ops and even
> complete listings are available - e.g., Bob Supnik links to several PDP11 and
> VAX microcode listings on the SimH web site:
> http://simh.trailing-edge.com/dsarchive.html
> and some more in the bitsavers archives.

The microcode from days of yore (1970's and 80's) was pretty
straight forward using wide, flat words to control function
units and took many u-words to implement a single instruction.
Basically feedback the NextAddress and NextAddressSelect fields
into the micro sequence counter, and route the rest of the
control fields to the various function units.
VAX takes somewhat more due to complexity of instructions.
But even a simple instruction takes many micro-words to execute
and there was little or no concurrency.

I meant modern RISC-ish internal micro-ops.
The modern x86 is basically an x86 decoder connected to
the guts of an Alpha. A 'mov rax, rcx' instruction decodes
to a single micro-op and is probably implemented essentially
the same as the internals of an Alpha doing a 'mov r0, r1'.

The best I could find was from a search for
'"reorder buffer" implementation' and came up with

A Reorder Buffer Design for High Performance Processors 2012
www.scielo.org.mx/pdf/cys/v16n1/v16n1a3.pdf‎

which provides a little more detail. Also the patent

Tracking multiple dependent instructions with instruction queue pointer
mapping table linked to a multiple wakeup table by a pointer
http://www.google.ca/patents/US7464253

provides a good level of detail but of course is weak on explanation.

Eric

MitchAlsup

unread,

Oct 6, 2013, 8:17:28 PM10/6/13

to

On Friday, September 20, 2013 5:54:08 PM UTC-5, Frank Yang wrote:

> I am confused about the microcode and micro-ops in X86. As far as what I understand, microcode is just a piece of software code stored in the ROM for some pre-defined routines - more like a firmware - while micro-ops are the actual micro-instructions executed inside the pipeline.

The real trick to microcode is that you have access to resources that macrocode does not. And the ability to perform several "instructions" worth of: movement,
calculation, memory; all at once.

In Opteron we had 4 (or was it 8) temporary registers so microcoded routines
could perform data calculations without having the save and restore user registers.

Microcode effectively ran at level -1 at higher priority than even the hypervisor or security system. There was a bunch of HW (not that much realy)
whereby MC could access user registers supplied by the instruction without
decoding the instruction (aqgain) and to deliver results back to user space
in a similar manner.

There are also a few places where, if you know the correct passwords, you can
re-enable features disabled on the test head by the blowing of fuses durring
device configuration.

Mitch

mag

unread,

Oct 13, 2013, 11:45:24 AM10/13/13

to

On 2013-09-21 04:03:57 +0000, Stephen Sprunk said:

> On 20-Sep-13 17:54, Frank Yang wrote:
>> I am confused about the microcode and micro-ops in X86. As far as
>> what I understand, microcode is just a piece of software code stored
>> in the ROM for some pre-defined routines - more like a firmware -
>> while micro-ops are the actual micro-instructions executed inside the
>> pipeline.
>>
>> Is that correct? If so, when do we need to run the microcode?
>
> Modern CPUs do not directly execute x86 instructions; they first decode
> them into one or more micro-ops, which are primitive operations such as
> you'd find on a RISC CPU. The decoders will directly spit out 1-4 (or
> so) micro-ops for simple instructions, but they need to fetch a sequence
> of dozens to hundreds of micro-ops from microcode for the complex ones.
>
> S

I wonder if anyone has done a study of the average instruction:micro-op
ratios for different CPU architectures running various workloads.

From what I've read, the ratio is close to 1:1 which leads me to
believe that there are just a few instructions that need micro-op
expansion. The micro-code is there mostly to just translate a
hard-to-decode CISC instruction into a single easy-to-decode RISC
micro-operation.

Stephen Sprunk

unread,

Oct 14, 2013, 1:04:56 PM10/14/13

to

On 13-Oct-13 10:45, mag wrote:
> On 2013-09-21 04:03:57 +0000, Stephen Sprunk said:
>> Modern CPUs do not directly execute x86 instructions; they first
>> decode them into one or more micro-ops, which are primitive
>> operations such as you'd find on a RISC CPU. The decoders will
>> directly spit out 1-4 (or so) micro-ops for simple instructions,
>> but they need to fetch a sequence of dozens to hundreds of
>> micro-ops from microcode for the complex ones.
>

> I wonder if anyone has done a study of the average
> instruction:micro-op ratios for different CPU architectures running
> various workloads.

I'm sure someone has, but I don't have anything to cite at hand.

> From what I've read, the ratio is close to 1:1 which leads me to
> believe that there are just a few instructions that need micro-op
> expansion.

On x86, for instance, any normal instruction with a memory operand will
decode to 2-4 micro-ops. For instance,

ADD EAX, [EBX]

will decode to

MOV TMP, [EBX]
ADD EAX, TMP

or,

ADD [EAX], EBX

will decode to

MOV TMP, [EAX]
ADD TMP, EBX
MOV [EAX], TMP ; two uops: store-address and store-data

> The micro-code is there mostly to just translate a hard-to-decode
> CISC instruction into a single easy-to-decode RISC micro-operation.

I think you're confusing the two here. Microcode is only used for a few
(relatively) rare instructions that decode to dozens or even hundreds of
micro-ops. However, many other instructions will be decoded into a
handful of micro-ops without microcode.

Michael S

unread,

Oct 14, 2013, 6:48:49 PM10/14/13

to

On Monday, October 14, 2013 8:04:56 PM UTC+3, Stephen Sprunk wrote:
> On 13-Oct-13 10:45, mag wrote:
>
> > On 2013-09-21 04:03:57 +0000, Stephen Sprunk said:
>
> >> Modern CPUs do not directly execute x86 instructions; they first
>
> >> decode them into one or more micro-ops, which are primitive
>
> >> operations such as you'd find on a RISC CPU. The decoders will
>
> >> directly spit out 1-4 (or so) micro-ops for simple instructions,
>
> >> but they need to fetch a sequence of dozens to hundreds of
>
> >> micro-ops from microcode for the complex ones.
>
> >
>
> > I wonder if anyone has done a study of the average
>
> > instruction:micro-op ratios for different CPU architectures running
>
> > various workloads.
>
>
>
> I'm sure someone has, but I don't have anything to cite at hand.
>
>
>
> > From what I've read, the ratio is close to 1:1 which leads me to
>
> > believe that there are just a few instructions that need micro-op
>
> > expansion.
>
>
>
> On x86, for instance, any normal instruction with a memory operand will
>
> decode to 2-4 micro-ops. For instance,
>
>
>
> ADD EAX, [EBX]
> will decode to
> MOV TMP, [EBX]
> ADD EAX, TMP
>

Not that simple.
On Intel Bonnel/Saltwell "ADD EAX, [EBX]" will be executed natively, as one internal instruction.
On majority of other modern x86 cores it will be decoded as 2 micro-ops, but immediately thereafter the micro-ops will be fused together to form complex micro-op (exact terms differ between manufacturers or even different cores from the same manufacturer). Complex micro-op will be kept together, consuming one slot rather than two, in various structure of OoO machinery and splat again only much later when dispatched to actual execution.

>
>
> or,
> ADD [EAX], EBX
> will decode to
> MOV TMP, [EAX]
> ADD TMP, EBX
> MOV [EAX], TMP ; two uops: store-address and store-data
>

And 4 resulting micro-op (3 on AMD) will be fused together to form 2 complex micro-ops on BBOoO Intel core. On AMD K8/K10 they will be fused into 1 complex micro-op, according to some reports the same is true for Intel Silvermont.

>
>
> > The micro-code is there mostly to just translate a hard-to-decode
> > CISC instruction into a single easy-to-decode RISC micro-operation.
>
>
>
> I think you're confusing the two here. Microcode is only used for a few
> (relatively) rare instructions that decode to dozens or even hundreds of
> micro-ops. However, many other instructions will be decoded into a
> handful of micro-ops without microcode.
>
>

Microcode also handles micro-traps. Those could be long, but not necessarily so.

Stephen Sprunk

unread,

Oct 14, 2013, 10:40:31 PM10/14/13

to

On 14-Oct-13 17:48, Michael S wrote:
> On Monday, October 14, 2013 8:04:56 PM UTC+3, Stephen Sprunk wrote:

Please trim quoted material down to what you're replying to, especially
since you're using an interface that double-spaces the quotes.

>> On x86, for instance, any normal instruction with a memory operand
>> will decode to 2-4 micro-ops. For instance,
>>
>> ADD EAX, [EBX]
>> will decode to
>> MOV TMP, [EBX]
>> ADD EAX, TMP
>
> Not that simple. On Intel Bonnel/Saltwell "ADD EAX, [EBX]" will be
> executed natively, as one internal instruction.

Really? Got any pointers to where I can read up on that, as it flies in
the face of everything Intel has been doing for the last decade?

> On majority of other modern x86 cores it will be decoded as 2
> micro-ops, but immediately thereafter the micro-ops will be fused

> together ...

Sure, but before one understands uop fusion, one must understand what
uops are in the first place, and that's what I was trying to explain to
the previous respondent and OP.

Michael S

unread,

Oct 15, 2013, 3:56:14 AM10/15/13

to

On Tuesday, October 15, 2013 5:40:31 AM UTC+3, Stephen Sprunk wrote:
> On 14-Oct-13 17:48, Michael S wrote:
>
> > On Monday, October 14, 2013 8:04:56 PM UTC+3, Stephen Sprunk wrote:
>
>
> >> On x86, for instance, any normal instruction with a memory operand
> >> will decode to 2-4 micro-ops. For instance,
> >>
> >> ADD EAX, [EBX]
> >> will decode to
> >> MOV TMP, [EBX]
> >> ADD EAX, TMP
> >
>
> > Not that simple. On Intel Bonnel/Saltwell "ADD EAX, [EBX]" will be
> > executed natively, as one internal instruction.
>
> Really? Got any pointers to where I can read up on that,

Google for "Intel 64 and IA-32 Architectures Optimization Reference Manual" Order Number: 248966-028. Read CHAPTER 14.
Take, for example, 14.2: "The in-order pipeline differs from out-of-order pipelines by treating an IA-32 instruction with a memory operand as a single pipeline operation instead of multiple micro-operations."

> as it flies in the face of everything Intel has been doing for the last decade?
>

Intel was doing many different things "for last decade". Bonnel/Saltwell is not like most others. But "others" are also not very similar to each other. Even ignoring Bonnel/Saltwell, Knight Corners and Quark, there made 6-7 seriously dissimilar variations of BBOoO x86 cores + one NSBOoO.

>
>
> > On majority of other modern x86 cores it will be decoded as 2
> > micro-ops, but immediately thereafter the micro-ops will be fused
> > together ...
>
>
>
> Sure, but before one understands uop fusion, one must understand what
> uops are in the first place, and that's what I was trying to explain to
> the previous respondent and OP.
>

But things are *not* that simple.
Microop fusion can be seen as a stage following decoding or as an integral part of decoding. If one takes the later view, than one has to say that "ADD EAX, [EBX]" is decoded into one [complex] micro-op.
On implementation side it's probably neither-former-nor-later, so it's all the matter of conventions.

You are trying to paint all modern BBOoO cores with one 19 y.o PPro brash. Yes, that's the ancestor and yes, there are many similarities, but also plenty of differences and enhancements.

winden

unread,

Nov 23, 2013, 7:40:49 AM11/23/13

to

On Wed, 25 Sep 2013 12:50:06 -0400, EricP
<ThatWould...@thevillage.com> wrote:
> It would be interesting to see the implementation
> details of the micro-ops but no one documents this.
> I can't imagine there is anything terribly secret about it.
> Perhaps once you define the functions then the
> implementation is blitheringly obvious.

> Still, it would be nice to see the details.

Search for information about the Xerox Alto, it used microcode
extensively to concentrate a lot of functionality into a single CPU
and even allowed replacing part of it from the user program.

MitchAlsup

unread,

Nov 23, 2013, 2:04:03 PM11/23/13

to

On Monday, October 14, 2013 12:04:56 PM UTC-5, Stephen Sprunk wrote:
> On x86, for instance, any normal instruction with a memory operand will
> decode to 2-4 micro-ops. For instance,
>
> ADD EAX, [EBX]
>
> will decode to
>
> MOV TMP, [EBX]
> ADD EAX, TMP

Err, maybe intel x86s, but Athlon and Opteron would decode this into ONE (1) operation in the reservation station.

That one operation would fire once when the memory address was calculatable,
and fire a second time when the memory operand was available.

In fact an:

ADD [EAX],ECX

was a single operation (note no word "micro" attached to operation) that
would fire three times from a single place in the reservation station.

Mitch

peterf...@gmail.com

unread,

Nov 23, 2013, 4:27:38 PM11/23/13

to

Certain old patents contain a lot of information about a specific generation of µop formats for one of the x86's, including bit-level format and opcode numbers. I forget whether they were from Intel or AMD.

-Peter

EricP

unread,

Nov 24, 2013, 1:16:23 PM11/24/13

to

Thanks, but I was referring to the modern risc-ish definition
of microops not the historical 1970's usage of the term.
(fyi the Alto used 74181 alus).

Most articles just show block diagrams.
Any detailed current info publicly available is from university papers.
I don't have access to IEEE so I can't say whats inside there.

It might be nice to see the details of, say, how the scheduler
prioritizes the pending ready microop list and combines it with
the availability of function units. Because that looks a very
long signal path to me and seems it should make back to back
dependent dispatches impossible.

Eric

EricP

unread,

Nov 24, 2013, 2:30:43 PM11/24/13

to

winden wrote:
>
> Search for information about the Xerox Alto, it used microcode
> extensively to concentrate a lot of functionality into a single CPU and
> even allowed replacing part of it from the user program.

In case anyone is interested, there are Alto internal hardware
manuals from 1979 available showing all the microcode fields, etc

http://bitsavers.informatik.uni-stuttgart.de/pdf/xerox/alto/AltoHWRef.part1.pdf
http://bitsavers.informatik.uni-stuttgart.de/pdf/xerox/alto/AltoHWRef.part2.pdf

Eric

Andy (Super) Glew

unread,

Nov 24, 2013, 3:13:33 PM11/24/13

to

Way back in the mid 90s, P6 simplified the problem by having only fully
pipelined functional units, and completely non-pipelined (DIVide).

If all units are fully pipelined, you never have to schedule their
availability.

So all you need is

(1) a marker, that marks ready instructions (non-priority CAM)

and (2) a picker for each fully pipeline functional unit. Basically a
find-first circuit, for some permutation of the candidates.

Yes, the critical loop picker->marker to determine back to back ready
instructions can be a pain. Various logic tricks are possible, e.g.
using decoded rather than encoding CAMs, and other decoded bitmask
tricks to move the find-first out of the loop. Trading more gates for
logic depth.

Or see the Wmt era patents on bit matrix schedulers - the P6 scheduler
was not strictly oldest first, but that introduced glass jaws.

Subrao Palacharla and Jim Smith started a line of research on
"complexity effective" microarchitecture, a large thread of which
attacked the scheduling problem. E.g. instead of having a homogenous
dataflow scheduler, they (and others at the same time and before, olike
me) proposed schedulers that combined queues with dataflow. E.g. at
onwe point in time Wmt was supposed to be purely a timing wheel
scheduler, but ended up being queues from a larger window fronted by a
16-entry bit matrix.

Much of this should be in patents.

...

As for mixing fully pipelined with fully non-pipelined: fixed latency is
easy. Variable latency, the divider had to arbitrate into the scheduler
loop several cycles (3?) before it needed a writeback port.

Basically, fully pipelined / independent pipelines / latency known up
front is easy. But variable latency is harder to deal with, and is
usually penalized. Which saddens me: one of the things that dataflow
should do is tolerate variable latency, but the pipelining costs means
that the only variable latencies we can tolerate are comparatively long,
like cache misses and divides.

...

P6-era, i.e. simplified the problem.

E.g. one of the biggest reasons that P6 only had a single load port was
that, while 2 load ports would be easy if truly dual ported, if you have
pseudo-dual-porting by banking then you have a dynamic scheduling
problem. Whereas if you pseudo-dual-port 1 load port and 1 store port,
the stores can often be delayed when there is a bank conflict without
the hassles of repairing incorrectly scheduled dependent operations.

We also see 2 load ports + 1 store port, on systems that have true dual
porting (between the loads, => no bank conflicts) and pseudo-porting for
the store port.

...

More recent machines *are* more complicated. They truly do have dynamic
scheduling, e.g. when they have an incomplete bypass network. So on
such systems there may be a "packer" in addition to the marking and
picking. However, if the packing is a simple greedy algorithm, it can
be done with bitmasks. Truly optimal picking and packing is probably too
hard (and IMHO is better left to compiler guys like Ivan, although
recording such information in a trace cache is tempting).

--
The content of this message is my personal opinion only. Although I am
an employee (currently Imagination Technologies, which acquired MIPS; in
the past of companies such as Intellectual Ventures/QIPS, Intel, AMD,
Motorola, and Gould), I reveal this only so that the reader may account
for any possible bias I may have towards my employers' products. The
statements I make here in no way represent my employers' positions on
the issue, nor am I authorized to speak on behalf of my employers, past
or present.

peterf...@gmail.com

unread,

Nov 25, 2013, 3:40:42 AM11/25/13

to

http://www.amazon.com/Processor-Microarchitecture-Implementation-Perspective-Architecture/dp/1608454525/ref=pd_sim_b_4

If that is the book I think it is, it contains a surprisingly detailed description of the AMD K6. And if it isn't, now you know that such a book exists! :)

-Peter

EricP

unread,

Nov 26, 2013, 2:45:43 PM11/26/13

to

peterf...@gmail.com wrote:
>
> http://www.amazon.com/Processor-Microarchitecture-Implementation-Perspective-Architecture/dp/1608454525/ref=pd_sim_b_4
>
> If that is the book I think it is, it contains a surprisingly detailed description of the AMD K6. And if it isn't, now you know that such a book exists! :)
>
> -Peter

Thanks.
It's not just on the K6 - it covers each pipeline stage,
discussing alternative designs and trade offs.
It has a section on issue stage I was inquiring about,
and a section on the AMD K6 load-store pipeline.

While it is not at the logic design level and only 100 pages long,
it is still interesting.

Eric

EricP

unread,

Nov 26, 2013, 2:48:55 PM11/26/13

to

Yeah, their papers got me thinking about this
'Complexity Effective Superscalar Processors 1997' and its companion
'Quantifying the Complexity of Superscalar Processors 1997'

But their analysis didn't consider resource availability,
e.g. will an FU or shared result bus be available.

>
> Much of this should be in patents.

I stumbled onto the patents for bit matrix schedulers a while back
but patents are difficult to read so I didn't study it too much.

Searching again I now find an Intel paper 'Matrix Scheduler Reloaded 2007'
so I'll give that a read.

Hmmm a quick scan and it looks like what I was thinking for Wakeup.
Instead of CAMs use a decoder (which you need anyway for write back
select) and a crossbar (a mux tree pseudo crossbar actually)
for watchers to set their ready bits. The crossbar should have the
same worst case load as CAMs, but the average load would be just
the one or two dependents actually waiting for a result.
No speed advantage but much lower average power.

> ....

>
> As for mixing fully pipelined with fully non-pipelined: fixed latency is
> easy. Variable latency, the divider had to arbitrate into the scheduler
> loop several cycles (3?) before it needed a writeback port.

Yeah, that's the kind of thing that was bothering me.
Also early terminate multiply if high order bits are zero.

I was wondering if result reservation stations might help.
They would hold early results while waiting for the result bus,
letting the FU get on with the next op.
The advantage is that it clears ready input slots in the input reservation
stations sooner, so they can be filled with subsequent ops.

> Basically, fully pipelined / independent pipelines / latency known up
> front is easy. But variable latency is harder to deal with, and is
> usually penalized. Which saddens me: one of the things that dataflow
> should do is tolerate variable latency, but the pipelining costs means
> that the only variable latencies we can tolerate are comparatively long,
> like cache misses and divides.
>

> ....

>
> P6-era, i.e. simplified the problem.
>
> E.g. one of the biggest reasons that P6 only had a single load port was
> that, while 2 load ports would be easy if truly dual ported, if you have
> pseudo-dual-porting by banking then you have a dynamic scheduling
> problem. Whereas if you pseudo-dual-port 1 load port and 1 store port,
> the stores can often be delayed when there is a bank conflict without
> the hassles of repairing incorrectly scheduled dependent operations.
>
> We also see 2 load ports + 1 store port, on systems that have true dual
> porting (between the loads, => no bank conflicts) and pseudo-porting for
> the store port.
>

> ....

>
> More recent machines *are* more complicated. They truly do have dynamic
> scheduling, e.g. when they have an incomplete bypass network. So on
> such systems there may be a "packer" in addition to the marking and
> picking. However, if the packing is a simple greedy algorithm, it can
> be done with bitmasks. Truly optimal picking and packing is probably too
> hard (and IMHO is better left to compiler guys like Ivan, although
> recording such information in a trace cache is tempting).

Yeah thats what had me scratching my head - limiting write ports
while expanding FU's and variable length ops implies on-demand bus
arbitration, which leads to the back to back scheduling problem.

Eric

Ivan Godard

unread,

Nov 26, 2013, 3:11:14 PM11/26/13

to

On 11/26/2013 11:48 AM, EricP wrote:
> Andy (Super) Glew wrote:
>> On 11/24/2013 10:16 AM, EricP wrote:

<snip>

>> As for mixing fully pipelined with fully non-pipelined: fixed latency
>> is easy. Variable latency, the divider had to arbitrate into the
>> scheduler loop several cycles (3?) before it needed a writeback port.
>
> Yeah, that's the kind of thing that was bothering me.
> Also early terminate multiply if high order bits are zero.

Has anybody determined whether early-outs like this are actually visible
at the program level, or just monkey-puzzle for engineers?

In a software pipeline latency is irrelevant so long as it is fixed,
while variable latency mucks up the pipeline. For open code, as far as
we could tell from limited sim it appeared that early-out for multiply
could be visible in contrived test cases but was completely in the noise
in real programs. Is your experience different?

Early-out can make a difference for school-boy divide, but who does
anything but Newton-Rapheson reciprocal any more, and there's no early
out on that.

> I was wondering if result reservation stations might help.
> They would hold early results while waiting for the result bus,
> letting the FU get on with the next op.
> The advantage is that it clears ready input slots in the input reservation
> stations sooner, so they can be filled with subsequent ops.
>
>> Basically, fully pipelined / independent pipelines / latency known up
>> front is easy.

Ain't it now :-)

Ivan

Paul A. Clayton

unread,

Nov 26, 2013, 8:47:42 PM11/26/13

to

On Tuesday, November 26, 2013 3:11:14 PM UTC-5, Ivan Godard wrote:
[snip]

> Has anybody determined whether early-outs like this are actually visible
> at the program level, or just monkey-puzzle for engineers?

Side note: early-out in multiply is sometimes associated
with a lack of full pipelining. Not fully pipelining
multiply would obviously save area (and presumably static
power).

An additional question is how often early-outs are not
determinable at compile time. (ARM AArch64 has indicators
for 32-bit operands in some instructions to support
reduced energy. At 64-bit, the fraction of known half-sized
operands is probably relatively large.) For dynamic
scheduling, determining early-out sufficiently before issue
would allow the scheduler to treat such as if static.
(Substantial early out on longer latency operations can
therefore be more attractive: hardware has more time to
determine if the case is exceptional [e.g., an operation
with latency of four or eight cycles could handle a
two-cycle scheduler if early-out is determined in one
cycle] and the reduced latency is more likely to benefit
scheduling assuming limited available ILP [though the
opportunities for latency reduction are probably not
especially common].)

> In a software pipeline latency is irrelevant so long as it is fixed,
> while variable latency mucks up the pipeline. For open code, as far as
> we could tell from limited sim it appeared that early-out for multiply
> could be visible in contrived test cases but was completely in the noise
> in real programs. Is your experience different?
>
> Early-out can make a difference for school-boy divide, but who does
> anything but Newton-Rapheson reciprocal any more, and there's no early
> out on that.

Actually, at least one AMD x86 implementation used reduced
precision multiply to support early-out for the first step.
(For integer division, even using NR methods one could
provide a data-dependent early-out.)

(Some series computations might similarly exploit a reduced need
for precision in some terms [but these are perhaps almost
entirely statically known].)

For Intel's Enhanced Core Microarchitecture: "A radix-16
divider replacing previous radix-4 based divider to speedup
long-latency operations such as divisions and square roots."

Some smaller processors probably also do not use NR
division. (The older Atom used radix-2.)

Ivan Godard

unread,

Nov 26, 2013, 9:30:21 PM11/26/13

to

On 11/26/2013 5:47 PM, Paul A. Clayton wrote:
> On Tuesday, November 26, 2013 3:11:14 PM UTC-5, Ivan Godard wrote:
> [snip]
>> Has anybody determined whether early-outs like this are actually visible
>> at the program level, or just monkey-puzzle for engineers?
>
> Side note: early-out in multiply is sometimes associated
> with a lack of full pipelining. Not fully pipelining
> multiply would obviously save area (and presumably static
> power).
>
> An additional question is how often early-outs are not
> determinable at compile time. (ARM AArch64 has indicators
> for 32-bit operands in some instructions to support
> reduced energy. At 64-bit, the fraction of known half-sized
> operands is probably relatively large.)

We have our own way of dealing with that. It will be covered in the
12/11 talk.

Paul A. Clayton

unread,

Nov 26, 2013, 9:42:26 PM11/26/13

to

On Tuesday, November 26, 2013 2:48:55 PM UTC-5, EricP wrote:
[snip]

> Hmmm a quick scan and it looks like what I was thinking for Wakeup.
> Instead of CAMs use a decoder (which you need anyway for write back

Just FYI, Andy Glew calls these *decoded* CAMs as distinct
from the traditional *encoded* CAMs. He likes to group terms
and use distinguishing adjectives. (I have some sympathy for
that manner of using terms.)

[snip]

> Yeah, that's the kind of thing that was bothering me.
> Also early terminate multiply if high order bits are zero.

(High order bits of the absolute value, presumably?)

> I was wondering if result reservation stations might help.
> They would hold early results while waiting for the result bus,
> letting the FU get on with the next op.
> The advantage is that it clears ready input slots in the input reservation
> stations sooner, so they can be filled with subsequent ops.

Scheduling dependent operations is probably more of a
problem than scheduling result writeback or FU availability
and is perhaps the main motivation behind early-out (even if
the operation is not pipelined).

Even with a single-cycle scheduler, one would have to know
the latency of an operation more or less at issue time to
avoid retry for single-cycle early-out operations. (Detecting
special operands might be folded into the execution latency,
perhaps especially for zero detection.)

(The above might not be adding much; I do not understand
schedulers well and am a bit sleepy at the moment.)

Andy (Super) Glew

unread,

Nov 26, 2013, 10:21:46 PM11/26/13

to

On 11/26/2013 11:48 AM, EricP wrote:

> Andy (Super) Glew wrote:
>> On 11/24/2013 10:16 AM, EricP wrote:
> Searching again I now find an Intel paper 'Matrix Scheduler Reloaded 2007'
> so I'll give that a read.
>
> Hmmm a quick scan and it looks like what I was thinking for Wakeup.
> Instead of CAMs use a decoder (which you need anyway for write back
> select) and a crossbar (a mux tree pseudo crossbar actually)
> for watchers to set their ready bits. The crossbar should have the
> same worst case load as CAMs, but the average load would be just
> the one or two dependents actually waiting for a result.
> No speed advantage but much lower average power.

By the way, I think that you are describing what I call a "Decoded CAM".

Encoded CAM:
for several entries i, CAM ports m
match_i := AND(
port_m.encoded_input.bit_j
== encoded_entry_i.bit_j.port_m
)
where j = 0,nbits-1

Decoded CAM:
for several CAM ports m
decoded_input_m |= decode_to_1_hot( encoded_input_i )
for several entries i
match_i := AND( decoded_input.bit_jj
== decoded_entry_i.bit_jj.port_m
)
where jj = 0 to 2^nbits-1

I think of encoded and decoded CAMs as almost interchangeable.

Encoded CAMs can more easily handle large numbers of inputs (ports).

But any individual decoded CAM is more compact, log(nvalues) rather than
nvalues.

If you have ORed things together for an encoded CAM, you have lost which
port matched.

You should be able to smoothly transition between them.

(E.g. my currently favored very-large-instruction-window scheduler uses
decoded CAMs in the outer L2 scheduler, on groups of 8-16-...
instructions, but CAMing only 1 or 2 inputs per group (typically
predicted almost last to arrive). Whereas when instruction groups flow
to the inner L1 scheduler, they are broken up, individual instructions
scheduled using a bit matrix / encoded CAM.

>> As for mixing fully pipelined with fully non-pipelined: fixed latency
>> is easy. Variable latency, the divider had to arbitrate into the
>> scheduler loop several cycles (3?) before it needed a writeback port.
>
> Yeah, that's the kind of thing that was bothering me.
> Also early terminate multiply if high order bits are zero.
>
> I was wondering if result reservation stations might help.
> They would hold early results while waiting for the result bus,
> letting the FU get on with the next op.
> The advantage is that it clears ready input slots in the input reservation
> stations sooner, so they can be filled with subsequent ops.

You have hit on one of the key issues: result bus (I prefer to say
writeback port).

There is no requirement that the number of input ports be the same as
the number of writeback ports.

Encoded CAMs can OR many, many, writeback ports together.

>> More recent machines *are* more complicated. They truly do have
>> dynamic scheduling, e.g. when they have an incomplete bypass network.
>> So on such systems there may be a "packer" in addition to the marking
>> and picking. However, if the packing is a simple greedy algorithm, it
>> can be done with bitmasks. Truly optimal picking and packing is
>> probably too hard (and IMHO is better left to compiler guys like Ivan,
>> although recording such information in a trace cache is tempting).
>
> Yeah thats what had me scratching my head - limiting write ports
> while expanding FU's and variable length ops implies on-demand bus
> arbitration, which leads to the back to back scheduling problem.

You can think of the bypass network as a decoded-CAM indexed "cache" for
the register file. Usually it advances systolically, but it doesn't
need to.

You can guarantee that there is always a place in this bypass cache for
a result of fixed latency to be written. You can make this guarantee
when scheduling an instruction.

You can separate the "is a dependent instruction ready" from "where does
the dependent instruction get its results". Decoded CAM for the former,
encoded CAM for the latter. (Well, actually, some bypass controls use
decoded CAMs everywhere, although they constrain some to be 1-hots or
thermometer codes.)

Ivan/OOTBC/Mill "belt" sounds very much like this sort of bypass
structure from existing machines. One difference is that instead of
matching on register numbers, whether logical (means that the CAMs need
to handle multiple matches) or physically renamed registers (CAMs can
only match once), Ivan does away with the register numbers, and just
matches on a relative position. Key to that being the ability to
perform arbitrary remappings of the relative positions (belt indexes)
when control flow converges.

Many folks have proposed relative positions to establish dataflow. But
these folks often had RISC on the brain, and said "this is simpler - I
don't need register renaming." Ivan/OOTBC/Mill's "belt", on the other
hand, has embraced renaming/remapping, albeit at the less frequently
occurring control flow convergences.

Andy (Super) Glew

unread,

Nov 26, 2013, 10:32:44 PM11/26/13

to

On 11/26/2013 12:11 PM, Ivan Godard wrote:
> On 11/26/2013 11:48 AM, EricP wrote:
>> Andy (Super) Glew wrote:
>>> On 11/24/2013 10:16 AM, EricP wrote:
>
> <snip>
>
>>> As for mixing fully pipelined with fully non-pipelined: fixed latency
>>> is easy. Variable latency, the divider had to arbitrate into the
>>> scheduler loop several cycles (3?) before it needed a writeback port.
>>
>> Yeah, that's the kind of thing that was bothering me.
>> Also early terminate multiply if high order bits are zero.
>
> Has anybody determined whether early-outs like this are actually visible
> at the program level, or just monkey-puzzle for engineers?

They are a monkey puzzle, because there are no good solutions.

If there were good solutions... well, I have looked into this.

Much depends on your circuits, and the relative speed of wires and gates.

E.g. most integer adds could be early outed, by bit 32-16-48-8, rather
than computing all the way to bit 63.

But if wires are fast, then you are only saving one or two levels of
carry lookahead or carry select.

If you are pipelining across the bits like Wmt, then it doesn't matter
to scheduling - but you can gate upper bits.

Or you can go redundant, carry save. Less carry propagate, but you pay
when you need to go non-redundant.

Conjecture: if we could schedule variable latency, then we could take
advantage of asynch logic. As it is, asynch logic can choke up a
pipeline - I call it the asynch pipeline hazard.

But I have little hope.

Shih-Lien Lu looked at asynch logic in schedulers. I am not sure if he
published.

>
> In a software pipeline latency is irrelevant so long as it is fixed,
> while variable latency mucks up the pipeline. For open code, as far as
> we could tell from limited sim it appeared that early-out for multiply
> could be visible in contrived test cases but was completely in the noise
> in real programs. Is your experience different?

Yes, for integer multiply.

> Early-out can make a difference for school-boy divide, but who does
> anything but Newton-Rapheson reciprocal any more, and there's no early
> out on that.

High radix iterative with early out can save power.

But, anyway, early out for divide is pretty easy.

Early out for 1-2 cycle things, now that's fun.

Study I like doing (have done repeatedly):

Come up with a timing model - e.g. assume that baseline integer add is
10 hyper-cycles, and then vary around that.

See how much performance is being left on the table by fixed latencies.

Ivan Godard

unread,

Nov 27, 2013, 1:38:23 AM11/27/13

to

That's an interesting, and apt, way of looking at it. We actually use
the ability to "rename" in several other places, besides at control flow
joins. One is in the return operations, where callee values are renamed
to caller values. Another is in the replay of values from the spiller
when returning from a call or interrupt. Several more - it's quite general.

You also can see the basic belt advance step (when a cycle drops values
at the front of the belt) as a rename, except it is a bulk rename - the
entire belt contents increments their names. However, while I can
understand how you (Andy) would want to think of it as a rename, I don't
want to use the term in our explanations - too easy to mislead people
for whom "rename" means a particular kind of OOO, which we aren't.

Terje Mathisen

unread,

Nov 27, 2013, 4:19:21 AM11/27/13

to

Ivan Godard wrote:
> On 11/26/2013 11:48 AM, EricP wrote:
>> Andy (Super) Glew wrote:
>>> On 11/24/2013 10:16 AM, EricP wrote:
>
> <snip>
>
>>> As for mixing fully pipelined with fully non-pipelined: fixed latency
>>> is easy. Variable latency, the divider had to arbitrate into the
>>> scheduler loop several cycles (3?) before it needed a writeback port.
>>
>> Yeah, that's the kind of thing that was bothering me.
>> Also early terminate multiply if high order bits are zero.
>
> Has anybody determined whether early-outs like this are actually visible
> at the program level, or just monkey-puzzle for engineers?
>
> In a software pipeline latency is irrelevant so long as it is fixed,
> while variable latency mucks up the pipeline. For open code, as far as
> we could tell from limited sim it appeared that early-out for multiply
> could be visible in contrived test cases but was completely in the noise
> in real programs. Is your experience different?
>
> Early-out can make a difference for school-boy divide, but who does
> anything but Newton-Rapheson reciprocal any more, and there's no early
> out on that.

Early-out will introduce timing channel leaks for any crypto protocol
that employs any such opcodes.

I made a special version of the AES DFC code that (via a #define) would
get rid of the single key-dependent branch, it ran 3-5% slower (afair)
than the standard version but the running time was completely
key-independent.

If MUL had been early-out it would have been (effectively) impossible to
make such a version. Even replacing the MULs with lots of table lookups
would require locking the tables in L1$.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

EricP

unread,

Nov 27, 2013, 5:04:24 PM11/27/13

to

Ivan Godard wrote:
>
> Early-out can make a difference for school-boy divide, but who does
> anything but Newton-Rapheson reciprocal any more, and there's no early
> out on that.

From what I can find, Intel uses radix 16 SRT and AMD uses Goldschmidt.

Eric

MitchAlsup

unread,

Nov 27, 2013, 8:14:49 PM11/27/13

to

On Wednesday, November 27, 2013 4:04:24 PM UTC-6, EricP wrote:
> From what I can find, Intel uses radix 16 SRT and AMD uses Goldschmidt.

Graphics processors use quadradic interpolation.

Mitch

MitchAlsup

unread,

Nov 27, 2013, 8:15:01 PM11/27/13

to

On Wednesday, November 27, 2013 4:04:24 PM UTC-6, EricP wrote:

> From what I can find, Intel uses radix 16 SRT and AMD uses Goldschmidt.

Terje Mathisen

unread,

Nov 28, 2013, 2:24:48 AM11/28/13

to

All or just some?

NR with a 12+ bit starting estimate like the SSE opcode seems like a
good match for mostly low-accuracy divides.

How does the GPU version work?

MitchAlsup

unread,

Nov 28, 2013, 12:33:51 PM11/28/13

to

On Thursday, November 28, 2013 1:24:48 AM UTC-6, Terje Mathisen wrote:
> All or just some?
>
> NR with a 12+ bit starting estimate like the SSE opcode seems like a
> good match for mostly low-accuracy divides.
>
> How does the GPU version work?

This:

http://arith.polito.it/final/paper-164.pdf

will give you the jist.

Mitch

EricP

unread,

Nov 28, 2013, 1:00:45 PM11/28/13

to

That is not quite what I meant, but I see what you are doing.
I mean:

decoded_done_tag = decode_to_1_hot (done_tag);
for all match entries i
match_i = select_1_of_mux (match_tag[i], decoded_done_tag);
ready_ff[i].set = match_i;
end

The difference is that you store the match value as a vector
of N bits with one set, and detect the match using N AND gates
and an N input OR gate.

I store the match value as binary, use it with a mux to select
one of N decode lines to monitor.

As I understand it, if there were, say, 48 ops then your way
requires the decoder to drive 48 AND gates for each signal.
That's better than XORs but still a constant value.

Having the watcher ops use selector mux's means that on average
only 1 or 2 of the 48 T gates will be enabled, the rest blocked.
So that should present 1 or 2 loads, the rest just leakage loads.
The worst case is the same: 47 ops waiting on 1 so the worst
case power and timing is the same. But the average case is lower.

Also a selector mux can directly use the binary select value,
so the match field stored in the ops would be binary not 1_of_N.
For the above example, each op would have two (src1 and src2)
6 bit operand match fields not 2 * 48 bit.

Unless I'm mistaken.
Eric

EricP

unread,

Nov 28, 2013, 4:08:01 PM11/28/13

to

EricP wrote:
> I mean:
>
> decoded_done_tag = decode_to_1_hot (done_tag);
> for all match entries i
> match_i = select_1_of_mux (match_tag[i], decoded_done_tag);
> ready_ff[i].set = match_i;
> end

Actually that should be an array of mux's. Perhaps more like

decoded_done_tag = decode_to_1_hot (done_tag);
for all i in match entries
match_i = select_1_of_mux[i] (match_tag[i], decoded_done_tag);

ready_ff[i].set = match_i;
end

doing this

done_tag->decode_to_1_hot
| | | |
v v v v
sel_match_tag_0->N_to_1_mux->match_0->set.ready_ff_0
| | | |
v v v v
sel_match_tag_1->N_to_1_mux->match_1->set.ready_ff_1
| | | |
v v v v
sel_match_tag_2->N_to_1_mux->match_2->set.ready_ff_2
| | | |
v v v v
etc

Eric

Terje Mathisen

unread,

Nov 29, 2013, 4:52:41 PM11/29/13

to

Thanks!

Basically they used a medium-sized starting lookup table, followed by
bitsize optimized multipliers for a fixed-length very short poly.

The funniest part was the reference to Tang [15], that pretty much had
to be the same Peter Tang who together with Tim Coe proved that the sw
workaround we wrote for the Pentium FDIV bug would work for all possible
double precision inputs.

MitchAlsup

unread,

Nov 30, 2013, 5:09:19 PM11/30/13

to

On Friday, November 29, 2013 3:52:41 PM UTC-6, Terje Mathisen wrote:
> Basically they used a medium-sized starting lookup table, followed by
> bitsize optimized multipliers for a fixed-length very short poly.

Yep, there is a dedicated HW unit designed to perform Transcendentals to 1 ULP

There are 2 patents under Briggs (AMD n.e. ATI) and Matula (UT-dallas)
which describe the necessary numerics properties to achieve 1 ULP. One
of which gives you the tables in bit-vector form.

Mitch

MitchAlsup

unread,

Nov 30, 2013, 5:09:32 PM11/30/13

to

On Friday, November 29, 2013 3:52:41 PM UTC-6, Terje Mathisen wrote:

> Basically they used a medium-sized starting lookup table, followed by
> bitsize optimized multipliers for a fixed-length very short poly.