CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution Near In-Order Energy with Near Out-of-Order Performance

joshua.l...@gmail.com

unread,

May 1, 2019, 1:52:53 AM5/1/19

to

Here's a neat trick. CG-OoO is an architecture that is claimed to get very
close to OoO performance, but without the cost of reordering between such
a large number of buffered instructions.

https://dl.acm.org/citation.cfm?id=3151034..

“We introduce the Coarse-Grain Out-of-Order (CG-OoO) general-purpose
processor designed to achieve close to In-Order (InO) processor energy
while maintaining Out-of-Order (OoO) performance. CG-OoO is an
energy-performance-proportional architecture. Block-level code processing
is at the heart of this architecture; CG-OoO speculates, fetches,
schedules, and commits code at block-level granularity. It eliminates
unnecessary accesses to energy-consuming tables and turns large tables
into smaller, distributed tables that are cheaper to access. CG-OoO
leverages compiler-level code optimizations to deliver efficient static
code and exploits dynamic block-level and instruction-level parallelism.
CG-OoO introduces Skipahead, a complexity effective, limited out-of-order
instruction scheduling model. Through the energy efficiency techniques
applied to the compiler and processor pipeline stages, CG-OoO closes 62%
of the average energy gap between the InO and OoO baseline processors at
the same area and nearly the same performance as the OoO. This makes
CG-OoO 1.8× more efficient than the OoO on the energy-delay product
inverse metric. CG-OoO meets the OoO nominal performance while trading off
the peak scheduling performance for superior energy efficiency.”

Instead of building a single, fast core and trying to feed it all the
instructions in the reorder window, they split the problem hierarchically.

Each basic block in the window is allocated to a ‘Block Window’ (BW). Each
has a cheap local register file, a *small* reorder buffer (3-5ish long) to
cover stalls, a 1 instruction/cycle amortized throughput limit, and small
(3-5ish) groups of BWs share execution units. There is then a global
network for communicating the global registers, which undergo rename,
between Block Windows, and a reorder buffer for those blocks.

The idea is to extract lots of long-range IPC from the system, but not
focus too much on being able to handle local IPC. This is in a sense the
inverse philosophy to a VLIW, or VLIW-ish approach like the Mill, where
the core does lots of local IPC, and tries to move that longer-range
parallelism into the local scope in the compiler.

Overall this idea looks really elegant to me. They have lots of nice
numbers, a compiler, and interesting simulations. I haven't managed to
think of any reason this shouldn't work in principle. It doesn't do
everything perfectly, like extremely parallel vector workloads, but it
does everything OK, and it circumvents a lot of the challenges of both OoO
processors and in-order processors by construction.

Something worth keeping in mind is that not having lots of local IPC
doesn't matter too much, as long as your compiler is sensible. If you can
start execution early, the basic block doesn't need low latency. If you
can't start execution early, either the latter blocks don't immediately
depend on you, so you're not holding much else up, or they do, and the
code doesn't have any latent parallelism to extract anyway. The exception
is when you could be doing things in parallel but for the local reorder
buffer getting full, or all the global parallelism getting pulled into the
local scope and very weird dependency graphs, but compilers can detect
that and hopefully just split basic blocks up.

One thing I did think worth exploring is avoiding the global interconnect
and global register rename phases by having many copies of a smaller global
register file and chaining each individual register with its copies in a
circular buffer.

Diagram: https://i.imgur.com/21Icv1k.png

Red and orange lines control register forwarded, overwriting, and
readiness. Blue line is for a skip connection to save latency.

It's not obvious if hardware like this would be practical, but I don't see
why not, and if it is then it gets rid of rename, global wakeup, and
global forwarding paths and their corresponding throughput and latency
limits.

MitchAlsup

unread,

May 2, 2019, 11:34:38 AM5/2/19

to

On Wednesday, May 1, 2019 at 12:52:53 AM UTC-5, joshua....@gmail.com wrote:
> Here's a neat trick. CG-OoO is an architecture that is claimed to get very
> close to OoO performance, but without the cost of reordering between such
> a large number of buffered instructions.
>
> https://dl.acm.org/citation.cfm?id=3151034..
>
> “We introduce the Coarse-Grain Out-of-Order (CG-OoO) general-purpose
> processor designed to achieve close to In-Order (InO) processor energy
> while maintaining Out-of-Order (OoO) performance. CG-OoO is an
> energy-performance-proportional architecture. Block-level code processing
> is at the heart of this architecture; CG-OoO speculates, fetches,
> schedules, and commits code at block-level granularity. It eliminates
> unnecessary accesses to energy-consuming tables and turns large tables
> into smaller, distributed tables that are cheaper to access. CG-OoO
> leverages compiler-level code optimizations to deliver efficient static
> code and exploits dynamic block-level and instruction-level parallelism.
> CG-OoO introduces Skipahead, a complexity effective, limited out-of-order
> instruction scheduling model. Through the energy efficiency techniques
> applied to the compiler and processor pipeline stages, CG-OoO closes 62%
> of the average energy gap between the InO and OoO baseline processors at
> the same area and nearly the same performance as the OoO. This makes
> CG-OoO 1.8× more efficient than the OoO on the energy-delay product
> inverse metric. CG-OoO meets the OoO nominal performance while trading off
> the peak scheduling performance for superior energy efficiency.”

Some comparative data::
A properly designed 1-wide InOrder machine can achieve just about 1/2 the
performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
area and power.

Also note:
There is middle ground between InOrder machines and Out of Order machines.
Machines can be partially ordered--out of order with respect to the program
counter but in order with respect to the function unit. This dramatically
reduces the problem of recovery (branch, exception,...) with very little
loss wrt performance.

Given a bit of compiler cleverness (such as software pipelining) one can
get essentially the same performance as the GBOoO machines at much less
cost (area and power).

None of the above do anything to harm max operating frequency; in fact the
PO and IO machines should be able to run faster than the GBOoO.

EricP

unread,

May 2, 2019, 12:59:13 PM5/2/19

to

MitchAlsup wrote:
> On Wednesday, May 1, 2019 at 12:52:53 AM UTC-5, joshua....@gmail.com wrote:
>> Here's a neat trick. CG-OoO is an architecture that is claimed to get very
>> close to OoO performance, but without the cost of reordering between such
>> a large number of buffered instructions.
>>
>> https://dl.acm.org/citation.cfm?id=3151034..
>>

>> <snip>

>
> Some comparative data::
> A properly designed 1-wide InOrder machine can achieve just about 1/2 the
> performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
> area and power.
>
> Also note:
> There is middle ground between InOrder machines and Out of Order machines.
> Machines can be partially ordered--out of order with respect to the program
> counter but in order with respect to the function unit. This dramatically
> reduces the problem of recovery (branch, exception,...) with very little
> loss wrt performance.
>
> Given a bit of compiler cleverness (such as software pipelining) one can
> get essentially the same performance as the GBOoO machines at much less
> cost (area and power).
>
> None of the above do anything to harm max operating frequency; in fact the
> PO and IO machines should be able to run faster than the GBOoO.

I came across a paper about a RISC-V implementation called Ariane
which I found interesting as it employs some of the uArch ideas
I have thought about for lightweight OoO.

The Cost of Application-Class Processing: Energy and Performance Analysis
of a Linux-ready 1.7GHz 64bit RISC-V Core in 22nm FDSOI Technology, 2019
https://arxiv.org/abs/1904.05442

- 6 stage pipeline
- in-order single issue, OoO execute & write-back, in-order commit
- variable latency execute units (ALU = 1, MUL = 2, DIV = 2..64)
- scoreboard scheduler
- 8 entry very lightweight ROB (no result values, no data ports,
no CAMs, no scheduling, etc. - just what is needed to commit in-order.)

It employs an interesting lightweight renaming mechanism:
"Write after Write (WAW) hazards in the scoreboard are resolved
through a light-weight re-naming scheme which increases the
logical register address space by one bit. Each issued instruction
toggles the MSB of its destination register address and subsequent
read addresses are re-named to read from the latest register address."

They measured performance at:
"On the rather small Dhrystone benchmark the mispredict
rate is 5.77% with a 128-entry BHT and a 64-entry BTB.
This results in an IPC of 0.82 for the Dhrystone benchmark."

Ivan Godard

unread,

May 2, 2019, 3:49:45 PM5/2/19

to

Amen!

MitchAlsup

unread,

May 2, 2019, 4:22:58 PM5/2/19

to

My scoreboard technology brought to them via Luke; Lkcl.

> - 8 entry very lightweight ROB (no result values, no data ports,
> no CAMs, no scheduling, etc. - just what is needed to commit in-order.)

Saying it has no CAMs is a grey area. Saying it has not binary CAMs is true.
The difference is how the read reservations are are allowed to complete
before the next write, keeping registers in a useful partial order. The
difference is that a k-ported CAM (like a reservation station or ROB)
will have k×ln2(physical registers) wires moving each cycle, while the
Scoreboard only has k moving (and no XOR gates,...)

>
> It employs an interesting lightweight renaming mechanism:
> "Write after Write (WAW) hazards in the scoreboard are resolved
> through a light-weight re-naming scheme which increases the
> logical register address space by one bit. Each issued instruction
> toggles the MSB of its destination register address and subsequent
> read addresses are re-named to read from the latest register address."

All sorts of hazards can be treated with scoreboard techniques--unlike how
Hennessy and Patterson describe the situation.

>
> They measured performance at:
> "On the rather small Dhrystone benchmark the mispredict
> rate is 5.77% with a 128-entry BHT and a 64-entry BTB.
> This results in an IPC of 0.82 for the Dhrystone benchmark."

sounds about right.

joshua.l...@gmail.com

unread,

May 2, 2019, 9:17:43 PM5/2/19

to

On Thursday, 2 May 2019 16:34:38 UTC+1, MitchAlsup wrote:
>
> Some comparative data::
> A properly designed 1-wide InOrder machine can achieve just about 1/2 the
> performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
> area and power.

The CG-OoO paper does include in-order measurements, but though they do
replicate your 1/2 performance claim, they used a 4-wide superscalar with a
70-wide register file to get there, and it runs closer to 1/2 the power per
instruction of the OoO core.

Getting that performance out of a 1-wide sounds incredibly optimistic to
me, seemingly almost impossible, so I'm curious what kind of core design
and workloads you're talking about.

Maybe a good way to extract real numbers from what one can expect is to
look at modern Apple processors, which have a mix of big and little cores
that most people consider industry leading. The A12 has big 7-wide decode
OoO Vortex cores and little 3-wide decode OoO Tempest cores. According to
AnandTech[1], the smaller cores use about half the energy of the big ones
per instruction, or about a 1/6 to 1/10 the watts, but have a third to a
fourth the performance.

These are significantly worse power and performance numbers than your
claim. If a 1-wide in-order core could execute almost twice as fast than
their Tempest core, and still save energy, one wonders why they haven't
done it. If rather you only mean it's possible in principle, but Apple
don't have the skillset to execute, then it's not really a useful
alternative anyway.

[1]: https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the-silicon-secrets/5

> Also note:
> There is middle ground between InOrder machines and Out of Order machines.
> Machines can be partially ordered--out of order with respect to the program
> counter but in order with respect to the function unit. This dramatically
> reduces the problem of recovery (branch, exception,...) with very little
> loss wrt performance.
>
> Given a bit of compiler cleverness (such as software pipelining) one can
> get essentially the same performance as the GBOoO machines at much less
> cost (area and power).

This sounds very similar to the skipahead technique used in the paper, and
they do test a 1-BW, 1-EU variant of the CG-OoO architecture that should be
similar to this in principle.

They find that it produces a noticeable performance increase at very low
cost, but it only closes about a fifth or so of the gap. Note that this is
still 4-wide, so execution throughput shouldn't be the bottleneck, though
the lookahead is pretty small.

Frankly this all seems very bullish. I can understand saying you can get a large improvement like this if you introduce something new, but this is just the same tech the industry has been working on for decades... well, if it's too good to be true...

MitchAlsup

unread,

May 2, 2019, 10:22:23 PM5/2/19

to

On Thursday, May 2, 2019 at 8:17:43 PM UTC-5, joshua....@gmail.com wrote:
> On Thursday, 2 May 2019 16:34:38 UTC+1, MitchAlsup wrote:
> >
> > Some comparative data::
> > A properly designed 1-wide InOrder machine can achieve just about 1/2 the
> > performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
> > area and power.
>
> The CG-OoO paper does include in-order measurements, but though they do
> replicate your 1/2 performance claim, they used a 4-wide superscalar with a
> 70-wide register file to get there, and it runs closer to 1/2 the power per
> instruction of the OoO core.

When we did Opteron, over the 5,000 4M instruction traces we had gathered and
assembled into a test collection, Opteron itself was getting 1.0 I/C while
a LBIO core I had designed was able to get almost 0.5 I/C. These were both
multi GHz designs with the same memory system from L2 outward and using the
same FU designs,...

You might claim 1.0 for Opteron is too low, but many of traces have significant
OS footprints and these excursions damage the caches and TLBs in significant
ways. But in both cases we executed the same instruction profile with complete-
ly different microarchitectures.

>
> Getting that performance out of a 1-wide sounds incredibly optimistic to
> me, seemingly almost impossible, so I'm curious what kind of core design
> and workloads you're talking about.

Think about this, a DRAM access was averaging 200+ cycles, L2 was in the
realm of 20+ cycles, and about 1/3-1/2 of the instructions inserted into
the execution window of Opteron were discarded due to branch misprediction
even with 95%+ branch prediction ratio. These cost time and energy without
doing anything for actual perf.

Since this was a bit more than a decade ago, I could be wrong.

Also note:: once you take 1/2 the I/C demand out of the caches, they display
less latency (so does the TLB and talbewalks.)

already...@yahoo.com

unread,

May 3, 2019, 8:35:08 AM5/3/19

to

Opteron is not GBOoO by today's standards.
Perf/Hz of Opteron is significantly less than half of state of the art GBOoO. May be, less than 1/4 if compared to newest Apple. Part of the gap is due to Opteron's slow exclusive L2 and absence of LLC, but I would guess that more than half of the difference is due to core itself.

Opteron (original SledgeHammer , not Shanghai or Istanbul) is in the same class as Goldmont. At best Goldmont+ and Arm Cortex A73, but probably not quite.

MitchAlsup

unread,

May 3, 2019, 1:05:50 PM5/3/19

to

On Wednesday, May 1, 2019 at 12:52:53 AM UTC-5, joshua....@gmail.com wrote:

> Here's a neat trick. CG-OoO is an architecture that is claimed to get very
> close to OoO performance, but without the cost of reordering between such
> a large number of buffered instructions.
>
> https://dl.acm.org/citation.cfm?id=3151034..

I looked this over, and it reminds me a lot of my Virtual Vector Facility
in that it decorates the top of a loop with an instruction that identifies
the loop and terminates the loop with another instruction.

VVF is different in that the top of loop instruction (VEC) identifies which
registers are scalar (versus vector) and are thus constant in the loop, and
at the bottom of the loop is the LOOP instruction which is equivalent to the
ADD/CMP/BC sequence at the bottom of counted loops.

loop: VEC {VV,VV,V} // VEC identifies next instruction as top of loop
LD Rv,[Rva+4] // LOOP reverts back to this instruction
ADD Rva,Rva,1
LOOP Rv,0,0

So, my model only executes 3 instructions per loop traversal in the loop
analyzed in the paper instead of 5.

But my vectorizing model is aimed at more modest issue width architectures
but with the ability to add multiple lanes of vector calculations so one
can perform 1,2,4,8,... loops/cycle depending on the implementation.

In addition, there are no vector register files defined in the architecture
or implemented in the microarchitecture.

Finally, I only waste an "instruction" (VEC) when I have a loop to be
vectorized, not every basic block.

Anton Ertl

unread,

May 3, 2019, 2:05:51 PM5/3/19

to

MitchAlsup <Mitch...@aol.com> writes:
>Some comparative data::
>A properly designed 1-wide InOrder machine can achieve just about 1/2 the
>performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
>area and power.

I have already presented some IPC numbers for our LaTeX benchmark in
<2019Feb...@mips.complang.tuwien.ac.at>, but I'll add some more
based on the execution time results from
<https://www.complang.tuwien.ac.at/franz/latex-bench>, and assuming
that there are 2100M instructions in the benchmark where I have not
measured instructions:

inst. cycles IPC Hardware
6220M 0.34 1992 Intel 486, 66 MHz, 256K L2-Cache IA-32
2866M 0.73 1997 Pentium MMX 233 IA-32
3717M 0.57 2008 Atom 330 (Bonnell) 1600MHz IA-32
2006M 1.04 2011 AMD E-450 (Bobcat) 1650MHz AMD64
2541M 0.82 2013 Celeron J1900 2416MHz (Silvermont) AMD64
1638M 1.28 2016 Celeron J3455 (Goldmont) AMD64
2071M 1336M 1.55 2017 Celeron J4105 (Goldmont+) 2500MHz AMD64
2007080563 1787418318 1.12 2005 K8 Athlon 64 X2 4400+ (2x2.2GHz, 2x1024KB L2)
2108280684 1483902954 1.42 2007 Penryn Xeon E5450 (4x3GHz, 2x6MB L2)
2140831178 1193597880 1.79 2011 Sandy Bridge Xeon E3 1220 (4x3.1GHz, 8MB L3)
2140436438 1075849207 1.99 2014 Haswell Core i7-4790K (4x4GHz, 8MB L3)
2070349870 945546455 2.19 2015 Skylake Core i5-6600K (4x3.5GHz, 6MB L3)

So the 486 has 3 times lower IPC than the K8. Even the two-wide
in-order Atom 330 has two times lower IPC than the K8. And the K8 has
two times lower IPC than Skylake.

In the low-power low-cost market segment, compared to the in-order
Bonnell, the OoO Bobcat has almost twice the IPC at the same clock
rate, while the OoO Silvermont has only 1.4x IPC, but higher clock
(resulting in overall slightly better performance). There is
apparently no power or clock rate advantage due to in-order, otherwise
Intel would have stayed with in-order for this market segment.

>Given a bit of compiler cleverness (such as software pipelining) one can=20

>get essentially the same performance as the GBOoO machines at much less
>cost (area and power).

Intel bet IA-64 on that. It did not work out.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

MitchAlsup

unread,

May 3, 2019, 3:36:18 PM5/3/19

to

On Friday, May 3, 2019 at 1:05:51 PM UTC-5, Anton Ertl wrote:

> I have already presented some IPC numbers for our LaTeX benchmark in
> <2019Feb...@mips.complang.tuwien.ac.at>, but I'll add some more
> based on the execution time results from
> <https://www.complang.tuwien.ac.at/franz/latex-bench>, and assuming
> that there are 2100M instructions in the benchmark where I have not
> measured instructions:
>
> inst. cycles IPC Hardware
> 6220M 0.34 1992 Intel 486, 66 MHz, 256K L2-Cache IA-32
> 2866M 0.73 1997 Pentium MMX 233 IA-32
> 3717M 0.57 2008 Atom 330 (Bonnell) 1600MHz IA-32
> 2006M 1.04 2011 AMD E-450 (Bobcat) 1650MHz AMD64
> 2541M 0.82 2013 Celeron J1900 2416MHz (Silvermont) AMD64
> 1638M 1.28 2016 Celeron J3455 (Goldmont) AMD64
> 2071M 1336M 1.55 2017 Celeron J4105 (Goldmont+) 2500MHz AMD64
> 2007080563 1787418318 1.12 2005 K8 Athlon 64 X2 4400+ (2x2.2GHz, 2x1024KB L2)
> 2108280684 1483902954 1.42 2007 Penryn Xeon E5450 (4x3GHz, 2x6MB L2)
> 2140831178 1193597880 1.79 2011 Sandy Bridge Xeon E3 1220 (4x3.1GHz, 8MB L3)
> 2140436438 1075849207 1.99 2014 Haswell Core i7-4790K (4x4GHz, 8MB L3)
> 2070349870 945546455 2.19 2015 Skylake Core i5-6600K (4x3.5GHz, 6MB L3)

Thanks for the data.

MitchAlsup

unread,

May 3, 2019, 4:31:01 PM5/3/19

to

On Friday, May 3, 2019 at 1:05:51 PM UTC-5, Anton Ertl wrote:
>

> I have already presented some IPC numbers for our LaTeX benchmark

a) how big is the data being processed by LaTeX?
b) what would the I/Cs be if no SIMD instructions were executed?

joshua.l...@gmail.com

unread,

May 3, 2019, 9:29:15 PM5/3/19

to

On Friday, 3 May 2019 03:22:23 UTC+1, MitchAlsup wrote:
> On Thursday, May 2, 2019 at 8:17:43 PM UTC-5, joshua....@gmail.com wrote:
> >
> > The CG-OoO paper does include in-order measurements, but though they do
> > replicate your 1/2 performance claim, they used a 4-wide superscalar with a
> > 70-wide register file to get there, and it runs closer to 1/2 the power per
> > instruction of the OoO core.
>
> When we did Opteron, over the 5,000 4M instruction traces we had gathered and
> assembled into a test collection, Opteron itself was getting 1.0 I/C while
> a LBIO core I had designed was able to get almost 0.5 I/C. These were both
> multi GHz designs with the same memory system from L2 outward and using the
> same FU designs,...
>
> You might claim 1.0 for Opteron is too low, but many of traces have significant
> OS footprints and these excursions damage the caches and TLBs in significant
> ways. But in both cases we executed the same instruction profile with complete-
> ly different microarchitectures.
> >
> > Getting that performance out of a 1-wide sounds incredibly optimistic to
> > me, seemingly almost impossible, so I'm curious what kind of core design
> > and workloads you're talking about.
>
> Think about this, a DRAM access was averaging 200+ cycles, L2 was in the
> realm of 20+ cycles, and about 1/3-1/2 of the instructions inserted into
> the execution window of Opteron were discarded due to branch misprediction
> even with 95%+ branch prediction ratio. These cost time and energy without
> doing anything for actual perf.
>
> Since this was a bit more than a decade ago, I could be wrong.

OoO cores have moved quite a way in the last decade, I'd say.[1] Just
halving the branch miss rate doubles the amount you can look ahead. There's
not as much one can do to speed up a 1-wide in-order design without it
ceasing to be 1-wide or in-order.

This asymmetry in performance improvements accounts for much or all of the
discrepancy.

[1]: https://imgur.com/gallery/GcUyY8k

> > > Also note:
> > > There is middle ground between InOrder machines and Out of Order machines.
> > > Machines can be partially ordered--out of order with respect to the program
> > > counter but in order with respect to the function unit. This dramatically
> > > reduces the problem of recovery (branch, exception,...) with very little
> > > loss wrt performance.
> > >

> > > Given a bit of compiler cleverness (such as software pipelining) one can

> > > get essentially the same performance as the GBOoO machines at much less
> > > cost (area and power).
> >

> > This sounds very similar to the skipahead technique used in the paper, and
> > they do test a 1-BW, 1-EU variant of the CG-OoO architecture that should be
> > similar to this in principle.
> >
> > They find that it produces a noticeable performance increase at very low
> > cost, but it only closes about a fifth or so of the gap. Note that this is
> > still 4-wide, so execution throughput shouldn't be the bottleneck, though
> > the lookahead is pretty small.
> >
> > Frankly this all seems very bullish. I can understand saying you can
get a large improvement like this if you introduce something new, but this
is just the same tech the industry has been working on for decades... well,
if it's too good to be true...
>
> Also note:: once you take 1/2 the I/C demand out of the caches, they display
> less latency (so does the TLB and talbewalks.)

I'm not sure what you're saying here. I/C means instruction cache?
Why would that be better in an in-order?

I can't say I see a lot of similarity here. CG-OoO is no more specific to
loops than a ROB is. The core isn't designed for massive overall
parallelism, since it will quickly bottleneck on the decoder, and
presumably targeting that will still take dedicated vector techniques like
SIMD or such.

That said, I can't say I understand your approach that well, either (though
a pointer to a writeup, if you've got one, would help). What happens when
the loop you want to vectorize is 50 instructions long? Don't you run out
of execution units?

Paul Rubin

unread,

May 3, 2019, 10:12:22 PM5/3/19

to

MitchAlsup <Mitch...@aol.com> writes:
> VVF is different in that the top of loop instruction (VEC) identifies
> which registers are scalar (versus vector) and are thus constant in
> the loop, and at the bottom of the loop is the LOOP instruction which
> is equivalent to the ADD/CMP/BC sequence at the bottom of counted
> loops.

Is this different from the zero-overhead loops that DSP's have had since
forever? I've figured big processors manage to do it all behind the
scenes using branch prediction.

Quadibloc

unread,

May 4, 2019, 12:43:57 AM5/4/19

to

On Thursday, May 2, 2019 at 2:22:58 PM UTC-6, MitchAlsup wrote:

> My scoreboard technology brought to them via Luke; Lkcl.

A Google search brought me to this:

https://www.crowdsupply.com/libre-risc-v/m-class/updates/modernising-1960s-computer-technology-learning-from-the-cdc-6600

which seems to be what you're referring to.

John Savard

MitchAlsup

unread,

May 4, 2019, 9:52:02 AM5/4/19

to

If you take 1/2 of the instruction per cycle demand out of the caches,
they inherently display less latency.

MitchAlsup

unread,

May 4, 2019, 9:53:09 AM5/4/19

to

On Friday, May 3, 2019 at 9:12:22 PM UTC-5, Paul Rubin wrote:

Generally the big machines overloop and then backup when the loop termination
becomes known.

MitchAlsup

unread,

May 4, 2019, 9:54:02 AM5/4/19

to

The later drawings are certainly mine.
>
> John Savard

EricP

unread,

May 4, 2019, 10:10:23 AM5/4/19

to

MitchAlsup wrote:
> On Friday, May 3, 2019 at 8:29:15 PM UTC-5, joshua....@gmail.com wrote:
>> On Friday, 3 May 2019 03:22:23 UTC+1, MitchAlsup wrote:
>>> Also note:: once you take 1/2 the I/C demand out of the caches, they display
>>> less latency (so does the TLB and talbewalks.)
>> I'm not sure what you're saying here. I/C means instruction cache?
>> Why would that be better in an in-order?
>
> If you take 1/2 of the instruction per cycle demand out of the caches,
> they inherently display less latency.

I assume you mean that memory queue delay is a significant
amount of the total access latency.
Less requests = shorter queue = lower latency = smaller pipeline bubbles
= higher IPC.

MitchAlsup

unread,

May 4, 2019, 10:58:42 AM5/4/19

to

Less requests = Fewer misses per cycle

EricP

unread,

May 4, 2019, 2:22:41 PM5/4/19

to

:-) that too.

Aton's IPC table got me thinking about what IPC is actually measuring.
The table shows IPC increasing over time, but cache size is also
increases over time, and one would expect that as cache size increases,
average access time decreases, and IPC goes up.
That got me wondering how much of the IPC increase is due
to larger caches, and how much to improved uArch efficiency,
an increased ability to hide pipeline bubbles and catch up after one.

MitchAlsup

unread,

May 4, 2019, 4:54:54 PM5/4/19

to

On Saturday, May 4, 2019 at 1:22:41 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Saturday, May 4, 2019 at 9:10:23 AM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Friday, May 3, 2019 at 8:29:15 PM UTC-5, joshua....@gmail.com wrote:
> >>>> On Friday, 3 May 2019 03:22:23 UTC+1, MitchAlsup wrote:
> >>>>> Also note:: once you take 1/2 the I/C demand out of the caches, they display
> >>>>> less latency (so does the TLB and talbewalks.)
> >>>> I'm not sure what you're saying here. I/C means instruction cache?
> >>>> Why would that be better in an in-order?
> >>> If you take 1/2 of the instruction per cycle demand out of the caches,
> >>> they inherently display less latency.
> >> I assume you mean that memory queue delay is a significant
> >> amount of the total access latency.
> >> Less requests = shorter queue = lower latency = smaller pipeline bubbles
> >> = higher IPC.
> >
> > Less requests = Fewer misses per cycle
>
> :-) that too.
>
> Aton's IPC table got me thinking about what IPC is actually measuring.
> The table shows IPC increasing over time, but cache size is also
> increases over time, and one would expect that as cache size increases,

Cache size stopped at [6,8] MB '07-'15
Frequency is up only 33% '07-'15

> average access time decreases, and IPC goes up.
> That got me wondering how much of the IPC increase is due
> to larger caches, and how much to improved uArch efficiency,

More misses processed per cycle in the L2-L3s
?wider data transfers?
Better branch prediction
Faster DRAM {Both latency and BW}

> an increased ability to hide pipeline bubbles and catch up after one.

Deeper execution windows.

And other stuff.

Anton Ertl

unread,

May 5, 2019, 2:17:20 AM5/5/19

to

EricP <ThatWould...@thevillage.com> writes:
>Aton's IPC table got me thinking about what IPC is actually measuring.
>The table shows IPC increasing over time, but cache size is also
>increases over time, and one would expect that as cache size increases,
>average access time decreases, and IPC goes up.

The LaTeX benchmark does not miss in the cache much; on a Core
i7-6700K (32KB L1, 256KB L2, 8MB L3)
<2016Jan...@mips.complang.tuwien.ac.at>:

2 cores 1core 2threads event
868M750k967 1526M150k379 cycles
1949M558k752 1949M705k920 instructions
402M397k933 402M422k201 r04c4 all branches retired
4M226k552 4M560k550 r04c5 all branches mispredicted
389M988k743 390M015k254 r82d0 all stores retired
609M795k749 609M820k645 r81d0 all loads retired
599M789k726 590M869k670 r01d1 load retired l1 hit
5M708k401 11M998k206 r08d1 load retired l1 miss
4M822k206 9M739k656 r02d1 load retired l2 hit
897k646 2M315k712 r10d1 load retired l2 miss
884k132 2M299k580 r04d1 load retired l3 hit
12k898 16k628 r20d1 load retired l3 miss

All the machines I gave numbers for have at least 256K L2 cache, so
<1M cache misses for about 2G instructions. I think that, for this
benchmark, the microarchitectural improvements are the main reason for
the IPC increase.

Anton Ertl

unread,

May 5, 2019, 2:53:15 AM5/5/19

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, May 3, 2019 at 1:05:51 PM UTC-5, Anton Ertl wrote:
>>
>> I have already presented some IPC numbers for our LaTeX benchmark
>
>a) how big is the data being processed by LaTeX?

You can download the source files to process at
<https://www.complang.tuwien.ac.at/anton/latex-bench/>. The main
source file is 142906 bytes long, the part of the bibliography
processed by latex 7476 bytes, and it also processes the 5144 byte
.aux file, for a total of 155526. The outputs are 220048 (.dvi), 5254
(.log), and 5144 (.aux) bytes long. Latex also reads a number of
other files when processing this input. Overall, there are 582 calls
of read() during a run I just made, many of them for 4096 bytes, and
many of those actually get the 4096 bytes, so overall <2.4MB of files
is read (this includes the 155526 byes above). There are also a
number of mmap() calls, but they seem to be due to the loader (none of
them is after the start message of the latex program).

>b) what would the I/Cs be if no SIMD instructions were executed?

Probably the same. The source code is plain C code, and the
instruction count has not changed much over time (so the
auto-vectorization did not manage to vectorize important parts of the
program). The major reason for SIMD instructions would be library or
kernel calls of stuff like memcpy(); I just looked at that with "perf
record ...; perf report", and did not find such library calls at all
(which probably means that none of the 933 samplles happened in such a
call), but some kernel calls that probably use SIMD instructions:

0.21% latex [kernel.kallsyms] [k] copy_user_enhanced_fast_string
0.11% latex [kernel.kallsyms] [k] copy_page_to_iter

Very little time is spent there, so the effect of eliminating SIMD
instructions cannot be much.

anti...@math.uni.wroc.pl

unread,

May 7, 2019, 1:56:40 PM5/7/19

to

TeX (which is at core of this benchmark) is essentially a special
macro-processor (expander). Working data is of order of few
megabytes, main part consists of (nowaday two word) cells which
can for various dynamic data structures. Core executable is
about 160 kB, so is should fit into L2-cache. When in nineties
I looked into correlation between execution time and benchmarks
I found good correlation to Dhrystone. Memory speed and cache
size/speed/organization had much small effect.

To add further anecdotic evidence: I had Athlon 64 as my main
machine, now use Core 2 and i5. For my purposes 1.8 GHz Core 2
gave slightly better performance than Athlon 64. And 1.7 GHz i5
performs comparably to Core 2. That would suggest improvement
in IPC of order 1.5.

Note: This affects code that AFAIK uses no SIMD operations
(but may benefit from faster buses).

--
Waldek Hebisch

lkcl

unread,

May 24, 2019, 3:38:22 PM5/24/19

to

On Wednesday, May 1, 2019 at 1:52:53 PM UTC+8, joshua....@gmail.com wrote:

> into smaller, distributed tables that are cheaper to access. CG-OoO
> leverages compiler-level code optimizations to deliver efficient static
> code and exploits dynamic block-level and instruction-level parallelism.

Hm, this is a thumbs down for use in general purpose computing, unfortunately.
Whilst yes, compiler optimisations can be expected, if the architecture is to take off, tying it to a compiler is a mistake.
Different hardware implementors will do different things, and performance will be penalised because the implementation does not match precisely what the compiler wants.

And of course, nobody is going to recompile the binaries of a proprietary application for the "underdog". We had this with Microsoft using Intel's proprietary c compiler. AMD were *always* penalised.

If on the other hand there is no expectation to create a wider ubiquitous community, the idea has merit.

> does everything OK, and it circumvents a lot of the challenges of both OoO
> processors and in-order processors by construction.

This is a perspective that, having begun my first core as a 6600-based engine (with thanks to Mitch for being exceptionally patient and helpful), I am finding it much easier both to understand as well as to work with, even compared to standard pipelined inorder designs.

That having been said, as noted in Thornton's "Design of a Computer", the 6600 scoreboard is exceptionally complex in its details, yet is paradoxically extremely simple in its actual implementation and effectiveness in gate count. The 6600 scoreboard's entire gate count is noticeably lower even than one of the simplest ALUs.

Code is here if interested
http://git.libre-riscv.org/?p=soc.git;a=tree;f=src/scoreboard;h=2015241742c84ec16c132a29960117adc398a310;hb=9aade2ba12aa80e263a43fefe83f5fbbae211793

Think of what has to be done in an inorder pipelined design.

* RaW hazards have to be detected. Solution: stall.
* likewise for WaR hazards.
* likewise for WaW.
* Interrupts have to be masked during atomic operations (yuk).
* LDs / ST hazards have to be dealt with. Solution: stall.
* Branches have to be observed: the simplest solution: stall.
* DIV and other FSM style blocking units are hard to integrate, as they block dependent instructions. Solution: stall.

etc etc. Note the fact that the inorder design *still has to detect the hazards*.

This is unavoidable. The catch-all solution (stall) gets stale very quickly, and also impacts the design of the pipelines. Early out pipelines wreak havoc, cause such scheduling nightmares that no sane inorder designer will touch them.

By contrast, I found that (with Mitch's help) I was able to implement a 6600 style augmented scoreboard, capable of dealing with RaW and WaR hazards in around 5 weeks. Adding in WaW on top of that took one day, and used the shadow system intended for branch speculation and precise exceptions to do so.

Once the shadow system had been confirmed as operational, I implemented branch speculation earlier today and will spend tomorrow testing and debugging it.

Once the initial RaW and WaR technical hurdle was cleared, progress has been very rapid, as the core concept is so cleanly designed: the Dependency Matrices take care of ALL hazards.

That's worth repeating.

6600 style Dependency Matrices take care of ALL hazards, leaving units free to execute WITHOUT STALLING.

The pipelined ALUs for example do not need to stall, they need only "cancellation" capability (and there is a way to avoid that).

Early-out and multi-path pipelines are also perfectly safe, as there is a "Management" signalling layer. The ALU just has to say when it is done. The completion time is NOT relevant.

FSM based ALUs are treated just like any of the other parallel units. Again: as the Dependency Matrices calculate the hazards, the blocking ALUs simply need to say when they have a result ready.

6600 scoreboards also come with register renaming *for free*. As in, it is an accidental inherent part of the algorithm. It does not need to be "added" and, ironically, cannot be removed, either.

Mitch's augmentations add precise exceptions and branch speculation. It is not conceptually difficult (nor costly in terms of gates): an array of latches hooks into the "Commit" phase, preventing register write whilst still allowing result *creation* until the outcome of the exception or branch is known.

Multi-issue execution, Write after Write and vector predication can all use the exact same "shadow" mechanism.

In short: with all of the complexity centralised, taking care of literally all hazards, the rest of the design becomes *drastically* simpler and much more straightforward.

So no, I strongly disagree that an OoO design is inherently more complex than an inorder one. Even an inorder design can I believe be made much simpler to implement by having degenerate Dependency Matrices.

L.

Ivan Godard

unread,

May 24, 2019, 4:10:11 PM5/24/19

to

On 5/24/2019 12:38 PM, lkcl wrote:

> Think of what has to be done in an inorder pipelined design.
>
> * RaW hazards have to be detected. Solution: stall.
> * likewise for WaR hazards.
> * likewise for WaW.
> * Interrupts have to be masked during atomic operations (yuk).
> * LDs / ST hazards have to be dealt with. Solution: stall.
> * Branches have to be observed: the simplest solution: stall.
> * DIV and other FSM style blocking units are hard to integrate, as they block dependent instructions. Solution: stall.
>
> etc etc. Note the fact that the inorder design *still has to detect the hazards*.
>
> This is unavoidable. The catch-all solution (stall) gets stale very quickly, and also impacts the design of the pipelines. Early out pipelines wreak havoc, cause such scheduling nightmares that no sane inorder designer will touch them.

Well, not really. What you have described is a consequence of (inOrder &
genRegs), not inOrder alone.

Now if you take the genRegs out of the act (and change a few other
legacy things) you see:

> * RaW hazards have to be detected.

can't happen.

> * likewise for WaR hazards.

can't happen
> * likewise for WaW.
can't happen

> * Interrupts have to be masked during atomic operations (yuk).

no masking with optimistic atomics

> * LDs / ST hazards have to be dealt with.

no hazards with at-retire view

> * Branches have to be observed:

no stall with skid buffers

> * DIV and other FSM style blocking units are hard to integrate, as
they block dependent instructions.

no stall with exposed FSM

So, no stalls from this list . Yes, there are predict miss and cache
miss stalls, but with or without a scoreboard the OOOs have those too.

lkcl

unread,

May 24, 2019, 4:22:30 PM5/24/19

to

On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
> On 5/24/2019 12:38 PM, lkcl wrote:
>
> > Think of what has to be done in an inorder pipelined design.
> >
> > * RaW hazards have to be detected. Solution: stall.
> > * likewise for WaR hazards.
> > * likewise for WaW.
> > * Interrupts have to be masked during atomic operations (yuk).
> > * LDs / ST hazards have to be dealt with. Solution: stall.
> > * Branches have to be observed: the simplest solution: stall.
> > * DIV and other FSM style blocking units are hard to integrate, as they block dependent instructions. Solution: stall.
> >
> > etc etc. Note the fact that the inorder design *still has to detect the hazards*.
> >
> > This is unavoidable. The catch-all solution (stall) gets stale very quickly, and also impacts the design of the pipelines. Early out pipelines wreak havoc, cause such scheduling nightmares that no sane inorder designer will touch them.
>
> Well, not really. What you have described is a consequence of (inOrder &
> genRegs), not inOrder alone.
>

By genRegs do you mean general-purpose registers? I assumed this.

Yes of course if an inorder system has no general purpose registers (per se), or interestingly if the pipelines are extremely simple same cycle combinatorial blocks (ie are not pipelines at all), there *are* no register hazards as such.

Oh, I forgot. CSRs. Control Status Registers. The typical approach to these, particularly the ones that alter execution state: worse than stall, *flush entire pipelines* typically needed!

Whereas I am envisioning that a 6600 style scoreboard and associated shadow augmentation could easily be turned to help with CSRs, by treating them as hazard-contentious material.

Anything that altered execution state that could be dangerous would be covered by shadows. Needs some more thought, just an idea at this stage.

L.

lkcl

unread,

May 24, 2019, 4:29:19 PM5/24/19

to

On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:

> > * Branches have to be observed:
> no stall with skid buffers
> > * DIV and other FSM style blocking units are hard to integrate, as
> they block dependent instructions.
> no stall with exposed FSM

What's the complexity cost on each of these? Both in terms of understandability as well as implementation details.

Optimistic atomics sound scary as hell. Skid buffers I have not heard of so need to look up (itself not a good sign). Not heard the term "exposed FSM" used until today.

All of which kinda helps make my point, that an inorder design has to turn into a multi headed hydra monster of design disproportionate complexity, in order (ha ha) to work around in-order inherent fundamental limitations.

L.

MitchAlsup

unread,

May 24, 2019, 5:15:27 PM5/24/19

to

After a nice summary by Luke:

An important property of the SB is that calculation units don't have to be
pipelined--you just have to have enough of them to avoid stalling ISSUE.

>
> Early-out and multi-path pipelines are also perfectly safe, as there is a "Management" signalling layer. The ALU just has to say when it is done. The completion time is NOT relevant.
>
> FSM based ALUs are treated just like any of the other parallel units. Again: as the Dependency Matrices calculate the hazards, the blocking ALUs simply need to say when they have a result ready.
>
> 6600 scoreboards also come with register renaming *for free*. As in, it is an accidental inherent part of the algorithm. It does not need to be "added" and, ironically, cannot be removed, either.

Technically one is NOT renaming the registers*, just making sure that all
writes are in program order per register and that all reads to a written
register value must occur prior to the next write to the register. The
register is never actually renamed, the SB keeps track of which FU will
deliver the result this FU is going to consume.

(*) R7 is always referred to as R7. But the SB keeps track of that facts
that FU[7] needs to read R7 after FU[5] writes R7 and before FU[9] attempts
to write R7.

>
> Mitch's augmentations add precise exceptions and branch speculation. It is not conceptually difficult (nor costly in terms of gates): an array of latches hooks into the "Commit" phase, preventing register write whilst still allowing result *creation* until the outcome of the exception or branch is known.

Many people believe that the most important thing about a SB is that is starts
instructions after resolving RAW hazards. THis is incorrect. The important
thing about a SB is the it disallows completing instructions until WAR hazards
resolve.

Ivan Godard

unread,

May 24, 2019, 5:25:44 PM5/24/19

to

On 5/24/2019 1:29 PM, lkcl wrote:
> On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
>
>> > * Branches have to be observed:
>> no stall with skid buffers
>> > * DIV and other FSM style blocking units are hard to integrate, as
>> they block dependent instructions.
>> no stall with exposed FSM
>
> What's the complexity cost on each of these? Both in terms of understandability as well as implementation details.

Skid buffers: complexity trivial (capture FU outputs into ping-pong
buffers; restore on branch mispredict.

Exposed FSM: it's just code in the regular ISA; not even microcode

> Optimistic atomics sound scary as hell.

Transactional semantics have been around for decades; remember COBOL?
IBM has it in hardware; Intel tried it for a while but there was too
much mess trying to support both kinds of atomicity simultaniously (I
understand)

> Skid buffers I have not heard of so need to look up (itself not a good sign).

These address the delay between execution of a branch and execution of
the branched-to, and incorporates any address arithmetic and the
confirmation of the executed branch against the prediction. This delay
seems to be two cycles typical, leading to delay-slot designs.

An alternative is to speculate down the predicted path and back up if
wrong, absorbing the back-up in the mis-predict overhead, but this
requires saving the state to back up to. As the amount of (potential)
back up is statically fixed, the pipeline can be arranged to save its
state at every cycle into a list of buffers (or shift regs) of length
equal to the branch realization delay. This are known as skid buffers,
analogous to a car skidding as it makes a hard stop.

Skid buffers do not work well if the rest of the design is not
idempotent, because any unreversible state changes would require special
handling t back up. Thus if the design uses genRegs it cannot be
permitted to overwrite a register during the brief speculation period;
that would require skid buffers for every register. However when all or
most actions during the speculation period are idempotent it is only
necessary to retain the transient mutable state, typically the FU outputs.

The Mill hardware uses skid buffering in two places: by having the
physical belt be twice the size of the logical belt, and in the store
request stream in the hierarchy. There is actually one operation
(rescue) that can address into the belt skid buffers to recover
operands that would otherwise have fallen off the belt.

> Not heard the term "exposed FSM" used until today.

Just take the FSM out of the microcode and emulate it in the normal ISA
(probably augmented with a few helper operations). The DIV (etc) op's
code then interleaves/overlaps with everything else the machine width
could be doing from the rest of the code.

> All of which kinda helps make my point, that an inorder design has to turn into a multi headed hydra monster of design disproportionate complexity, in order (ha ha) to work around in-order inherent fundamental limitations.

I guess that's true if you try to tack in-order as an on-top-of addition
to the design. Try just using it as a way to discard complexity instead.
The Mill is stunningly simple once you wrap your head around everything
that we *don't* do :-)

I do appreciate your (or rather Mitch's) scoreboard design. It's a
clever way to get some parallelism from a one- or two-way-issue machine,
and far better than classic OOO in my (admittedly ignorant)
understanding. I suspect that it would be competitive with our Tin
configuration, which is wider but has to pay for static scheduling, and
the hardware cost in area and power would be similar.

It's not clear that it scales though.

MitchAlsup

unread,

May 24, 2019, 5:29:33 PM5/24/19

to

I could quote myself: "You see: Mill really IS different".......

As to detecting and stalling--these were found to be "Not so bad" in the
1-wide IO early RISC machines. And found to be "Bad" in the 2-wide partially
ordered generation. By the time we were doing 4-6-wide OoO machines, the
only real problem was integrating FU busy into the Reservation Station
logic "Don't pick an instruction if the unit is busy". I, personally, did
not find this one difficult at all.

As to masking interrupts during ATOMIC events--I never found this necessary
--quite possibly because my ATOMIC stuff has <at least modest> forward progress
guarantees.

As to memory ordering--in my GBOoO machine we used a Memory Dependency Matrix
(similar to what Luke is doing for calculations) to resolve memory order
based on <lower order> address bit equality. We could detect "can't possibly be
the same" to resolve a conflict, and "ate" the ones we could not be sure of.
The very vast majority of the time, there were no conflicts and not delay in
delivering the memory value.

joshua.l...@gmail.com

unread,

May 25, 2019, 11:15:44 AM5/25/19

to

On Friday, 24 May 2019 20:38:22 UTC+1, lkcl wrote:
> On Wednesday, May 1, 2019 at 1:52:53 PM UTC+8, joshua....@gmail.com wrote:
>
> > into smaller, distributed tables that are cheaper to access. CG-OoO
> > leverages compiler-level code optimizations to deliver efficient static
> > code and exploits dynamic block-level and instruction-level parallelism.
>
> Hm, this is a thumbs down for use in general purpose computing, unfortunately.
> Whilst yes, compiler optimisations can be expected, if the architecture is to take off, tying it to a compiler is a mistake.
> Different hardware implementors will do different things, and performance will be penalised because the implementation does not match precisely what the compiler wants.

I'll note that the requirements are fairly lightweight, see 3.1.2.

FWIW compiled code compatibility is a cardinal sin, and should never have been a thing.

> > does everything OK, and it circumvents a lot of the challenges of both OoO
> > processors and in-order processors by construction.
>
> This is a perspective that, having begun my first core as a 6600-based engine (with thanks to Mitch for being exceptionally patient and helpful), I am finding it much easier both to understand as well as to work with, even compared to standard pipelined inorder designs.

I don't know enough about this to judge Ivan's “It's a
clever way to get some parallelism from a one- or two-way-issue machine” comment, but if so it's not really a valid alternative. CG-OoO is

joshua.l...@gmail.com

unread,

May 25, 2019, 11:19:02 AM5/25/19

to

> clever way to get some parallelism from a one- or two-way-issue machine” comment, but if so it's not really a valid alternative. CG-OoO is...

...competing with today's wide OoO cores.

lkcl

unread,

May 25, 2019, 2:44:24 PM5/25/19

to

On Saturday, May 25, 2019 at 5:25:44 AM UTC+8, Ivan Godard wrote:
> On 5/24/2019 1:29 PM, lkcl wrote:
> > On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
> >
> >> > * Branches have to be observed:
> >> no stall with skid buffers
> >> > * DIV and other FSM style blocking units are hard to integrate, as
> >> they block dependent instructions.
> >> no stall with exposed FSM
> >
> > What's the complexity cost on each of these? Both in terms of understandability as well as implementation details.
>
> Skid buffers: complexity trivial (capture FU outputs into ping-pong
> buffers; restore on branch mispredict.

I found a different use of skid buffers, to do with cache lines, glad you clarified.

Tricks like this - considered workarounds in the software world - are precisely why I do not like inorder designs.

The augmented 6600 scoreboard has enough information to actually interleave both paths (currently implementing), cancelling one entire path once the branch is known.

This may even be nested indefinitely (multiple branches, multiple paths).

The Function Units basically *are* the skid buffers (for free).

> Exposed FSM: it's just code in the regular ISA; not even microcode

Yuk :) again, application recompilation.

> > Optimistic atomics sound scary as hell.
>
> Transactional semantics have been around for decades; remember COBOL?
> IBM has it in hardware; Intel tried it for a while but there was too
> much mess trying to support both kinds of atomicity simultaniously (I
> understand)

Need to do a bit more research before being able to assess.

> > Skid buffers I have not heard of so need to look up (itself not a good sign).
>
> These address the delay between execution of a branch and execution of
> the branched-to, and incorporates any address arithmetic and the
> confirmation of the executed branch against the prediction. This delay
> seems to be two cycles typical, leading to delay-slot designs.

See above. Sounds complex..

> An alternative is to speculate down the predicted path and back up if
> wrong, absorbing the back-up in the mis-predict overhead, but this
> requires saving the state to back up to

And in an inorder design that gets more and more complex, doesn't it?

Whereas in the augmented 6600 design all you need is a latch per Function Unit per branch to be speculated, plus a few gates per each FU.

These gates hook into the "commit" phase, preventing register write (on all shadowed instructions), so no damage may occur whilst waiting for the branch computation to take place. It hooks *directly* into *already existing* write hazard infrastructure basically.

Fail causes the instruction to self destruct.

Success frees up the write hazard.

It's real simple, given that the infrastructure is already there.

> As the amount of (potential)
> back up is statically fixed, the pipeline can be arranged to save its
> state at every cycle into a list of buffers (or shift regs) of length
> equal to the branch realization delay. This are known as skid buffers,
> analogous to a car skidding as it makes a hard stop.
>
> Skid buffers do not work well if the rest of the design is not
> idempotent, because any unreversible state changes would require special
> handling t back up. Thus if the design uses genRegs it cannot be
> permitted to overwrite a register during the brief speculation period;
> that would require skid buffers for every register.

And that is precisely and exactly what the job of the 6600 style scoreboard does. Provides the direct equivalent of a skid buffer for every single register or other object it it designed to protect and manage.

> However when all or
> most actions during the speculation period are idempotent it is only
> necessary to retain the transient mutable state, typically the FU outputs.

Yes. The Computation Units (one per FU) have register latches and also an FU output latch.

If you want more speculation or more "free hidden reg renaming", you just add more FUs and load them up more (multi issue)

> The Mill hardware uses skid buffering in two places: by having the
> physical belt be twice the size of the logical belt, and in the store
> request stream in the hierarchy. There is actually one operation
> (rescue) that can address into the belt skid buffers to recover
> operands that would otherwise have fallen off the belt.
>
> > Not heard the term "exposed FSM" used until today.
>
> Just take the FSM out of the microcode and emulate it in the normal ISA
> (probably augmented with a few helper operations). The DIV (etc) op's
> code then interleaves/overlaps with everything else the machine width
> could be doing from the rest of the code.
>
> > All of which kinda helps make my point, that an inorder design has to turn into a multi headed hydra monster of design disproportionate complexity, in order (ha ha) to work around in-order inherent fundamental limitations.
>
> I guess that's true if you try to tack in-order as an on-top-of addition
> to the design. Try just using it as a way to discard complexity instead.
> The Mill is stunningly simple once you wrap your head around everything
> that we *don't* do :-)

:)

The one thing that I really miss from genRegs standard instruction sets is "tags" that say that a register is to be discarded after use.

Context, basically.

RISCV tries adding something like this, using FENCE instructions.

> I do appreciate your (or rather Mitch's) scoreboard design. It's a
> clever way to get some parallelism from a one- or two-way-issue machine,

I'm working on a way to get up to 4 issue, and 8 issue Vector Processing should be easy (due to the regular nature of the vectors).

Multi issue "the simple way" is I believe doable by detecting whether the src or dest regs overlap. In any given group of to-be-multi-issued instructions, at the first sign of an overlap, STOP at the instruction that would conflict. Let that be the point at which the *NEXT* cycle begins, and let the Dep Matrices sort it out (in that next cycle).

So with only non-overlapping instructions going into the Dependency Matrices in parallel, no conflicting paths will occur because the Matrices will not have any conflicts occur in either a row *or* a column, because you already made sure that would not happen by only allowing non-overlapping regs to enter the Matrices.

Quite simple and effective (and not scalable, see below)

It should be logically easy to notice that vector instructions should not conflict as much, and the logic required to detect longer runs should also be easier than for genRegs.

To resolve (preserve) the instruction commit order, I have hooked into the *preexisting* branch/precise/shadow system, using it in effect to create a 2D bitmatrix version of a ROB / cyclic instruction buffer.

Working out if an instruction may commit in this new multi-commit context is a matter of keeping a counter per instruction.

Increment every issued instruction by 1 when a new instruction is added. Decrement when an instruction is retired.

It should be evident that in a multi issue context, "retire possible" is a simple comparison "if counter less than or equal to number of instructions allowed to commit".

In this way several instructions can be retired even though they have this bitmatrix pseudo-linkedlist which would otherwise only allow 1 of them to retire in any given clock.

> and far better than classic OOO in my (admittedly ignorant)
> understanding.

State of the art revolves around Tomasulo. Milti issue Tomasulo is a pig to implement.

> I suspect that it would be competitive with our Tin
> configuration, which is wider but has to pay for static scheduling, and
> the hardware cost in area and power would be similar.
>
> It's not clear that it scales though.

Going into genRegs 6 or 8 issue territory (vectors are ok) would need a much more complex register overlap detection and mitigation algorithm than I am prepared to investigate at this very early stage of development.

Up to 3 or 4 however really should not be hard, and not a lot of cost ie the simple reg overlap detection should give a reasonably high bang for the buck, only starting to be ineffective above 4 instruction issue.

Schemes for going beyond that, which I have only vaguely dreamed up, I expect there to be a lot of combinatorial ripple through the Matrices that give me concern on the impact on max clock rate.

Overall, then, the summary is that all the tricks that an inorder pipelined general purpose register based design has to deploy, they all have to be done, and they all have to be done at least once. And, going beyond once for each "trick" (skid buffering) is so hairy that it is rarely done. Early-out pipelining messes with the scheduling so badly that nobody considers it.

By contrast, with the augmented 6600 design, all the tricks are still there: it's just that with the Dependency Matrices taking care of identifying hazards (all hazards), ALL units are free and clear to operate not only in parallel but also on a time completion schedule of their own choosing. WITHOUT stalling.

L.

lkcl

unread,

May 25, 2019, 2:56:33 PM5/25/19

to

It is an isolated self contained algorithm problem which is easily defined and has no need to take into account side effects.

With multiple Function Units being able to accept instructions, a Priority Picker based on which ones are still free should be perfectly sufficient.

Of course, now that it is a MULTI priority picker ("give me the N FUs out of M total as long as they are not busy"), it is a leeeetle more involved.

Yet, at the same time, it is the sort of thing that would make an excellent 1st year EE Exam question.

> As to masking interrupts during ATOMIC events--I never found this necessary
> --quite possibly because my ATOMIC stuff has <at least modest> forward progress
> guarantees.

Sigh atomics I have not gotten into yet. Memory atomics I am hoping can hook into the LDST Matrix, and LR/SC of RISCV I believe that the "allowed to fail" rule should come to the rescue.

> As to memory ordering--in my GBOoO machine we used a Memory Dependency Matrix
> (similar to what Luke is doing for calculations) to resolve memory order
> based on <lower order> address bit equality. We could detect "can't possibly be
> the same" to resolve a conflict, and "ate" the ones we could not be sure of.

So, basically, if the lower order bits match, then it *might* be the same address (in the highbits), so rather than risk it, treat it as a clash anyway?

This would save a LOT by not having to do a huge number of full bitwise address compares, all in parallel.

> The very vast majority of the time, there were no conflicts and no delay in
> delivering the memory value.

Cool. Will definitely use that trick :)

L.

lkcl

unread,

May 25, 2019, 3:09:54 PM5/25/19

to

On Saturday, May 25, 2019 at 11:15:44 PM UTC+8, joshua....@gmail.com wrote:

> FWIW compiled code compatibility is a cardinal sin, and should never have been a thing.

Sigh.. ahould not have, reality is, in x86 (windows) and ARM (android) worlds, they are.

Android being based on java it was supposed not to be a thing, however java is so bad everyone does native ARM binaries.

Apple fascinatingly has gone over to LLVM, presumably in an extremely long strategic plan to move away from dependence on ARM *or* x86.

They learned the lesson I think from the pain of the transition to and from PowerPC, and now can have the same apps run on desktop as on mobile.

> I don't know enough about this to judge Ivan's “It's a
> clever way to get some parallelism from a one- or two-way-issue machine” comment, but if so it's not really a valid alternative. CG-OoO is

You will have seen my reply to Ivan just before this, I will investigate how to go beyond 4 issue in the next iteration, it is already so much to do.

Vector processing should be easy to do at least 8 issue, even potentially 16 issue on 16 bit values, because whilst the frontend is vector, the backend is SIMD ALUs.

This starting to go a bit beyond a traditional general purpose processor and out of scope of this thread, so will not elaborate further.

Point being, Ivan may have reasonably assumed that all 6600 style designs are 1 or 2 issue, because that's what the original 6600 was, back in the 1960s: single issue.

L.

MitchAlsup

unread,

May 25, 2019, 3:27:10 PM5/25/19

to

Multi-issue should begin with the current state of the Read reservations and
of the write reservations (and the FU_busy).

As each instruction is considered for issue, you take its read reservations
and OR it onto the current read reservations, and likewise for the write
reservations.

Thus, by the time you decide to issue {1,2,3,4,5} you already HAVE the
read and write reservations for that FU you are issuing into transitively
through the whole set of issuing instructions.

>
> So with only non-overlapping instructions going into the Dependency Matrices in parallel, no conflicting paths will occur because the Matrices will not have any conflicts occur in either a row *or* a column, because you already made sure that would not happen by only allowing non-overlapping regs to enter the Matrices.
>
> Quite simple and effective (and not scalable, see below)
>
> It should be logically easy to notice that vector instructions should not conflict as much, and the logic required to detect longer runs should also be easier than for genRegs.
>
> To resolve (preserve) the instruction commit order, I have hooked into the *preexisting* branch/precise/shadow system, using it in effect to create a 2D bitmatrix version of a ROB / cyclic instruction buffer.
>
> Working out if an instruction may commit in this new multi-commit context is a matter of keeping a counter per instruction.
>
> Increment every issued instruction by 1 when a new instruction is added. Decrement when an instruction is retired.
>
> It should be evident that in a multi issue context, "retire possible" is a simple comparison "if counter less than or equal to number of instructions allowed to commit".
>
> In this way several instructions can be retired even though they have this bitmatrix pseudo-linkedlist which would otherwise only allow 1 of them to retire in any given clock.
>
>
> > and far better than classic OOO in my (admittedly ignorant)
> > understanding.
>
> State of the art revolves around Tomasulo. Milti issue Tomasulo is a pig to implement.
>
> > I suspect that it would be competitive with our Tin
> > configuration, which is wider but has to pay for static scheduling, and
> > the hardware cost in area and power would be similar.
> >
> > It's not clear that it scales though.
>
> Going into genRegs 6 or 8 issue territory (vectors are ok) would need a much more complex register overlap detection and mitigation algorithm than I am prepared to investigate at this very early stage of development.
>
> Up to 3 or 4 however really should not be hard, and not a lot of cost ie the simple reg overlap detection should give a reasonably high bang for the buck, only starting to be ineffective above 4 instruction issue.

1-issue just uses the current read and write reservations
2-issue uses the current for inst 1 and current OR inst 1 reservations for
inst 2.
3-issue uses the current for inst 1 and current OR inst 1 reservations for
inst 2 and current OR inst 1&2 reservations for inst 3
4-issue uses the current for inst 1 and current OR inst 1 reservations for
inst 2 and current OR inst 1&2 reservations for inst 3 and current
OR inst 1&2&3 for inst 4
3-issue uses the current for inst 1 and current OR inst 1 reservations for
inst 2 and current OR inst 1&2 reservations for inst 3 and current
OR inst 1&2&3 for inst 4 and current OR inst 1&2&3&4 for inst 5

Thus, 5-issue is only 1 gate <delay> harder than 1-issue over where one is
keeping track of data and control flow dependencies.

MitchAlsup

unread,

May 25, 2019, 3:29:57 PM5/25/19

to

Yes, we used the bits that were not even translated by the TLB to make this
determination. This also avoids TLB page aliasing issues.

MitchAlsup

unread,

May 25, 2019, 3:33:40 PM5/25/19

to

On Saturday, May 25, 2019 at 2:09:54 PM UTC-5, lkcl wrote:
> On Saturday, May 25, 2019 at 11:15:44 PM UTC+8, joshua....@gmail.com wrote:
>
> > FWIW compiled code compatibility is a cardinal sin, and should never have been a thing.

All you have to do is have everybody give you their source code...........

>
> Sigh.. ahould not have, reality is, in x86 (windows) and ARM (android) worlds, they are.

O would not write-off backwards compatibility so lightly.

>
> Android being based on java it was supposed not to be a thing, however java is so bad everyone does native ARM binaries.
>
> Apple fascinatingly has gone over to LLVM, presumably in an extremely long strategic plan to move away from dependence on ARM *or* x86.
>
> They learned the lesson I think from the pain of the transition to and from PowerPC, and now can have the same apps run on desktop as on mobile.

They can do this, they have the source code.

>
> > I don't know enough about this to judge Ivan's “It's a
> > clever way to get some parallelism from a one- or two-way-issue machine” comment, but if so it's not really a valid alternative. CG-OoO is
>
> You will have seen my reply to Ivan just before this, I will investigate how to go beyond 4 issue in the next iteration, it is already so much to do.

Perhaps you should look at chapter 9 where I show the packet cache FETCH
organization. Not only do you get wide instruction issue, but you get
search-free multiple taken branches per cycle.

joshua.l...@gmail.com

unread,

May 25, 2019, 5:28:00 PM5/25/19

to

On Saturday, 25 May 2019 20:33:40 UTC+1, MitchAlsup wrote:
>
> All you have to do is have everybody give you their source code...........

This is one of those problems that's only hard because
we've built a ton of stuff that assumes it's hard.
You don't need to ship all that much metadata to compile
to all but the most foreign of architectures; WASM
is already pretty close and that has incredibly little,
plus suffers from tight compile-time deadlines. Designed
right the performance penalty should be <1% I expect.

John Dallman

unread,

May 25, 2019, 5:48:31 PM5/25/19

to

In article <8f5961be-6303-42e3...@googlegroups.com>,

joshua.l...@gmail.com () wrote:

> You don't need to ship all that much metadata to compile
> to all but the most foreign of architectures; WASM
> is already pretty close and that has incredibly little,
> plus suffers from tight compile-time deadlines. Designed
> right the performance penalty should be <1% I expect.

However, if you're trying to optimise code at all hard, you start hitting
*bugs* in optimizers.

So it's necessary to test with every combination of compiler version and
architecture that your customers will run with. That's an N x M scale of
problem. Whereas if you compile it yourself, you can freeze the compiler
version (this takes some effort these days, but remains practical) it's
only an N-scale problem.

This approach will actually be practical for code that needs a reasonable
fraction of the achievable performance when a method has been
demonstrated, and proven in practice, of producing compilers with bug
frequencies at least two orders of magnitude better than current practice.
I'm not expecting this any time soon.

John

joshua.l...@gmail.com

unread,

May 25, 2019, 5:59:02 PM5/25/19

to

This argument would imply Java, Javascript, etc. are infeasible. In practice the architecture-specific parts are rarely the source of bugs.

Ivan Godard

unread,

May 25, 2019, 7:00:33 PM5/25/19

to

On 5/25/2019 11:44 AM, lkcl wrote:
> On Saturday, May 25, 2019 at 5:25:44 AM UTC+8, Ivan Godard wrote:
>> On 5/24/2019 1:29 PM, lkcl wrote:
>>> On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
>>>
>>>> > * Branches have to be observed:
>>>> no stall with skid buffers
>>>> > * DIV and other FSM style blocking units are hard to integrate, as
>>>> they block dependent instructions.
>>>> no stall with exposed FSM
>>>
>>> What's the complexity cost on each of these? Both in terms of understandability as well as implementation details.
>>
>> Skid buffers: complexity trivial (capture FU outputs into ping-pong
>> buffers; restore on branch mispredict.
>
> I found a different use of skid buffers, to do with cache lines, glad you clarified.
>
> Tricks like this - considered workarounds in the software world - are precisely why I do not like inorder designs.

Hardly a trick - remember, high end Mills are *much* bigger than what
you are considering. It takes more than one cycle for a stall signal to
propagate across the core, so we have to provide a way to deal with
parts that have already done what they shouldn't have.

You use issue replay, so you have to synchronize retire; Mitch's system
lets you do that, but the retire point is central and you have to deal
with the delay between knowledge of completion and retire. Because you
*assume* genRegs you think that the delay is zero because it's all at
the RF, but there really is the time to get FU-result back.

Mill is *much* more asynchronous. It uses static scheduling so that it
is always known that an input will always be available when and where it
is needed, without synchronization. But then it has to deal with inputs
being created that are not in fact needed, or which are not needed yet
but will be needed after the interrupt. For that we use result replay,
which completely changes how the pipe timings work and removes all the
synch points. Those skid buffers hold operands that have already been
computed but which will not be used until later.

> The augmented 6600 scoreboard has enough information to actually interleave both paths (currently implementing), cancelling one entire path once the branch is known.

We're a statically scheduled design, so all the interleave is present
but the work is done at compile time. I'm a compiler guy so I think
that's simpler than any runtime mechanism; you're a hardware guy so no
doubt you feel the reverse :-)

> This may even be nested indefinitely (multiple branches, multiple paths).
>
> The Function Units basically *are* the skid buffers (for free).

They are for us too, except that we can buffer more than one result from
an FU in the latency latches before we ever need to move to spiller buffers.

>> Exposed FSM: it's just code in the regular ISA; not even microcode
>
> Yuk :) again, application recompilation.

Depends on what the meaning of "compilation" is :-) Our distribution
format coming from the compiler is target-independent; no recomp is
needed. It is specialized to the actual target at install time. This is
no different than microcode, except that out translate occurs once at
install whereas with microcode the translate occurs at runtime on every
instruction.

Ours is cheaper. :-)

>>> Optimistic atomics sound scary as hell.
>>
>> Transactional semantics have been around for decades; remember COBOL?
>> IBM has it in hardware; Intel tried it for a while but there was too
>> much mess trying to support both kinds of atomicity simultaniously (I
>> understand)
>
> Need to do a bit more research before being able to assess.
>
>>> Skid buffers I have not heard of so need to look up (itself not a good sign).
>>
>> These address the delay between execution of a branch and execution of
>> the branched-to, and incorporates any address arithmetic and the
>> confirmation of the executed branch against the prediction. This delay
>> seems to be two cycles typical, leading to delay-slot designs.
>
> See above. Sounds complex..

Not really; it's just a longer physical belt extending into the spiller,
which is able to replay results as if they were coming from the original
FU.

Say you have a hardware divide with a 14-cycle latency. Issue the divide
and say the next bundle contains a taken call, which does 50 cycles of
stuff. 14 physical cycles after issue the div FU spits out a result, but
that result is only due 14 logical (in the issue frame) cycles after
issue, and because of the call, we are only one logical cycle on in the
issue frame. So the result operand stays in the div output latch, or, if
that latch winds up needed, will migrate to a holding buffer in the
spiller.

Eventually the call returns, and the spiller content gets replayed in
logical (same frame) retire cycles. Absent more calls or interrupts 13
physical (and logical) cycles the div result in the spiller will get
dropped on the belt. The consumers of that operand cannot tell that the
div result took a side path through the spiller while the called
function was running; everything in all the datapaths is completely as
if the call had never happened.

Hence we get to overlap execution with control flow just like an OOO
design does, except that we have no rename regs or the rest of the OOO
machinery. I do agree with you that when an IO design forces state synch
at every issue (or retire, for Mitch) then there is a performance
penalty. However, that penalty does not arise from IO, it arises from
the state synch. Once you realize that the synchronization is not
inherent in IO then all the retire penalty goes away.

In fairness, I admit that the IO issue may incur a different penalty
that an OOO may avoid. We use partial call inline and other methods for
that, but it's easy to find examples where an OOO will gain a few
percent on an equally-provisioned Mill. That one of the reasons why we
are wider than any practical OOO: if the few percent matter in your
actual app then just go for the next higher Mill family member.

>> An alternative is to speculate down the predicted path and back up if
>> wrong, absorbing the back-up in the mis-predict overhead, but this
>> requires saving the state to back up to
>
> And in an inorder design that gets more and more complex, doesn't it?

Everything is complex until you understand it. We here have watched
Mitch work you through to understanding scoreboards; I suspect if you
worked your way through the Mill you'd find it easy too. The hard part
will be abandoning your preconceptions. But that true for all of is,
isn't it?

> Whereas in the augmented 6600 design all you need is a latch per Function Unit per branch to be speculated, plus a few gates per each FU.
>
> These gates hook into the "commit" phase, preventing register write (on all shadowed instructions), so no damage may occur whilst waiting for the branch computation to take place. It hooks *directly* into *already existing* write hazard infrastructure basically.
>
> Fail causes the instruction to self destruct.
>
> Success frees up the write hazard.
>
> It's real simple, given that the infrastructure is already there.

Yeah, I know. It's a clever way to do retire-time sync in a design that
has to worry about hazards. Of course, why do retire-time synch? And why
have hazards?

It's very good within its assumptions.

>> and far better than classic OOO in my (admittedly ignorant)
>> understanding.
>
> State of the art revolves around Tomasulo. Milti issue Tomasulo is a pig to implement.

I agree.

>> I suspect that it would be competitive with our Tin
>> configuration, which is wider but has to pay for static scheduling, and
>> the hardware cost in area and power would be similar.
>>
>> It's not clear that it scales though.
>
> Going into genRegs 6 or 8 issue territory (vectors are ok) would need a much more complex register overlap detection and mitigation algorithm than I am prepared to investigate at this very early stage of development.
>
> Up to 3 or 4 however really should not be hard, and not a lot of cost ie the simple reg overlap detection should give a reasonably high bang for the buck, only starting to be ineffective above 4 instruction issue.

Impressive if you can; I'm too ignorant to say.

> Schemes for going beyond that, which I have only vaguely dreamed up, I expect there to be a lot of combinatorial ripple through the Matrices that give me concern on the impact on max clock rate.
>
> Overall, then, the summary is that all the tricks that an inorder pipelined general purpose register based design has to deploy, they all have to be done, and they all have to be done at least once. And, going beyond once for each "trick" (skid buffering) is so hairy that it is rarely done. Early-out pipelining messes with the scheduling so badly that nobody considers it.
>
> By contrast, with the augmented 6600 design, all the tricks are still there: it's just that with the Dependency Matrices taking care of identifying hazards (all hazards), ALL units are free and clear to operate not only in parallel but also on a time completion schedule of their own choosing. WITHOUT stalling.

Choices are easy when everything you don't understand yet can be
rejected as being too complicated or tricky by definition :-)

John Dallman

unread,

May 25, 2019, 7:13:50 PM5/25/19

to

In article <4b3ec863-949b-45df...@googlegroups.com>,

joshua.l...@gmail.com () wrote:

> This argument would imply Java, Javascript, etc. are infeasible.

Oddly enough, those are rarely used for writing numerically intensive,
iterative, performance-critical code, and they don't try to optimise
nearly as hard as GCC, Clang or MSVC do

> In practice the architecture-specific parts are rarely the source of
> bugs.

Indeed not. The problems are usually with one compiler getting confused
about something that others are happy with, or occasionally about the
behaviour of the machine it's compiling for, in code that isn't the
slightest bit architecture-specific.

Have you actually ever dug out and reported compiler bugs?

John

lkcl

unread,

May 25, 2019, 7:19:04 PM5/25/19

to

On Sunday, May 26, 2019 at 7:00:33 AM UTC+8, Ivan Godard wrote:

> In fairness, I admit that the IO issue may incur a different penalty
> that an OOO may avoid. We use partial call inline and other methods for
> that, but it's easy to find examples where an OOO will gain a few
> percent on an equally-provisioned Mill. That one of the reasons why we
> are wider than any practical OOO: if the few percent matter in your
> actual app then just go for the next higher Mill family member.

and stomp all over the OoO design, in a scalable fashion, without too much in the way of design compromises. nice.

> >> An alternative is to speculate down the predicted path and back up if
> >> wrong, absorbing the back-up in the mis-predict overhead, but this
> >> requires saving the state to back up to
> >
> > And in an inorder design that gets more and more complex, doesn't it?
>
> Everything is complex until you understand it. We here have watched
> Mitch work you through to understanding scoreboards; I suspect if you
> worked your way through the Mill you'd find it easy too. The hard part
> will be abandoning your preconceptions. But that true for all of is,
> isn't it?

If I had an inkling that Mill could have genRegs dropped on it to do e.g RISCV or MIPS binaries with no design compromises, I would probably have taken you up on that :)

> > It's real simple, given that the infrastructure is already there.
>
> Yeah, I know. It's a clever way to do retire-time sync in a design that
> has to worry about hazards. Of course, why do retire-time synch? And why
> have hazards?

Ha, very true! Unfortunately, I decided to tag on to RISCV (still considering MIPS) which inherently brings arbitrarily compiled binaries beyond the hw designer control, genRegs and thus hazards... sigh

L.

Ivan Godard

unread,

May 25, 2019, 7:32:33 PM5/25/19

to

On 5/25/2019 4:19 PM, lkcl wrote:
> On Sunday, May 26, 2019 at 7:00:33 AM UTC+8, Ivan Godard wrote:
>
>> In fairness, I admit that the IO issue may incur a different penalty
>> that an OOO may avoid. We use partial call inline and other methods for
>> that, but it's easy to find examples where an OOO will gain a few
>> percent on an equally-provisioned Mill. That one of the reasons why we
>> are wider than any practical OOO: if the few percent matter in your
>> actual app then just go for the next higher Mill family member.
>
> and stomp all over the OoO design, in a scalable fashion, without too much in the way of design compromises. nice.
>
>>>> An alternative is to speculate down the predicted path and back up if
>>>> wrong, absorbing the back-up in the mis-predict overhead, but this
>>>> requires saving the state to back up to
>>>
>>> And in an inorder design that gets more and more complex, doesn't it?
>>
>> Everything is complex until you understand it. We here have watched
>> Mitch work you through to understanding scoreboards; I suspect if you
>> worked your way through the Mill you'd find it easy too. The hard part
>> will be abandoning your preconceptions. But that true for all of is,
>> isn't it?
>
> If I had an inkling that Mill could have genRegs dropped on it to do e.g RISCV or MIPS binaries with no design compromises, I would probably have taken you up on that :)

What's in a name? A binary, by any other name, would run as well.

(Sorry about that)

It's been a rather public secret that one use for the Mill's width is to
run foreign codes.

>>> It's real simple, given that the infrastructure is already there.
>>
>> Yeah, I know. It's a clever way to do retire-time sync in a design that
>> has to worry about hazards. Of course, why do retire-time synch? And why
>> have hazards?
>
> Ha, very true! Unfortunately, I decided to tag on to RISCV (still considering MIPS) which inherently brings arbitrarily compiled binaries beyond the hw designer control, genRegs and thus hazards... sigh

You really should look into binary translation a bit more.

lkcl

unread,

May 25, 2019, 7:38:04 PM5/25/19

to

On Sunday, May 26, 2019 at 3:27:10 AM UTC+8, MitchAlsup wrote:

> Multi-issue should begin with the current state of the Read reservations and
> of the write reservations (and the FU_busy).
>
> As each instruction is considered for issue, you take its read reservations
> and OR it onto the current read reservations, and likewise for the write
> reservations.

These being in unary (the reg numbers) it is just ORing of vectors.

> Thus, by the time you decide to issue {1,2,3,4,5} you already HAVE the
> read and write reservations for that FU you are issuing into transitively
> through the whole set of issuing instructions.

Transitive: a links b, b links c, therefore a links c.

Surely it can't be that simple. I thought that at the very least it would be necessary to work out the exclusions, instead.

Busy priority picking aside, making I2 depend entirely on I1's read *and* write regs (and so on) would mean that I2 would just be paused until I1 was cleared of hazards.

Setting up the actual dependencies does NOT mean that the regs get read or written.

So if say R1 has a read dependency on I1 and I3, and I2 writes to R1 in between, I2 would ALSO have a write hazard which would prevent I3 from reading R1 until AFTER it had written, yet I1 would...

dang me, I think it actually works.

I have yet to ascertain if any corruption could occur in the implementation, assumptions in the RTL as to the timing on write matrix and read matrix entries, between two clock cycles... I guess I will just find out.

If correct however it is so simple in theory that I am genuinely confused as to why this technique is not more widely known or deployed.

L.

MitchAlsup

unread,

May 25, 2019, 8:07:45 PM5/25/19

to

On Saturday, May 25, 2019 at 6:13:50 PM UTC-5, John Dallman wrote:
> In article <4b3ec863-949b-45df...@googlegroups.com>,
> joshua.l...@gmail.com () wrote:
>
> > This argument would imply Java, Javascript, etc. are infeasible.
>
> Oddly enough, those are rarely used for writing numerically intensive,

Because the FP semantics of JAVA are unacceptable.

joshua.l...@gmail.com

unread,

May 25, 2019, 9:03:36 PM5/25/19

to

On Sunday, 26 May 2019 00:13:50 UTC+1, John Dallman wrote:
> In article <4b3ec863-949b-45df...@googlegroups.com>,
> joshua.l...@gmail.com () wrote:
>
> > This argument would imply Java, Javascript, etc. are infeasible.
>
> Oddly enough, those are rarely used for writing numerically intensive,
> iterative, performance-critical code, and they don't try to optimise
> nearly as hard as GCC, Clang or MSVC do

Because Java is a poor language for performance-intensive code,
not any fault of the optimizer.

> > In practice the architecture-specific parts are rarely the source of
> > bugs.
>
> Indeed not. The problems are usually with one compiler getting confused
> about something that others are happy with, or occasionally about the
> behaviour of the machine it's compiling for, in code that isn't the
> slightest bit architecture-specific.

That's what I meant by ‘architecture-specific parts’. This really
isn't common, because it's not the hard part of the problem.
People don't test Javascript on every architecture, and if it does
crash on some other architecture, that's the compiler's problem
to fix.

> Have you actually ever dug out and reported compiler bugs?

Yes, though the only bug I've had in architecture-specific
codegen was already fixed in a newer version of the compiler.

Terje Mathisen

unread,

May 26, 2019, 2:23:55 AM5/26/19

to

I'm pretty sure this is the only general usenet groups with memebers who
are just as likely to have reported actual cpu hw bugs as compiler bugs. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

May 26, 2019, 2:26:30 AM5/26/19

to

MitchAlsup wrote:
> On Saturday, May 25, 2019 at 6:13:50 PM UTC-5, John Dallman wrote:
>> In article <4b3ec863-949b-45df...@googlegroups.com>,
>> joshua.l...@gmail.com () wrote:
>>
>>> This argument would imply Java, Javascript, etc. are infeasible.
>>
>> Oddly enough, those are rarely used for writing numerically intensive,
>
> Because the FP semantics of JAVA are unacceptable.
>

The Java FP rules are very close to the opposite end of the spectrum
compared to Fortran.

"Bit-for-bit" binary (as well as bug) compatibility is much more
important than actual performance.

Anton Ertl

unread,

May 26, 2019, 3:57:27 AM5/26/19

to

MitchAlsup <Mitch...@aol.com> writes:
>On Saturday, May 25, 2019 at 2:09:54 PM UTC-5, lkcl wrote:

>> On Saturday, May 25, 2019 at 11:15:44 PM UTC+8, joshua....@gmail.com wrot=
>e:
>> =20
>> > FWIW compiled code compatibility is a cardinal sin, and should never ha=

>ve been a thing.
>
>All you have to do is have everybody give you their source code...........

That does not help at least in case of C source code, because C
compiler maintainers proudly do not maintain source code
compatibility, not even for the same architecture, much less for a
different one (where I don't expect the same level of compatibility).
I guess that C++ is in the same boat.

Machine code is a much more long-lived distribution format than the
source code of most programming languages. I think that's because it
is commercially much more important for the processor companies than
source code longevity is for the compiler companies, because when a
new compiler version fails to compile the source code as intended,
compiler users just stay with old compiler versions and/or work around
the compiler breakage, while in case of a processor company that's
much harder, so a processor company that does not run binary code for
their old processors gets blackballed and eventually goes out of
business.

And that's why instruction sets tend to be strongly defined, while
compiler maintainers love undefined behaviour.

Note how HP, Apple and others provided binary compatibility in their
architecture transitions.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

John Dallman

unread,

May 26, 2019, 6:09:10 AM5/26/19

to

In article <9015d8fa-1a8d-4728...@googlegroups.com>,
joshua.l...@gmail.com () wrote:

> That's what I meant by _architecture-specific parts_. This really

> isn't common, because it's not the hard part of the problem.

Depends on what you mean by /common/, but if you have a few million lines
of code and test it pretty thoroughly, it happens often enough that
dealing with it is a significant activity.

> People don't test Javascript on every architecture, and if it does
> crash on some other architecture, that's the compiler's problem
> to fix.

It is, but somebody has to get the bug into the compiler organisation,
and I'm the one with the screaming customers, because they've paid my
employers. That means I need to be able to reproduce it, which in an N*M
situation means I need to be able to get hold of the compiler and
hardware combination they're using.

> > Have you actually ever dug out and reported compiler bugs?
> Yes, though the only bug I've had in architecture-specific
> codegen was already fixed in a newer version of the compiler.

I've had around 150 over 24 years in architecture-specific code
generation, very few of them already fixed, many of those in beta
compilers (because it's easier and quicker to get bugs fixed in beta).

John

John Dallman

unread,

May 26, 2019, 6:09:11 AM5/26/19

to

In article <8f5961be-6303-42e3...@googlegroups.com>,
joshua.l...@gmail.com () wrote:

> You don't need to ship all that much metadata to compile
> to all but the most foreign of architectures; WASM
> is already pretty close and that has incredibly little,
> plus suffers from tight compile-time deadlines. Designed
> right the performance penalty should be <1% I expect.

I have not tried WASM, but another part of the company, which does
somewhat similar code on a smaller scale reports their WASM runs between
a quarter and a nineth the speed of native code.

John

already...@yahoo.com

unread,

May 26, 2019, 10:01:19 AM5/26/19

to

On non-parallelizable and non-vectorizable source code?
Or non-parallelizable, but vectorizable ?

Stephen Fuld

unread,

May 26, 2019, 10:05:48 AM5/26/19

to

On 5/26/2019 12:41 AM, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Saturday, May 25, 2019 at 2:09:54 PM UTC-5, lkcl wrote:
>>> On Saturday, May 25, 2019 at 11:15:44 PM UTC+8, joshua....@gmail.com wrot=
>> e:
>>> =20
>>>> FWIW compiled code compatibility is a cardinal sin, and should never ha=
>> ve been a thing.
>>
>> All you have to do is have everybody give you their source code...........
>
> That does not help at least in case of C source code, because C
> compiler maintainers proudly do not maintain source code
> compatibility, not even for the same architecture, much less for a
> different one (where I don't expect the same level of compatibility).
> I guess that C++ is in the same boat.

Is that true for a commercial C compiler as well as the various free ones?

> Machine code is a much more long-lived distribution format than the
> source code of most programming languages. I think that's because it
> is commercially much more important for the processor companies than
> source code longevity is for the compiler companies, because when a
> new compiler version fails to compile the source code as intended,
> compiler users just stay with old compiler versions and/or work around
> the compiler breakage, while in case of a processor company that's
> much harder, so a processor company that does not run binary code for
> their old processors gets blackballed and eventually goes out of
> business.

I guess me question above posits that it may be the difference between a
commercial company versus freely available doing the support rather than
the difference between hardware and software.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

already...@yahoo.com

unread,

May 26, 2019, 10:09:23 AM5/26/19

to

I reported quite a few cases of bad optimizations, but no cases of broken language semantics.
The closest to semantics bug was case where gcc attribute didn't do what was expected from reasonable reading of imprecise language of the manual. Still, we probably have to put in "documentation ambiguity" bag rather than "broken semantics".

already...@yahoo.com

unread,

May 26, 2019, 10:11:58 AM5/26/19

to

On Sunday, May 26, 2019 at 5:05:48 PM UTC+3, Stephen Fuld wrote:
> On 5/26/2019 12:41 AM, Anton Ertl wrote:
> > MitchAlsup <Mitch...@aol.com> writes:
> >> On Saturday, May 25, 2019 at 2:09:54 PM UTC-5, lkcl wrote:
> >>> On Saturday, May 25, 2019 at 11:15:44 PM UTC+8, joshua....@gmail.com wrot=
> >> e:
> >>> =20
> >>>> FWIW compiled code compatibility is a cardinal sin, and should never ha=
> >> ve been a thing.
> >>
> >> All you have to do is have everybody give you their source code...........
> >
> > That does not help at least in case of C source code, because C
> > compiler maintainers proudly do not maintain source code
> > compatibility, not even for the same architecture, much less for a
> > different one (where I don't expect the same level of compatibility).
> > I guess that C++ is in the same boat.
>
> Is that true for a commercial C compiler as well as the various free ones?
>

"Commercial C compilers" are dying breed. Even Microsoft's now is a free compiler with non-free add-ons.

John Dallman

unread,

May 26, 2019, 10:35:11 AM5/26/19

to

In article <qce6fq$tta$1...@dont-email.me>, SF...@alumni.cmu.edu.invalid

(Stephen Fuld) wrote:

> Is that true for a commercial C compiler as well as the various
> free ones?

Of the ones I've dealt with in recent years, Microsoft keep a fair
measure of consistency from version to version of their compiler,
although it isn't always perfect.

IBM did not change their C or C++ compilers for AIX very much for a long
time, but are now engaged in replacing the front ends with Clang, while
keeping their own backend. A few hiccups are to be expected.

Oracle decided not to upgrade the classic Solaris C++ run-time to support
C++11, but adopted the G++ run-time instead. That meant that code
compiled as C++11 cannot, as a practical matter, exist in the same
process as code compiled as C++98. It isn't actually impossible, but
there are so many restrictions that it isn't practical.

Oracle also switched to requiring GCC-style C++ stack unwind information
for C++11, which means you can't throw C++11 exceptions through C code
compiled without the -fexceptions option. However, none of the Solaris C
compilers up to Studio 12.5 inclusive /had/ a -fexceptions option. It was
added in an update to 12.5, after we pointed out that the lack of it
meant you couldn't throw C++11 exceptions through C code at all, and this
made it impossible to produce some products on Solaris.

> I guess me question above posits that it may be the difference
> between a commercial company versus freely available doing the
> support rather than the difference between hardware and software.

There used to be characteristically different kinds of behaviour from
free and commercial compilers, but it's now much more of a question of
unique behaviour per supplier.

John

John Dallman

unread,

May 26, 2019, 10:35:11 AM5/26/19

to

In article <86d19ae0-3e41-41d4...@googlegroups.com>,

already...@yahoo.com () wrote:

> > I have not tried WASM, but another part of the company, which does
> > somewhat similar code on a smaller scale reports their WASM runs
> > between a quarter and a nineth the speed of native code.

> On non-parallelizable and non-vectorizable source code?
> Or non-parallelizable, but vectorizable ?

I'll ask.

John

joshua.l...@gmail.com

unread,

May 26, 2019, 11:09:12 AM5/26/19

to

Numbers like that suggest it's because WASM doesn't do SIMD yet.

WASM wants to be interpretable, fast to compile to native
code, and fast to compile to semi-efficient Javascript.
A more traditional IR could just leave vectorization
to the backend.

Anton Ertl

unread,

May 26, 2019, 11:11:51 AM5/26/19

to

Stephen Fuld <SF...@alumni.cmu.edu.invalid> writes:
>On 5/26/2019 12:41 AM, Anton Ertl wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>>> On Saturday, May 25, 2019 at 2:09:54 PM UTC-5, lkcl wrote:
>>>> On Saturday, May 25, 2019 at 11:15:44 PM UTC+8, joshua....@gmail.com wrot=
>>> e:
>>>> =20
>>>>> FWIW compiled code compatibility is a cardinal sin, and should never ha=
>>> ve been a thing.
>>>
>>> All you have to do is have everybody give you their source code...........
>>
>> That does not help at least in case of C source code, because C
>> compiler maintainers proudly do not maintain source code
>> compatibility, not even for the same architecture, much less for a
>> different one (where I don't expect the same level of compatibility).
>> I guess that C++ is in the same boat.
>
>Is that true for a commercial C compiler as well as the various free ones?

I assume you mean proprietary C compilers vs. free ones. The most
popular free C compilers are also highly commercial, and that may be
part of the problem (see below).

Anyway, some of the things I heard from proprietary C compilers have
sounded hope-inspiring, while others sounded as bad as gcc and LLVM.
But I have no first-hand experience with these compilers.

There was a talk at 35c3 <https://media.ccc.de/v/35c3-9788-memsad>
about how to reliably and portably erase memory. This talk included
discussions of the situation on proprietary compilers, and IIRC the
situation there is just as bad as for the free ones. BTW, there is so
much to talk about on this topic that the presenter consumed all the
time, including the Q&A slot, just with his presentation.

>> Machine code is a much more long-lived distribution format than the
>> source code of most programming languages. I think that's because it
>> is commercially much more important for the processor companies than
>> source code longevity is for the compiler companies, because when a
>> new compiler version fails to compile the source code as intended,
>> compiler users just stay with old compiler versions and/or work around
>> the compiler breakage, while in case of a processor company that's
>> much harder, so a processor company that does not run binary code for
>> their old processors gets blackballed and eventually goes out of
>> business.
>
>
>I guess me question above posits that it may be the difference between a
>commercial company versus freely available doing the support rather than
>the difference between hardware and software.

One theory I have is that the commercial nature and business model of
gcc and LLVM development encourages this kind of behaviour: There is
enough money flowing into gcc and LLVM development that they invest
some of it into "optimization". When the "optimization" breaks some
source code, the paying customers complain and get their regressions
fixed, while the others are advised to "fix" their code; ideally this
kind of maintainence and support will drive up the number of paying
customers.

However, if compilers with a less support-oriented business model
exhibit similar behaviour, as suggested by (my memory of) the 35c3
talk, that's counterevidence for this theory.

Another theory I have is that the C compiler maintainers (maybe not
everyone, but those with policy-making power) consider good benchmark
results more important than compiling existing programs that worked
with an earlier compiler version as intended; the maintainers are in a
filter bubble where all these programs are considered broken, and
everyone who thinks otherwise is, at best, uninformed. And it's not
actual benchmark results (where all these "optimizations" have a
minimal effect that can be replicated with a few small source-code
changes [wang+12]), but just an improvement on some demo code that
they hope will translate into improved performance. (Why did I
mention benchmarks? Because they fix their regressions for
benchmarks, but not (without additional impetus) for other code.)

@InProceedings{wang+12,
author = {Xi Wang and Haogang Chen and Alvin Cheung and Zhihao Jia and Nickolai Zeldovich and M. Frans Kaashoek},
title = {Undefined Behavior: What Happened to My Code?},
booktitle = {Asia-Pacific Workshop on Systems (APSYS'12)},
OPTpages = {},
year = {2012},
url1 = {http://homes.cs.washington.edu/~akcheung/getFile.php?file=apsys12.pdf},
url2 = {http://people.csail.mit.edu/nickolai/papers/wang-undef-2012-08-21.pdf},
OPTannote = {}

MitchAlsup

unread,

May 26, 2019, 4:17:28 PM5/26/19

to

On Sunday, May 26, 2019 at 10:11:51 AM UTC-5, Anton Ertl wrote:
> Stephen Fuld <SF...@alumni.cmu.edu.invalid> writes:
> >On 5/26/2019 12:41 AM, Anton Ertl wrote:

Of course, if their compiler does not run benchmarks faster than somebody
elses compiler, they lose a sale.

However, if their compiler does not run the customer applications, they
still have time to fix the problem, and have money in the bank.

joshua.l...@gmail.com

unread,

May 26, 2019, 7:44:34 PM5/26/19

to

On Sunday, 26 May 2019 16:11:51 UTC+1, Anton Ertl wrote:
>
> Another theory I have is that the C compiler maintainers (maybe not
> everyone, but those with policy-making power) consider good benchmark
> results more important than compiling existing programs that worked
> with an earlier compiler version as intended; the maintainers are in a
> filter bubble where all these programs are considered broken, and
> everyone who thinks otherwise is, at best, uninformed. And it's not
> actual benchmark results (where all these "optimizations" have a
> minimal effect that can be replicated with a few small source-code
> changes [wang+12]), but just an improvement on some demo code that
> they hope will translate into improved performance. (Why did I
> mention benchmarks? Because they fix their regressions for
> benchmarks, but not (without additional impetus) for other code.)

This is all a very uncharitable view of things. Much more plausible is
compiler maintainers consider the language standards a good enough,
widely-agreed upon contract to provide to progammers, in a space where
there's no other useful alternative. C and C++ are so memory unsafe that
literally any change to the generated code is visible somehow; there needs
to be *some* specification of what things are guaranteed and what are
incedental, and this specification needs to be something different vendors
can cooperate on, something that has some degree of authority in the eyes
of the programmer, and something that's had a good deal of thought put into
it.

Now, you and I might not like the C standard's specific choice of undefined
behaviours, but surely you'll agree that it's at least good that we have
something on which we can hang our hats. If every compiler had an
arbitrary, different set of guarantees without any kind of minimal subset
one could target in confidence, life would be even more chaotic than it
already is.

Bruce Hoult

unread,

May 26, 2019, 10:42:35 PM5/26/19

to

I co-authored a paper that showed a rather simple JIT can run RISC-V code on x86_64 at around 0.5 to 0.8 of the speed of optimised x86_64 binaries.

https://carrv.github.io/2017/papers/clark-rv8-carrv2017.pdf

Stephen Fuld

unread,

May 28, 2019, 1:51:49 AM5/28/19

to

Thanks John. It will be interesting to see how the commercial companies,
with their (at least somewhat) better long term compatibility policies
will deal with the integration of open source pieces, with probably less
of that kind of commitment.

John Dallman

unread,

Jun 3, 2019, 4:18:44 PM6/3/19

to

In article <86d19ae0-3e41-41d4...@googlegroups.com>,
already...@yahoo.com () wrote:

> > I have not tried WASM, but another part of the company, which does
> > somewhat similar code on a smaller scale reports their WASM runs
> > between a quarter and a nineth the speed of native code.

> On non-parallelizable and non-vectorizable source code?
> Or non-parallelizable, but vectorizable ?

When I asked them about it, they took another look, and disowned their
previous figures. An overall time for a large set of tests on 32-bit WASM
in node.js was 50% higher than native code on 64-bit Windows, but there's
considerable variation inside that. They didn't want to take the time to
investigate, since they don't have many WASM customers, and those aren't
complaining.

The code is not doing stuff amenable to auto-parallelisation or
vectorisation.

John

Quadibloc

unread,

Jun 3, 2019, 7:15:15 PM6/3/19

to

On Thursday, May 2, 2019 at 9:34:38 AM UTC-6, MitchAlsup wrote:

> Some comparative data::
> A properly designed 1-wide InOrder machine can achieve just about 1/2 the
> performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
> area and power.

I have no reason to doubt that this is true.

However, if sixteen processors, running in parallel, are _not_ as valuable as *one* processor that has twice the performance... because most (or many) problems are hard (at least for us at present) to parallelize, then that fact is not terribly useful.

> Given a bit of compiler cleverness (such as software pipelining) one can
> get essentially the same performance as the GBOoO machines at much less
> cost (area and power).

This, on the other hand, is useful.

If it weren't for another fact you've pointed out - that OoO is important for
dealing with cache misses - it would even be "obvious" how to do that.

After all, this is why RISC computers tend to have register banks of 32
registers. So that they can mix together multiple sequences of instructions
using different registers, so that OoO is less necessary for dealing with
register re-use.

Given, though, that today RISC processors are implemented OoO, and the Itanium
has singularly failed to take the world by storm, there must be more to it than
I realize.

If it were possible to nearly match OoO performance at much less cost, and
people knew how to do it, I would expect that this would actually get done. Perhaps naive of me, of course.

John Savard

Bruce Hoult

unread,

Jun 4, 2019, 2:39:15 AM6/4/19

to

On Monday, June 3, 2019 at 4:15:15 PM UTC-7, Quadibloc wrote:
> Given, though, that today RISC processors are implemented OoO, and the Itanium
> has singularly failed to take the world by storm, there must be more to it than
> I realize.
>
> If it were possible to nearly match OoO performance at much less cost, and
> people knew how to do it, I would expect that this would actually get done. Perhaps naive of me, of course.

Almost all the code running on today's high performance mobile devices is spending almost all of its time running on the dual-issue in-order ARM A53 core.

Dual-issue adds maybe 15% size/cost/energy to the overall core complex, but gives about 40% IPC increase, so it's a pretty good win, as they knew in original Pentium and PowerPC days, and as Motorola/Apple came back to with the G3 and early G4 after they proved to be as good or better than out of order -- at least once branch prediction improved sufficiently.

Anton Ertl

unread,

Jun 4, 2019, 8:39:14 AM6/4/19

to

Bruce Hoult <bruce...@gmail.com> writes:
>Almost all the code running on today's high performance mobile devices is s=
>pending almost all of its time running on the dual-issue in-order ARM A53 c=
>ore.

On what basis do you make this claim? And even if it is true, what is
its relevance?

>Dual-issue adds maybe 15% size/cost/energy to the overall core complex

The 486 has 1.2M transistors; the Pentium has 3.1 million transistors.
According to <https://en.wikichip.org/wiki/intel/80486/486dx2-66>, the
486DX2-66 dissipates 4.88W, while the same-process Pentium 60
dissipated IIRC 13W (the smaller-process Pentium 75 dissipates 8.1W
according to
<https://en.wikipedia.org/wiki/List_of_CPU_power_dissipation_figures#Pentium>.
As for the cost, a computer with a Pentium 60 cost about twice as much
in 1993 as a computer with a 486DX2-66.

>as Motorola/Apple came back to with t=
>he G3 and early G4 after they proved to be as good or better than out of or=

>der -- at least once branch prediction improved sufficiently.

The Steve Jobs reality distortion field is still in action, more than
a decade after he himself decided to switch from PowerPCs to Intel.

We now also have an Odroid N2 with 4 1.8GHz Cortex-A73s (2-wide OoO)
and 2 1.9GHz Cortex-A53s (2-wide in-order). Here are the LaTeX
benchmark results on this machine:

2.480s 1.9GHz Cortex-A53 Odroid N2 Ubuntu 18.04
1.204s 1.8GHz Cortex-A73 Odroid N2 Ubuntu 18.04

This is similar to 2-wide AMD64 CPUs:

2.368S Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit
1.216s AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit
1.052s Celeron J1900 (Silvermont) 2416MHz (Shuttle XS35V4) Ubuntu16.10

Quadibloc

unread,

Jun 5, 2019, 1:14:17 AM6/5/19

to

On Tuesday, June 4, 2019 at 6:39:14 AM UTC-6, Anton Ertl wrote:

> Bruce Hoult <bruce...@gmail.com> writes:

> >as Motorola/Apple came back to with t=
> >he G3 and early G4 after they proved to be as good or better than out of or=
> >der -- at least once branch prediction improved sufficiently.

> The Steve Jobs reality distortion field is still in action, more than
> a decade after he himself decided to switch from PowerPCs to Intel.

My understanding was that, at the time, he made that switch because of a lack of
availability of suitable laptop chips for the PowerPC ISA, not because of any
lack of performance from PowerPC chips. However, I wasn't aware that the desktop
chips available at that time were in-order; if so, I would expect that to be an
issue in terms of performance.

John Savard

EricP

unread,

Jun 5, 2019, 2:04:55 AM6/5/19

to

It was money. It's always money.
Give me this much price break or I take my ball and my bat
and my customer base elsewhere. Then they take that offer
across the street and say "better it or I go with them".

John Dallman

unread,

Jun 5, 2019, 2:16:55 AM6/5/19

to

In article <a4f8fc56-5dc1-4906...@googlegroups.com>,

jsa...@ecn.ab.ca (Quadibloc) wrote:

> My understanding was that, at the time, he made that switch because
> of a lack of availability of suitable laptop chips for the PowerPC
> ISA, not because of any lack of performance from PowerPC chips.

That was a major factor. The fastest PowerPC chips of the time were
Power PC 970 from IBM (https://en.wikipedia.org/wiki/PowerPC_970). Apple
called them "G4". They were based on IBM POWER4, and while they weren't
as fast as Intel NetBurst and Core microarchitectures of 2005-6, they
were only about one generation behind. Before the advent of AMD Athlon
and the Intel/AMD clockspeed race, PowerPC 970 had been pretty
competitive.

IBM had never managed to reduce power usage enough for laptops, not to
reach the promised clock speeds.

> However, I wasn't aware that the desktop chips available at that
> time were in-order; if so, I would expect that to be an issue in
> terms of performance.

They weren't. NetBurst and Core were both OoO.

John

Anton Ertl

unread,

Jun 5, 2019, 2:54:47 AM6/5/19

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Tuesday, June 4, 2019 at 6:39:14 AM UTC-6, Anton Ertl wrote:
>> Bruce Hoult <bruce...@gmail.com> writes:
>
>> >as Motorola/Apple came back to with t=
>> >he G3 and early G4 after they proved to be as good or better than out of or=
>> >der -- at least once branch prediction improved sufficiently.
>
>> The Steve Jobs reality distortion field is still in action, more than
>> a decade after he himself decided to switch from PowerPCs to Intel.
>
>My understanding was that, at the time, he made that switch because of a lack of
>availability of suitable laptop chips for the PowerPC ISA, not because of any
>lack of performance from PowerPC chips.

PPC 7450 family chips were available and suitable for laptops, they
just were not performance-competetive with what Intel offered at the
time.

>However, I wasn't aware that the desktop
>chips available at that time were in-order

AFAIK the PPC 970 was derived from Power 4, which is an OoO
implementation.

Anyway, that was some time after the PPC 750 (called G3 by Apple) and
PPC 7400 (early G4). While Intel and AMD were competing in the GHz
race with OoO CPUs (reached by both in March 2000), AIM introduced the
500MHz PPC 7400 in February 2000.

Just to give an idea of the performance of these machines, for the
Latex benchmark:

Machine time(s)
- Athlon (Thunderbird) 800, Abit KT7, PC100-333, RedHat 5.1 2.49
- apple powerbook g4/667, 667MHz 7455 256KB L2, Debian 4.030
- apple powerbook ("pismo"), G3 (750), 400 MHz, teTeX 1.0 5.940

The Thunderbird was available in June 2000, the PPC 7455 in January
2002, the PPC750 400MHz in 1999.

Megol

unread,

Jun 5, 2019, 2:56:19 AM6/5/19

to

G3 and G4 used (shallow) OoO execution. Am I missing something?

already...@yahoo.com

unread,

Jun 5, 2019, 7:47:47 AM6/5/19

to

On Wednesday, June 5, 2019 at 9:16:55 AM UTC+3, John Dallman wrote:
> In article <a4f8fc56-5dc1-4906...@googlegroups.com>,
> jsa...@ecn.ab.ca (Quadibloc) wrote:
>
> > My understanding was that, at the time, he made that switch because
> > of a lack of availability of suitable laptop chips for the PowerPC
> > ISA, not because of any lack of performance from PowerPC chips.
>
> That was a major factor. The fastest PowerPC chips of the time were
> Power PC 970 from IBM (https://en.wikipedia.org/wiki/PowerPC_970). Apple
> called them "G4".

G5

>> They were based on IBM POWER4,

At core mecroarchitecture level - yes.
Different tuning of the process, for much higher (than POWER4) clock frequency.
And dramatically weaker uncore.

IBM never made POWER5 based version. Probably, they new in advance that Jobs is going to drop them.

> and while they weren't
> as fast as Intel NetBurst and Core microarchitectures of 2005-6, they
> were only about one generation behind. Before the advent of AMD Athlon
> and the Intel/AMD clockspeed race, PowerPC 970 had been pretty
> competitive.
>
> IBM had never managed to reduce power usage enough for laptops, not to
> reach the promised clock speeds.

Considering POWER4 root, I'd say that clock speed was quite high.

already...@yahoo.com

unread,

Jun 5, 2019, 8:27:39 AM6/5/19

to

Both you and Bruce are referring to PPC/MPC7xx and 7xxx as "in-order". Technically, that incorrect. Their ancestor, PPC603e, was in-order design. 7xx and 7xxxx are only mostly in-order. They have few rename registers. Bot IUs and have single-entry reservation stations, LSU has dual-entry reservation station. On 7xxx there is also VPU with 4 EUs and, respectively, 4 reservation stations.

Yes, by comparison to PPro or MIPS R10K or PPC970 the reordering done in 7xx is minimal, but if your compare it to direct predecessor, PPC604, which by itself wasn't OoO monster, then OoO aspect of G3/G4 while significantly reduced, is still apparent.