Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution Near In-Order Energy with Near Out-of-Order Performance

559 views
Skip to first unread message

joshua.l...@gmail.com

unread,
May 1, 2019, 1:52:53 AM5/1/19
to
Here's a neat trick. CG-OoO is an architecture that is claimed to get very
close to OoO performance, but without the cost of reordering between such
a large number of buffered instructions.

https://dl.acm.org/citation.cfm?id=3151034..

“We introduce the Coarse-Grain Out-of-Order (CG-OoO) general-purpose
processor designed to achieve close to In-Order (InO) processor energy
while maintaining Out-of-Order (OoO) performance. CG-OoO is an
energy-performance-proportional architecture. Block-level code processing
is at the heart of this architecture; CG-OoO speculates, fetches,
schedules, and commits code at block-level granularity. It eliminates
unnecessary accesses to energy-consuming tables and turns large tables
into smaller, distributed tables that are cheaper to access. CG-OoO
leverages compiler-level code optimizations to deliver efficient static
code and exploits dynamic block-level and instruction-level parallelism.
CG-OoO introduces Skipahead, a complexity effective, limited out-of-order
instruction scheduling model. Through the energy efficiency techniques
applied to the compiler and processor pipeline stages, CG-OoO closes 62%
of the average energy gap between the InO and OoO baseline processors at
the same area and nearly the same performance as the OoO. This makes
CG-OoO 1.8× more efficient than the OoO on the energy-delay product
inverse metric. CG-OoO meets the OoO nominal performance while trading off
the peak scheduling performance for superior energy efficiency.”


Instead of building a single, fast core and trying to feed it all the
instructions in the reorder window, they split the problem hierarchically.

Each basic block in the window is allocated to a ‘Block Window’ (BW). Each
has a cheap local register file, a *small* reorder buffer (3-5ish long) to
cover stalls, a 1 instruction/cycle amortized throughput limit, and small
(3-5ish) groups of BWs share execution units. There is then a global
network for communicating the global registers, which undergo rename,
between Block Windows, and a reorder buffer for those blocks.

The idea is to extract lots of long-range IPC from the system, but not
focus too much on being able to handle local IPC. This is in a sense the
inverse philosophy to a VLIW, or VLIW-ish approach like the Mill, where
the core does lots of local IPC, and tries to move that longer-range
parallelism into the local scope in the compiler.


Overall this idea looks really elegant to me. They have lots of nice
numbers, a compiler, and interesting simulations. I haven't managed to
think of any reason this shouldn't work in principle. It doesn't do
everything perfectly, like extremely parallel vector workloads, but it
does everything OK, and it circumvents a lot of the challenges of both OoO
processors and in-order processors by construction.

Something worth keeping in mind is that not having lots of local IPC
doesn't matter too much, as long as your compiler is sensible. If you can
start execution early, the basic block doesn't need low latency. If you
can't start execution early, either the latter blocks don't immediately
depend on you, so you're not holding much else up, or they do, and the
code doesn't have any latent parallelism to extract anyway. The exception
is when you could be doing things in parallel but for the local reorder
buffer getting full, or all the global parallelism getting pulled into the
local scope and very weird dependency graphs, but compilers can detect
that and hopefully just split basic blocks up.


One thing I did think worth exploring is avoiding the global interconnect
and global register rename phases by having many copies of a smaller global
register file and chaining each individual register with its copies in a
circular buffer.

Diagram: https://i.imgur.com/21Icv1k.png

Red and orange lines control register forwarded, overwriting, and
readiness. Blue line is for a skip connection to save latency.

It's not obvious if hardware like this would be practical, but I don't see
why not, and if it is then it gets rid of rename, global wakeup, and
global forwarding paths and their corresponding throughput and latency
limits.

MitchAlsup

unread,
May 2, 2019, 11:34:38 AM5/2/19
to
On Wednesday, May 1, 2019 at 12:52:53 AM UTC-5, joshua....@gmail.com wrote:
> Here's a neat trick. CG-OoO is an architecture that is claimed to get very
> close to OoO performance, but without the cost of reordering between such
> a large number of buffered instructions.
>
> https://dl.acm.org/citation.cfm?id=3151034..
>
> “We introduce the Coarse-Grain Out-of-Order (CG-OoO) general-purpose
> processor designed to achieve close to In-Order (InO) processor energy
> while maintaining Out-of-Order (OoO) performance. CG-OoO is an
> energy-performance-proportional architecture. Block-level code processing
> is at the heart of this architecture; CG-OoO speculates, fetches,
> schedules, and commits code at block-level granularity. It eliminates
> unnecessary accesses to energy-consuming tables and turns large tables
> into smaller, distributed tables that are cheaper to access. CG-OoO
> leverages compiler-level code optimizations to deliver efficient static
> code and exploits dynamic block-level and instruction-level parallelism.
> CG-OoO introduces Skipahead, a complexity effective, limited out-of-order
> instruction scheduling model. Through the energy efficiency techniques
> applied to the compiler and processor pipeline stages, CG-OoO closes 62%
> of the average energy gap between the InO and OoO baseline processors at
> the same area and nearly the same performance as the OoO. This makes
> CG-OoO 1.8× more efficient than the OoO on the energy-delay product
> inverse metric. CG-OoO meets the OoO nominal performance while trading off
> the peak scheduling performance for superior energy efficiency.”

Some comparative data::
A properly designed 1-wide InOrder machine can achieve just about 1/2 the
performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
area and power.

Also note:
There is middle ground between InOrder machines and Out of Order machines.
Machines can be partially ordered--out of order with respect to the program
counter but in order with respect to the function unit. This dramatically
reduces the problem of recovery (branch, exception,...) with very little
loss wrt performance.

Given a bit of compiler cleverness (such as software pipelining) one can
get essentially the same performance as the GBOoO machines at much less
cost (area and power).

None of the above do anything to harm max operating frequency; in fact the
PO and IO machines should be able to run faster than the GBOoO.

EricP

unread,
May 2, 2019, 12:59:13 PM5/2/19
to
MitchAlsup wrote:
> On Wednesday, May 1, 2019 at 12:52:53 AM UTC-5, joshua....@gmail.com wrote:
>> Here's a neat trick. CG-OoO is an architecture that is claimed to get very
>> close to OoO performance, but without the cost of reordering between such
>> a large number of buffered instructions.
>>
>> https://dl.acm.org/citation.cfm?id=3151034..
>>
>> <snip>
>
> Some comparative data::
> A properly designed 1-wide InOrder machine can achieve just about 1/2 the
> performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
> area and power.
>
> Also note:
> There is middle ground between InOrder machines and Out of Order machines.
> Machines can be partially ordered--out of order with respect to the program
> counter but in order with respect to the function unit. This dramatically
> reduces the problem of recovery (branch, exception,...) with very little
> loss wrt performance.
>
> Given a bit of compiler cleverness (such as software pipelining) one can
> get essentially the same performance as the GBOoO machines at much less
> cost (area and power).
>
> None of the above do anything to harm max operating frequency; in fact the
> PO and IO machines should be able to run faster than the GBOoO.

I came across a paper about a RISC-V implementation called Ariane
which I found interesting as it employs some of the uArch ideas
I have thought about for lightweight OoO.

The Cost of Application-Class Processing: Energy and Performance Analysis
of a Linux-ready 1.7GHz 64bit RISC-V Core in 22nm FDSOI Technology, 2019
https://arxiv.org/abs/1904.05442

- 6 stage pipeline
- in-order single issue, OoO execute & write-back, in-order commit
- variable latency execute units (ALU = 1, MUL = 2, DIV = 2..64)
- scoreboard scheduler
- 8 entry very lightweight ROB (no result values, no data ports,
no CAMs, no scheduling, etc. - just what is needed to commit in-order.)

It employs an interesting lightweight renaming mechanism:
"Write after Write (WAW) hazards in the scoreboard are resolved
through a light-weight re-naming scheme which increases the
logical register address space by one bit. Each issued instruction
toggles the MSB of its destination register address and subsequent
read addresses are re-named to read from the latest register address."

They measured performance at:
"On the rather small Dhrystone benchmark the mispredict
rate is 5.77% with a 128-entry BHT and a 64-entry BTB.
This results in an IPC of 0.82 for the Dhrystone benchmark."




Ivan Godard

unread,
May 2, 2019, 3:49:45 PM5/2/19
to
Amen!

MitchAlsup

unread,
May 2, 2019, 4:22:58 PM5/2/19
to
My scoreboard technology brought to them via Luke; Lkcl.

> - 8 entry very lightweight ROB (no result values, no data ports,
> no CAMs, no scheduling, etc. - just what is needed to commit in-order.)

Saying it has no CAMs is a grey area. Saying it has not binary CAMs is true.
The difference is how the read reservations are are allowed to complete
before the next write, keeping registers in a useful partial order. The
difference is that a k-ported CAM (like a reservation station or ROB)
will have k×ln2(physical registers) wires moving each cycle, while the
Scoreboard only has k moving (and no XOR gates,...)

>
> It employs an interesting lightweight renaming mechanism:
> "Write after Write (WAW) hazards in the scoreboard are resolved
> through a light-weight re-naming scheme which increases the
> logical register address space by one bit. Each issued instruction
> toggles the MSB of its destination register address and subsequent
> read addresses are re-named to read from the latest register address."

All sorts of hazards can be treated with scoreboard techniques--unlike how
Hennessy and Patterson describe the situation.
>
> They measured performance at:
> "On the rather small Dhrystone benchmark the mispredict
> rate is 5.77% with a 128-entry BHT and a 64-entry BTB.
> This results in an IPC of 0.82 for the Dhrystone benchmark."

sounds about right.

joshua.l...@gmail.com

unread,
May 2, 2019, 9:17:43 PM5/2/19
to
On Thursday, 2 May 2019 16:34:38 UTC+1, MitchAlsup wrote:
>
> Some comparative data::
> A properly designed 1-wide InOrder machine can achieve just about 1/2 the
> performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
> area and power.

The CG-OoO paper does include in-order measurements, but though they do
replicate your 1/2 performance claim, they used a 4-wide superscalar with a
70-wide register file to get there, and it runs closer to 1/2 the power per
instruction of the OoO core.

Getting that performance out of a 1-wide sounds incredibly optimistic to
me, seemingly almost impossible, so I'm curious what kind of core design
and workloads you're talking about.

Maybe a good way to extract real numbers from what one can expect is to
look at modern Apple processors, which have a mix of big and little cores
that most people consider industry leading. The A12 has big 7-wide decode
OoO Vortex cores and little 3-wide decode OoO Tempest cores. According to
AnandTech[1], the smaller cores use about half the energy of the big ones
per instruction, or about a 1/6 to 1/10 the watts, but have a third to a
fourth the performance.

These are significantly worse power and performance numbers than your
claim. If a 1-wide in-order core could execute almost twice as fast than
their Tempest core, and still save energy, one wonders why they haven't
done it. If rather you only mean it's possible in principle, but Apple
don't have the skillset to execute, then it's not really a useful
alternative anyway.

[1]: https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the-silicon-secrets/5

> Also note:
> There is middle ground between InOrder machines and Out of Order machines.
> Machines can be partially ordered--out of order with respect to the program
> counter but in order with respect to the function unit. This dramatically
> reduces the problem of recovery (branch, exception,...) with very little
> loss wrt performance.
>
> Given a bit of compiler cleverness (such as software pipelining) one can
> get essentially the same performance as the GBOoO machines at much less
> cost (area and power).

This sounds very similar to the skipahead technique used in the paper, and
they do test a 1-BW, 1-EU variant of the CG-OoO architecture that should be
similar to this in principle.

They find that it produces a noticeable performance increase at very low
cost, but it only closes about a fifth or so of the gap. Note that this is
still 4-wide, so execution throughput shouldn't be the bottleneck, though
the lookahead is pretty small.

Frankly this all seems very bullish. I can understand saying you can get a large improvement like this if you introduce something new, but this is just the same tech the industry has been working on for decades... well, if it's too good to be true...

MitchAlsup

unread,
May 2, 2019, 10:22:23 PM5/2/19
to
On Thursday, May 2, 2019 at 8:17:43 PM UTC-5, joshua....@gmail.com wrote:
> On Thursday, 2 May 2019 16:34:38 UTC+1, MitchAlsup wrote:
> >
> > Some comparative data::
> > A properly designed 1-wide InOrder machine can achieve just about 1/2 the
> > performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
> > area and power.
>
> The CG-OoO paper does include in-order measurements, but though they do
> replicate your 1/2 performance claim, they used a 4-wide superscalar with a
> 70-wide register file to get there, and it runs closer to 1/2 the power per
> instruction of the OoO core.

When we did Opteron, over the 5,000 4M instruction traces we had gathered and
assembled into a test collection, Opteron itself was getting 1.0 I/C while
a LBIO core I had designed was able to get almost 0.5 I/C. These were both
multi GHz designs with the same memory system from L2 outward and using the
same FU designs,...

You might claim 1.0 for Opteron is too low, but many of traces have significant
OS footprints and these excursions damage the caches and TLBs in significant
ways. But in both cases we executed the same instruction profile with complete-
ly different microarchitectures.
>
> Getting that performance out of a 1-wide sounds incredibly optimistic to
> me, seemingly almost impossible, so I'm curious what kind of core design
> and workloads you're talking about.

Think about this, a DRAM access was averaging 200+ cycles, L2 was in the
realm of 20+ cycles, and about 1/3-1/2 of the instructions inserted into
the execution window of Opteron were discarded due to branch misprediction
even with 95%+ branch prediction ratio. These cost time and energy without
doing anything for actual perf.

Since this was a bit more than a decade ago, I could be wrong.
Also note:: once you take 1/2 the I/C demand out of the caches, they display
less latency (so does the TLB and talbewalks.)

already...@yahoo.com

unread,
May 3, 2019, 8:35:08 AM5/3/19
to
Opteron is not GBOoO by today's standards.
Perf/Hz of Opteron is significantly less than half of state of the art GBOoO. May be, less than 1/4 if compared to newest Apple. Part of the gap is due to Opteron's slow exclusive L2 and absence of LLC, but I would guess that more than half of the difference is due to core itself.

Opteron (original SledgeHammer , not Shanghai or Istanbul) is in the same class as Goldmont. At best Goldmont+ and Arm Cortex A73, but probably not quite.

MitchAlsup

unread,
May 3, 2019, 1:05:50 PM5/3/19
to
On Wednesday, May 1, 2019 at 12:52:53 AM UTC-5, joshua....@gmail.com wrote:
> Here's a neat trick. CG-OoO is an architecture that is claimed to get very
> close to OoO performance, but without the cost of reordering between such
> a large number of buffered instructions.
>
> https://dl.acm.org/citation.cfm?id=3151034..

I looked this over, and it reminds me a lot of my Virtual Vector Facility
in that it decorates the top of a loop with an instruction that identifies
the loop and terminates the loop with another instruction.

VVF is different in that the top of loop instruction (VEC) identifies which
registers are scalar (versus vector) and are thus constant in the loop, and
at the bottom of the loop is the LOOP instruction which is equivalent to the
ADD/CMP/BC sequence at the bottom of counted loops.

loop: VEC {VV,VV,V} // VEC identifies next instruction as top of loop
LD Rv,[Rva+4] // LOOP reverts back to this instruction
ADD Rva,Rva,1
LOOP Rv,0,0

So, my model only executes 3 instructions per loop traversal in the loop
analyzed in the paper instead of 5.

But my vectorizing model is aimed at more modest issue width architectures
but with the ability to add multiple lanes of vector calculations so one
can perform 1,2,4,8,... loops/cycle depending on the implementation.

In addition, there are no vector register files defined in the architecture
or implemented in the microarchitecture.

Finally, I only waste an "instruction" (VEC) when I have a loop to be
vectorized, not every basic block.

Anton Ertl

unread,
May 3, 2019, 2:05:51 PM5/3/19
to
MitchAlsup <Mitch...@aol.com> writes:
>Some comparative data::
>A properly designed 1-wide InOrder machine can achieve just about 1/2 the
>performance of a GreatBigOoO machine at somewhere in the 1/10 to 1/16 the
>area and power.

I have already presented some IPC numbers for our LaTeX benchmark in
<2019Feb...@mips.complang.tuwien.ac.at>, but I'll add some more
based on the execution time results from
<https://www.complang.tuwien.ac.at/franz/latex-bench>, and assuming
that there are 2100M instructions in the benchmark where I have not
measured instructions:

inst. cycles IPC Hardware
6220M 0.34 1992 Intel 486, 66 MHz, 256K L2-Cache IA-32
2866M 0.73 1997 Pentium MMX 233 IA-32
3717M 0.57 2008 Atom 330 (Bonnell) 1600MHz IA-32
2006M 1.04 2011 AMD E-450 (Bobcat) 1650MHz AMD64
2541M 0.82 2013 Celeron J1900 2416MHz (Silvermont) AMD64
1638M 1.28 2016 Celeron J3455 (Goldmont) AMD64
2071M 1336M 1.55 2017 Celeron J4105 (Goldmont+) 2500MHz AMD64
2007080563 1787418318 1.12 2005 K8 Athlon 64 X2 4400+ (2x2.2GHz, 2x1024KB L2)
2108280684 1483902954 1.42 2007 Penryn Xeon E5450 (4x3GHz, 2x6MB L2)
2140831178 1193597880 1.79 2011 Sandy Bridge Xeon E3 1220 (4x3.1GHz, 8MB L3)
2140436438 1075849207 1.99 2014 Haswell Core i7-4790K (4x4GHz, 8MB L3)
2070349870 945546455 2.19 2015 Skylake Core i5-6600K (4x3.5GHz, 6MB L3)

So the 486 has 3 times lower IPC than the K8. Even the two-wide
in-order Atom 330 has two times lower IPC than the K8. And the K8 has
two times lower IPC than Skylake.

In the low-power low-cost market segment, compared to the in-order
Bonnell, the OoO Bobcat has almost twice the IPC at the same clock
rate, while the OoO Silvermont has only 1.4x IPC, but higher clock
(resulting in overall slightly better performance). There is
apparently no power or clock rate advantage due to in-order, otherwise
Intel would have stayed with in-order for this market segment.

>Given a bit of compiler cleverness (such as software pipelining) one can=20
>get essentially the same performance as the GBOoO machines at much less
>cost (area and power).

Intel bet IA-64 on that. It did not work out.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

MitchAlsup

unread,
May 3, 2019, 3:36:18 PM5/3/19
to
On Friday, May 3, 2019 at 1:05:51 PM UTC-5, Anton Ertl wrote:

> I have already presented some IPC numbers for our LaTeX benchmark in
> <2019Feb...@mips.complang.tuwien.ac.at>, but I'll add some more
> based on the execution time results from
> <https://www.complang.tuwien.ac.at/franz/latex-bench>, and assuming
> that there are 2100M instructions in the benchmark where I have not
> measured instructions:
>
> inst. cycles IPC Hardware
> 6220M 0.34 1992 Intel 486, 66 MHz, 256K L2-Cache IA-32
> 2866M 0.73 1997 Pentium MMX 233 IA-32
> 3717M 0.57 2008 Atom 330 (Bonnell) 1600MHz IA-32
> 2006M 1.04 2011 AMD E-450 (Bobcat) 1650MHz AMD64
> 2541M 0.82 2013 Celeron J1900 2416MHz (Silvermont) AMD64
> 1638M 1.28 2016 Celeron J3455 (Goldmont) AMD64
> 2071M 1336M 1.55 2017 Celeron J4105 (Goldmont+) 2500MHz AMD64
> 2007080563 1787418318 1.12 2005 K8 Athlon 64 X2 4400+ (2x2.2GHz, 2x1024KB L2)
> 2108280684 1483902954 1.42 2007 Penryn Xeon E5450 (4x3GHz, 2x6MB L2)
> 2140831178 1193597880 1.79 2011 Sandy Bridge Xeon E3 1220 (4x3.1GHz, 8MB L3)
> 2140436438 1075849207 1.99 2014 Haswell Core i7-4790K (4x4GHz, 8MB L3)
> 2070349870 945546455 2.19 2015 Skylake Core i5-6600K (4x3.5GHz, 6MB L3)

Thanks for the data.

MitchAlsup

unread,
May 3, 2019, 4:31:01 PM5/3/19
to
On Friday, May 3, 2019 at 1:05:51 PM UTC-5, Anton Ertl wrote:
>
> I have already presented some IPC numbers for our LaTeX benchmark

a) how big is the data being processed by LaTeX?
b) what would the I/Cs be if no SIMD instructions were executed?

joshua.l...@gmail.com

unread,
May 3, 2019, 9:29:15 PM5/3/19
to
On Friday, 3 May 2019 03:22:23 UTC+1, MitchAlsup wrote:
> On Thursday, May 2, 2019 at 8:17:43 PM UTC-5, joshua....@gmail.com wrote:
> >
> > The CG-OoO paper does include in-order measurements, but though they do
> > replicate your 1/2 performance claim, they used a 4-wide superscalar with a
> > 70-wide register file to get there, and it runs closer to 1/2 the power per
> > instruction of the OoO core.
>
> When we did Opteron, over the 5,000 4M instruction traces we had gathered and
> assembled into a test collection, Opteron itself was getting 1.0 I/C while
> a LBIO core I had designed was able to get almost 0.5 I/C. These were both
> multi GHz designs with the same memory system from L2 outward and using the
> same FU designs,...
>
> You might claim 1.0 for Opteron is too low, but many of traces have significant
> OS footprints and these excursions damage the caches and TLBs in significant
> ways. But in both cases we executed the same instruction profile with complete-
> ly different microarchitectures.
> >
> > Getting that performance out of a 1-wide sounds incredibly optimistic to
> > me, seemingly almost impossible, so I'm curious what kind of core design
> > and workloads you're talking about.
>
> Think about this, a DRAM access was averaging 200+ cycles, L2 was in the
> realm of 20+ cycles, and about 1/3-1/2 of the instructions inserted into
> the execution window of Opteron were discarded due to branch misprediction
> even with 95%+ branch prediction ratio. These cost time and energy without
> doing anything for actual perf.
>
> Since this was a bit more than a decade ago, I could be wrong.

OoO cores have moved quite a way in the last decade, I'd say.[1] Just
halving the branch miss rate doubles the amount you can look ahead. There's
not as much one can do to speed up a 1-wide in-order design without it
ceasing to be 1-wide or in-order.

This asymmetry in performance improvements accounts for much or all of the
discrepancy.

[1]: https://imgur.com/gallery/GcUyY8k

> > > Also note:
> > > There is middle ground between InOrder machines and Out of Order machines.
> > > Machines can be partially ordered--out of order with respect to the program
> > > counter but in order with respect to the function unit. This dramatically
> > > reduces the problem of recovery (branch, exception,...) with very little
> > > loss wrt performance.
> > >
> > > Given a bit of compiler cleverness (such as software pipelining) one can
> > > get essentially the same performance as the GBOoO machines at much less
> > > cost (area and power).
> >
> > This sounds very similar to the skipahead technique used in the paper, and
> > they do test a 1-BW, 1-EU variant of the CG-OoO architecture that should be
> > similar to this in principle.
> >
> > They find that it produces a noticeable performance increase at very low
> > cost, but it only closes about a fifth or so of the gap. Note that this is
> > still 4-wide, so execution throughput shouldn't be the bottleneck, though
> > the lookahead is pretty small.
> >
> > Frankly this all seems very bullish. I can understand saying you can
get a large improvement like this if you introduce something new, but this
is just the same tech the industry has been working on for decades... well,
if it's too good to be true...
>
> Also note:: once you take 1/2 the I/C demand out of the caches, they display
> less latency (so does the TLB and talbewalks.)

I'm not sure what you're saying here. I/C means instruction cache?
Why would that be better in an in-order?
I can't say I see a lot of similarity here. CG-OoO is no more specific to
loops than a ROB is. The core isn't designed for massive overall
parallelism, since it will quickly bottleneck on the decoder, and
presumably targeting that will still take dedicated vector techniques like
SIMD or such.

That said, I can't say I understand your approach that well, either (though
a pointer to a writeup, if you've got one, would help). What happens when
the loop you want to vectorize is 50 instructions long? Don't you run out
of execution units?

Paul Rubin

unread,
May 3, 2019, 10:12:22 PM5/3/19
to
MitchAlsup <Mitch...@aol.com> writes:
> VVF is different in that the top of loop instruction (VEC) identifies
> which registers are scalar (versus vector) and are thus constant in
> the loop, and at the bottom of the loop is the LOOP instruction which
> is equivalent to the ADD/CMP/BC sequence at the bottom of counted
> loops.

Is this different from the zero-overhead loops that DSP's have had since
forever? I've figured big processors manage to do it all behind the
scenes using branch prediction.

Quadibloc

unread,
May 4, 2019, 12:43:57 AM5/4/19
to
On Thursday, May 2, 2019 at 2:22:58 PM UTC-6, MitchAlsup wrote:

> My scoreboard technology brought to them via Luke; Lkcl.

A Google search brought me to this:

https://www.crowdsupply.com/libre-risc-v/m-class/updates/modernising-1960s-computer-technology-learning-from-the-cdc-6600

which seems to be what you're referring to.

John Savard

MitchAlsup

unread,
May 4, 2019, 9:52:02 AM5/4/19
to
If you take 1/2 of the instruction per cycle demand out of the caches,
they inherently display less latency.

MitchAlsup

unread,
May 4, 2019, 9:53:09 AM5/4/19
to
On Friday, May 3, 2019 at 9:12:22 PM UTC-5, Paul Rubin wrote:
Generally the big machines overloop and then backup when the loop termination
becomes known.

MitchAlsup

unread,
May 4, 2019, 9:54:02 AM5/4/19
to
The later drawings are certainly mine.
>
> John Savard

EricP

unread,
May 4, 2019, 10:10:23 AM5/4/19
to
MitchAlsup wrote:
> On Friday, May 3, 2019 at 8:29:15 PM UTC-5, joshua....@gmail.com wrote:
>> On Friday, 3 May 2019 03:22:23 UTC+1, MitchAlsup wrote:
>>> Also note:: once you take 1/2 the I/C demand out of the caches, they display
>>> less latency (so does the TLB and talbewalks.)
>> I'm not sure what you're saying here. I/C means instruction cache?
>> Why would that be better in an in-order?
>
> If you take 1/2 of the instruction per cycle demand out of the caches,
> they inherently display less latency.

I assume you mean that memory queue delay is a significant
amount of the total access latency.
Less requests = shorter queue = lower latency = smaller pipeline bubbles
= higher IPC.


MitchAlsup

unread,
May 4, 2019, 10:58:42 AM5/4/19
to
Less requests = Fewer misses per cycle

EricP

unread,
May 4, 2019, 2:22:41 PM5/4/19
to
:-) that too.

Aton's IPC table got me thinking about what IPC is actually measuring.
The table shows IPC increasing over time, but cache size is also
increases over time, and one would expect that as cache size increases,
average access time decreases, and IPC goes up.
That got me wondering how much of the IPC increase is due
to larger caches, and how much to improved uArch efficiency,
an increased ability to hide pipeline bubbles and catch up after one.



MitchAlsup

unread,
May 4, 2019, 4:54:54 PM5/4/19
to
On Saturday, May 4, 2019 at 1:22:41 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Saturday, May 4, 2019 at 9:10:23 AM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Friday, May 3, 2019 at 8:29:15 PM UTC-5, joshua....@gmail.com wrote:
> >>>> On Friday, 3 May 2019 03:22:23 UTC+1, MitchAlsup wrote:
> >>>>> Also note:: once you take 1/2 the I/C demand out of the caches, they display
> >>>>> less latency (so does the TLB and talbewalks.)
> >>>> I'm not sure what you're saying here. I/C means instruction cache?
> >>>> Why would that be better in an in-order?
> >>> If you take 1/2 of the instruction per cycle demand out of the caches,
> >>> they inherently display less latency.
> >> I assume you mean that memory queue delay is a significant
> >> amount of the total access latency.
> >> Less requests = shorter queue = lower latency = smaller pipeline bubbles
> >> = higher IPC.
> >
> > Less requests = Fewer misses per cycle
>
> :-) that too.
>
> Aton's IPC table got me thinking about what IPC is actually measuring.
> The table shows IPC increasing over time, but cache size is also
> increases over time, and one would expect that as cache size increases,

Cache size stopped at [6,8] MB '07-'15
Frequency is up only 33% '07-'15

> average access time decreases, and IPC goes up.
> That got me wondering how much of the IPC increase is due
> to larger caches, and how much to improved uArch efficiency,

More misses processed per cycle in the L2-L3s
?wider data transfers?
Better branch prediction
Faster DRAM {Both latency and BW}

> an increased ability to hide pipeline bubbles and catch up after one.

Deeper execution windows.

And other stuff.

Anton Ertl

unread,
May 5, 2019, 2:17:20 AM5/5/19
to
EricP <ThatWould...@thevillage.com> writes:
>Aton's IPC table got me thinking about what IPC is actually measuring.
>The table shows IPC increasing over time, but cache size is also
>increases over time, and one would expect that as cache size increases,
>average access time decreases, and IPC goes up.

The LaTeX benchmark does not miss in the cache much; on a Core
i7-6700K (32KB L1, 256KB L2, 8MB L3)
<2016Jan...@mips.complang.tuwien.ac.at>:

2 cores 1core 2threads event
868M750k967 1526M150k379 cycles
1949M558k752 1949M705k920 instructions
402M397k933 402M422k201 r04c4 all branches retired
4M226k552 4M560k550 r04c5 all branches mispredicted
389M988k743 390M015k254 r82d0 all stores retired
609M795k749 609M820k645 r81d0 all loads retired
599M789k726 590M869k670 r01d1 load retired l1 hit
5M708k401 11M998k206 r08d1 load retired l1 miss
4M822k206 9M739k656 r02d1 load retired l2 hit
897k646 2M315k712 r10d1 load retired l2 miss
884k132 2M299k580 r04d1 load retired l3 hit
12k898 16k628 r20d1 load retired l3 miss

All the machines I gave numbers for have at least 256K L2 cache, so
<1M cache misses for about 2G instructions. I think that, for this
benchmark, the microarchitectural improvements are the main reason for
the IPC increase.

Anton Ertl

unread,
May 5, 2019, 2:53:15 AM5/5/19
to
MitchAlsup <Mitch...@aol.com> writes:
>On Friday, May 3, 2019 at 1:05:51 PM UTC-5, Anton Ertl wrote:
>>
>> I have already presented some IPC numbers for our LaTeX benchmark
>
>a) how big is the data being processed by LaTeX?

You can download the source files to process at
<https://www.complang.tuwien.ac.at/anton/latex-bench/>. The main
source file is 142906 bytes long, the part of the bibliography
processed by latex 7476 bytes, and it also processes the 5144 byte
.aux file, for a total of 155526. The outputs are 220048 (.dvi), 5254
(.log), and 5144 (.aux) bytes long. Latex also reads a number of
other files when processing this input. Overall, there are 582 calls
of read() during a run I just made, many of them for 4096 bytes, and
many of those actually get the 4096 bytes, so overall <2.4MB of files
is read (this includes the 155526 byes above). There are also a
number of mmap() calls, but they seem to be due to the loader (none of
them is after the start message of the latex program).

>b) what would the I/Cs be if no SIMD instructions were executed?

Probably the same. The source code is plain C code, and the
instruction count has not changed much over time (so the
auto-vectorization did not manage to vectorize important parts of the
program). The major reason for SIMD instructions would be library or
kernel calls of stuff like memcpy(); I just looked at that with "perf
record ...; perf report", and did not find such library calls at all
(which probably means that none of the 933 samplles happened in such a
call), but some kernel calls that probably use SIMD instructions:

0.21% latex [kernel.kallsyms] [k] copy_user_enhanced_fast_string
0.11% latex [kernel.kallsyms] [k] copy_page_to_iter

Very little time is spent there, so the effect of eliminating SIMD
instructions cannot be much.

anti...@math.uni.wroc.pl

unread,
May 7, 2019, 1:56:40 PM5/7/19
to
TeX (which is at core of this benchmark) is essentially a special
macro-processor (expander). Working data is of order of few
megabytes, main part consists of (nowaday two word) cells which
can for various dynamic data structures. Core executable is
about 160 kB, so is should fit into L2-cache. When in nineties
I looked into correlation between execution time and benchmarks
I found good correlation to Dhrystone. Memory speed and cache
size/speed/organization had much small effect.

To add further anecdotic evidence: I had Athlon 64 as my main
machine, now use Core 2 and i5. For my purposes 1.8 GHz Core 2
gave slightly better performance than Athlon 64. And 1.7 GHz i5
performs comparably to Core 2. That would suggest improvement
in IPC of order 1.5.

Note: This affects code that AFAIK uses no SIMD operations
(but may benefit from faster buses).

--
Waldek Hebisch

lkcl

unread,
May 24, 2019, 3:38:22 PM5/24/19
to
On Wednesday, May 1, 2019 at 1:52:53 PM UTC+8, joshua....@gmail.com wrote:

> into smaller, distributed tables that are cheaper to access. CG-OoO
> leverages compiler-level code optimizations to deliver efficient static
> code and exploits dynamic block-level and instruction-level parallelism.

Hm, this is a thumbs down for use in general purpose computing, unfortunately.
Whilst yes, compiler optimisations can be expected, if the architecture is to take off, tying it to a compiler is a mistake.
Different hardware implementors will do different things, and performance will be penalised because the implementation does not match precisely what the compiler wants.

And of course, nobody is going to recompile the binaries of a proprietary application for the "underdog". We had this with Microsoft using Intel's proprietary c compiler. AMD were *always* penalised.

If on the other hand there is no expectation to create a wider ubiquitous community, the idea has merit.

> does everything OK, and it circumvents a lot of the challenges of both OoO
> processors and in-order processors by construction.

This is a perspective that, having begun my first core as a 6600-based engine (with thanks to Mitch for being exceptionally patient and helpful), I am finding it much easier both to understand as well as to work with, even compared to standard pipelined inorder designs.

That having been said, as noted in Thornton's "Design of a Computer", the 6600 scoreboard is exceptionally complex in its details, yet is paradoxically extremely simple in its actual implementation and effectiveness in gate count. The 6600 scoreboard's entire gate count is noticeably lower even than one of the simplest ALUs.

Code is here if interested
http://git.libre-riscv.org/?p=soc.git;a=tree;f=src/scoreboard;h=2015241742c84ec16c132a29960117adc398a310;hb=9aade2ba12aa80e263a43fefe83f5fbbae211793

Think of what has to be done in an inorder pipelined design.

* RaW hazards have to be detected. Solution: stall.
* likewise for WaR hazards.
* likewise for WaW.
* Interrupts have to be masked during atomic operations (yuk).
* LDs / ST hazards have to be dealt with. Solution: stall.
* Branches have to be observed: the simplest solution: stall.
* DIV and other FSM style blocking units are hard to integrate, as they block dependent instructions. Solution: stall.

etc etc. Note the fact that the inorder design *still has to detect the hazards*.

This is unavoidable. The catch-all solution (stall) gets stale very quickly, and also impacts the design of the pipelines. Early out pipelines wreak havoc, cause such scheduling nightmares that no sane inorder designer will touch them.

By contrast, I found that (with Mitch's help) I was able to implement a 6600 style augmented scoreboard, capable of dealing with RaW and WaR hazards in around 5 weeks. Adding in WaW on top of that took one day, and used the shadow system intended for branch speculation and precise exceptions to do so.

Once the shadow system had been confirmed as operational, I implemented branch speculation earlier today and will spend tomorrow testing and debugging it.

Once the initial RaW and WaR technical hurdle was cleared, progress has been very rapid, as the core concept is so cleanly designed: the Dependency Matrices take care of ALL hazards.

That's worth repeating.

6600 style Dependency Matrices take care of ALL hazards, leaving units free to execute WITHOUT STALLING.

The pipelined ALUs for example do not need to stall, they need only "cancellation" capability (and there is a way to avoid that).

Early-out and multi-path pipelines are also perfectly safe, as there is a "Management" signalling layer. The ALU just has to say when it is done. The completion time is NOT relevant.

FSM based ALUs are treated just like any of the other parallel units. Again: as the Dependency Matrices calculate the hazards, the blocking ALUs simply need to say when they have a result ready.

6600 scoreboards also come with register renaming *for free*. As in, it is an accidental inherent part of the algorithm. It does not need to be "added" and, ironically, cannot be removed, either.

Mitch's augmentations add precise exceptions and branch speculation. It is not conceptually difficult (nor costly in terms of gates): an array of latches hooks into the "Commit" phase, preventing register write whilst still allowing result *creation* until the outcome of the exception or branch is known.

Multi-issue execution, Write after Write and vector predication can all use the exact same "shadow" mechanism.

In short: with all of the complexity centralised, taking care of literally all hazards, the rest of the design becomes *drastically* simpler and much more straightforward.

So no, I strongly disagree that an OoO design is inherently more complex than an inorder one. Even an inorder design can I believe be made much simpler to implement by having degenerate Dependency Matrices.

L.

Ivan Godard

unread,
May 24, 2019, 4:10:11 PM5/24/19
to
On 5/24/2019 12:38 PM, lkcl wrote:

> Think of what has to be done in an inorder pipelined design.
>
> * RaW hazards have to be detected. Solution: stall.
> * likewise for WaR hazards.
> * likewise for WaW.
> * Interrupts have to be masked during atomic operations (yuk).
> * LDs / ST hazards have to be dealt with. Solution: stall.
> * Branches have to be observed: the simplest solution: stall.
> * DIV and other FSM style blocking units are hard to integrate, as they block dependent instructions. Solution: stall.
>
> etc etc. Note the fact that the inorder design *still has to detect the hazards*.
>
> This is unavoidable. The catch-all solution (stall) gets stale very quickly, and also impacts the design of the pipelines. Early out pipelines wreak havoc, cause such scheduling nightmares that no sane inorder designer will touch them.

Well, not really. What you have described is a consequence of (inOrder &
genRegs), not inOrder alone.

Now if you take the genRegs out of the act (and change a few other
legacy things) you see:

> * RaW hazards have to be detected.
can't happen.
> * likewise for WaR hazards.
can't happen
> * likewise for WaW.
can't happen
> * Interrupts have to be masked during atomic operations (yuk).
no masking with optimistic atomics
> * LDs / ST hazards have to be dealt with.
no hazards with at-retire view
> * Branches have to be observed:
no stall with skid buffers
> * DIV and other FSM style blocking units are hard to integrate, as
they block dependent instructions.
no stall with exposed FSM

So, no stalls from this list . Yes, there are predict miss and cache
miss stalls, but with or without a scoreboard the OOOs have those too.

lkcl

unread,
May 24, 2019, 4:22:30 PM5/24/19
to
On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
> On 5/24/2019 12:38 PM, lkcl wrote:
>
> > Think of what has to be done in an inorder pipelined design.
> >
> > * RaW hazards have to be detected. Solution: stall.
> > * likewise for WaR hazards.
> > * likewise for WaW.
> > * Interrupts have to be masked during atomic operations (yuk).
> > * LDs / ST hazards have to be dealt with. Solution: stall.
> > * Branches have to be observed: the simplest solution: stall.
> > * DIV and other FSM style blocking units are hard to integrate, as they block dependent instructions. Solution: stall.
> >
> > etc etc. Note the fact that the inorder design *still has to detect the hazards*.
> >
> > This is unavoidable. The catch-all solution (stall) gets stale very quickly, and also impacts the design of the pipelines. Early out pipelines wreak havoc, cause such scheduling nightmares that no sane inorder designer will touch them.
>
> Well, not really. What you have described is a consequence of (inOrder &
> genRegs), not inOrder alone.
>

By genRegs do you mean general-purpose registers? I assumed this.

Yes of course if an inorder system has no general purpose registers (per se), or interestingly if the pipelines are extremely simple same cycle combinatorial blocks (ie are not pipelines at all), there *are* no register hazards as such.

Oh, I forgot. CSRs. Control Status Registers. The typical approach to these, particularly the ones that alter execution state: worse than stall, *flush entire pipelines* typically needed!

Whereas I am envisioning that a 6600 style scoreboard and associated shadow augmentation could easily be turned to help with CSRs, by treating them as hazard-contentious material.

Anything that altered execution state that could be dangerous would be covered by shadows. Needs some more thought, just an idea at this stage.

L.

lkcl

unread,
May 24, 2019, 4:29:19 PM5/24/19
to
On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:

> > * Branches have to be observed:
> no stall with skid buffers
> > * DIV and other FSM style blocking units are hard to integrate, as
> they block dependent instructions.
> no stall with exposed FSM

What's the complexity cost on each of these? Both in terms of understandability as well as implementation details.

Optimistic atomics sound scary as hell. Skid buffers I have not heard of so need to look up (itself not a good sign). Not heard the term "exposed FSM" used until today.

All of which kinda helps make my point, that an inorder design has to turn into a multi headed hydra monster of design disproportionate complexity, in order (ha ha) to work around in-order inherent fundamental limitations.

L.

MitchAlsup

unread,
May 24, 2019, 5:15:27 PM5/24/19
to
After a nice summary by Luke:
An important property of the SB is that calculation units don't have to be
pipelined--you just have to have enough of them to avoid stalling ISSUE.

>
> Early-out and multi-path pipelines are also perfectly safe, as there is a "Management" signalling layer. The ALU just has to say when it is done. The completion time is NOT relevant.
>
> FSM based ALUs are treated just like any of the other parallel units. Again: as the Dependency Matrices calculate the hazards, the blocking ALUs simply need to say when they have a result ready.
>
> 6600 scoreboards also come with register renaming *for free*. As in, it is an accidental inherent part of the algorithm. It does not need to be "added" and, ironically, cannot be removed, either.

Technically one is NOT renaming the registers*, just making sure that all
writes are in program order per register and that all reads to a written
register value must occur prior to the next write to the register. The
register is never actually renamed, the SB keeps track of which FU will
deliver the result this FU is going to consume.

(*) R7 is always referred to as R7. But the SB keeps track of that facts
that FU[7] needs to read R7 after FU[5] writes R7 and before FU[9] attempts
to write R7.
>
> Mitch's augmentations add precise exceptions and branch speculation. It is not conceptually difficult (nor costly in terms of gates): an array of latches hooks into the "Commit" phase, preventing register write whilst still allowing result *creation* until the outcome of the exception or branch is known.

Many people believe that the most important thing about a SB is that is starts
instructions after resolving RAW hazards. THis is incorrect. The important
thing about a SB is the it disallows completing instructions until WAR hazards
resolve.

Ivan Godard

unread,
May 24, 2019, 5:25:44 PM5/24/19
to
On 5/24/2019 1:29 PM, lkcl wrote:
> On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
>
>> > * Branches have to be observed:
>> no stall with skid buffers
>> > * DIV and other FSM style blocking units are hard to integrate, as
>> they block dependent instructions.
>> no stall with exposed FSM
>
> What's the complexity cost on each of these? Both in terms of understandability as well as implementation details.

Skid buffers: complexity trivial (capture FU outputs into ping-pong
buffers; restore on branch mispredict.

Exposed FSM: it's just code in the regular ISA; not even microcode

> Optimistic atomics sound scary as hell.

Transactional semantics have been around for decades; remember COBOL?
IBM has it in hardware; Intel tried it for a while but there was too
much mess trying to support both kinds of atomicity simultaniously (I
understand)

> Skid buffers I have not heard of so need to look up (itself not a good sign).

These address the delay between execution of a branch and execution of
the branched-to, and incorporates any address arithmetic and the
confirmation of the executed branch against the prediction. This delay
seems to be two cycles typical, leading to delay-slot designs.

An alternative is to speculate down the predicted path and back up if
wrong, absorbing the back-up in the mis-predict overhead, but this
requires saving the state to back up to. As the amount of (potential)
back up is statically fixed, the pipeline can be arranged to save its
state at every cycle into a list of buffers (or shift regs) of length
equal to the branch realization delay. This are known as skid buffers,
analogous to a car skidding as it makes a hard stop.

Skid buffers do not work well if the rest of the design is not
idempotent, because any unreversible state changes would require special
handling t back up. Thus if the design uses genRegs it cannot be
permitted to overwrite a register during the brief speculation period;
that would require skid buffers for every register. However when all or
most actions during the speculation period are idempotent it is only
necessary to retain the transient mutable state, typically the FU outputs.

The Mill hardware uses skid buffering in two places: by having the
physical belt be twice the size of the logical belt, and in the store
request stream in the hierarchy. There is actually one operation
(rescue) that can address into the belt skid buffers to recover
operands that would otherwise have fallen off the belt.

> Not heard the term "exposed FSM" used until today.

Just take the FSM out of the microcode and emulate it in the normal ISA
(probably augmented with a few helper operations). The DIV (etc) op's
code then interleaves/overlaps with everything else the machine width
could be doing from the rest of the code.

> All of which kinda helps make my point, that an inorder design has to turn into a multi headed hydra monster of design disproportionate complexity, in order (ha ha) to work around in-order inherent fundamental limitations.

I guess that's true if you try to tack in-order as an on-top-of addition
to the design. Try just using it as a way to discard complexity instead.
The Mill is stunningly simple once you wrap your head around everything
that we *don't* do :-)

I do appreciate your (or rather Mitch's) scoreboard design. It's a
clever way to get some parallelism from a one- or two-way-issue machine,
and far better than classic OOO in my (admittedly ignorant)
understanding. I suspect that it would be competitive with our Tin
configuration, which is wider but has to pay for static scheduling, and
the hardware cost in area and power would be similar.

It's not clear that it scales though.

MitchAlsup

unread,
May 24, 2019, 5:29:33 PM5/24/19
to
I could quote myself: "You see: Mill really IS different".......

As to detecting and stalling--these were found to be "Not so bad" in the
1-wide IO early RISC machines. And found to be "Bad" in the 2-wide partially
ordered generation. By the time we were doing 4-6-wide OoO machines, the
only real problem was integrating FU busy into the Reservation Station
logic "Don't pick an instruction if the unit is busy". I, personally, did
not find this one difficult at all.

As to masking interrupts during ATOMIC events--I never found this necessary
--quite possibly because my ATOMIC stuff has <at least modest> forward progress
guarantees.

As to memory ordering--in my GBOoO machine we used a Memory Dependency Matrix
(similar to what Luke is doing for calculations) to resolve memory order
based on <lower order> address bit equality. We could detect "can't possibly be
the same" to resolve a conflict, and "ate" the ones we could not be sure of.
The very vast majority of the time, there were no conflicts and not delay in
delivering the memory value.

joshua.l...@gmail.com

unread,
May 25, 2019, 11:15:44 AM5/25/19
to
On Friday, 24 May 2019 20:38:22 UTC+1, lkcl wrote:
> On Wednesday, May 1, 2019 at 1:52:53 PM UTC+8, joshua....@gmail.com wrote:
>
> > into smaller, distributed tables that are cheaper to access. CG-OoO
> > leverages compiler-level code optimizations to deliver efficient static
> > code and exploits dynamic block-level and instruction-level parallelism.
>
> Hm, this is a thumbs down for use in general purpose computing, unfortunately.
> Whilst yes, compiler optimisations can be expected, if the architecture is to take off, tying it to a compiler is a mistake.
> Different hardware implementors will do different things, and performance will be penalised because the implementation does not match precisely what the compiler wants.

I'll note that the requirements are fairly lightweight, see 3.1.2.

FWIW compiled code compatibility is a cardinal sin, and should never have been a thing.

> > does everything OK, and it circumvents a lot of the challenges of both OoO
> > processors and in-order processors by construction.
>
> This is a perspective that, having begun my first core as a 6600-based engine (with thanks to Mitch for being exceptionally patient and helpful), I am finding it much easier both to understand as well as to work with, even compared to standard pipelined inorder designs.

I don't know enough about this to judge Ivan's “It's a
clever way to get some parallelism from a one- or two-way-issue machine” comment, but if so it's not really a valid alternative. CG-OoO is

joshua.l...@gmail.com

unread,
May 25, 2019, 11:19:02 AM5/25/19
to
> clever way to get some parallelism from a one- or two-way-issue machine” comment, but if so it's not really a valid alternative. CG-OoO is...

...competing with today's wide OoO cores.

lkcl

unread,
May 25, 2019, 2:44:24 PM5/25/19
to
On Saturday, May 25, 2019 at 5:25:44 AM UTC+8, Ivan Godard wrote:
> On 5/24/2019 1:29 PM, lkcl wrote:
> > On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
> >
> >> > * Branches have to be observed:
> >> no stall with skid buffers
> >> > * DIV and other FSM style blocking units are hard to integrate, as
> >> they block dependent instructions.
> >> no stall with exposed FSM
> >
> > What's the complexity cost on each of these? Both in terms of understandability as well as implementation details.
>
> Skid buffers: complexity trivial (capture FU outputs into ping-pong
> buffers; restore on branch mispredict.

I found a different use of skid buffers, to do with cache lines, glad you clarified.

Tricks like this - considered workarounds in the software world - are precisely why I do not like inorder designs.

The augmented 6600 scoreboard has enough information to actually interleave both paths (currently implementing), cancelling one entire path once the branch is known.

This may even be nested indefinitely (multiple branches, multiple paths).

The Function Units basically *are* the skid buffers (for free).

> Exposed FSM: it's just code in the regular ISA; not even microcode

Yuk :) again, application recompilation.

> > Optimistic atomics sound scary as hell.
>
> Transactional semantics have been around for decades; remember COBOL?
> IBM has it in hardware; Intel tried it for a while but there was too
> much mess trying to support both kinds of atomicity simultaniously (I
> understand)

Need to do a bit more research before being able to assess.

> > Skid buffers I have not heard of so need to look up (itself not a good sign).
>
> These address the delay between execution of a branch and execution of
> the branched-to, and incorporates any address arithmetic and the
> confirmation of the executed branch against the prediction. This delay
> seems to be two cycles typical, leading to delay-slot designs.

See above. Sounds complex..

> An alternative is to speculate down the predicted path and back up if
> wrong, absorbing the back-up in the mis-predict overhead, but this
> requires saving the state to back up to

And in an inorder design that gets more and more complex, doesn't it?

Whereas in the augmented 6600 design all you need is a latch per Function Unit per branch to be speculated, plus a few gates per each FU.

These gates hook into the "commit" phase, preventing register write (on all shadowed instructions), so no damage may occur whilst waiting for the branch computation to take place. It hooks *directly* into *already existing* write hazard infrastructure basically.

Fail causes the instruction to self destruct.

Success frees up the write hazard.

It's real simple, given that the infrastructure is already there.

> As the amount of (potential)
> back up is statically fixed, the pipeline can be arranged to save its
> state at every cycle into a list of buffers (or shift regs) of length
> equal to the branch realization delay. This are known as skid buffers,
> analogous to a car skidding as it makes a hard stop.
>
> Skid buffers do not work well if the rest of the design is not
> idempotent, because any unreversible state changes would require special
> handling t back up. Thus if the design uses genRegs it cannot be
> permitted to overwrite a register during the brief speculation period;
> that would require skid buffers for every register.

And that is precisely and exactly what the job of the 6600 style scoreboard does. Provides the direct equivalent of a skid buffer for every single register or other object it it designed to protect and manage.


> However when all or
> most actions during the speculation period are idempotent it is only
> necessary to retain the transient mutable state, typically the FU outputs.

Yes. The Computation Units (one per FU) have register latches and also an FU output latch.

If you want more speculation or more "free hidden reg renaming", you just add more FUs and load them up more (multi issue)

> The Mill hardware uses skid buffering in two places: by having the
> physical belt be twice the size of the logical belt, and in the store
> request stream in the hierarchy. There is actually one operation
> (rescue) that can address into the belt skid buffers to recover
> operands that would otherwise have fallen off the belt.
>
> > Not heard the term "exposed FSM" used until today.
>
> Just take the FSM out of the microcode and emulate it in the normal ISA
> (probably augmented with a few helper operations). The DIV (etc) op's
> code then interleaves/overlaps with everything else the machine width
> could be doing from the rest of the code.
>
> > All of which kinda helps make my point, that an inorder design has to turn into a multi headed hydra monster of design disproportionate complexity, in order (ha ha) to work around in-order inherent fundamental limitations.
>
> I guess that's true if you try to tack in-order as an on-top-of addition
> to the design. Try just using it as a way to discard complexity instead.
> The Mill is stunningly simple once you wrap your head around everything
> that we *don't* do :-)

:)

The one thing that I really miss from genRegs standard instruction sets is "tags" that say that a register is to be discarded after use.

Context, basically.

RISCV tries adding something like this, using FENCE instructions.

> I do appreciate your (or rather Mitch's) scoreboard design. It's a
> clever way to get some parallelism from a one- or two-way-issue machine,

I'm working on a way to get up to 4 issue, and 8 issue Vector Processing should be easy (due to the regular nature of the vectors).

Multi issue "the simple way" is I believe doable by detecting whether the src or dest regs overlap. In any given group of to-be-multi-issued instructions, at the first sign of an overlap, STOP at the instruction that would conflict. Let that be the point at which the *NEXT* cycle begins, and let the Dep Matrices sort it out (in that next cycle).

So with only non-overlapping instructions going into the Dependency Matrices in parallel, no conflicting paths will occur because the Matrices will not have any conflicts occur in either a row *or* a column, because you already made sure that would not happen by only allowing non-overlapping regs to enter the Matrices.

Quite simple and effective (and not scalable, see below)

It should be logically easy to notice that vector instructions should not conflict as much, and the logic required to detect longer runs should also be easier than for genRegs.

To resolve (preserve) the instruction commit order, I have hooked into the *preexisting* branch/precise/shadow system, using it in effect to create a 2D bitmatrix version of a ROB / cyclic instruction buffer.

Working out if an instruction may commit in this new multi-commit context is a matter of keeping a counter per instruction.

Increment every issued instruction by 1 when a new instruction is added. Decrement when an instruction is retired.

It should be evident that in a multi issue context, "retire possible" is a simple comparison "if counter less than or equal to number of instructions allowed to commit".

In this way several instructions can be retired even though they have this bitmatrix pseudo-linkedlist which would otherwise only allow 1 of them to retire in any given clock.


> and far better than classic OOO in my (admittedly ignorant)
> understanding.

State of the art revolves around Tomasulo. Milti issue Tomasulo is a pig to implement.

> I suspect that it would be competitive with our Tin
> configuration, which is wider but has to pay for static scheduling, and
> the hardware cost in area and power would be similar.
>
> It's not clear that it scales though.

Going into genRegs 6 or 8 issue territory (vectors are ok) would need a much more complex register overlap detection and mitigation algorithm than I am prepared to investigate at this very early stage of development.

Up to 3 or 4 however really should not be hard, and not a lot of cost ie the simple reg overlap detection should give a reasonably high bang for the buck, only starting to be ineffective above 4 instruction issue.

Schemes for going beyond that, which I have only vaguely dreamed up, I expect there to be a lot of combinatorial ripple through the Matrices that give me concern on the impact on max clock rate.

Overall, then, the summary is that all the tricks that an inorder pipelined general purpose register based design has to deploy, they all have to be done, and they all have to be done at least once. And, going beyond once for each "trick" (skid buffering) is so hairy that it is rarely done. Early-out pipelining messes with the scheduling so badly that nobody considers it.

By contrast, with the augmented 6600 design, all the tricks are still there: it's just that with the Dependency Matrices taking care of identifying hazards (all hazards), ALL units are free and clear to operate not only in parallel but also on a time completion schedule of their own choosing. WITHOUT stalling.

L.

lkcl

unread,
May 25, 2019, 2:56:33 PM5/25/19
to
It is an isolated self contained algorithm problem which is easily defined and has no need to take into account side effects.

With multiple Function Units being able to accept instructions, a Priority Picker based on which ones are still free should be perfectly sufficient.

Of course, now that it is a MULTI priority picker ("give me the N FUs out of M total as long as they are not busy"), it is a leeeetle more involved.

Yet, at the same time, it is the sort of thing that would make an excellent 1st year EE Exam question.

> As to masking interrupts during ATOMIC events--I never found this necessary
> --quite possibly because my ATOMIC stuff has <at least modest> forward progress
> guarantees.

Sigh atomics I have not gotten into yet. Memory atomics I am hoping can hook into the LDST Matrix, and LR/SC of RISCV I believe that the "allowed to fail" rule should come to the rescue.

> As to memory ordering--in my GBOoO machine we used a Memory Dependency Matrix
> (similar to what Luke is doing for calculations) to resolve memory order
> based on <lower order> address bit equality. We could detect "can't possibly be
> the same" to resolve a conflict, and "ate" the ones we could not be sure of.

So, basically, if the lower order bits match, then it *might* be the same address (in the highbits), so rather than risk it, treat it as a clash anyway?

This would save a LOT by not having to do a huge number of full bitwise address compares, all in parallel.

> The very vast majority of the time, there were no conflicts and no delay in
> delivering the memory value.

Cool. Will definitely use that trick :)

L.

lkcl

unread,
May 25, 2019, 3:09:54 PM5/25/19
to
On Saturday, May 25, 2019 at 11:15:44 PM UTC+8, joshua....@gmail.com wrote:

> FWIW compiled code compatibility is a cardinal sin, and should never have been a thing.

Sigh.. ahould not have, reality is, in x86 (windows) and ARM (android) worlds, they are.

Android being based on java it was supposed not to be a thing, however java is so bad everyone does native ARM binaries.

Apple fascinatingly has gone over to LLVM, presumably in an extremely long strategic plan to move away from dependence on ARM *or* x86.

They learned the lesson I think from the pain of the transition to and from PowerPC, and now can have the same apps run on desktop as on mobile.

> I don't know enough about this to judge Ivan's “It's a
> clever way to get some parallelism from a one- or two-way-issue machine” comment, but if so it's not really a valid alternative. CG-OoO is

You will have seen my reply to Ivan just before this, I will investigate how to go beyond 4 issue in the next iteration, it is already so much to do.

Vector processing should be easy to do at least 8 issue, even potentially 16 issue on 16 bit values, because whilst the frontend is vector, the backend is SIMD ALUs.

This starting to go a bit beyond a traditional general purpose processor and out of scope of this thread, so will not elaborate further.

Point being, Ivan may have reasonably assumed that all 6600 style designs are 1 or 2 issue, because that's what the original 6600 was, back in the 1960s: single issue.

L.

MitchAlsup

unread,
May 25, 2019, 3:27:10 PM5/25/19
to
Multi-issue should begin with the current state of the Read reservations and
of the write reservations (and the FU_busy).

As each instruction is considered for issue, you take its read reservations
and OR it onto the current read reservations, and likewise for the write
reservations.

Thus, by the time you decide to issue {1,2,3,4,5} you already HAVE the
read and write reservations for that FU you are issuing into transitively
through the whole set of issuing instructions.

>
> So with only non-overlapping instructions going into the Dependency Matrices in parallel, no conflicting paths will occur because the Matrices will not have any conflicts occur in either a row *or* a column, because you already made sure that would not happen by only allowing non-overlapping regs to enter the Matrices.
>
> Quite simple and effective (and not scalable, see below)
>
> It should be logically easy to notice that vector instructions should not conflict as much, and the logic required to detect longer runs should also be easier than for genRegs.
>
> To resolve (preserve) the instruction commit order, I have hooked into the *preexisting* branch/precise/shadow system, using it in effect to create a 2D bitmatrix version of a ROB / cyclic instruction buffer.
>
> Working out if an instruction may commit in this new multi-commit context is a matter of keeping a counter per instruction.
>
> Increment every issued instruction by 1 when a new instruction is added. Decrement when an instruction is retired.
>
> It should be evident that in a multi issue context, "retire possible" is a simple comparison "if counter less than or equal to number of instructions allowed to commit".
>
> In this way several instructions can be retired even though they have this bitmatrix pseudo-linkedlist which would otherwise only allow 1 of them to retire in any given clock.
>
>
> > and far better than classic OOO in my (admittedly ignorant)
> > understanding.
>
> State of the art revolves around Tomasulo. Milti issue Tomasulo is a pig to implement.
>
> > I suspect that it would be competitive with our Tin
> > configuration, which is wider but has to pay for static scheduling, and
> > the hardware cost in area and power would be similar.
> >
> > It's not clear that it scales though.
>
> Going into genRegs 6 or 8 issue territory (vectors are ok) would need a much more complex register overlap detection and mitigation algorithm than I am prepared to investigate at this very early stage of development.
>
> Up to 3 or 4 however really should not be hard, and not a lot of cost ie the simple reg overlap detection should give a reasonably high bang for the buck, only starting to be ineffective above 4 instruction issue.

1-issue just uses the current read and write reservations
2-issue uses the current for inst 1 and current OR inst 1 reservations for
inst 2.
3-issue uses the current for inst 1 and current OR inst 1 reservations for
inst 2 and current OR inst 1&2 reservations for inst 3
4-issue uses the current for inst 1 and current OR inst 1 reservations for
inst 2 and current OR inst 1&2 reservations for inst 3 and current
OR inst 1&2&3 for inst 4
3-issue uses the current for inst 1 and current OR inst 1 reservations for
inst 2 and current OR inst 1&2 reservations for inst 3 and current
OR inst 1&2&3 for inst 4 and current OR inst 1&2&3&4 for inst 5

Thus, 5-issue is only 1 gate <delay> harder than 1-issue over where one is
keeping track of data and control flow dependencies.

MitchAlsup

unread,
May 25, 2019, 3:29:57 PM5/25/19
to
Yes, we used the bits that were not even translated by the TLB to make this
determination. This also avoids TLB page aliasing issues.

MitchAlsup

unread,
May 25, 2019, 3:33:40 PM5/25/19
to
On Saturday, May 25, 2019 at 2:09:54 PM UTC-5, lkcl wrote:
> On Saturday, May 25, 2019 at 11:15:44 PM UTC+8, joshua....@gmail.com wrote:
>
> > FWIW compiled code compatibility is a cardinal sin, and should never have been a thing.

All you have to do is have everybody give you their source code...........
>
> Sigh.. ahould not have, reality is, in x86 (windows) and ARM (android) worlds, they are.

O would not write-off backwards compatibility so lightly.
>
> Android being based on java it was supposed not to be a thing, however java is so bad everyone does native ARM binaries.
>
> Apple fascinatingly has gone over to LLVM, presumably in an extremely long strategic plan to move away from dependence on ARM *or* x86.
>
> They learned the lesson I think from the pain of the transition to and from PowerPC, and now can have the same apps run on desktop as on mobile.

They can do this, they have the source code.
>
> > I don't know enough about this to judge Ivan's “It's a
> > clever way to get some parallelism from a one- or two-way-issue machine” comment, but if so it's not really a valid alternative. CG-OoO is
>
> You will have seen my reply to Ivan just before this, I will investigate how to go beyond 4 issue in the next iteration, it is already so much to do.

Perhaps you should look at chapter 9 where I show the packet cache FETCH
organization. Not only do you get wide instruction issue, but you get
search-free multiple taken branches per cycle.

joshua.l...@gmail.com

unread,
May 25, 2019, 5:28:00 PM5/25/19
to
On Saturday, 25 May 2019 20:33:40 UTC+1, MitchAlsup wrote:
>
> All you have to do is have everybody give you their source code...........

This is one of those problems that's only hard because
we've built a ton of stuff that assumes it's hard.
You don't need to ship all that much metadata to compile
to all but the most foreign of architectures; WASM
is already pretty close and that has incredibly little,
plus suffers from tight compile-time deadlines. Designed
right the performance penalty should be <1% I expect.

John Dallman

unread,
May 25, 2019, 5:48:31 PM5/25/19
to
In article <8f5961be-6303-42e3...@googlegroups.com>,
joshua.l...@gmail.com () wrote:

> You don't need to ship all that much metadata to compile
> to all but the most foreign of architectures; WASM
> is already pretty close and that has incredibly little,
> plus suffers from tight compile-time deadlines. Designed
> right the performance penalty should be <1% I expect.

However, if you're trying to optimise code at all hard, you start hitting
*bugs* in optimizers.

So it's necessary to test with every combination of compiler version and
architecture that your customers will run with. That's an N x M scale of
problem. Whereas if you compile it yourself, you can freeze the compiler
version (this takes some effort these days, but remains practical) it's
only an N-scale problem.

This approach will actually be practical for code that needs a reasonable
fraction of the achievable performance when a method has been
demonstrated, and proven in practice, of producing compilers with bug
frequencies at least two orders of magnitude better than current practice.
I'm not expecting this any time soon.

John

joshua.l...@gmail.com

unread,
May 25, 2019, 5:59:02 PM5/25/19
to
This argument would imply Java, Javascript, etc. are infeasible. In practice the architecture-specific parts are rarely the source of bugs.

Ivan Godard

unread,
May 25, 2019, 7:00:33 PM5/25/19