chained multi-issue reg-renaming in the same clock cycle: is it possible?

luke.l...@gmail.com

unread,

Apr 15, 2023, 4:30:00 AM4/15/23

to

an interesting question has come up where i don't know
if it's even possible (in reasonable gate propagation time
given how large Dependency Matrices can get).

the scenario is:

* a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
* each instruction (for the sake of simplicity) is Scalar
* the multi-issue width is 12 to 16

the desire here is to obviously fit *four* simultaneous
copies of that loop into the same clock cycle, but to
do so requires *THREE* Write-after-Write register
renames in a row (four on the next clock cycle if one
is still "in the system").

my question is: is this even possible, without having to
go to massive-wide register operations as a work-around?

my current understanding of how WaW reg-rename
works is that it would be perfectly possible (without
huge gate cascades) to do at least one reg-renaming
in the same clock cycle, but that only gets 50%
utilisation.

has anyone encountered or attempted this before, or
have any good ideas? i know VVM auto-vectorisation
would solve this because the LOOP construct marks
the LD-COMPUTE-ST loop-counter, load, and store
registers such that hazards can be dropped on those.

any other ideas?

l.

EricP

unread,

Apr 15, 2023, 11:02:22 AM4/15/23

to

I'm not sure what you are asking as rename is a separate
stage from execute.

Does "three write-after-write register renames" mean
three concurrent back-to-back result forwarding's,
whereby results are forwarded from uOp to uOp triggering an
immediate wake-up & launch of dest uOps with no intervening clocks?

Anton Ertl

unread,

Apr 15, 2023, 11:13:43 AM4/15/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>* a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
>* each instruction (for the sake of simplicity) is Scalar
>* the multi-issue width is 12 to 16
>
>the desire here is to obviously fit *four* simultaneous
>copies of that loop into the same clock cycle, but to
>do so requires *THREE* Write-after-Write register
>renames in a row (four on the next clock cycle if one
>is still "in the system").
>
>my question is: is this even possible, without having to
>go to massive-wide register operations as a work-around?

Issue width 12 has not been demonstrated in a register-renaming CPU
yet. The Zen 4 register renamer can process 6 macro-ops per cycle,
and Intel's register renamer can process 6 macro-ops IIRC. The
numbers I remember for ARM and Apple (but not sure which structure
they refer to) are in the 7-9 range.

>my current understanding of how WaW reg-rename
>works is that it would be perfectly possible (without
>huge gate cascades) to do at least one reg-renaming
>in the same clock cycle, but that only gets 50%
>utilisation.

I never heard of or experienced any limitations in that area. A
slightly different, but probably related problem: Both Intel and AMD
manage to process a chain of 6 *dependent* mov reg,reg instructions in
a cycle into nothing in the renamer.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

luke.l...@gmail.com

unread,

Apr 15, 2023, 12:41:59 PM4/15/23

to

On Saturday, April 15, 2023 at 4:02:22 PM UTC+1, EricP wrote:

> I'm not sure what you are asking as rename is a separate
> stage from execute.

indeed. it's usually performed just after Decode phase

> Does "three write-after-write register renames" mean
> three concurrent back-to-back result forwarding's,
> whereby results are forwarded from uOp to uOp triggering an
> immediate wake-up & launch of dest uOps with no intervening clocks?

no.

three write-after-write register renames need (in the scenario
i am envisioning) to occur right there, right then, just after
Decode, having been detected as part of the Hazard Dependency
Matrix setup.

in one issue-batch:

op1 | op2 | op3 | op4 | op5 | op6
LD r0, r1| ADD r0,r0,1 | ST r0,r1 | LD r0, r1 | ADD r0,r0,1 | ST r0,r1

those need to *not* stall just because r0 is loaded calced
stored twice in the same clock cycle.

a *single* reg-rename (on r0 and r1) is perfectly reasonably achievable.

but *multiple* reg-renames in the same clock cycle, to avoid the
Write-after-Write Hazards? that's what i don't know is possible.

l.

luke.l...@gmail.com

unread,

Apr 15, 2023, 12:45:57 PM4/15/23

to

On Saturday, April 15, 2023 at 4:13:43 PM UTC+1, Anton Ertl wrote:

> Issue width 12 has not been demonstrated in a register-renaming CPU
> yet.

the M1 has 8, and sustains 100% throughput.

> The Zen 4 register renamer can process 6 macro-ops per cycle,
> and Intel's register renamer can process 6 macro-ops IIRC. The
> numbers I remember for ARM and Apple (but not sure which structure
> they refer to) are in the 7-9 range.

good to know. ah, i know: inline loop-unrolled assembler would
hit a reg-renamer pretty hard even on 8-wide OoO issue.

> >my current understanding of how WaW reg-rename
> >works is that it would be perfectly possible (without
> >huge gate cascades) to do at least one reg-renaming
> >in the same clock cycle, but that only gets 50%
> >utilisation.
> I never heard of or experienced any limitations in that area. A
> slightly different, but probably related problem: Both Intel and AMD
> manage to process a chain of 6 *dependent* mov reg,reg instructions in
> a cycle into nothing in the renamer.

this would not surprise me, they would be forced to because the
number of regs in x86 is so small, i heard that Intel actually does
RISC micro-coding at the back-end, translating x86 into an internal
RISC ISA.

l.

luke.l...@gmail.com

unread,

Apr 15, 2023, 12:50:35 PM4/15/23

to

On Saturday, April 15, 2023 at 4:02:22 PM UTC+1, EricP wrote:

> Does "three write-after-write register renames" mean
> three concurrent back-to-back result forwarding's,
> whereby results are forwarded from uOp to uOp triggering an
> immediate wake-up & launch of dest uOps with no intervening clocks?

a much simpler example:

QTY 6of loop-unrolled repetitions of "ADD r0, r0, 1" on an 8-wide
multi-issue OoO system. these should all be renamed:

1st ADD reads r0, stores in r0.1 (temporary, Reservation Station)
2nd ADD reads r0.1, stores in r0.2
...
6th ADD reads r0.5, stores in r0

but although the ADDs themselves should be pipeline-chained,
actually getting them *into* the Reservation Stations at the
Decode Phase (and moving on to the next batch of 8
fetched instructions) should take *one* clock cycle, not 4 or 5.

l.

Scott Lurndal

unread,

Apr 15, 2023, 1:39:30 PM4/15/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>On Saturday, April 15, 2023 at 4:13:43=E2=80=AFPM UTC+1, Anton Ertl wrote:
>
>> Issue width 12 has not been demonstrated in a register-renaming CPU=20
>> yet.=20

>
>the M1 has 8, and sustains 100% throughput.
>

>> The Zen 4 register renamer can process 6 macro-ops per cycle,=20
>> and Intel's register renamer can process 6 macro-ops IIRC. The=20
>> numbers I remember for ARM and Apple (but not sure which structure=20

>> they refer to) are in the 7-9 range.
>
>good to know. ah, i know: inline loop-unrolled assembler would
>hit a reg-renamer pretty hard even on 8-wide OoO issue.
>

>> >my current understanding of how WaW reg-rename=20
>> >works is that it would be perfectly possible (without=20
>> >huge gate cascades) to do at least one reg-renaming=20
>> >in the same clock cycle, but that only gets 50%=20
>> >utilisation.
>> I never heard of or experienced any limitations in that area. A=20
>> slightly different, but probably related problem: Both Intel and AMD=20
>> manage to process a chain of 6 *dependent* mov reg,reg instructions in=20
>> a cycle into nothing in the renamer.=20

>
>this would not surprise me, they would be forced to because the
>number of regs in x86 is so small,

There are quite a few more in x86_64.

EricP

unread,

Apr 16, 2023, 3:03:36 PM4/16/23

to

luke.l...@gmail.com wrote:
> On Saturday, April 15, 2023 at 4:02:22 PM UTC+1, EricP wrote:
>
>> Does "three write-after-write register renames" mean
>> three concurrent back-to-back result forwarding's,
>> whereby results are forwarded from uOp to uOp triggering an
>> immediate wake-up & launch of dest uOps with no intervening clocks?
>
> a much simpler example:
>
> QTY 6of loop-unrolled repetitions of "ADD r0, r0, 1" on an 8-wide
> multi-issue OoO system. these should all be renamed:
>
> 1st ADD reads r0, stores in r0.1 (temporary, Reservation Station)
> 2nd ADD reads r0.1, stores in r0.2

> ....

> 6th ADD reads r0.5, stores in r0
>
> but although the ADDs themselves should be pipeline-chained,
> actually getting them *into* the Reservation Stations at the
> Decode Phase (and moving on to the next batch of 8
> fetched instructions) should take *one* clock cycle, not 4 or 5.
>
> l.

One answer is brute force: each parallel Rename lane propagates its
changes to higher lanes. Each higher lane checks if any of its source
Architecture Register Numbers (ARN) is a dest ARN of a lower lane,
and if so use that lower lane's new dest rename assignment.

Lane-1 checks if its source ARN's equal lane-0 dest ARN,
Lane-2 checks lane-0, lane-1, Lane-3 checks lane-0, lane-1, lane-2.
The gate cost is 1+2+3... = (N2+N)/2 = O(N2) the number of lanes,
and the max propagation delay is serial across the rename lanes.

Assuming the instructions have two source and one dest ARN,
and assuming the rename state table is an ARN-indexed SRAM,
and assuming a classic design of ROB + commit Architecture Register File,
then the rename table needs 3R1W ports for each rename lane
plus write ports to track each result write and ARN commit.
And bag full of logic ensures that the rename state table is updated
with correct value for renaming multiple dest ARN's in the same clock.
(Rename state table checkpoint and restore cost extra.)

EricP

unread,

Apr 16, 2023, 3:11:12 PM4/16/23

to

EricP wrote:
>
> Lane-1 checks if its source ARN's equal lane-0 dest ARN,
> Lane-2 checks lane-0, lane-1, Lane-3 checks lane-0, lane-1, lane-2.
> The gate cost is 1+2+3... = (N2+N)/2 = O(N2) the number of lanes,

= (N^2+N)/2 = O(N^2) the number of lanes

(the ^ got lost somewhere).

luke.l...@gmail.com

unread,

Apr 16, 2023, 5:53:56 PM4/16/23

to

On Sunday, April 16, 2023 at 8:03:36 PM UTC+1, EricP wrote:

> One answer is brute force: each parallel Rename lane propagates its
> changes to higher lanes. Each higher lane checks if any of its source
> Architecture Register Numbers (ARN) is a dest ARN of a lower lane,
> and if so use that lower lane's new dest rename assignment.

all the information about which are reads and which writes being easy to decode

> Lane-1 checks if its source ARN's equal lane-0 dest ARN,
> Lane-2 checks lane-0, lane-1, Lane-3 checks lane-0, lane-1, lane-2.
> The gate cost is 1+2+3... = (N2+N)/2 = O(N2) the number of lanes,
> and the max propagation delay is serial across the rename lanes.

ahh... because it's transitive, it's serial?

instruction 2 depends on instruction 1
instruction 3 depends on 2 and 1
...
...

> Assuming the instructions have two source and one dest ARN,
> and assuming the rename state table is an ARN-indexed SRAM,
> and assuming a classic design of ROB + commit Architecture Register File,
> then the rename table needs 3R1W ports for each rename lane
> plus write ports to track each result write and ARN commit.

which isn't massively expensive or unachievable.

> And bag full of logic ensures that the rename state table is updated
> with correct value for renaming multiple dest ARN's in the same clock.
> (Rename state table checkpoint and restore cost extra.)

awesome.

eric i really appreciate your insights. thank you.

l.

BGB

unread,

Apr 16, 2023, 6:07:58 PM4/16/23

to

On 4/15/2023 3:29 AM, luke.l...@gmail.com wrote:
> an interesting question has come up where i don't know
> if it's even possible (in reasonable gate propagation time
> given how large Dependency Matrices can get).
>
> the scenario is:
>
> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
> * each instruction (for the sake of simplicity) is Scalar
> * the multi-issue width is 12 to 16
>

Possible: Probably...
But clock speed is gonna suck...

Not sure if/how OoO machines avoid the issue where increasing width
leads to massively expanding cost.

IOW: Why my BJX2 core is 3-wide, but why I effectively shelved ideas for
trying to go wider for the time being:
1=>2 or 2=>3 are arguably modest jumps;
But, 3=>4 or beyond, stuff starts to get a bit unreasonable.

As far as I know, there is no real good way to avoid the x^2 cost-curve
of widening the register file and similar.

> the desire here is to obviously fit *four* simultaneous
> copies of that loop into the same clock cycle, but to
> do so requires *THREE* Write-after-Write register
> renames in a row (four on the next clock cycle if one
> is still "in the system").
>
> my question is: is this even possible, without having to
> go to massive-wide register operations as a work-around?
>
> my current understanding of how WaW reg-rename
> works is that it would be perfectly possible (without
> huge gate cascades) to do at least one reg-renaming
> in the same clock cycle, but that only gets 50%
> utilisation.
>
> has anyone encountered or attempted this before, or
> have any good ideas? i know VVM auto-vectorisation
> would solve this because the LOOP construct marks
> the LD-COMPUTE-ST loop-counter, load, and store
> registers such that hazards can be dropped on those.
>
> any other ideas?
>

If you have a lot of registers, and leave all the scheduling up to the
compiler, one does not need register renaming.

Making the compiler "not suck" is difficult though.

I can often get pretty large speedups by writing things in ASM in my
case; contrast with some other ISA's where ASM doesn't accomplish much.

My past (small scale) experiments with trying to generate code for ARM
yielded atrociously bad performance (so, there is still some "secret
sauce" that seems to be missing in my case).

But, it could be that these are "two sides to the same coin".

I am not aware of any big "smoking gun" issue for why my code generators
suck so bad, seemingly it is all "death by 1000 paper cuts" style issues.

But, in other news:
Had worked on hacking my FM Synth module design to do PCM mixing;
Modified my MIDI backend to effectively do wavetable music synthesis (in
a hacky way), but theoretically could also now play S3M music and
similar in hardware (assuming my Verilog tweaks for this work, still
untested).

Ended up mostly going with 11025 A-Law for the wavetable patches for
now, but seemingly this does effect the quality of the sound some
(better than 8kHz; 16kHz would have left the wavetable "a bit large").

In this case, ended up putting the wavetable in a package based on the
Doom WAD (IWAD) format; with "SND_xxxx" for the wavetable patches, and
"PATCHIDX" for a combined index of audio patches (as a single big table
covering all of the MIDI instruments). Design was a bit "quick and
dirty". The WAD lumps are then copied into "audio RAM".

Not entirely clear as of yet (from the API design sense) how it will
accommodate something like an S3M player. Same basic MIDI
command-structure would work, but would need a "good" way for the client
program to submit customized instrument patches.

Though, possibly would use a combined instrument number space, say:
0..127: Normal MIDI Instruments
128..255: MIDI Percussive Instruments
256+: User-defined.
With patches submitted as waveform data (likely WAVEFORMATEX + data),
and instruments as additional entries into the patch table. May need to
be tied to a context.

The wavetable MIDI isn't entirely a win though vs the OPL-like FM
synthesis (the FM synthesis is "more authentic" to the original sound of
the Doom music and similar).

Another possible option would be allowing the program to submit
instruments in bulk form as a WAD image, but this would be a bit tacky
and require a program to compose a WAD in-RAM for something like an S3M
player.

Otherwise, was faced with an annoyance that seemingly hardly anyone
sells ~ 505nm LEDs in a 5mm form factor (465nm, 525nm, and combined
465nm+525nm LEDs would not work for what I want them for; but some
combined-color LEDs would be useful as well).

Mostly consider wanting to try to do a "more empirical" test for whether
or not I can in-fact see 505nm and 465nm+525nm as different colors...

Conventional wisdom is that they should look like the same color (as
opposed to two separate "cyan" and an "essence-of-gray" style colors).

( Was mostly looking on Amazon, but no one on here seems to be selling
the types of LEDs I would want for this... Would seemingly need to go
through the hassle of buying LEDs from DigiKey or similar... ).

Well, either that or I am gradually slipping into insanity, one of the
two...

Can mostly ignore 590nm and 570nm, as while these colors of LED are
seemingly more common, I don't seem to notice anything in these areas
which seems worth investigating (there are no "unusual" colors between
green and red, only up near the blue end of the color range where stuff
diverges).

But, even then, even if I see it, it may not matter if it is still the
case if no one else sees the difference.

Like, a kind of seemingly "incredibly moot" type of insanity...

luke.l...@gmail.com

unread,

Apr 16, 2023, 6:52:53 PM4/16/23

to

On Sunday, April 16, 2023 at 11:07:58 PM UTC+1, BGB wrote:
> On 4/15/2023 3:29 AM, luke.l...@gmail.com wrote:
> > * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
> > * each instruction (for the sake of simplicity) is Scalar
> > * the multi-issue width is 12 to 16
> >
> Possible: Probably...
> But clock speed is gonna suck...

one of the requirements is "not sucking" :)

> As far as I know, there is no real good way to avoid the x^2 cost-curve
> of widening the register file and similar.

striping. QTY 4of separate regfiles, every 4th register has
direct access to every 4th regfile, and *indirect* access
via either a cyclic shift register or multiplexer bus.

issue may also be similarly striped, i saw a paper on
tomasulo algorithm reserving every modulo-4 ROB entry
for every modulo-4 instruction issued.

l.

Anton Ertl

unread,

Apr 17, 2023, 3:55:45 AM4/17/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>On Sunday, April 16, 2023 at 8:03:36=E2=80=AFPM UTC+1, EricP wrote:
>
>> One answer is brute force: each parallel Rename lane propagates its=20
>> changes to higher lanes. Each higher lane checks if any of its source=20
>> Architecture Register Numbers (ARN) is a dest ARN of a lower lane,=20
>> and if so use that lower lane's new dest rename assignment.=20
>
>all the information about which are reads and which writes being easy to de=
>code
>
>> Lane-1 checks if its source ARN's equal lane-0 dest ARN,=20
>> Lane-2 checks lane-0, lane-1, Lane-3 checks lane-0, lane-1, lane-2.=20
>> The gate cost is 1+2+3... =3D (N2+N)/2 =3D O(N2) the number of lanes,=20
>> and the max propagation delay is serial across the rename lanes.=20

>
>ahh... because it's transitive, it's serial?

Only if you implement it so. You can also implement it as a tree
(O(ln N) time complexity):

overwrite(arn, instructions) returns (flag, prn)
{
if |instructions|=1 then // base case
if instructions[0] writes to arn then
return yes, instructions[0].prn
else
return no, dontcare
else
instructions1, instructions2 = split(instructions)
flag1, prn1 = overwrite(arn,instructions1) || // parallel
flag2, prn2 = overwrite(arn,instructions2)
if flag2 then
return flag2, prn2
else
return flag1, prn1

Quadibloc

unread,

Apr 18, 2023, 12:46:06 AM4/18/23

to

On Saturday, April 15, 2023 at 2:30:00 AM UTC-6, luke.l...@gmail.com wrote:

> the desire here is to obviously fit *four* simultaneous
> copies of that loop into the same clock cycle, but to
> do so requires *THREE* Write-after-Write register
> renames in a row (four on the next clock cycle if one
> is still "in the system").

I am not nearly as knowledgeable about such matters
as some of those who have already replied.

However, I think that there would be ways in which this
could be made to work. Register renaming belongs with
instruction decoding, and thus can take place prior to
execution itself.

So you convert to micro-ops that include renaming, and
then do the loop on the cached micro-ops, and then
no cycles are required for rename on the second and subsequent
iterations of the loop.

The problem is, though, if you want to do an iteration in
one cycle, and floating-point arithmetic takes, say, 11 cycles
or so to finish, you're _not_ going to get by with just enough
register renames to let *four* sets of arithmetic operations
take place concurrently.

In an ISA that doesn't have Mitch Alsup's VVM, though, the
machine language to micro-op translator could still note
the existence of a loop, so with short loops, in a conventional
architecture, one _could_ do anything that VVM can do. This
sort of trickery is the sort of thing Mitch Alsup often talks
about as a possibility.

So if, instead of thinking in terms of register renames, one
frames the entire loop in terms of result forwarding, with no
actual register use at all, an optimal sequence of micro-ops
that involves no delays due to WAW hazards should indeed
be at least a theoretical possibility.

John Savard

luke.l...@gmail.com

unread,

Apr 18, 2023, 3:49:50 AM4/18/23

to

On Tuesday, April 18, 2023 at 5:46:06 AM UTC+1, Quadibloc wrote:
> On Saturday, April 15, 2023 at 2:30:00 AM UTC-6, luke.l...@gmail.com wrote:
>
> > the desire here is to obviously fit *four* simultaneous
> > copies of that loop into the same clock cycle, but to
> > do so requires *THREE* Write-after-Write register
> > renames in a row (four on the next clock cycle if one
> > is still "in the system").
> I am not nearly as knowledgeable about such matters
> as some of those who have already replied.
>
> However, I think that there would be ways in which this
> could be made to work. Register renaming belongs with
> instruction decoding, and thus can take place prior to
> execution itself.

yes.

> So you convert to micro-ops that include renaming, and
> then do the loop on the cached micro-ops, and then
> no cycles are required for rename on the second and subsequent
> iterations of the loop.

this still leaves the method of detection which Eric kindly pointed
out is achievable (whew).

but yes for SVP64 the loop is already expressed (like x86 REP)
so actual micro-ops are ironically the scalar operations themselves.

thus in SVP64 even on single-issue one instruction may result in
say 32 back-end SIMD operations.

but that takes up 32 registers (64 if 2 operands, 96 if not an overwrite)
and that's what multi-issue fixes, you can use less register names
have multi-issue put many more smaller blocks of processing into
RSes.

> The problem is, though, if you want to do an iteration in
> one cycle, and floating-point arithmetic takes, say, 11 cycles
> or so to finish, you're _not_ going to get by with just enough
> register renames to let *four* sets of arithmetic operations
> take place concurrently.

they need to get into Reservation Stations.
they don't have to complete immediately.
but if they are not in RSes on the same clock there
is no chance of acheving > 1 IPC.

> In an ISA that doesn't have Mitch Alsup's VVM, though, the
> machine language to micro-op translator could still note
> the existence of a loop, so with short loops, in a conventional
> architecture, one _could_ do anything that VVM can do. This
> sort of trickery is the sort of thing Mitch Alsup often talks
> about as a possibility.

SVP64 Looping expresses this possibility in reality but does
so explicitly with an actual instruction equivalent to x86 "REP"
and bringing z80 LDIR and CPIR capability.

l.

MitchAlsup

unread,

Apr 20, 2023, 5:36:01 PM4/20/23

to

On Saturday, April 15, 2023 at 3:30:00 AM UTC-5, luke.l...@gmail.com wrote:
> an interesting question has come up where i don't know
> if it's even possible (in reasonable gate propagation time
> given how large Dependency Matrices can get).
>
> the scenario is:
>
> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
> * each instruction (for the sake of simplicity) is Scalar
> * the multi-issue width is 12 to 16
>
> the desire here is to obviously fit *four* simultaneous
> copies of that loop into the same clock cycle, but to
> do so requires *THREE* Write-after-Write register
> renames in a row (four on the next clock cycle if one
> is still "in the system").
>
> my question is: is this even possible, without having to
> go to massive-wide register operations as a work-around?
<

Luke:: use the force..........
<
A quick examination indicates this is not even a problem.
Just give each result a name (from the rename pool) {you
have to be able to do this anyway, so it falls out for free.}
<
As for the 3 that get elided, this is a traded off in the
register rename pool. With a sufficiently big pool, do
nothing and the problem solves itself. With a more
conservative pool, you can detect this in the cycle
after decode {¿issue, insert, queue, sched, execute?}
and return the invisible registers back to the pool.

>
> my current understanding of how WaW reg-rename
> works is that it would be perfectly possible (without
> huge gate cascades) to do at least one reg-renaming
> in the same clock cycle, but that only gets 50%
> utilisation.
<

A) give each destination register a new name.
B) when you have time, return the elided writes to the
.....pool.

MitchAlsup

unread,

Apr 20, 2023, 5:37:17 PM4/20/23

to

Yes, if you try to do it the hard way it is BigO( n^2 ).
<
If you get clever, the check logic is quadratic, but the gate logic
is linear.

BGB

unread,

Apr 21, 2023, 4:05:36 AM4/21/23

to

My compiler still sucks...

API issues not resolved.

Did end up remaking a "quick and dirty" set of replacement patches to
hopefully avoid copyright issues from using patches that were apparently
derived from roughly 30 year old "Gravis Ultra Sound" patches.

My improvised replacement set doesn't exactly sound "good", but is
basically usable as a "proof of concept" for testing the wavetable approach.

Have also ended up adding some helper instructions for A-Law encoding,
in the form of a 4x Binary16 to 4x A-Law converter instruction (reducing
the computational cost vs doing it using normal integer math),
effectively using the SIMD converter ops and FPU for most of the
heavy-lifting.

This was intended to reduce the CPU cost of sending PCM audio data to
the sound device (which accepts audio data in A-Law form rather than PCM).

Also went and added a few color-cell ops as well:

* RGB5MINMAX Rm, Rn
Takes 4x RGB555 colors, and selects the minimum and maximum colors
according to luma.

* RGB5CCENC Rm, Ro, Rn
Classifies each RGB555 pixel from Rm according to the Min and Max given
in Ro, adding the selector output bits to Rn (while also shifting Rn
right by 8 bits).

These could be used to hopefully make 640x400 mode more practical by
speeding up the color-cell encoding process. Though, there are limits,
as redrawing a 640x400 framebuffer, etc, will still take a significant
amount of memory bandwidth.

But, if I could do a 640x400 "GUI" with a redraw speed of more than a
few frames per-second, that would be good (still not entirely sure how
early 90s PC's made 640x480 practical; as they would have likely faced
similar performance constraints).

Ended up basically handling luma by bit-shuffling, say:
G4 G3 R4 B4 - G2 R3 B3 G1 - R2 B2 G0 R1

Mostly because bit-shuffling and comparing a slightly larger value has
less latency than trying to "actually calculate the luma", and still
gives a similar result for relative comparisons.

>
>
>
> Otherwise, was faced with an annoyance that seemingly hardly anyone
> sells ~ 505nm LEDs in a 5mm form factor (465nm, 525nm, and combined
> 465nm+525nm LEDs would not work for what I want them for; but some
> combined-color LEDs would be useful as well).
>
> Mostly consider wanting to try to do a "more empirical" test for whether
> or not I can in-fact see 505nm and 465nm+525nm as different colors...
>
> Conventional wisdom is that they should look like the same color (as
> opposed to two separate "cyan" and an "essence-of-gray" style colors).
>

After getting a few LEDs, initial results were not what was expected:
505nm and 525nm were "almost" the same color, both "obviously green".
The 505nm LEDs were a slightly brighter green than the 525nm LEDs.
But, both appeared green to me, and nearly the same color.
However, 465nm was a match for the mystery color.
The 465nm LEDs were sold as "blue", but they are not blue...
They are a color that people often call blue, but it is not blue.

I have now ordered some different 465nm LEDs (for contrast), some 400nm
LEDs, and some 440nm LEDs.

Claimed descriptions were:
465nm, "blue" but not really (it is the imposter...);
440nm, "royal blue" or "pink" (WTF? *1);
400nm, UV / invisible(?) (*2).

*1: Not sure how this works, when "pink" is not even a color on the rainbow.

More so, as apparently 440nm is what is used for the blue in monitors,
and presumably they would not have done this if it were "pink".

In any case, whatever monitors use, seems to match with the "higher"
blue, and not with the "blue" LEDs. Whatever exactly is going on (it is
possible the LEDs were not 465nm, so I ordered some different 465nm blue
LEDs in case this was what was going on).

Some other stuff describes 440nm as "Royal Blue" or "Indigo", which at
least makes more sense.

Then again, another possibility is that the LEDs are actually pink (due
to a pigment), but (by extension) not actually 440nm (in which case, I
will be annoyed; me trying to find monochromatic 440nm).

*2: Will see if this is true. I suspect it is probably if-anything going
to be the black-light effect of lighting up things with a haze: roughly
"blue-violet" or "indigo" colored (with the "UV fluorescence effect"
only really visible on camera; because the camera doesn't seem to see
all the indigo haze).

IME, there is generally no actual purple or violet in the rainbow, it
just sort of stops with sort of an indigo-blue color.

Though, I have notices that some objects with this color tend to show up
as black on cameras (and some other blue objects show up as purple on
camera).

But, I guess maybe I will know more when the next batch of LEDs shows up.

Though, a lot of this does seem to shoot a hole in the "maybe rods are
contributing to my color vision?" theory (505nm would have made sense
for this; 465nm does not...).

> ( Was mostly looking on Amazon, but no one on here seems to be selling
> the types of LEDs I would want for this... Would seemingly need to go
> through the hassle of buying LEDs from DigiKey or similar... ).
>

Ended up buying them from "LED Supply".

Annoying as this color turned out to be annoyingly expensive.
Also annoying that it just turned out to be plain ol' green...

luke.l...@gmail.com

unread,

Apr 21, 2023, 1:36:37 PM4/21/23

to

On Thursday, April 20, 2023 at 10:36:01 PM UTC+1, MitchAlsup wrote:

> Luke:: use the force..........

you can imagine what happened at stonyhurst boarding school
when Star Wars first came out...

> As for the 3 that get elided, this is a traded off in the
> register rename pool. With a sufficiently big pool, do
> nothing and the problem solves itself. With a more
> conservative pool, you can detect this in the cycle
> after decode {¿issue, insert, queue, sched, execute?}
> and return the invisible registers back to the pool.

slight chang3 of topic for a moment..

i remember you said a couple weeks back you managed to eliminate
multi-ported CAMs by aligning ROB numbers with Scoreboard-esque
Function Unit numbers: does that basically well, the 1:1 relationship
eliminates even the need for the CAM, but in essence the number
of ROB row entries *equals* the number of Reservation Stations
in effect, is that right?

if so, i can then foresee a new spanner-in-the-works: the allocation
of ROB entries in Tomasulo is an incremental affair, rotating round
robin, but if ROBs are tied to RSes then finding "first free RS" is a bit
more complex, needing to skip over multiple entries based on
Function Unit type.

i don't think however what you and Eric have kindly outlined would
be impacted by the above, i.e. Multi-Issue Tomasulo *and* reg-renaming
should work perfectly well?

l.

MitchAlsup

unread,

Apr 21, 2023, 2:19:20 PM4/21/23

to

On Friday, April 21, 2023 at 12:36:37 PM UTC-5, luke.l...@gmail.com wrote:
> On Thursday, April 20, 2023 at 10:36:01 PM UTC+1, MitchAlsup wrote:
>
> > Luke:: use the force..........
>
> you can imagine what happened at stonyhurst boarding school
> when Star Wars first came out...
> > As for the 3 that get elided, this is a traded off in the
> > register rename pool. With a sufficiently big pool, do
> > nothing and the problem solves itself. With a more
> > conservative pool, you can detect this in the cycle
> > after decode {¿issue, insert, queue, sched, execute?}
> > and return the invisible registers back to the pool.
<
> slight chang3 of topic for a moment..
>
> i remember you said a couple weeks back you managed to eliminate
> multi-ported CAMs by aligning ROB numbers with Scoreboard-esque
> Function Unit numbers: does that basically well, the 1:1 relationship
> eliminates even the need for the CAM, but in essence the number
> of ROB row entries *equals* the number of Reservation Stations
> in effect, is that right?
<

If the operand waiting entry in the RS knows which function unit delivers
the result, then you place a multiplexer on the tag busses, and now, this
RS-operand entry only watches 1 result tag bus.
<
So the number of tag busses remains the same, but the number of CAM
comparitors is 1 per operand.
<
You can go farther:
<
If the RS operand entry knows which FU and which number, then the
comparisons can be done with an AND gate--just like in a dependency
matrix; eliminating the CAM.

>
> if so, i can then foresee a new spanner-in-the-works: the allocation
> of ROB entries in Tomasulo is an incremental affair, rotating round
> robin, but if ROBs are tied to RSes then finding "first free RS" is a bit
> more complex, needing to skip over multiple entries based on
> Function Unit type.
<

Let me explain it in terms of a physical register file::
<
given a number of function units and a number of physical registers,
consider dividing the physical registers into the number of pools of
the function units. Now, each function unit has a small pool of rename
registers. So, once you decide the FU, you have a rename register sitting
their waiting to be consumed. Presto--complete parallelism.

>
> i don't think however what you and Eric have kindly outlined would
> be impacted by the above, i.e. Multi-Issue Tomasulo *and* reg-renaming
> should work perfectly well?
<

I had no problem with it in 1991......
>
> l.

BGB

unread,

Apr 21, 2023, 2:43:22 PM4/21/23

to

On 4/16/2023 5:52 PM, luke.l...@gmail.com wrote:
> On Sunday, April 16, 2023 at 11:07:58 PM UTC+1, BGB wrote:
>> On 4/15/2023 3:29 AM, luke.l...@gmail.com wrote:
>>> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
>>> * each instruction (for the sake of simplicity) is Scalar
>>> * the multi-issue width is 12 to 16
>>>
>> Possible: Probably...
>> But clock speed is gonna suck...
>
> one of the requirements is "not sucking" :)
>

Not sucking is hard.

In my case, I am sorta managing 50 MHz.

A few of the "big shiny RISC-V cores", ironically, have a hard time
being clocked much over 25 or 30 MHz...

Apparently similar for some of the ARM on FPGA cores as well...

Getting 100 MHz is harder though, as then one is basically on the edge
of things like the latency of adding two numbers; and the general
inability to have single-cycle access in many cases to block-RAM arrays.

reg[127:0] arrA[63:0]; //goes in LUTRAM, it is fine...
reg[127:0] arrB[511:0]; //Oh Crap...
reg[127:0] arrC[8191:0]; //Oh Crap...

So, at 50MHz, both arrA and arrB can be "easily" accessed in a single
cycle, but at 100MHz, only arrA works.

Whereas arrC will generally need 2 or 3 cycles (works OK for L2 cache,
not so good for L1).

In my early testing, the advantage of a bigger L1 cache more than offset
the "loss" in terms of needing to have a lower clock speed.

Hence why in my case, I ended up with a big 3-wide 64-bit core running
at 50MHz, rather than a 1-wide 32-bit core running at 100 MHz.

The 1-wide core might have ran Doom better, in theory, apart from the
issue that 2K L1 caches would have been killed performance...

Well, either that, or have a longer pipeline with 4 or 5 cycle
memory-access instructions...

Also ironically, with the Spartan and Artix, the clock-speed has a
slight negative correlation with FPGA size at the same speed grade (so,
say, the XC7A100T is slightly slower than the XC7S50, and the XC7A200T
is slightly slower than the XC7A100T).

But, one can fit a whole lot more into an XC7A100T or XC7A200T than in
an XC7S50, so it is mostly a worthwhile tradeoff.

...

>> As far as I know, there is no real good way to avoid the x^2 cost-curve
>> of widening the register file and similar.
>
> striping. QTY 4of separate regfiles, every 4th register has
> direct access to every 4th regfile, and *indirect* access
> via either a cyclic shift register or multiplexer bus.
>
> issue may also be similarly striped, i saw a paper on
> tomasulo algorithm reserving every modulo-4 ROB entry
> for every modulo-4 instruction issued.
>

I guess maybe this can work for register renaming...
For a more conventional ISA design, this would suck.

When I was designing WEX6W, one idea was to try to split the regfile
into 2x 6R4W, rather than have a monolithic 12R6W register file.

In this case, operations would have been organized (for 4/5/6 bundles):
OP3B | OP3A | OP2B | OP2A | OP1B | OP1A

With A having an affinity for R0..R31 and B having an affinity for
R32..R63, with a penalty for using a register outside the set (the
operations would need to trigger an interlock rather than forwarding).

Bundles would also only be able to handle a single store (per A or B) to
a register outside of the corresponding A|B set.

I ended up mostly not doing this, as cost and complexity was still
getting absurd even without the full forwarding.

Also there was a considered SMT mode that would have split the register
file into 2x 32 registers (so XGPR would not have been usable in SMT
mode). This idea was also dropped.

Even if it were implemented, unclear how it could have been used.

My compiler can't even make effective use of 2 and 3 wide bundles.
My ASM code can mostly make use of 2-wide bundles, but I mostly ended up
writing 2-wide ASM because 3-wide ASM is much more of a pain to write
and mentally manage the instruction scheduling (more so as 3-wide is
much more subject to "which instructions are allowed in which
combinations" issues).

Much wider would likely be "nearly impossible".

As noted, BGBCC mostly uses a strategy like:
Emit code like it was a "Plain Ol' RISC";
Shuffle instructions after the fact;
Bundle any instructions which have ended up in the right combinations.

Though, this strategy only really tends to get an average bundle length
of around 1.20 to 1.35 instructions, which is "kinda suck"...

It is also painfully bad at the whole "instruction scheduling" thing, so
ends up basically paying much of the interlock penalty for nearly every
multi-cycle instruction (at least short of manually organizing things in
the C code to try to avoid this; and writing C that kinda resembles ASM).

...

robf...@gmail.com

unread,

Apr 22, 2023, 4:49:43 AM4/22/23

to

I have not had much luck getting a superscalar to clock over 20 MHz in
an FPGA. One issue is the size of the core slows things down. Even at
20 MHz though performance is like that of a 40 to 50 MHz scalar core.
One nice thing about a lower clock speed is that there are fewer issues
with some instructions which can be more complex because more
clock space is available. I got a in-order PowerPC clone to work a little
faster than 20 MHz.

Even though slow an FPGA is great for experimenting with things like
register renaming.
One issue with the FPGA is it cannot handle non-
clocked logic at all. So some logic that relies on a circuit settling in
a loop cannot be done very well in an FPGA.

For my most recent core I am starting with a simple scalar
sequential machine. It should be able to run at 40 MH+, but likely
taking 3 or more clocks per instruction. I have not been able to
isolate a signal that is missing and causing about half the core
to be omitted when built.

I think getting a VLIW machine working would be very difficult
to do. Getting the compiler to generate code making use of the
machine would be a significant part of it. Risc with a lot of registers
makes it easier on the compiler.

BGB

unread,

Apr 22, 2023, 5:44:25 PM4/22/23

to

Yeah.

There are a few reasons I went with VLIW rather than superscalar...
Both VLIW and an in-order superscalar require similar logic from the
compiler in order to be used efficiently, and the main cost of VLIW here
reduces to losing 1 bit of instruction entropy and some additional logic
in the compiler to detect if/when instructions can run in parallel.

Superscalar effectively requires a big glob of pattern recognition early
in the pipeline, which seems like a roadblock.

Had considered support for superscalar RISC-V by using a lookup to
classify instructions as "valid prefix" and "valid suffix" and also
logic to check for register clashes, and then behaving as if there were
a "virtual WEX bit" based on this.

Hadn't got around to finishing this. RISC-V support on the BJX2 core is
still mostly untested, and will still be limited to in-order operation
for the time being (and, the design of superscalar mechanism would only
be able to give 2-wide operation for 32-bit instructions only).

Running POWER ISA code on the BJX2 pipeline would be a bit more of a
stretch though (I added RISC-V as it was already pretty close to a
direct subset of BJX2 at that point).

Things like "modulo scheduling" could in theory help with a VLIW
machine, but in my experience modulo-scheduling could likely also help
with some big OoO machines as well (faking it manually in the C code
being a moderately effective optimization strategy on x86-64 machines as
well).

Apparently, clang supports this optimization, but this sort of thing is
currently a bit out of scope of what I can currently manage in BGBCC.

Though, have observed that this strategy seems to be counter-productive
on ARM machines (where it seems to often be faster to not try to
manually modulo-schedule the loops). Though, this may depend on the ARM
core (possibly an OoO ARM core might fare better; most of the ones I had
tested on had been in-order superscalar).

Though, I wouldn't expect there to be all that huge of a difference
between AArch64 and BJX2 on this front, where manual modulo-scheduling
is generally effective on BJX2.

But, as noted in my case on BJX2, latency is sort of like:
1-cycle:
Basic converter ops, like sign and zero extension;
MOV reg/reg, imm/reg, ...
...
2-cycle:
Most ALU ops (ADD/SUB/CMPxx/etc);
Some ALU ops could be made 1-cycle, but "worth the cost?".
More complex converter-class instructions ('CONV2').
Many of the FPU and SIMD format converters go here.
3-cycle:
MUL (32-bit only);
"low-precision" SIMD-FPU ops (Binary16, opt Binary32*);
Memory Loads;
The newer RGB5MINMAX and RGB5CCENC instructions;
...

*: The support for Binary32 is optional, and was pulled off by fiddling
the FPU to "barely" give correct-ish Binary32 results, with the notable
restriction that it is truncate-only rounding.

RGB5MINMAX was basically:
Cycle 1: Figure out the RGB555 Y values;
Cycle 2: Compare and select based on Y values;
Cycle 3: Deliver output from Cycle 2.

Initial attempt had routed this through the CONV2 path, but it was bad
for cost and timing, so I had reworked it to share the RGB5CCENC module,
which also had similar logic (both needed to find Y based on RGB555
values, ...).

RGB5CCENC was basically:
Cycle 1:
Figure out the RGB555 Y values (for pixels);
Figure out the RGB555 Y values (for Mid, Lo-Sel, Hi-Sel);
Cycle 2: Compare and generate selector indices based on Y values;
Cycle 3: Deliver output from Cycle 2.

Some longer non-pipelined cases:
6-cycle: FADD/FMUL/etc (main FPU)
10-cycle: FP-SIMD via main FPU ("high precision").
40-cycle: Integer DIVx.L and MODx.L
80-cylce: Integer DIVx.Q and MODx.Q, 64-bit MULx.Q, ...
120-cycle: FDIV.
480-cycle: FSQRT.

For integer divide and modulo, it is mostly a toss-up between the ISA
instruction and "just doing it in software".

For 64-bit integer multiply, doing it in software is still faster.

For floating-point divide, doing it in software is faster, but the
hardware FDIV is able to give more accurate results (software N-R
seemingly being unable to correctly converge the last few low-order bits).

The HW FDIV was based on noting that I could basically hack the "Binary
Long Division" hardware divider to also handle floating-point by making
it run ~ 50% longer (and adding a little extra special-case logic for
the exponents).

The attempt at hardware FSQRT is basically a boat anchor (doing it in
software is significantly faster). Not aware of any "good/cheap" way to
do SQRT in hardware.

> Even though slow an FPGA is great for experimenting with things like
> register renaming.
> One issue with the FPGA is it cannot handle non-
> clocked logic at all. So some logic that relies on a circuit settling in
> a loop cannot be done very well in an FPGA.
>

Yeah.

> For my most recent core I am starting with a simple scalar
> sequential machine. It should be able to run at 40 MH+, but likely
> taking 3 or more clocks per instruction. I have not been able to
> isolate a signal that is missing and causing about half the core
> to be omitted when built.
>

In early design attempts, I was trying for 100MHz, but was mostly being
defeated. Eventually, I dropped to 50, and have mostly settled here.

Some people have gotten 200 MHz cores to work, but these were mostly 8
and 16 bit designs (typically with no external RAM interface, so only
using block-RAM as RAM).

Though, apparently, some people have done Commodore 64 clones using the
XC7A100T and XC7A200T this way, since the FPGA has more Block-RAM than
the entire memory space of the C64 (then mostly implementing the 6502
ISA and various peripheral devices).

Not sure of the point of a "massively overclocked C64" though.

Doing it on an FPGA makes more sense than some other people trying to do
it more "authentically" with DIP chips though (and trying to source a
bunch of chips that have been out-of-production almost as long as I have
been alive).

Then apparently people have done "replacement clone" DIP chips, that are
effectively just FPGAs on small PCBs made to mimic the original DIP
pinouts (but, then one ends up with a C64 clone mostly running off a
bunch smaller FPGAs rather than a single bigger FPGA).

But, I guess my project is mostly different in that it wasn't
particularly motivated by "childhood nostalgia", and some of the major
games that I played when I was still young (namely Quake 1 and 2), are
still a little heavyweight for the BJX2 core.

Others, like Half-Life, never had their source code released (there is
partial source, such as for the game-logic, etc, but it is not under
anything like GPL or similar). In theory, someone could do a clone of
the GoldSrc engine based on a modified GLQuake or Quake2 engine or similar.

Well, and game consoles like the PlayStation, with games like Mega-Man
X4/Legends/etc, FF7, etc. Or DreamCast with games like Sonic Adventure
and similar.

Well, and sadly, my software OpenGL rasterizer still falls short of
matching the performance of the GPU in the original PlayStation.

Though, possibly, doing something like a game with graphics on-par to
something like Mega-Man Legends could work (the scenes were mostly made
out of a limited number of fairly large polygons).

Ironically, games like "Resident Evil" and similar would be easier, as
they were mostly using low-detail 3D models over static image backgrounds.

Or, other possible possible trickery involving cube-maps or skyboxes to
fake a bigger and more detailed scene than could actually be rendered
directly (stuff near the camera being 3D rendered, and anything else
offloaded to what is effectively a skybox).

> I think getting a VLIW machine working would be very difficult
> to do. Getting the compiler to generate code making use of the
> machine would be a significant part of it. Risc with a lot of registers
> makes it easier on the compiler.
>

Yeah.

I have done a VLIW ISA, but with the compiler mostly treating it like a
RISC.

This is kinda crappy, but basically works...

Sadly, does mean that "performance sensitive parts" end up needing to
fall back to ASM far more often than would be needed on a more modern
system (on a modern PC, whether there is any real speedup from ASM being
a bit hit or miss).

...

MitchAlsup

unread,

Apr 22, 2023, 7:11:33 PM4/22/23

to

On Saturday, April 22, 2023 at 4:44:25 PM UTC-5, BGB wrote:
> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:

> Yeah.
>
> There are a few reasons I went with VLIW rather than superscalar...
> Both VLIW and an in-order superscalar require similar logic from the
> compiler in order to be used efficiently, and the main cost of VLIW here
> reduces to losing 1 bit of instruction entropy and some additional logic
> in the compiler to detect if/when instructions can run in parallel.
>
> Superscalar effectively requires a big glob of pattern recognition early
> in the pipeline, which seems like a roadblock.
<

It is "such a massive roadblock" that my ISA has to look at <gasp> all
of 6-bits to determine where an instruction gets routed (its function unit)
in unary, one gate later you know if a conflict is present.

>
> Had considered support for superscalar RISC-V by using a lookup to
> classify instructions as "valid prefix" and "valid suffix" and also
> logic to check for register clashes, and then behaving as if there were
> a "virtual WEX bit" based on this.
>
> Hadn't got around to finishing this. RISC-V support on the BJX2 core is
> still mostly untested, and will still be limited to in-order operation
> for the time being (and, the design of superscalar mechanism would only
> be able to give 2-wide operation for 32-bit instructions only).
>
>
> Running POWER ISA code on the BJX2 pipeline would be a bit more of a
> stretch though (I added RISC-V as it was already pretty close to a
> direct subset of BJX2 at that point).
>
>
>
> Things like "modulo scheduling" could in theory help with a VLIW
> machine, but in my experience modulo-scheduling could likely also help
> with some big OoO machines as well (faking it manually in the C code
> being a moderately effective optimization strategy on x86-64 machines as
> well).
<

Modulo scheduling reduces the number of resources in flight to get
loops-with-recurrences running smoothly. It helps mono-scalar and larger
in-order pipelines, and does no harm to OoO designs whatsoever.

>
> Apparently, clang supports this optimization, but this sort of thing is
> currently a bit out of scope of what I can currently manage in BGBCC.
>
>
> Though, have observed that this strategy seems to be counter-productive
> on ARM machines (where it seems to often be faster to not try to
> manually modulo-schedule the loops). Though, this may depend on the ARM
> core (possibly an OoO ARM core might fare better; most of the ones I had
> tested on had been in-order superscalar).
>
> Though, I wouldn't expect there to be all that huge of a difference
> between AArch64 and BJX2 on this front, where manual modulo-scheduling
> is generally effective on BJX2.
>
>
>
> But, as noted in my case on BJX2, latency is sort of like:
> 1-cycle:
> Basic converter ops, like sign and zero extension;
> MOV reg/reg, imm/reg, ...
<

Are these the zero calculation-cost "calculations"
from the set {FABS, IMOV, FMOV, FNEG, FcopySign, INVERT} ?

<
> ...
> 2-cycle:
> Most ALU ops (ADD/SUB/CMPxx/etc);
> Some ALU ops could be made 1-cycle, but "worth the cost?".
<

Warning "Will Robinson":: the slope is very slippery; but candidates
are from the set {AND, OR, XOR, <<1, <<2, >>1, >>2, ROT <<1, ROT >>1}

<
> More complex converter-class instructions ('CONV2').
> Many of the FPU and SIMD format converters go here.
> 3-cycle:
> MUL (32-bit only);
> "low-precision" SIMD-FPU ops (Binary16, opt Binary32*);
> Memory Loads;
<

Given that integer ADD is 2-cycles {in your definition of the list ordinality}
I find it interesting that you get::
{
route AGEN adder to SRAMs address decoder,
SRAM access (1-full-cycle)
SRAM route to data-path; tag address compare;
LD align; Set-Selection;
Drive result bus
}
in 1 more cycle.

If I were to take the machine I am designing AND it happened that I
had a calculation unit which could do FADD or FMUL or FMAC in
64-bits and in a 6-cycle pipeline then::
IDIV IMOD is 24-28 cycles
FDIV is 24-26 cycles
SQRT is 27-32 cycles
in a pipeline where::
LD latency is 4 cycles
IMUL is 6 cycles
<
If I stated the latencies appropriate to 4-cycle {FADD, FMUL, FMAC}
IDIV IMOD is 17 cycles
FDIV is 17 cycles
SQRT is 22-cycles
in a pipeline where::
LD latency is 3-cycles
IMUL is 4 cycles.

>
> For integer divide and modulo, it is mostly a toss-up between the ISA
> instruction and "just doing it in software".
>
> For 64-bit integer multiply, doing it in software is still faster.
>
> For floating-point divide, doing it in software is faster, but the
> hardware FDIV is able to give more accurate results (software N-R
> seemingly being unable to correctly converge the last few low-order bits).
<

This is a consequence of not calculating all of the partial product bits
and then summing to a correct result. N-R only converges absolutely
when the integer parts of the arithmetic are correct. Any error here,
which is similar to the uncorrected error of Goldschmidt, prevents
convergence to correctness.
>

BGB

unread,

Apr 22, 2023, 8:32:43 PM4/22/23

to

On 4/22/2023 6:11 PM, MitchAlsup wrote:
> On Saturday, April 22, 2023 at 4:44:25 PM UTC-5, BGB wrote:
>> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:
>
>
>> Yeah.
>>
>> There are a few reasons I went with VLIW rather than superscalar...
>> Both VLIW and an in-order superscalar require similar logic from the
>> compiler in order to be used efficiently, and the main cost of VLIW here
>> reduces to losing 1 bit of instruction entropy and some additional logic
>> in the compiler to detect if/when instructions can run in parallel.
>>
>> Superscalar effectively requires a big glob of pattern recognition early
>> in the pipeline, which seems like a roadblock.
> <
> It is "such a massive roadblock" that my ISA has to look at <gasp> all
> of 6-bits to determine where an instruction gets routed (its function unit)
> in unary, one gate later you know if a conflict is present.

It would require a slightly bigger lookup table for RISC-V or BJX2,
mostly because instructions are not organized by prefix/suffix class;
nor by which function-unit handles them.

Also routing to FU's gets a bit ad-hoc (or subject to vary based on
which features are enabled or disabled), ...

>>
>> Had considered support for superscalar RISC-V by using a lookup to
>> classify instructions as "valid prefix" and "valid suffix" and also
>> logic to check for register clashes, and then behaving as if there were
>> a "virtual WEX bit" based on this.
>>
>> Hadn't got around to finishing this. RISC-V support on the BJX2 core is
>> still mostly untested, and will still be limited to in-order operation
>> for the time being (and, the design of superscalar mechanism would only
>> be able to give 2-wide operation for 32-bit instructions only).
>>
>>
>> Running POWER ISA code on the BJX2 pipeline would be a bit more of a
>> stretch though (I added RISC-V as it was already pretty close to a
>> direct subset of BJX2 at that point).
>>
>>
>>
>> Things like "modulo scheduling" could in theory help with a VLIW
>> machine, but in my experience modulo-scheduling could likely also help
>> with some big OoO machines as well (faking it manually in the C code
>> being a moderately effective optimization strategy on x86-64 machines as
>> well).
> <
> Modulo scheduling reduces the number of resources in flight to get
> loops-with-recurrences running smoothly. It helps mono-scalar and larger
> in-order pipelines, and does no harm to OoO designs whatsoever.

Yes.

As noted, it seems to help with both x86-64 and BJX2 in my experience.

Not so much with ARM IME, but less sure as to why.
At least, on ARMv8, theoretically there should not be any detrimental
effect to modulo scheduling the loops.

>>
>> Apparently, clang supports this optimization, but this sort of thing is
>> currently a bit out of scope of what I can currently manage in BGBCC.
>>
>>
>> Though, have observed that this strategy seems to be counter-productive
>> on ARM machines (where it seems to often be faster to not try to
>> manually modulo-schedule the loops). Though, this may depend on the ARM
>> core (possibly an OoO ARM core might fare better; most of the ones I had
>> tested on had been in-order superscalar).
>>
>> Though, I wouldn't expect there to be all that huge of a difference
>> between AArch64 and BJX2 on this front, where manual modulo-scheduling
>> is generally effective on BJX2.
>>
>>
>>
>> But, as noted in my case on BJX2, latency is sort of like:
>> 1-cycle:
>> Basic converter ops, like sign and zero extension;
>> MOV reg/reg, imm/reg, ...
> <
> Are these the zero calculation-cost "calculations"
> from the set {FABS, IMOV, FMOV, FNEG, FcopySign, INVERT} ?
> <

Yes, FABS and FNEG and similar are also 1-cycle ops.
My list was not exactly exhaustive...

>> ...
>> 2-cycle:
>> Most ALU ops (ADD/SUB/CMPxx/etc);
>> Some ALU ops could be made 1-cycle, but "worth the cost?".
> <
> Warning "Will Robinson":: the slope is very slippery; but candidates
> are from the set {AND, OR, XOR, <<1, <<2, >>1, >>2, ROT <<1, ROT >>1}
> <

Yeah.
32-bit ADDS.L / SUBS.L
AND/OR/XOR
...
Could all be turned into 1-cycle ops.

However, most are not high enough on the ranking to show
significant/obvious benefit from doing so.

It would "maybe" reduce the average interlock penalty from ~ 7% to
around 5% or 6% of the total clock cycles, but I don't expect much more
than this (with most of the rest of the interlock penalty being due to
memory loads).

>> More complex converter-class instructions ('CONV2').
>> Many of the FPU and SIMD format converters go here.
>> 3-cycle:
>> MUL (32-bit only);
>> "low-precision" SIMD-FPU ops (Binary16, opt Binary32*);
>> Memory Loads;
> <
> Given that integer ADD is 2-cycles {in your definition of the list ordinality}
> I find it interesting that you get::
> {
> route AGEN adder to SRAMs address decoder,
> SRAM access (1-full-cycle)

Both happen in EX1, L1 SRAM is accessed on 1 clock-edge.
Cache is direct-mapped, so only the low order bits need to reach the
L1's SRAM during this clock cycle.

This can roughly handle a 16K or (maybe) 32K array at 50MHz; 64K is
basically no-go.

Contrast, the L2 cache uses multiple clock-cycles to access the SRAM
array, so can support a somewhat larger array.

> SRAM route to data-path; tag address compare;
> LD align; Set-Selection;

Mostly happens in EX2.
The "check for L1 cache miss and stall pipeline" logic is one of the
major "tight" paths in the core. Anything touching this pathway is
basically a mine-field.

> Drive result bus
> }
> in 1 more cycle.

Final result handling is in EX3, say:
Final sign/zero extension;
Single -> Double (FMOV.S);
Pixel Extraction (LDTEX).

> <

The 2-cycle ADD was originally partly an artifact.
Originally I had not discovered some tricks to make the adder faster,
and a naive 64-bit add has a fairly high latency.

Though, the trick only really helps with larger adders, so I can *also*
support 128-bit ADD in 2 clock cycles (effectively by having both the
Lane 1 and 2 ALUs being able to combine into a larger virtual ALU).

But, I can note that:
ADD, SUB, CMPxx, PADDx, ...
Are all basically routed through the same logic.

But, I can also note that getting the L1 D$ to not blow out timing
constraints is a semi-constant annoyance...

Poke at something, somewhere random, and once again logic in the L1 D$
has decided to start failing timing again...

It would be easier here if I made the pipeline longer so I could do
4-cycle memory loads.

Part of the "magic" that allows 3-cycle IMUL and 6-cycle FMUL is the
DSP48 hard-logic. If not for this, these would not likely be possible.

As noted, DSP48 can effectively do:
18s*18s => 36s
17u*17u => 34u
With either optionally being added to a 48-bit accumulator value.

Had noted I could get acceptable 6-cycle FMUL by using 6x DSP48's and
some extra bit-twiddly for the low-order-bits.

> <
> If I stated the latencies appropriate to 4-cycle {FADD, FMUL, FMAC}
> IDIV IMOD is 17 cycles
> FDIV is 17 cycles
> SQRT is 22-cycles
> in a pipeline where::
> LD latency is 3-cycles
> IMUL is 4 cycles.

OK.

>>
>> For integer divide and modulo, it is mostly a toss-up between the ISA
>> instruction and "just doing it in software".
>>
>> For 64-bit integer multiply, doing it in software is still faster.
>>
>> For floating-point divide, doing it in software is faster, but the
>> hardware FDIV is able to give more accurate results (software N-R
>> seemingly being unable to correctly converge the last few low-order bits).
> <
> This is a consequence of not calculating all of the partial product bits
> and then summing to a correct result. N-R only converges absolutely
> when the integer parts of the arithmetic are correct. Any error here,
> which is similar to the uncorrected error of Goldschmidt, prevents
> convergence to correctness.

So, possibly a consequence then of my "only calculate high-order bits
and let all the low-order bits fall off the bottom" FMUL design?...

In my case, it is only "most of the time" that one will get a
correctly-rounded FMUL result (where "most of the time" in the sense
that one has ~ 2.5 bits below the ULP, below which the low parts of the
multiplier are effectively "hanging into the void").

...

Well, and the FADD also effectively has the low order bits falling into
the void as well (shift right into oblivion). But, the mantissa is wide
enough to handle 64-bit integer conversion, so in this case one
effectively has around an extra 12 bits below the ULP.

...

MitchAlsup

unread,

Apr 22, 2023, 9:09:01 PM4/22/23

to

Yes.

Scott Lurndal

unread,

Apr 23, 2023, 10:05:49 AM4/23/23

to

BGB <cr8...@gmail.com> writes:
>On 4/22/2023 3:49 AM, robf...@gmail.com wrote:

>>> ...
>> I have not had much luck getting a superscalar to clock over 20 MHz in
>> an FPGA. One issue is the size of the core slows things down. Even at
>> 20 MHz though performance is like that of a 40 to 50 MHz scalar core.
>> One nice thing about a lower clock speed is that there are fewer issues
>> with some instructions which can be more complex because more
>> clock space is available. I got a in-order PowerPC clone to work a little
>> faster than 20 MHz.
>>
>
>Yeah.
>
>There are a few reasons I went with VLIW rather than superscalar...
>Both VLIW and an in-order superscalar require similar logic from the
>compiler in order to be used efficiently, and the main cost of VLIW here
>reduces to losing 1 bit of instruction entropy and some additional logic
>in the compiler to detect if/when instructions can run in parallel.

I think there is pretty clear evidence (c.f. Itanic) that expecting compilers
to fully utilize the capabilities of a general purpose VLIW instruction
set is a very difficult problem, even with highly skilled (SGI/HP/Intel)
compiler engineers.

BGB

unread,

Apr 23, 2023, 3:48:07 PM4/23/23

to

There is an "easier problem" and a "hard problem":
"easy problem": Make VLIW performance competitive with in-order superscalar;
"hard problem": Make VLIW performance competitive with OoO.

The easy problem is "easy" because the compiler still needs to do the
same work in both cases; getting instructions into the correct order
that parallel execution is possible, the only real difference being that
for VLIW, the compiler needs to flag which instructions can be run in
parallel (and know the width of the target machine and which
instructions are allowed to run in what combinations).

The "hard" problem is, well, a lot harder. This is the one that Intel
would have needed to defeat.

Competing with OoO is less of an issue:
For the most part, OoO cores are too large to fit on Spartan or Artix
class FPGAs (and, even if they did, the timing situation would likely
not be good).

I am also assuming different scenarios:
Embedded processors;
Processors in cases where Moore's Law has hit a wall, and it becomes
more attractive to start trying shrink the CPUs again (rather than
trying to maximize single-threaded performance).

Itanium also suffered from "particularly awful" code density. I have
managed to avoid this, mostly by sticking with 32-bit instructions and
variable-length bundles (more like what is common in DSP architectures).

But, I guess, more like Itanium than DSPs:
BJX2 does still perform register interlock checking;
There are no delay slots.

For example, in a more traditional VLIW (such as TMS320C6x), one might
need to NOP pad a Load or similar if there was nothing to do in the
meantime, and there might be two or more delay-slot cycles following a
branch, ... I didn't go this route, and instead the CPU will stall the
pipeline for as many cycles are needed for the conflicting operation to
finish.

There are similarities and differences in the predication mechanism:
IA-64 used an array of predicate-bit registers;
It was effectively "Bit is 1, execute; Bit is 0, no-execute".
BJX2 generally only uses a single bit (SR.T);
Instructions are "?T" (Execute if SR.T==1) or "?F" (If SR.T==0).
...

The main competition then is with in-order superscalar cores, but most
of these end up with the limitation of typically only running at 25 or
30 MHz (eg: SweRV and friends have this issue).

It still gets impressive looking Dhrystone numbers, but as far as I can
tell, much of this is because GCC tends to be somewhat better at code
generation (particularly in the Dhrystone case).

Things like Doom performance are still "kinda suck" though.

OTOH:
Got around to testing my color-cell instructions, and they mostly serve
their purpose.

Running Doom in the 640x400 color-cell mode (in a "Window") goes from
6fps to around 20fps.

With the helper ops, color-cell encoder is roughly 6% of the CPU time
(down from 70%).

However, transforming a 4x4 cell still leaves around 40% of the
clock-cycles in the encoder as interlock penalties. I may consider
reworking it to transform multiple color-cells at a time (say: 8x4
pixels or 16x4 pixels); which could potentially reduce the amount of
cycles spent on interlock penalties.

Note that 640x400 screen redraw at 20fps will require shoving roughly 10
MB/s through the color-cell encoder. Or ~ 5 megapixels/sec.

Scaling for CPU use, it seems I am not too far off from the theoretical
limits of the design of the color-cell helper instructions.

Theoretically, the encoder instructions could also be used to help with
encoding the RPZA or CRAM video formats as well. Though, real-time video
capture from the BJX2 core isn't an immediate priority.

More it is a case of color-cell being used to reduce video-memory and
video-memory-bandwidth requirements (a 640x400 or 640x480 hi-color
bitmapped frambuffer would need too much memory bandwidth; and is too
big to fit into the L2 cache).

The helper instructions don't currently directly cover the 800x600 mode
(which uses a color-cell format with 1-bit selectors vs 2-bit), or the
1024x768 mode (1-bit monochrome, with an optional Bayer Matrix sub-mode
to mimic full color output, in addition to a range of fixed-color
options). Though, admittedly, haven't really used these modes much thus far.

IIRC, did test them on actual hardware, and the monitor I was testing
with seemed to accept them (despite the non-standard timings), though
fails to correctly identify the intended resolution. Then again, the
monitor identifies the 320x200 mode as 720x400, so, ...

Still not confirmed whether they would work on an actual CRT, vs an LCD.

Even then, I have had mixed results with the LCD's I have tried, like
another LCD had a hard time aligning the image and would generally cut
off part of the screen (effectively "letter boxing" the image after
misidentifying the 320x200 mode as 640x360 ...).

Also with the current monitor, the temporal dithering feature seems to
not work correctly, as the screen will not update the pixels (until a
more major change occurs), causing it to hold an arbitrary combination
of fixed dither patterns.

...

Still don't know yet if I will develop a GUI for this...

MitchAlsup

unread,

Apr 23, 2023, 4:37:53 PM4/23/23

to

On Sunday, April 23, 2023 at 2:48:07 PM UTC-5, BGB wrote:
> On 4/23/2023 9:05 AM, Scott Lurndal wrote:
> > BGB <cr8...@gmail.com> writes:
> >> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:
> >

> There is an "easier problem" and a "hard problem":
> "easy problem": Make VLIW performance competitive with in-order superscalar;
> "hard problem": Make VLIW performance competitive with OoO.
>
> The easy problem is "easy" because the compiler still needs to do the
> same work in both cases; getting instructions into the correct order
> that parallel execution is possible, the only real difference being that
> for VLIW, the compiler needs to flag which instructions can be run in
> parallel (and know the width of the target machine and which
> instructions are allowed to run in what combinations).
>
> The "hard" problem is, well, a lot harder. This is the one that Intel
> would have needed to defeat.
>
> Competing with OoO is less of an issue:
> For the most part, OoO cores are too large to fit on Spartan or Artix
> class FPGAs (and, even if they did, the timing situation would likely
> not be good).
>

On the other hand a 6-wide 96-deep GBOoO machine is 2mm^2 in 5nm.

>
> I am also assuming different scenarios:
> Embedded processors;
> Processors in cases where Moore's Law has hit a wall, and it becomes
> more attractive to start trying shrink the CPUs again (rather than
> trying to maximize single-threaded performance).
>

When you don't need/want that kind of performance you can lose the
G & B parts and still keep the OoO; 2-wide OoO, hit-under-miss caching
and get significantly more than 2-wide IO. Here the CPU is 4× smaller
then the GBOoO but still 3× bigger than the 1-wide IO.

>
> Itanium also suffered from "particularly awful" code density. I have
> managed to avoid this, mostly by sticking with 32-bit instructions and
> variable-length bundles (more like what is common in DSP architectures).
<

Itanium failed primarily because <drum roll> too many cooks in the kitchen.
>

BGB

unread,

Apr 23, 2023, 6:25:20 PM4/23/23

to

On 4/23/2023 3:37 PM, MitchAlsup wrote:
> On Sunday, April 23, 2023 at 2:48:07 PM UTC-5, BGB wrote:
>> On 4/23/2023 9:05 AM, Scott Lurndal wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:
>>>
>
>> There is an "easier problem" and a "hard problem":
>> "easy problem": Make VLIW performance competitive with in-order superscalar;
>> "hard problem": Make VLIW performance competitive with OoO.
>>
>> The easy problem is "easy" because the compiler still needs to do the
>> same work in both cases; getting instructions into the correct order
>> that parallel execution is possible, the only real difference being that
>> for VLIW, the compiler needs to flag which instructions can be run in
>> parallel (and know the width of the target machine and which
>> instructions are allowed to run in what combinations).
>>
>> The "hard" problem is, well, a lot harder. This is the one that Intel
>> would have needed to defeat.
>>
>> Competing with OoO is less of an issue:
>> For the most part, OoO cores are too large to fit on Spartan or Artix
>> class FPGAs (and, even if they did, the timing situation would likely
>> not be good).
>>
> On the other hand a 6-wide 96-deep GBOoO machine is 2mm^2 in 5nm.

Still probably not going to fit into an typical Artix-7...

There are some "really big" FPGA's with 50x as many LUTs, but the cost
for the FPGA is absurd, and they are not going to be supported by the
freeware version of Vivado in any case.

>>
>> I am also assuming different scenarios:
>> Embedded processors;
>> Processors in cases where Moore's Law has hit a wall, and it becomes
>> more attractive to start trying shrink the CPUs again (rather than
>> trying to maximize single-threaded performance).
>>
> When you don't need/want that kind of performance you can lose the
> G & B parts and still keep the OoO; 2-wide OoO, hit-under-miss caching
> and get significantly more than 2-wide IO. Here the CPU is 4× smaller
> then the GBOoO but still 3× bigger than the 1-wide IO.

OK.

My current core is pretty heavy vs the smaller cores I have pulled off,
but is also pretty feature rich. Much more of the cost seems to be due
to things which are independent of core width (for things like the L1
caches, FPU and SIMD units, ...), than due to parts which are added for
each lane (such as the additional Lane 2/3 ALUs).

Whereas, while a 32-bit integer-only core can be a fair-bit smaller, it
is a lot more limited.

But, I guess it is partly a question of what one compares to.

I suspect a similar core with a double-precision FPU and 4-wide Binary32
SIMD unit would still have a comparably similar cost even if it were
only 1-wide (given L1 caches + FPU + SIMD unit represents roughly half
of the total LUT cost for the core).

This also represented an issue for the design of a "GPU core", which
would still end up needing most of the "expensive parts" of my main core
(and about the only major part I could drop was the Double-Precision
FPU, but this didn't really save enough to make it viable).

And, throwing a 1-wide integer-only core at the problem wouldn't really
work, ...

Though, I am at least "back in business" for going dual-core on the
XC7A200T...

Some features originally intended to help with the GPU core have ended
up in the main core, as they still sorta help.

Some features, like LDTEX, do technically help in terms of performance,
but are still debatable in terms of their added cost.

And, LDTEX is still mostly effectively limited to square power-of-2 UTX2
compressed textures stored in Morton-order. For non-square textures, it
is still necessary to fall back to using raster order and manually using
the BLKUTX2 instruction or similar instead (texel addressing logic in a
raster-order texture being a more complicated problem than in a
Morton-order texture).

Does also mean one needs alternate versions of span drawing loops for
combinations of texture format and blending mode.

Texture Formats:
UTX2+Morton (Square, Compressed);
UTX2+Raster (Non-Square, Compressed);
Uncompressed Raster (RGB555A, Uncompressed).

Blending Modes:
Direct Texture, No-Alpha-Blend, No Z-Test
Direct Texture, No-Alpha-Blend, Z-Test
Texture + Modulation, No-Alpha-Blend, No Z-Test
Texture + Modulation, No-Alpha-Blend, Z-Test
Texture + Modulation, Alpha-Blend, No Z-Test
Texture + Modulation, Alpha-Blend, Z-Test
Texture + Modulation, Alpha Test, No Z-Test
Texture + Modulation, Alpha Test, Z-Test
...

These being relevant for:
(GL_ONE, GL_ZERO), (GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)
Whereas, say:
(GL_ZERO, GL_ONE_MINUS_SRC_COLOR)
Requiring a whole different set of span functions.

And, other combinations needing to fall back to "generic" (but somewhat
slower) span-drawing functions (effectively needing to invoke function
pointers for texture-fetch, blending, and drawing the resulting pixel to
the framebuffer).

Where, say, there is a big mess of function pointers that needs to be
updated whenever a non-trivial change is made:
Program has bound a different texture;
Enabling or disabling flags like GL_DEPTH_TEST, GL_ALPHA_TEST, ...
...

With only a small sliver of a few of the common texture span-drawing
functions actually using the LDTEX instruction (since, as noted, it only
works for square-Morton-order-compressed-textures).

Note that in the front-end API, textures would be uploaded using DXT1 or
DXT5, but are then quietly transformed into UTX2 internally (mostly
because the UTX2 decoding logic is cheaper than S3TC would have been).

>>
>> Itanium also suffered from "particularly awful" code density. I have
>> managed to avoid this, mostly by sticking with 32-bit instructions and
>> variable-length bundles (more like what is common in DSP architectures).
> <
> Itanium failed primarily because <drum roll> too many cooks in the kitchen.

Probably that too.
The thing was absurdly over-engineered.

It was also implausibly expensive.
Combined with the weak performance, this was a killer.

Itanium might have done better it it were less absurd, and cheap...

Though, it would still not likely have been able to compete in ARM's
original part of the market segment (say, for PDAs and older style
feature phones).

It would more likely have needed to try to compete with PowerPC in terms
of early and mid 2000s game-consoles and "set-top boxes".

But, no one is going to put a CPU that costs $k into a game console...

luke.l...@gmail.com

unread,

Apr 23, 2023, 7:59:50 PM4/23/23

to

On Sunday, April 23, 2023 at 11:25:20 PM UTC+1, BGB wrote:

> but is also pretty feature rich. Much more of the cost seems to be due
> to things which are independent of core width (for things like the L1
> caches,

CAMs in FPGAs are always expensive to implement, unless there is
a special CAM block (like the DSP block).

> FPU and SIMD units, ...),

yep - now you know why i designed that Dynamic Partitioned SIMD
unit i mentioned a few months ago. estimated 50% extra resources
you can use 64-bit wide single-arithmetical-ALU for everything: 2x32
4x16 8x8 - all dynamically selectable.

add is easy: just add extra bits at the partition points, use carry-rollover
to "bleed" the 7th bit's carry over to the 9th bit (etc.). the rest follow the same
kind of principle.

l.

MitchAlsup

unread,

Apr 23, 2023, 8:46:57 PM4/23/23

to

Take your 64-bit adder and stretch it to 72 bits.
every 8-bits hardwire the operand-bit inputs to a bit-pair control state.
If this bit-pair is 00 then carry into bit<9> is 0
if this bit-pair is 01 or 10 carry propagates from bit<7> to bit<9>
if this bit-pair is 11 then carry in to bit<9> is 1
<
So if the bit-pair is {00,00,00,00,00,00,00,00} you have eight 8-bit adders SIMD
.... if the bit-pair is {01,00,01,00,01,00,01,00} you have four 16-bit adders SIMD
.... if the bit-pair is (01,01,01,00,01,01,01,00} you have two 32-bit adders SIMD
.... if the bit-pair is {01,01,01,01,01,01,01,00} you have one 64-bit adder.
<
But more exotic things can be constructed::
.... if the bit pair is {01,00,01,01,00,01,01,00} you have one 16-bit add and 2×24-bit adds.
.
.
.
The reason you use a 2-bit control scheme comes into play when you want the
individual terms to insert their own carries. ADD8C-like. The carry in from the
operand is used to gate the bit-pair.
>
> l.

BGB

unread,

Apr 23, 2023, 9:32:35 PM4/23/23

to

On 4/23/2023 6:59 PM, luke.l...@gmail.com wrote:
> On Sunday, April 23, 2023 at 11:25:20 PM UTC+1, BGB wrote:
>
>> but is also pretty feature rich. Much more of the cost seems to be due
>> to things which are independent of core width (for things like the L1
>> caches,
>
> CAMs in FPGAs are always expensive to implement, unless there is
> a special CAM block (like the DSP block).
>

I am using direct-mapped caches.

But, forwarding the results of a memory store on one clock-cycle into
the next clock cycle adds significant cost, but can make a big
difference in terms of performance for code involving stores to
consecutive locations in memory.

Have experimented with associative caches, but my results have been
mixed. They are more expensive, but any performance gains tend to be a
bit hit or miss (and programs like Doom and similar tended to perform
better with direct-mapped caches than with associative caches in my
testing).

Though, in a strict hit/miss rate sense, a 2-way or 4-way LRU cache
would compare well against a direct-mapped cache (but, somehow, the DM
cache seems to perform better in Doom despite the slightly worse miss
rate in an absolute sense...).

>> FPU and SIMD units, ...),
>
> yep - now you know why i designed that Dynamic Partitioned SIMD
> unit i mentioned a few months ago. estimated 50% extra resources
> you can use 64-bit wide single-arithmetical-ALU for everything: 2x32
> 4x16 8x8 - all dynamically selectable.
>

I handle packed integer SIMD via the ALUs.

The "expensive" SIMD unit is the one for doing 4x Binary32 operations in
3 clock-cycles...

Without this unit, most of the FPU-SIMD operations would require 10
clock cycles. So, it is basically paying a big chunk of LUTs for fully
pipelined Binary32 SIMD ops...

Or, I can save some cost, by having the low-precision unit only handle
Binary16 and an S.E8.F16 format (but, then by default Binary32 SIMD ops
need to be routed through the main FPU, and thus cost 10 cycles).

Well, and extra cost and timing hassles resulting from another
experimental feature where thew SIMD ops can perform an inline vector
shuffle (vs needing to use external vector shuffles).

Maximum performance for the SIMD unit is 200 MFLOP/s at 50 MHz.

However, vector shuffle can eat into this a fair bit in certain cases.
There were some "very experimental" ops, like:
PMULSHX.H Rm, Imm48fv8sh, Rn

Which can combine a vector multiply against an immediate (4x S.E5.F6)
with a 4-element vector shuffle.

Which can allow the SIMD unit to operate "relatively close" to the 200
MFLOP/s hard-limit (and, at the same time, manage to outperform an
otherwise significantly faster early 2000's laptop at running a neural
net via x87 ops...).

Though, it is almost kinda moot though given this operation is otherwise
"extremely niche" (and a slightly newer Vista era laptop will stomp on
both of them...).

Also, it is pretty much entirely moot if one wants to be able to train
the net (which requires keeping the weight vectors and similar in RAM).

Some of this could be relevant to GLSL, but I am not yet sure how I
would get usable performance from code generated by a GLSL compiler
(but, uploading shaders as BJX2 ASM would be a bit non-standard...).

> add is easy: just add extra bits at the partition points, use carry-rollover
> to "bleed" the 7th bit's carry over to the 9th bit (etc.). the rest follow the same
> kind of principle.
>

I have both MMX and SSE style.

Packed Integer:
4x Int16
2x Int32
4x Int32

Packed Float:
4x Binary16
2x Binary32
4x Binary32
2x Binary64 (Main FPU)

With converters to/from some additional formats:
3x 20 bits (S.E5.F10:4);
64-bit storage <-> 4x Binary32
3x 10 bits (S.E5.F4)
32-bit storage <-> 4x Binary16
4x 8 bits:
E4.F4, E4.F3.S (FP8U/FP8S)
S.E3.F4 (A-Law; unit range only)
32-bit storage <-> 4x Binary16
3x 42-bit (S.E8.F23:10)
128-bit storage <-> 4x Binary64.
Packed/Unpacked piecewise from 2x Binary64, multi-step conversion.

The 3x formats mostly set W to either 0.0 or 1.0 on unpack, ignoring W
on pack.

And, some "partial" packed Integer <-> Float SIMD converters (to reduce
cost, the mechanism is slightly wonky).

luke.l...@gmail.com

unread,

Apr 24, 2023, 4:49:55 PM4/24/23

to

On Monday, April 24, 2023 at 1:46:57 AM UTC+1, MitchAlsup wrote:

> So if the bit-pair is {00,00,00,00,00,00,00,00} you have eight 8-bit adders SIMD
> .... if the bit-pair is {01,00,01,00,01,00,01,00} you have four 16-bit adders SIMD
> .... if the bit-pair is (01,01,01,00,01,01,01,00} you have two 32-bit adders SIMD
> .... if the bit-pair is {01,01,01,01,01,01,01,00} you have one 64-bit adder.

ta-daaa. from add you get subtract, from that you get compare.
but equals greater less-than are easy to construct, based on byte-level
analysis. equals is the obvious one, GT and LT require a Carry-Propagation
Cascade.
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/eq/

eq0 = a[0:7] == b[0:7]
eq1 = a[8:15] == b[8:15]
eq2 = a[16:23] == b[16:23]
eq3 = a[24:31] == b[24:31]

and then you combine those based on the partitions.

multiply is also possible by doing 8x8->16-bit result multiply
blocks then performing a Wallace Tree reduction using,
ta-daaa, the above adder scheme, and you now have every
permutation of Dynamic SIMD multiply.

shift was hair-raising and there is a caveat: you can't make the
partitions too small, because if the shift amount is *also*
partitioned you lose the ability to correctly express the
amount of shifting required in a SIMD context.

but, it is all doable:
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/

> But more exotic things can be constructed::
> .... if the bit pair is {01,00,01,01,00,01,01,00} you have one 16-bit add and 2×24-bit adds.

yes, i ended up accidentally implementing that in the Dynamic
Partitioned SIMD HDL Library.

> The reason you use a 2-bit control scheme comes into play when you want the
> individual terms to insert their own carries. ADD8C-like. The carry in from the
> operand is used to gate the bit-pair.

ooo niiice, i missed that. ok i did and i didn't. Power ISA has
carry-in carry-out (XER.CA), and yes that bit needs to go in at one end
and come out at the other.

l.

MitchAlsup

unread,

Apr 24, 2023, 5:15:17 PM4/24/23

to

grasshopper, why does it amaze you so..........

Quadibloc

unread,

Apr 24, 2023, 9:54:19 PM4/24/23

to

On Saturday, April 22, 2023 at 5:11:33 PM UTC-6, MitchAlsup wrote:

> Warning "Will Robinson":: the slope is very slippery; but candidates
> are from the set {AND, OR, XOR, <<1, <<2, >>1, >>2, ROT <<1, ROT >>1}

And here I thought that a barrel shifter was standard equipment on
any processor above the tiniest size these days, and so *all* shifts and
rotates are usually single-cycle operations.

John Savard

MitchAlsup

unread,

Apr 24, 2023, 9:59:19 PM4/24/23

to

As I have repeatedly stated: Nobody has use a barrel shifter (as seen
in Meade Conway) since 1990--we all use multiplexer shifters.
<
A barrel shifter has the property that it is n+n wires wide
A multiplexer shifter has the property that it is n+lnk(n) bits wide.
(Where k is 2 or 4) {IBM circa 1989±}
<
This is similar to people giving Wallace credit for how multiplier trees
are constructed, whereas all multiplier trees use Dadda form these
days (Wallace is a peculiar subset of Dadda.) Wallace wrote the
paper, Dadda wrote the book.
>
> John Savard

Quadibloc

unread,

Apr 24, 2023, 10:26:24 PM4/24/23

to

On Monday, April 24, 2023 at 7:59:19 PM UTC-6, MitchAlsup wrote:

> As I have repeatedly stated: Nobody has use a barrel shifter (as seen
> in Meade Conway) since 1990--we all use multiplexer shifters.
> <
> A barrel shifter has the property that it is n+n wires wide
> A multiplexer shifter has the property that it is n+lnk(n) bits wide.
> (Where k is 2 or 4) {IBM circa 1989±}

Oh, dear. What I was thinking of, then, would be a multiplexer shifter.

John Savard

BGB

unread,

Apr 24, 2023, 11:01:32 PM4/24/23

to

FWIW:

I actually ended up actually using funnel shifters, with some logic to
reconfigure the input values depending on what type of shift/rotate was
being performed.

Shift Left:
High Input: Rs
Low Input: 0
Offset: 64-Ri
Shift Right:
High Input: 0 or SExt(Rs)
Low Input: Rs
Offset: Ri (or (-Ri) / (1+(~Ri)) )
Rotate Left:
High Input: Rs
Low Input: Rs
Offset: 64-Ri ( or (65+(~Ri)) )

Essentially, the shift extracting a 64-bit value from within a sliding
128-bit window (which can in turn handle everything from 0..63).

With the actual shift being essentially a stack of multiplexers driven
by each bit of the offset.

And, with some additional trickery, two such 64-bit shift units can
combine to perform a 128-bit shift or rotate.

Add some extra special case handling to deal with the shift-values being
signed for SHADx/SHLDx:
Ri>0: Shift Left
Ri<0: Shift Right

Well, and 64-Ri => 65+(~Ri), ...

And, add some additional logic to support 32-bit shifts (hint, more
stuff about configuring the values in the input window, ...).

But, yeah, something to this effect...

> John Savard

Quadibloc

unread,

Apr 24, 2023, 11:23:32 PM4/24/23

to

I couldn't find a definition of a "multiplexer shifter" online.

After further searching, I found one reference to what I was thinking of:
something which had a layer that shifted by 1, another layer that shifted by
2, another layer that shifted by 4, and so on.

But the source I saw it in called it a "logarithmic barrel shifter". It noted that
each stage would do a rotate, and a mask would change the rotate into a shift
as desired. Plus the switch from a left shift to a right shift would be done by
optionally reversing all the bits on entry and exit.

John Savard

Quadibloc

unread,

Apr 24, 2023, 11:34:26 PM4/24/23

to

Another paper that I found described both a barrel shifter and a funnel
shifter as having layers that shift by powers of two. With a funnel shifter,
the difference between a shift and a rotate is whether you fill both halves,
or just one, of the double-width input, which is simpler than preparing a
mask.

John Savard

luke.l...@gmail.com

unread,

Apr 25, 2023, 4:35:13 AM4/25/23

to

because relatively speaking this is all entirely new to me! :)
i say "relatively", it's been 4 years...

l.

BGB

unread,

Apr 25, 2023, 3:39:44 PM4/25/23

to

I have been poking around with this stuff for around 7 years now...

I am much older now than when I started it seems...

Elsewhere, an online argument came up, where arguably my graphics
hardware is much crappier than some people would expect for the class of
FPGA I am using.

Eg:
320x200 hi-color
640x400 color-cell (or 16-color RGBI)
800x600 color-cell (or 4-color CGA-like)
1024x768 text-cell (or 1-bpp monochrome)

In text-cell mode, one could still sorta do graphics by loading the
graphics tiles into the "Font RAM".

But...
Higher bpp modes, RAM fetch can't keep up with screen refresh.
Or, if I do Block-RAM... It sorta eats all the Block-RAM.

Still doesn't help that my DDR RAM controller is kinda slow.

But, the "effort curve" for a lot of this is pretty steep.

...

MitchAlsup

unread,

Apr 25, 2023, 4:10:20 PM4/25/23

to

On Tuesday, April 25, 2023 at 2:39:44 PM UTC-5, BGB wrote:
> On 4/25/2023 3:35 AM, luke.l...@gmail.com wrote:
> > On Monday, April 24, 2023 at 10:15:17 PM UTC+1, MitchAlsup wrote:
> >> On Monday, April 24, 2023 at 3:49:55 PM UTC-5, luke.l...@gmail.com wrote:
> >>>> The reason you use a 2-bit control scheme comes into play when you want the
> >>>> individual terms to insert their own carries. ADD8C-like. The carry in from the
> >>>> operand is used to gate the bit-pair.
> >> <
> >>> ooo niiice,
> >> <
> >> grasshopper, why does it amaze you so..........
> >
> > because relatively speaking this is all entirely new to me! :)
> > i say "relatively", it's been 4 years...
> >
> I have been poking around with this stuff for around 7 years now...
<

If I had only been poking around for 7 years I would only be 10% older..........

BGB

unread,

Apr 26, 2023, 12:59:32 AM4/26/23

to

Yeah... More around 18% of my lifespan in my case...

Was near the start of my 30s at the start of this project, now I am near
the end...

And I am using BGBCC as my C compiler, which came into existence (as its
own entity) when I was in my early 20s, based on fork of a VM I wrote
back when I was still a teenager (for a language I would later end up
calling BGBScript); end then shelved for a while, as it turned out C did
not make for a particularly good scripting language (and the VM was a
pain to debug). But then roughly a decade later I had need for a C
compiler, and had a still mostly intact C front-end,...

Well, nevermind if BGBCC can still sorta compile my script language,
which was in turn influenced by JavaScript and ActionScript (and
following after a Scheme interpreter I had written earlier during my
time in high-school).

Partly this was in turn because in my youth I had also crossed paths
with things like Adobe Flash and similar.

...

robf...@gmail.com

unread,

Apr 26, 2023, 9:56:49 AM4/26/23

to

I be in my late 50's now. Into computers for 40+ years and I am still
learning new things. It amazes me the kind of knowledge accumulation
required.

Just on the topic of size of OoO cores, they can be fit into a large Artix
FPGA if one is careful and has not too many varieties of operations. I
have seen a superscalar core fit into as little as 10k LUTs. I have been
experimenting some with scalar OoO CPUs rather than superscalar.
The idea was to hide some of the memory latency. I have found that
memory access seems to be what is limiting the cores a lot of the
time. It does not matter if it is superscalar if it cannot be fed
instructions and data fast enough; it sits idle. OoO memory
operations can really hide latency.

If it takes four clocks to get data from the cache and 30 clocks to get
external data, using an extra clock or two in the CPU core for
processing has a low impact.

MitchAlsup

unread,

Apr 26, 2023, 11:49:19 AM4/26/23

to

Yes, indeed, just about the time one has been exposed to enough of
what computers are made of to be a good computer architect, it is
time to retire.

>
> Just on the topic of size of OoO cores, they can be fit into a large Artix
> FPGA if one is careful and has not too many varieties of operations. I
> have seen a superscalar core fit into as little as 10k LUTs. I have been
> experimenting some with scalar OoO CPUs rather than superscalar.
> The idea was to hide some of the memory latency. I have found that
> memory access seems to be what is limiting the cores a lot of the
> time. It does not matter if it is superscalar if it cannot be fed
> instructions and data fast enough; it sits idle. OoO memory
> operations can really hide latency.
<

I have been watching what happens when one gets 1 constant per
instruction, and where that constant can be of any size. Many
loops can get unrolled, and all of the memory references eliminated,
leaving only a string of FMAC instructions......
<
If you were to compare to a more typical RISC ISA, the more typical RISC
architecture would be executing 2× to 3× more instructions per clock
but take more time to perform the calculation......These extra instructions
are accessing the constants which were directly in the instruction stream
of the atypical ISA.

BGB

unread,

Apr 27, 2023, 12:00:25 AM4/27/23

to

I haven't been around that long, but don't think I am doing too horribly
at least...

As can be noted, a lot of my early code wasn't particularly good, and it
wasn't really until my 20s that my coding skills "stopped sucking
enough" that I really started making all that much progress.

Sometimes, it seems like my "biggest achievements" were in the past, but
a lot of my code and engineering back then was pretty terrible (and some
of my current tech is still in the shadow of past design missteps).

But, yeah, back when I was in high-school, some things were still pretty
hyped:
XML
Adobe Flash (and Flash animations)
The Web / websites / ...
Intel Itanium
...

I guess, in the wake of both "the dot-com crash" and "the whole 9/11
thing", ...

Java was still popular, but its initial hype was starting to fade. In
this case, .NET was still pretty new, and didn't reach "full hype"
status until a few years later (as seemingly most of the "Java fandom"
started to jump ship over to .NET).

AMD having released x86-64 around this time also meant "the writing was
on the wall" for Itanium; and XML quickly went from being praised to
being despised; ...

Can't say I was entirely immune to some of this...

*cough* me at the time proceeding to implement a JavaScript/ActionScript
clone on top of using an XML DOM implementation for the ASTs (and
DOM-tree-walking logic to run for the interpreter).

Yes... Its performance was epic...

Ironically, it was a fork of this that later became BGBCC...

However, given the initially horrible results with my DOM walking script
interpreter, I re-implemented this VM by writing a parser for my JS/AS
variant and then running this on top of the core of the Scheme
interpreter I had written previously (which had initially worked by
using an S-Expression walking design, but was then modified to compile
the cons-lists / S-Expressions into a stack-oriented bytecode).

The above VM survived mostly until the early 2010s, when I wrote a new
VM which has instead used ex-nihilo objects as the main AST building
block in opposition to cons-cell lists; and had otherwise moved to a
"fully static typed" core (but keeping dynamic types as an optional
feature). And, the language had migrated from being "more like
ActionScript" to "more like Java" in the name of better performance.

By this point, I had also moved BGBCC's core over to static typing as
well. Its original type-system was more a mimic of C's type-system built
on top of dynamic type system.

However, BGBCC still retains various "scars" from what it once was (and
is in a way an unbroken like from some of my earliest code).

And, my C variant still retains support for dynamic types, and my ISA
partly ended up being designed around supporting them, because they are
"still useful sometimes".

Sometimes, it seems like I am always living in the shadow of my past self.

I also sit here with an elderly cat who has also been with me through
much of this journey (and nearly half of my lifespan thus far). Sadly,
neither of us will live forever...

> Just on the topic of size of OoO cores, they can be fit into a large Artix
> FPGA if one is careful and has not too many varieties of operations. I
> have seen a superscalar core fit into as little as 10k LUTs. I have been
> experimenting some with scalar OoO CPUs rather than superscalar.
> The idea was to hide some of the memory latency. I have found that
> memory access seems to be what is limiting the cores a lot of the
> time. It does not matter if it is superscalar if it cannot be fed
> instructions and data fast enough; it sits idle. OoO memory
> operations can really hide latency.
>

Yeah. I can currently fit two BJX2 cores into the XC7A200T I am
currently using (and using ~ 67% of the total LUT budget).

But, yeah, by current models (in my emulator) roughly 44% of the
clock-cycles (in Doom) are spent waiting for memory access (the bulk of
this being L2->DRAM).

But, from what I have seen, I suspect my emulator is still a little
optimistic on this front (my Verilog implementations seem to spend more
clock-cycles waiting for RAM).

In both the emulator and partial simulation, if one assumes a 100% L2
hit rate, Doom is pegged at 30fps (its internal frame-rate limiter, *1),
and GLQuake is "mostly playable" (typically high single-digit or
double-digit at 50MHz).

In GLQuake, in the "100% L2 hit" scenario, the majority of CPU time is
also spent in the OpenGL backend rather than in the BSP walk and similar
(seemingly "R_RecursiveWorldNode" and similar causing a *lot* of L2
cache misses).

In the "L2 always hits" scenario, it also seems drawing pixel spans
becomes one of the major "hot spots" in terms of clock-cycle budget.

For my newer FPGA board, it is a partial tradeoff as it can have a
bigger L2 cache (512K vs 256K, due to more block RAM), but has slower
DRAM access (due to the higher minimum CAS latency from the DDR3 chip used).

Doesn't really effect Dhrystone, which already has a pretty much 100% L2
hit rate.

*1: Though one can exclude Doom in the "GUI Mode" from this, as now the
"UpdateWindowStack" code eats a big chunk of CPU time; seemingly L2
cache misses are not the primary cause of its slowness...

By default, the emulator and partial simulation try to model the full
memory hierarchy; mostly to try to keep performance consistent with the
full implementation.

> If it takes four clocks to get data from the cache and 30 clocks to get
> external data, using an extra clock or two in the CPU core for
> processing has a low impact.

OK.

I can note there are reasons my own code is using formats like Binary16
and similar so often...

And, for TKRA-GL using an RGB555 framebuffer, Z16 / Z12.S4 depth buffer,
and typically using texture compression. It isn't just about saving RAM,
so much as saving RAM can also help with performance.