Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Floating Point Constants

577 views
Skip to first unread message

MitchAlsup

unread,
Aug 4, 2022, 4:28:24 PM8/4/22
to
Does anyone have a reference where a group of people measured the
percentage of floating point operands that are constant/immediate.

10 minutes of Google turns up billions of useless papers and threads
on everything other than what I am looking for.

BGB

unread,
Aug 4, 2022, 10:50:11 PM8/4/22
to
( Retry, clicked wrong button at first and unintentionally sent an
email... )
I don't know of any papers or anyone else looking into this.


Gleaning a few stats from my BGBCC output while compiling my port of
GLQuake:
FADD: 541
FSUB: 321
FMUL: 968
FCMP/etc: 2329 (FCMPxx, FNEG, FABS, ...)
Total: 4159
FP Constant Loads:
Binary16: 1146
Binary32: 78
Binary64: ~ 100 (Guesstimate, *1)
Total: ~ 1324

*1: Being more accurate on this estimate would require adding something
to differentiate between Binary64 loads and other types of 64-bit
constants. For the others, it is less ambiguous.

This isn't previously something I had looked into before, so I don't
really have dedicated stats for this.


Quick survey skim of a the ASM output dump, it looks like a fair chunk
of the FPU ops use immediate values.

Totals would imply an average case of around 32% of FPU ops needing a
constant. This roughly agrees with my attempts at skimming the ASM dump.

In terms of relative constant-density ranking, high to low:
FCMPGT
FCMPEQ
FMUL
FADD
FSUB
...

For FCMP, constants appear to be very common, slightly less so for FMUL,
and a relative minority for FADD and FSUB.


I guess someone could go and add FP immediate forms.

It looks like these would maybe save around 0.2% to 0.4% of the total
clock-cycle budget in this case.

Would save maybe a little more if dealing with a more FP dominated
use-case. GLQuake spends a decent chunk of time in TKRA-GL, where the
backend rasterizer functions and similar are at present pretty much
entirely built on fixed-point integer stuff.

But, I don't really have many FP dominated workloads in this case.


Dunno if of any real use here...

MitchAlsup

unread,
Aug 5, 2022, 10:38:31 AM8/5/22
to
VAX did a long time ago
My 66000 did circa 2010 (in this case they fell out for free.)
>
> It looks like these would maybe save around 0.2% to 0.4% of the total
> clock-cycle budget in this case.
>
> Would save maybe a little more if dealing with a more FP dominated
> use-case. GLQuake spends a decent chunk of time in TKRA-GL, where the
> backend rasterizer functions and similar are at present pretty much
> entirely built on fixed-point integer stuff.
<
Blending (LERP) does:: z = x×a + y×(1.0-a)

EricP

unread,
Aug 5, 2022, 11:58:09 AM8/5/22
to
I looked through the VAX papers on instruction set usage stats
but they don't break out floating point constants.

I searched for a while but couldn't find any stats either.

Some points:

- Many RISC ISA's dont have float immediates at all.
Alpha uses IP-rel addressing to load constants, including floats,
from tables located just prior to the routine entry point.

Superficially these show up as regular loads.

- The stats will be skewed by optimization. I saw various references
to whether compiler optimization did/didnt do constant folding.

These could show up as multiple constant loads where one might have
done with constant folding. But then you get into the language rules
for rearranging floating point expressions.

One issue I have with many of the ISA usage papers is that they
simply scan and count instruction types, but don't look deeper
for idioms to try to infer why some sequence was being done.
E.g. It would be nice to know the difference between a compare
guarding a bounds check trap and a compare for program logic.
And then that frequency in turn depends on the source language.

MitchAlsup

unread,
Aug 5, 2022, 1:42:30 PM8/5/22
to
Thanks for reminding me that VAX data has int, pointer, and float
in their #immed data.
>
> I searched for a while but couldn't find any stats either.
>
> Some points:
>
> - Many RISC ISA's dont have float immediates at all.
<
Specifically, I have been tasked with comparing RISC-V with My 66000.
One has constants, one does not. I have usefully good data on ints and
pointers but not on FP.
<
I am specifically wondering if FP constants are used often enough to
bring up the topic in my comparison. My experience with big numerics
says no, my experience with GPU says yes--for example:
LERP = x×a + y×(1.0-a)
So, I don't have enough data to form an opinion.
<
> Alpha uses IP-rel addressing to load constants, including floats,
> from tables located just prior to the routine entry point.
<
It is simpler to say Alpha does not have floating pint constants.
>
> Superficially these show up as regular loads.
<
If you have to load it is it not a constant.
constant = setof{ immediate, displacement }
>
> - The stats will be skewed by optimization. I saw various references
> to whether compiler optimization did/didnt do constant folding.
<
My other problem is that I do not have access to the code bases which
cost money to access (SPEC for example) whereas I do have a usefully
good LLVM port to My 66000 with working clang and flang. For example
I can get Linpack, Livermore Loops, certain sections of BLAS, but not
the applications which might use those.
>
> These could show up as multiple constant loads where one might have
> done with constant folding. But then you get into the language rules
> for rearranging floating point expressions.
>
> One issue I have with many of the ISA usage papers is that they
> simply scan and count instruction types, but don't look deeper
> for idioms to try to infer why some sequence was being done.
<
I have this is spades:: whereas RISC-V has compare-and-branch with
11-bit target range, My 66000 has compare to zero and branch, and
a compare instruction CoIssued with the successive branch on condition
instruction. My branches have 18-bit target range (or 28-bit for unconditional)
So, any fair comparison needs to take the instruction count, the execution
cycles, and the number of times fix-ups are required into account.
<
I also have to figure out how to rate Predication verus branching only...
<
RISC-V thinks 123456/0 = 0xffffffffffffffff
My 66000 thinks 123456/0 = DIVZERO exception and if the exception is
not enabled (sign)123456/0 = {signed,unsigned}MAXimum.
<
RISC-V literature says there are 50 integer instructions. the actual
assembly reference manual has a list that I counted to be 150
different instructions. My 66000 has similar "counting" problems
depending on where you are looking {assembler "spellings", actual
OpCodes, expanded OpCodes, ...}
<
My 66000 has a switch instruction (JTT; jump through table) that
performs the range comparison [0..k] and sends out of range to
the k+1th entry in the table, and the table can be bytes, half, words
and is PIC. RISC-V has none of this and has to use AUIPC to get
PICed.
<
RISC-V is proud of their MOV FP<->INT instruction showing that
extracting and debiasing the exponent of an FP number can be
done in 5 instructions. My 66000 has a single EXPON instruction
that does the same work.
<
> E.g. It would be nice to know the difference between a compare
> guarding a bounds check trap and a compare for program logic.
> And then that frequency in turn depends on the source language.
<
If it were easy, anyone (or his brother) could do it.
<
Of all of these, access to suitable code bases is the hardest.....
<
Also note: I am not being paid to do this, and don't have the kind
of cash needed to just spring for the code myself.

BGB

unread,
Aug 5, 2022, 1:56:19 PM8/5/22
to
Doesn't fall out for free in my case; I would need some new encodings
and to throw some format converters into ID2 (at least, assuming) a lack
of 64-bit immediate values, but given these only exist as FE-FE-F8
encodings, this is unlikely.

Only option other than ID2 is to put the FP converter into instruction
decoder, which I guess could be "less bad" than putting it into the
register handling logic (they would then follow the same path 64-bit
immediate values in ID2, rather than creating a new path).



Some possible encodings:
FFw0_0vii-F0nm_5go8 FADD Rm, Ro, Rn, Imm8 //(Exists) Imm=rounding
FFw0_0Vii-F0nm_5gi8 FADD Rm, Imm16, Rn //(Possible), Adds Immed

Also possible would be putting them in the F2 block:
FFw0_PPjj-F2nm_0gjj OP Rm, Imm18s, Rn

Where, say: PP: 00=ADD, 58=FADD, 59=FSUB, 5A=FMUL


Main limitation to the idea is mostly that, while it seems around 32% of
the FPU operations use constants, FPU instructions are not a big enough
part of the total clock cycles for this to likely make all that big of a
difference in terms of performance.


In general, things like Binary16 and Binary32 constant-load encodings
(*1) are still a lot better than using memory loads here (which most of
the other ISAs seem happy with using).

F88n_iiii FLDCH Imm16, Rn //Load Binary16
FFw0_iiii_F88n_iiii FLDCF Imm32, Rn //Load Binary32



>>
>> It looks like these would maybe save around 0.2% to 0.4% of the total
>> clock-cycle budget in this case.
>>
>> Would save maybe a little more if dealing with a more FP dominated
>> use-case. GLQuake spends a decent chunk of time in TKRA-GL, where the
>> backend rasterizer functions and similar are at present pretty much
>> entirely built on fixed-point integer stuff.
> <
> Blending (LERP) does:: z = x×a + y×(1.0-a)

In TKRA-GL, the bilinear interpolation also used fixed-point...

Pretty much everything past the transformation and projection stage is
fixed point in this case.


This would change for HDR, but HDR would likely use Binary16 SIMD for
pixels and fixed-point for pretty much everything else. LERP would
likely still use fixed pioint for bilinear, because it still mostly
works OK on floating point values as long as the exponent is similar
(with larger exponent differences, fixed point LERP develops an obvious
S-Curve effect, but this isn't too much of an issue for texture
interpolation as a sudden change usually also means an edge-like feature
or similar in the texture).

No idea what would happen if/when it get a shader compiler, it is likely
either I would need to rework a few things, or make the shader compiler
have "meta types", where it figures out where the vectors are "actually"
using fixed-point and merely pretends like everything is FP-SIMD.

Trying to "naively" compile the shaders (and then put FP->Fixed
conversion in the "texture2D()" calls and similar; or worse, use an
actual function call here), would perform like garbage.

For now, I will ignore this.

BGB

unread,
Aug 5, 2022, 2:17:42 PM8/5/22
to
I suspect most ISAs don't have these.
Hell, not even x86 has these (well, at least as far as I last looked
into this; I mostly ignore AVX and newer).

I have a Binary16 immediate load instruction and similar, this is still
more than most of them (and a lot better than using memory loads IMO).


> - The stats will be skewed by optimization. I saw various references
>   to whether compiler optimization did/didnt do constant folding.
>
>   These could show up as multiple constant loads where one might have
>   done with constant folding. But then you get into the language rules
>   for rearranging floating point expressions.
>

In my case, my C compiler evaluates constant expressions when possible.

Things like:
y=x/(1.0+2.0);
Will get transformed into, say:
y=x*0.333333;

Eliminating FDIV whenever possible also makes sense, since this is
generally significantly slower than the other instructions.

FMUL: 6 cycles.
FDIV: 120 cycles.


FDIV is also rare though for most things; apart from Quake's software
renderer where it is kind of a boat anchor (perspective correct texture
mapping needs fast FDIV).

In my case, TKRA-GL uses affine projection (with dynamic tessellation),
which at least mostly avoids the FDIV issue.


> One issue I have with many of the ISA usage papers is that they
> simply scan and count instruction types, but don't look deeper
> for idioms to try to infer why some sequence was being done.
> E.g. It would be nice to know the difference between a compare
> guarding a bounds check trap and a compare for program logic.
> And then that frequency in turn depends on the source language.
>

Probably.

JimBrakefield

unread,
Aug 5, 2022, 2:32:02 PM8/5/22
to
On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:
A rough and risky approach is to posit floating point constants are as frequent as integer constants normalized by the relative occurrence rates for integer and floating-point operations. Another possible assumption is the total number of unique floating-point constants is relatively small for a given program and have the linker put them in a single memory area referenced by a short offset. I.e. chose between in lined single or double immediate constants or a memory/cache reference. My preference is to give the user that choice.

MitchAlsup

unread,
Aug 5, 2022, 3:01:08 PM8/5/22
to
Which, BTW, does not achieve 0.5 ULP accuracy, whereas y=x/3.0 does.
>
> Eliminating FDIV whenever possible also makes sense, since this is
> generally significantly slower than the other instructions.
>
> FMUL: 6 cycles.
> FDIV: 120 cycles.
>
FMUL 6 cycles
FDIV 18 cycles

MitchAlsup

unread,
Aug 5, 2022, 3:09:30 PM8/5/22
to
useable
<
> Another possible assumption
> is the total number of unique floating-point constants is relatively small for
> a given program and have the linker put them in a single memory area
> referenced by a short offset. I.e. chose between in lined single or double
> immediate constants or a memory/cache reference.
<
Here, the use of an FP constant costs 2 instructions, one to LD and one to calculate,
compared to 1 instruction and 1 issue cycle, where the instruction is 64 bits (32-bit
immediate which can be expanded to 64-bits depending on FP calculation size) or
96 bits (with 64-bit immediate). Also note: these immediates come through the
instruction cache and do not need read access to that page, so the data cache
is not polluted, nor is the TLB disturbed.
<
> My preference is to give the user that choice.
<
My 66000 has FP immediates in the instruction itself:
<
FDIV R9,#3.1415926535863,R7
or
FDIV R9,R7,#3.1415926535863
<
and Brian's LLVM port already does this. My problem is comparing one that does
against one that does not.

Marcus

unread,
Aug 5, 2022, 3:11:35 PM8/5/22
to
I'm not sure if it would help, but I find that ffmpeg (GPL) is a
relatively large, portable code base that is heavy on data processing
(codecs), and it has some FP code too. I have not analyzed it
thoroughly, though, and you may have to tweak the build (configuration
parameters) to get all the interesting codecs built.

It built out-of-the-box for MRISC32 (no OS, just newlib) - so it's
portable alright.

https://ffmpeg.org

>>
>> These could show up as multiple constant loads where one might have
>> done with constant folding. But then you get into the language rules
>> for rearranging floating point expressions.
>>
>> One issue I have with many of the ISA usage papers is that they
>> simply scan and count instruction types, but don't look deeper
>> for idioms to try to infer why some sequence was being done.
> <
> I have this is spades:: whereas RISC-V has compare-and-branch with
> 11-bit target range, My 66000 has compare to zero and branch, and
> a compare instruction CoIssued with the successive branch on condition
> instruction. My branches have 18-bit target range (or 28-bit for unconditional)
> So, any fair comparison needs to take the instruction count, the execution
> cycles, and the number of times fix-ups are required into account.

Don't forget, RISC-V compare-and-branch only works with registers, so if
you want to compare to an immediate you need an extra instruction to
load the immediate into a register.

RISC-V (2 instructions):

li a2, 55
blt a2, a0, foo

MRISC32 (2 instructions):

sle r2, r1, #55
bns r2, foo

...the RISC-V compare-and-branch shines for loops (where the loop count
is preloaded into a register), which is what it was optimized for, I
suppose.

Stephen Fuld

unread,
Aug 5, 2022, 3:15:16 PM8/5/22
to
On 8/5/2022 11:32 AM, JimBrakefield wrote:
> On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> On Thursday, August 4, 2022 at 9:50:11 PM UTC-5, BGB wrote:
>>>> ( Retry, clicked wrong button at first and unintentionally sent an
>>>> email... )
>>>> On 8/4/2022 3:28 PM, MitchAlsup wrote:
>>>>> Does anyone have a reference where a group of people measured the
>>>>> percentage of floating point operands that are constant/immediate.

> A rough and risky approach is to posit floating point constants are as frequent as integer constants normalized by the relative occurrence rates for integer and floating-point operations. Another possible assumption is the total number of unique floating-point constants is relatively small for a given program

I would guess that is probably true, but I suggest a different approach
than yours. Why not put those few constants in ROM within the CPU and
reference them from there? It would require some ISA modifications, but
it would eliminate the load completely.


and have the linker put them in a single memory area referenced by a
short offset. I.e. chose between in lined single or double immediate
constants or a memory/cache reference. My preference is to give the
user that choice.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)


MitchAlsup

unread,
Aug 5, 2022, 3:17:42 PM8/5/22
to
On Friday, August 5, 2022 at 2:11:35 PM UTC-5, Marcus wrote:
My 66000 has the LOOP instruction which does ADD:CMP:branch back to the
top of the loop as 1 instruction. Using this, loops can execute as wide as the
cache access port width.
<
VEC...LOOP are the access means to SIMD functionality.
So a low end machine can do byte-by-byte copy loops at 32-iterations per
clock (160 instruction per clock).
A higher end machine could to DGEMM at 4 iterations per cycle (32 IPC).

MitchAlsup

unread,
Aug 5, 2022, 3:19:05 PM8/5/22
to
On Friday, August 5, 2022 at 2:15:16 PM UTC-5, Stephen Fuld wrote:
> On 8/5/2022 11:32 AM, JimBrakefield wrote:
> > On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Thursday, August 4, 2022 at 9:50:11 PM UTC-5, BGB wrote:
> >>>> ( Retry, clicked wrong button at first and unintentionally sent an
> >>>> email... )
> >>>> On 8/4/2022 3:28 PM, MitchAlsup wrote:
> >>>>> Does anyone have a reference where a group of people measured the
> >>>>> percentage of floating point operands that are constant/immediate.
> > A rough and risky approach is to posit floating point constants are as frequent as integer constants normalized by the relative occurrence rates for integer and floating-point operations. Another possible assumption is the total number of unique floating-point constants is relatively small for a given program
> I would guess that is probably true, but I suggest a different approach
> than yours. Why not put those few constants in ROM within the CPU and
> reference them from there? It would require some ISA modifications, but
> it would eliminate the load completely.
<
I did this in my Denelcor compiler code generator for a useful subset of
the constants (found in the code).

BGB

unread,
Aug 5, 2022, 7:53:46 PM8/5/22
to
The 120-cycle HW FDIV gets ~ 0.5, whereas the software divide
(Newton-Raphson based) is seemingly limited to somewhere around 3.5 ULP
or so for Binary64.


For most things, "X/C" vs "X*(1.0/C)" doesn't have any real practical
effect on behavior, but if the latter is 20x faster, this is what it is
going to be.

Likewise, the 0.5 ULP requirement does not seem to be a requirement for
the C standard.


>>
>> Eliminating FDIV whenever possible also makes sense, since this is
>> generally significantly slower than the other instructions.
>>
>> FMUL: 6 cycles.
>> FDIV: 120 cycles.
>>
> FMUL 6 cycles
> FDIV 18 cycles

Not going to get 18 cycles from Shift-Add unit, would need a divider
that can do multiple bits per clock cycle for this.


Going and enabling FDIV:
GLQuake FDIV=1.65% (vs FMUL=2.86% and FADD=2.51%)
SWQuake FDIV=2.42% (vs FMUL=8.40% and FADD=4.85%)

It relative slowness doesn't really seem to matter too much...

MitchAlsup

unread,
Aug 5, 2022, 8:24:25 PM8/5/22
to
Are these overall time spent executing ?op, or occurrence of ?op ??

robf...@gmail.com

unread,
Aug 5, 2022, 9:12:11 PM8/5/22
to
It may be interesting to know the kind of precision required for float-point constants.
VAX had six-bit float immediates I think. If there are 16-bits available in the instruction for a constant,
there may be a lot of float-constants that could be mapped to a higher precision. The whole 64-bit or
128-bit constant may not need to be encoded for constants to be useful.


Andy

unread,
Aug 5, 2022, 9:29:08 PM8/5/22
to
On 6/08/22 05:56, BGB wrote:

<snip>

>>> Would save maybe a little more if dealing with a more FP dominated
>>> use-case. GLQuake spends a decent chunk of time in TKRA-GL, where the
>>> backend rasterizer functions and similar are at present pretty much
>>> entirely built on fixed-point integer stuff.
>> <
>> Blending (LERP) does:: z = x×a + y×(1.0-a)
>
> In TKRA-GL, the bilinear interpolation also used fixed-point...
>
> Pretty much everything past the transformation and projection stage is
> fixed point in this case.

I'm assuming you've implemented some kind of deferred shading tile
renderer?, since block-rams would seem to be the perfect fit for tile
buffers if they aren't too oddly sized.

BGB

unread,
Aug 5, 2022, 9:41:08 PM8/5/22
to
Percentage of clock cycle budget spent executing these instructions,
according to my emulator (which does account for the clock-cycle costs
of the various instructions, including for things like pipeline
interlocks and cache-miss costs and similar).


Granted, I can disable this modeling via the command line (where the
emulator then uses simpler models and assumes constant-cycle costs for
the various instructions) and get a significant speedup.

If these checks though, cache hit/miss modeling is the most significant
(as it is the most dynamic and context dependent), so disabling
cache-miss modeling causes the estimates to be wildly inaccurate.

BGB

unread,
Aug 5, 2022, 10:01:23 PM8/5/22
to
That is basically the case in my experience here.

The vast majority of constants in a program tend to be things like
100.0, 1.375, 420.0, ..., which can be expressed exactly as Binary16
values (which are then show up as-if they were the original Binary64
constants).

In general, around 90% of the FP constants can fit into Binary16 and be
represented exactly, without any loss of precision due to the
limitations of the format.

However, going much smaller than Binary16, and this starts dropping off
fairly rapidly (so, while Binary16 can represent most of the constants
exactly, an 8 or 9 bit format can represent relatively few).


This basically means that FPU Immediate instructions would mostly need
around 16 bits or so to be particularly useful (maybe 12 could work, but
is pushing it).

Though, some more statistical modeling could be useful here.

>

MitchAlsup

unread,
Aug 5, 2022, 10:10:46 PM8/5/22
to
In my case: FP immediates come in 32-bit and 64-bit flavors.
32-bit FP immediates are converted to 64-bit FP immediates during operand delivery
when the calculation is double and left alone when the calculation is 32-bits.
<
Since My 66000 does not currently have FP8 or FP16 the rest of the size question is moot.

Stephen Fuld

unread,
Aug 6, 2022, 1:05:06 AM8/6/22
to
On 8/5/2022 7:01 PM, BGB wrote:
> On 8/5/2022 8:12 PM, robf...@gmail.com wrote:
>> It may be interesting to know the kind of precision required for
>> float-point constants.
>> VAX had six-bit float immediates I think. If there are 16-bits
>> available in the instruction for a constant,
>> there may be a lot of float-constants that could be mapped to a higher
>> precision. The whole 64-bit or
>> 128-bit constant may not need to be encoded for constants to be useful.
>>
>
> That is basically the case in my experience here.
>
> The vast majority of constants in a program tend to be things like
> 100.0, 1.375, 420.0, ..., which can be expressed exactly as Binary16
> values (which are then show up as-if they were the original Binary64
> constants).
>
> In general, around 90% of the FP constants can fit into Binary16 and be
> represented exactly, without any loss of precision due to the
> limitations of the format.

I am a little surprised by that. Are pi and e for example, not
frequently used?

Thomas Koenig

unread,
Aug 6, 2022, 1:57:02 AM8/6/22
to
MitchAlsup <Mitch...@aol.com> schrieb:
> On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:

>> - The stats will be skewed by optimization. I saw various references
>> to whether compiler optimization did/didnt do constant folding.
><
> My other problem is that I do not have access to the code bases which
> cost money to access (SPEC for example) whereas I do have a usefully
> good LLVM port to My 66000 with working clang and flang. For example
> I can get Linpack, Livermore Loops, certain sections of BLAS, but not
> the applications which might use those.

Try the Polyhedron benchmarks at
https://www.fortran.uk/fortran-compiler-comparisons/the-polyhedron-solutions-benchmark-suite/

And I'm not sure why you can only get certain sections of BLAS, the
reference implementation is at https://netlib.org/blas/ .

BGB

unread,
Aug 6, 2022, 2:47:01 AM8/6/22
to
No, errm, TKRA-GL is a software renderer implementing the OpenGL API on
the front end...


It implements roughly enough of the OpenGL API to render Quake 1/2/3 and
similar (roughly ~ OpenGL 1.3):
Fixed function rendering;
Blending;
Bilinear Interpolation (sorta);
Stencil Buffers (Optional);
Various stuff for managing transform and projection matrices;
...

Includes some other features:
Texture compression;
Half Float;
...

Omits some stuff that Quake/etc don't use:
Display lists;
GL_LIGHT and GL_FOG and similar (*);

*: While Quake3 does fog effects, it does so by effectively drawing a
bunch of translucent fog layers with depth writes disabled (rather than
using OpenGL's built in fog effect).


I had considered possibly adding GL_LIGHT stuff, on the basis that it is
"not entirely useless", and I could potentially support a "poor man's
Phong" extension (mostly by running the Gouraud Shading math after
tessellation rather than before tessellation).


Rendering process is sorta like (per primitive):
Project vertices to screen space;
If primitive is too big, split it up:
Keep splitting until no longer too big;
Draw the primitive (as its subdivided pieces).

This process is implemented via a small stack, where a primitive is
popped, projected, and if it needs to be subdivided, each of the pieces
is pushed back to the stack, else it is drawn. If the stack limit is
reached, the primitive is discarded.


Primitive Drawing:
Set up function pointers/etc depending on OpenGL settings and the
primitive being drawn;
Walk the edges of the primitive (step down left and right edges);
Call span-drawing function at each scanline.

The span-drawing function is invoked using function pointers, where
properties of the primitive and of the current GL state settings are
used to select which span-drawing function pointers to use.


The edge-walking basically keeps several sets of vectors, for the left
and right sides:
XZ / XZ+Stencil
ST
RGBA
XZ_ystep
ST_ystep
RGBA_ystep

Don't need a vertex normal here, as the Normal vector and similar are
"dead" by this point.

Where for each scanline:
If Y is outside viewport
Add step vectors to current vectors;
Continue.
Calculate step values for span func;
Clip span to viewport;
Call span-drawing function;
Add step vectors to current vectors;
Continue.

For a triangle, it walks from the top vertex to the middle vertex, then
recalculates the step values and goes from the middle to the bottom.

For a quad, currently it draws it as two triangles (0,1,2; 0,2,3).
In earlier stages, triangles and quads are treated as two different
primitive types.

Trying to draw a polygon primitive results in it being decomposed into
quads and triangles.

The preferred interface here is glDrawArrays or glDrawElements, with the
glBegin/glVertex*/glEnd existing as a wrapper on top of an internal set
of vertex arrays.



Span drawing functions are classified according to major features:
Flat, Only draws a flat color;
Tex, use raster texture (no color modulation)
ModTex, color modulated texture, opaque
Utx, UTX2 texture (no color modulation)
ModUtx, UTX2 texture, color modulated, opaque
AlphaModTex, color modulated texture, alpha blend
AlphaModUtx, UTX2 texture, color modulated, alpha blend
BlModTex, color modulated texture, opaque, bilinear
BlModUtx, UTX2 texture, color modulated, opaque, bilinear
AlphaBlModTex, color modulated texture, alpha blend, bilinear
AlphaBlModUtx, UTX2 texture, color modulated, alpha blend, bilinear
Atest*, Alpha-tested variants
LMap*, Lightmap cases
Blend*, Generic Blend cases
...

Then with suffixes:
*Mort*, Uses Morton Order;
-, Does not use Z buffer;
Zb, Check ZBuffer (GL_LEQUAL) and write ZBuffer
Zt, Check ZBuffer (GL_LEQUAL), but no write to ZBuffer.
...


This is an unwieldy mess, but is needed for performance, basically these
features interact as a sort of combinatorial explosion. But this is an
area where trying to solve it in a "clean" way (say, nested "if()" or
"switch()" blocks) results in code that is incredibly slow.

There are a few cases for specific blending modes
Alpha ~= (GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)
Other blending modes fall back to a slower "Blend" case.
This one calls additional functions to perform the blend.

Likewise, trying to set glDepthFunc to something other than GL_LESS or
GL_LESS_EQUAL/GL_LEQUAL will adversely effect performance.

For example, at present trying to draw stencil shadows would fall into a
large number of slow cases (so, probably can't do anything like Doom 3
on this anytime soon).


LLVMpipe had instead taken a route of generating lots of paths in a
procedural way and then using LLVM to generate machine code (also as the
backend for a shader compiler), but this is likely a bit too heavyweight
for this.

One other possibility would be to use procedural generation for C (with
some overhead for using C rather than ASM), which is sort of a poor-mans
solution. Going too far in this direction though would lead to excessive
code bloat.

Things would also be easier if there were fewer features one could
glEnable or glDisable (which need to be supported for rendering to work
correctly).



A lot of these span drawing functions are written in ASM as well.


Buffers used (all stored in raster order):
Color Buffer: RGB555A
Depth: 16-bit
Stencil: 8-bit

Internal texture formats:
Generic rectangular, stored as RGB555A (Raster Order);
Square RGB555A (Morton Order);
Square UTX2 (Morton Order).

Morton order allows better performance for span drawing, but only works
on textures that are either square or have a 2:1 aspect ratio (though an
S/T flip mechanism also allows 1:2 textures).

Say:
256x256, Can use Morton;
256x128, Can use Morton;
128x256, Can use Morton (S/T Flip);
256x64, Needs raster
64x256, Needs raster
...

In practice, most textures can use Morton order.


RGB555A is a modified RGB555 variant:
0rrr-rrgg-gggb-bbbb RGB555 (Opaque)
1rrr-ragg-ggab-bbba RGB444+A3 (Translucent)



UTX2 is a block texture compression format:
(63:32): Pixel Selectors (P), 4x4x2 bits;
(31:16): ColorB
(15: 0): ColorA

UTX2 has several sub-modes (Bits 31 and 15):
00: Opaque, Interpolated
Endpoints interpreted as RGB555
P: 00=A, 01=(5/8)*A+(3/8)*B, 10=(3/8)*A+(5/8)*B, 11=B
01: Color+Alpha
Endpoints interpreted as RGB444A3 above;
1 bit selects Color, the other Alpha.
10: Opaque+Transparency
Endpoints interpreted as RGB555
00=A, 01=(A+B)/2, 10=Transparent, 11=B
11: Translucent, Interpolated.
Endpoints interpreted as RGB444A3
Interpolate linearly (as in 00).

Modes 00 and 10 can mimic DXT1, whereas 01 and 11 also allow it to mimic
DXT5 (albeit with fewer bits). Note that the use of explicit mode-bits
(rather than relative comparisons), means this is cheaper to decode than
DXT1 or DXT5 would have been.


There is also a UTX3L format (128-bit):
(127:96): Alpha Selectors, 4x4x2;
( 95:64): Color Selectors, 4x4x2;
( 63:32): ColorB (RGBA32 / RGBA8888)
( 31: 0): ColorA (RGBA32 / RGBA8888)
And, UTX3H (128-bit):
Basically the same as UTX3L, but treat endpoints as FP8U (E4.F4);
Result unpacked to 4x Binary16.

UTX3 is designed to try to be "mostly comparable" to BC7 and BC6H,
albeit with a significantly cheaper hardware decoder (and skipping the
"partitioned" formats).

Had previously considered some "more flexible" ideas, but they would
have been a lot more expensive to decode. Both UTX2 and UTX3 use
overlapping machinery internally.


At the OpenGL API level, generally the traditional formats (DXT1/DXT5 /
BC1/BC3/BC6H/BC7) are used, with TKRA-GL translating them internally.

Where relevant, a lot of my code is assuming the DDS and BMP file formats.


I could make stuff look a lot better, but this would require:
Using RGBA32 buffers and textures;
Not doing as much corner cutting;
Generally spending a lot more clock cycles on the 3D rendering;
...

It is pretty hard to try to get playable GLQuake on something running at
50MHz without a GPU.

Like, if there is one big advantage that the PlayStation had, it was
that it had a GPU.

Ironically, due to the affine texturing and similar, my GLQuake port
tends to look kinda like it was running on a PlayStation.

Ironically, I think I may not be too far off from something like the
Sega Saturn, given that most of the examples I had seen of Saturn games
had very simplistic geometry and lots of pop-in at relatively short
distances.

Well, contrast with something like Mega-Man Legends (on the
PlayStation), which had very simplistic geometry but often fairly large
draw distances (in comparison).

Like this game was like "Behold, this character has cubes for arms and
hands!" or "behold this house, with its epic 6 polygons! No 3D modeled
roof overhangs or windows here!"


Then, there is GLQuake, which despite having "simple looking" geometry,
a lot of this geometry is already cut into a fair number of small pieces
by the BSP algorithm.

FWIW: The dynamic tessellation actually only splits a relative minority
of primitives, mostly limited to those a short distance from the camera.


...

BGB

unread,
Aug 6, 2022, 2:57:42 AM8/6/22
to
While M_PI and M_E are cases that don't fit into 16-bits (these would
need a full 64-bit constant), values like M_PI and M_E (and other
derived values) are nowhere near being the majority of the floating
point constants in use in my experience.

They tend to be vastly outnumbered by other much less noteworthy constants.

Ivan Godard

unread,
Aug 6, 2022, 7:59:43 AM8/6/22
to
Us too :-(

EricP

unread,
Aug 6, 2022, 12:36:56 PM8/6/22
to
MitchAlsup wrote:
> On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>>> On 8/4/2022 3:28 PM, MitchAlsup wrote:
>>>>> Does anyone have a reference where a group of people measured the
>>>>> percentage of floating point operands that are constant/immediate.
>>>>>
> <
>> Alpha uses IP-rel addressing to load constants, including floats,
>> from tables located just prior to the routine entry point.
> <
> It is simpler to say Alpha does not have floating pint constants.

Sure it does. These are constant values stored in read-only memory.
They just are not immediate constants.

>> Superficially these show up as regular loads.
> <
> If you have to load it is it not a constant.
> constant = setof{ immediate, displacement }

The point I'm making relates to the issue I raise below
about papers analysis of benchmarks ISA usage stats.

>> One issue I have with many of the ISA usage papers is that they
>> simply scan and count instruction types, but don't look deeper
>> for idioms to try to infer why some sequence was being done.
> <
> I have this is spades:: whereas RISC-V has compare-and-branch with
> 11-bit target range, My 66000 has compare to zero and branch, and
> a compare instruction CoIssued with the successive branch on condition
> instruction. My branches have 18-bit target range (or 28-bit for unconditional)
> So, any fair comparison needs to take the instruction count, the execution
> cycles, and the number of times fix-ups are required into account.

In order to get at the stats you seek one has to look deeper into the code.
Taking Alpha for example. To determine if a load was for a FP constant

- Alpha uses the idiom Branch And Link BAL+0 to copy the current
incremented IP into a link register Rx. In routines that need
to access constants it does this in the routine prologue.
- Scan the routine from start and note which register Rx it uses for BAL+0
- Later if you encounter a load FLD frd,[Rx+offset] that loads a float
register using the prior Rx then you know that was a FP constant.

The above assumes the ISA has separate register banks for INT and FP and
therefore separate LD and FLD instructions to tell you what it was doing.

If ISA has a unified INT-FP bank then you need to continue scanning
forward following the branch flow until you find an instruction that
reads the register previously loaded.


MitchAlsup

unread,
Aug 6, 2022, 3:57:56 PM8/6/22
to
not really:: 1/pi, 2/pi, 4/pi are used a lot more than pi, but in many of these
cases, the pi needs 1¼ fractions of width to be sufficiently accurate (64-bits)
<
Realistically: +1.0, -1.0, 0.0, 2.0, 5.0 10.0 are used a lot.

MitchAlsup

unread,
Aug 6, 2022, 4:03:56 PM8/6/22
to
I see; you are showing how to determine it was an Alpha FP constant, not
stating Alpha used it as a #immed in an instruction. You are showing how to
count. Thanks.

Quadibloc

unread,
Aug 6, 2022, 9:38:21 PM8/6/22
to
That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
possible constants instead of just a fraction of them, it would be even better.

Inspired by the "Heads and Tails" scheme, I came up with the idea of dividing programs into
blocks of eight 32-bit instructions - with an indication of how many slots are not used for
instructions. Then immediates could be referenced by pointers within the block.

This had the virtue of letting all instructions be 32 bits long for fast decoding. But it meant
that the instruction stream is artificially divided into blocks, instead of being uniform and
continuous, which complicates compilers. Plus, for a three-bit field that indicates the number
of unused slots, I need to accept 32 bits of overhead!

An obvious solution would be to have eight 31-bit instructions in a block, but the problem is
that the immediates need to be 32 bits and 64 bits long, not 31 bits and 62 bits long! There is
a way around that, too, but I didn't like it because it would strongly tempt people to implement
it with serial decoding - and the whole point of having a uniform instruction length is to facilitate
parallel decoding.

I think I've finally come up with an alternative way of doing this.

Some architectures had branch instructions with a delay - the branch instruction appears in
the code, and is defined as causing a branch... after the next two instructions are executed.

Well, then, why not have instructions with immediates work like this:

A 32-bit long instruction which also has an immediate argument appears in the code.

That instruction will be executed... after seven more instructions following it. Following
the seventh such instruction, the immediate value appears in the instruction stream!

Of course, that means that none of those seven instructions may be a branch target,
because branching into that area would cause the immediate value to be executed
as code.

There you go - 32-bit and 64-bit immediates, uniform 32-bit instructions, eight
instructions can be executed in parallel at a time, since adequate warning is given...
and yet the instruction stream is uniform!

Note that if an instruction with an immediate value occurs within the seven
instructions following such an instruction... each 32 bits of the length of the
immediate called for by the first instruction is counted as an "instruction",
since we only need delay slots in space, not in time.

John Savard

MitchAlsup

unread,
Aug 6, 2022, 10:19:02 PM8/6/22
to
On Saturday, August 6, 2022 at 8:38:21 PM UTC-5, Quadibloc wrote:
> On Friday, August 5, 2022 at 7:12:11 PM UTC-6, robf...@gmail.com wrote:
> > It may be interesting to know the kind of precision required for float-point constants.
> > VAX had six-bit float immediates I think. If there are 16-bits available in the instruction for a constant,
> > there may be a lot of float-constants that could be mapped to a higher precision. The whole 64-bit or
> > 128-bit constant may not need to be encoded for constants to be useful.
> That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
> possible constants instead of just a fraction of them, it would be even better.
<
My 66000 allows for all FP constants (float and double) including NaNs with payloads
>

Andy

unread,
Aug 7, 2022, 7:02:51 AM8/7/22
to
On 6/08/22 18:46, BGB wrote:
> On 8/5/2022 8:29 PM, Andy wrote:

>> I'm assuming you've implemented some kind of deferred shading tile
>> renderer?, since block-rams would seem to be the perfect fit for tile
>> buffers if they aren't too oddly sized.
>>
>
> No, errm, TKRA-GL is a software renderer implementing the OpenGL API on
> the front end...

So, pretty much like the standard Z-buffer software scanline renderer
one would write on a PC then?

They invented the PowerVR style tile renderer for a reason, so may as
well steal from the best. :-)

I'm guessing it might help boost your frame-rates for pretty much the
same reasons, -- the small internal tile buffer negates the need to
read/write to dram for every buffer lookup/update.

And two or more scans over the tile buffer lets you figure out exactly
which pixels of the polygons that are visible need to be shaded and
textured, versus the possibly not insignificant overdraw a z-buffer
renderer might waste time and bandwidth on.

Of course ripping up and changing your existing core probably isn't
something you'd happily contemplate, so take my suggestion with a large
grain of salt. ;-)


<snip>

> For a triangle, it walks from the top vertex to the middle vertex, then
> recalculates the step values and goes from the middle to the bottom.
>
> For a quad, currently it draws it as two triangles (0,1,2; 0,2,3).
> In earlier stages, triangles and quads are treated as two different
> primitive types.
>
> Trying to draw a polygon primitive results in it being decomposed into
> quads and triangles.

So why'd you deprecate the quads as primitives?

I've been looking through the source code of Core Design's
un-released/developed game 'TombRaider Anniversary Edition' (not to be
confused with the Crystal Dynamics game which did get released),

there's plenty of quad polygon handling functions to be seen, which
backs up my intuition that tri's and quads should be handled equally
well if at all possible.


<more snip>

>
> I could make stuff look a lot better, but this would require:
>   Using RGBA32 buffers and textures;

16bit textures shaded into 24bit frame buffers was the standard for a
while when memories were smaller weren't they? seemed to be acceptable
for the time and probably not to shabby for the retro gaming inclined
today either.



>   Not doing as much corner cutting;
>   Generally spending a lot more clock cycles on the 3D rendering;
>   ...
>
> It is pretty hard to try to get playable GLQuake on something running at
> 50MHz without a GPU.
>
> Like, if there is one big advantage that the PlayStation had, it was
> that it had a GPU.

Maybe there's a hint to be had there...

Something like a big-little multi core design,
your large but leaner WEX core handling all the game input, camera &
object updates, then vector style churning though all the floating point
geometry to leave an array of Z-sorted integer polygons that can be fed
to a number of tiny 16/24bit risc/misc like cores to render into however
many spare tile buffers you can fit into your fpga.

And by tiny, I mean like a 16bit 6502 with half the instruction set
thrown out, if an instruction doesn't aid in placing a texture sampled
pixel into the tile buffer --- axe it!

I'm thinking a small quantity of tiny independent cores working
simultaneously might work better over-all than one big complex core
trying to do it all. YMMV



> Ironically, due to the affine texturing and similar, my GLQuake port
> tends to look kinda like it was running on a PlayStation.
>
> Ironically, I think I may not be too far off from something like the
> Sega Saturn, given that most of the examples I had seen of Saturn games
> had very simplistic geometry and lots of pop-in at relatively short
> distances.
>
> Well, contrast with something like Mega-Man Legends (on the
> PlayStation), which had very simplistic geometry but often fairly large
> draw distances (in comparison).
>
> Like this game was like "Behold, this character has cubes for arms and
> hands!" or "behold this house, with its epic 6 polygons! No 3D modeled
> roof overhangs or windows here!"
>

I always end with the Crystal Dynamics TombRaider games for my nostalgia
trips, Lara has plenty of polgons, in all the right places, they even,
uhh jigg...

Ummm, better not finish that last sentence, lest the WokePC brigade are
watching! ;-)

>
> Then, there is GLQuake, which despite having "simple looking" geometry,
> a lot of this geometry is already cut into a fair number of small pieces
> by the BSP algorithm.
>
> FWIW: The dynamic tessellation actually only splits a relative minority
> of primitives, mostly limited to those a short distance from the camera.
>

I've always been tempted to write a game engine called the REPYES engine
- remember every polygon you've ever seen.

Basically a giant view direction and player position dependent database
that loads and frees polygons and each of their individual texture maps
to Vram from main memory, so that older laptops and such with weakish
GPUs can enjoy near maximum / lush visible poly counts as they work
their way through a game level.

But instead of using BSPs to figure it all out, I'd just brute force
paint polygonIDs into the frame buffer then trace over the buffer and
record exactly which polygons were visible, step view direction, step
position, rince repeat over all player accessible regions of the game map.


But it'll probably never happen, cause, urrr, it's possibly quite a
stupid thing to do in practice I guess.

Yeah, best forget I mentioned that. :-)

EricP

unread,
Aug 7, 2022, 9:49:12 AM8/7/22
to
Yes.
The instruction opcode gives the data type and size.
It is followed by 0 to 5 operand address mode specifiers.
The operand specifier byte can hold a 6 bit literal constant,
either unsigned integer or unsigned float(exp,frac)=(3,3).
Opspec could also be long form immediate data 1,2,4,8,16 bytes.

If a literal it is converted to the opcodes float type (s,e,f)
F (1,8,23), D (1,8,55), G (1,11,52), H (1,15,96)

VAX static instruction stream stats show literal operands used between
10% and 18% of instructions, average 15.2%, was second to register 40%.
Interestingly, the longer immediate data format only occurs 2.4%.
No data types are given.

- A Case Study of VAX-11 Instruction Set Usage For Compiler Execution 1982

- Characterization of Processor Performance in the VAX-11/780 1984

Quadibloc

unread,
Aug 7, 2022, 10:20:26 AM8/7/22
to
I'm aware of that. But it also has variable-length instructions.

How can one have immediate values while still _also_ having the advantage that all
instructions are the same length, so that the CPU can just fetch instructions 256 bits at
a time, and start decoding all eight instructions in a block in parallel? Unless given
advance notice to ignore certain instruction slots in a block.

First, my Concertina II attempts tried to do this with a complicated scheme of block
headers - that provided other VLIW features as well, to try to make use of the big
overhead this imposed.

Now, I've come up with something that requires "no overhead", and which doesn't
complicate compilation by chopping up the instruction stream into pieces.

Not that it doesn't have disadvantages - by requiring seven delay slots for every
immediate instruction, in a way, unlike the Concertina II scheme, it's forcing each
immediate instruction to involve a pessimal restriction on possible branch targets,
whereas a block scheme usually doesn't restrict branch targets at all.

John Savard

MitchAlsup

unread,
Aug 7, 2022, 12:25:45 PM8/7/22
to
On Sunday, August 7, 2022 at 9:20:26 AM UTC-5, Quadibloc wrote:
> On Saturday, August 6, 2022 at 8:19:02 PM UTC-6, MitchAlsup wrote:
> > On Saturday, August 6, 2022 at 8:38:21 PM UTC-5, Quadibloc wrote:
>
> > > That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
> > > possible constants instead of just a fraction of them, it would be even better.
>
> > My 66000 allows for all FP constants (float and double) including NaNs with payloads
<
> I'm aware of that. But it also has variable-length instructions.
<
Yes, it has variable length instructions, but it has fixed width instruction specifiers.
<
Fixed width instruction specifier is the key to not screwing up the Decodability of the ISA.
<
All of the registers, all of the sizes, all of the modes, all of the operand routing is
in the instruction-specifier. The only variability is in the amount of constants attached
to the instruction specified.
<
This is a far cry from VAX and x86 and in line with IBM 360. VAX and x86 do serial parsing
of the instruction stream. IBM has a single instruction specifier (the first 16-bits) and
a series of additions based solely on the first 16-bits--this may not be true of system/Z
now, but was circa 1965.
<
It is better than RISC-V where where the registers are depends on whether you have
compressed instruction or a uncompressed instruction. My 66000 always has the
register specifiers in the same position. So, the 1-wide machine can always route
inst<4..0> to the Rs2 register port, inst<12..8> to the RS3 register port, and inst<26..22>
to the Rs1 register port. This saves 2-gates* of delay wrt RISC-V minimal implementations
with the compressed extension in getting from "instruction arrives" to bits into the
register file port decoder. (*) there is potential for more wire delay, here, also.
<
In addition, there are encodings where the register specifier is converted into a
signed 5-bit immediate. This enables a single instruction to do:
<
ADD Rd,#1,-Rs2
or
STW #3,[SP+1234]
>
> How can one have immediate values while still _also_ having the advantage that all
> instructions are the same length, so that the CPU can just fetch instructions 256 bits at
> a time, and start decoding all eight instructions in a block in parallel? Unless given
> advance notice to ignore certain instruction slots in a block.
<
A) you can't--mainly because you phrased the question improperly. You are not playing
both ends towards the middle.
<
B) what you can do is to PARSE everything in the instruction buffer as it arrives, so that
figuring out which containers contain instruction-specifiers and which containers
contain constants. In My 66000 this takes 4 gates of delay (31 total gates) to come up
with, instruction length, offset to immediate, offset to displacement. At this point (4
gates into the cycle) you can double your DECODE width every added gate of delay
(up to 16 instruction where it starts taking 2 gates to double your DECODE width).
>
> First, my Concertina II attempts tried to do this with a complicated scheme of block
> headers - that provided other VLIW features as well, to try to make use of the big
> overhead this imposed.
>
> Now, I've come up with something that requires "no overhead", and which doesn't
> complicate compilation by chopping up the instruction stream into pieces.
<
My 66000 ISA does not chop the instruction stream into pieces and accomplishes all
of what you desire (with respect to constants).
>
> Not that it doesn't have disadvantages - by requiring seven delay slots for every
> immediate instruction, in a way, unlike the Concertina II scheme, it's forcing each
> immediate instruction to involve a pessimal restriction on possible branch targets,
> whereas a block scheme usually doesn't restrict branch targets at all.
<
My 66000 does not have those disadvantages, either--and is essentially orthogonal
to the compiler.
>
> John Savard

BGB

unread,
Aug 7, 2022, 5:58:38 PM8/7/22
to
On 8/7/2022 6:02 AM, Andy wrote:
> On 6/08/22 18:46, BGB wrote:
>> On 8/5/2022 8:29 PM, Andy wrote:
>
>>> I'm assuming you've implemented some kind of deferred shading tile
>>> renderer?, since block-rams would seem to be the perfect fit for tile
>>> buffers if they aren't too oddly sized.
>>>
>>
>> No, errm, TKRA-GL is a software renderer implementing the OpenGL API
>> on the front end...
>
> So, pretty much like the standard Z-buffer software scanline renderer
> one would write on a PC then?
>

Pretty much.

The same code can build and run on a PC as well, albeit it has a slight
disadvantage on PC due to it lacking a few special features I have in my
ISA. Despite my PC having 74x the clock speed, it only seems to pulls
off around 20x the fill rate (takes roughly 4x as many clock-cycles per
pixel).


> They invented the PowerVR style tile renderer for a reason, so may as
> well steal from the best. :-)
>
> I'm guessing it might help boost your frame-rates for pretty much the
> same reasons, -- the small internal tile buffer negates the need to
> read/write to dram for every buffer lookup/update.
>

It would mostly help if one is throwing multiple cores at the problem.
Some of my past experiments with multi-threaded renders had split the
screen half or into quarters, with one thread working on each part of
the screen.

Though, my current rasterizer is single threaded.

Originally, it was intended to be dual threaded, but ran into a problem
when I started exceeding the resource budgets needed for doing dual core
on the FPGA I am using, and the emphasis shifted to trying to make a
single core run fast.



L1/L2 misses from the raster drawing part aren't too bad IME.

Texture related misses were a bigger issue, which is part of why I am
using Morton ordering when possible.

A 256x256 texture has more texels than a 320x200 framebuffer has pixels.


Splitting up geometry and then drawing each tile sequentially is not as
likely to be helpful.

There is a possibility of a slight advantage to drawing geometry
Z-buffer-only first, and then going back and drawing surfaces.

This would matter a lot more for doing a lot of blending or possibly if
running shaders, since in this case per pixel cost is a bigger issue.


I already have some special cases where geometry hidden behind
previously drawn geometry will be culled.

Some of the span drawing loops have "Z-fail sub-loop" special cases:
If the first pixel would be Z-Fail, go into a Z-fail loop;
If we hit a pixel that is Z-Pass, branch back into Z-Pass loop.

The Z-Fail sub-loop simply updates the state variables and checks for
Z-Pass.

The Z-Pass loops generally use predication for Z test, though another
possible (valid) design would be to have the Z-Pass loop branch back
into the Z-Fail loop on Z-Fail.



> And two or more scans over the tile buffer lets you figure out exactly
> which pixels of the polygons that are visible need to be shaded and
> textured, versus the possibly not insignificant overdraw a z-buffer
> renderer might waste time and bandwidth on.
>
> Of course ripping up and changing your existing core probably isn't
> something you'd happily contemplate, so take my suggestion with a large
> grain of salt. ;-)
>

Yeah.

Also it isn't likely to offer a huge advantage in this case (with a
software renderer).

Also possibly counter-intuitively, a bigger amount of time is currently
going into the transform stages than into the raster-drawing parts.


So for GLQuake, time budget seems to be, roughly, say:
~ 50%, Quake itself;
~ 38%, transform stages
~ 4%, Edge Walking
~ 12%, Span Drawing

Part of the reason for the transform stage cost is that GLQuake draws a
fair number of primitives that end up being culled.


Say, for example, Quake tries to draw 1000 primitives in a frame, 300
fragmented, 900 are culled. Then, 400 are drawn.

Typically, the majority are culled due to frustum culling, and also a
fair number due to backface and Z-occlusion checks.



>
> <snip>
>
>> For a triangle, it walks from the top vertex to the middle vertex,
>> then recalculates the step values and goes from the middle to the bottom.
>>
>> For a quad, currently it draws it as two triangles (0,1,2; 0,2,3).
>> In earlier stages, triangles and quads are treated as two different
>> primitive types.
>>
>> Trying to draw a polygon primitive results in it being decomposed into
>> quads and triangles.
>
> So why'd you deprecate the quads as primitives?
>

Originally it only did triangles internally, but then I added quads as a
special case in the transform stages:
Projecting 4 vertices is less than 6;
They tessellate more efficiently;
...

This stops when it gets to the final rasterization stages, mostly
because the general-case logic for walking the edges of a quad is more
complicated than for a triangle (so didn't bother implementing it).

So, the "WalkEdgesQuad" function makes two calls to the
WalkEdgesTriangle function...

There are a few special cases that could probably be handled without too
much complexity, but they don't really happen that often in 3D scenery.


> I've been looking through the source code of Core Design's
> un-released/developed game 'TombRaider Anniversary Edition' (not to be
> confused with the Crystal Dynamics game which did get released),
>
> there's plenty of quad polygon handling functions to be seen, which
> backs up my intuition that tri's and quads should be handled equally
> well if at all possible.
>

Wasn't aware of any of the code for any of the Tomb Raider games having
been released, but then again I wasn't really much into Tome Raider.


Back when I was much younger (middle school), my brother had a
PlayStation, a few games I remember on it:
Mega-Man Legends;
Mega-Man X4;
Final Fantasy VII;
Xenogears;
Silent Hill;
Crash Bandicoot;
...

There was also a demo CD with demo versions of games like Spyro the
Dragon and Tomb Raider and similar.

By high-school, he had a Dreamcast and the Sonic Adventure games and
similar, along with a PlayStation 2 with games like Grand Theft Auto 3
and similar.


I had been mostly into PC stuff at the time (Quake 1/2, etc).

I had preferred Quake 1 over Quake 2, as while Quake 2 had a few things
going for it (a more hub-world like structure), Quake 1 was more
interesting. In high-school it was mostly Half-Life. I think by the time
I was taking college classes, was mostly poking around in Half-Life 2
and Garry's Mod, then Portal came out, ...

Well, until I started messing around with Minecraft, not really done
much else in gaming much past this point.


Contrary to some people, I suspect the HL/HL2 era is when graphics got
"good enough", despite newer advances in terms of rendering technology,
the "better graphics" don't really improve the gameplay experience all
that much.

One of the more interesting recent developments is real-time ray-tracing.


However, it is not so easy to write a ray-tracer that is competitive
with a raster renderer. I had experimented (not too long ago) with
implementing a software ray-tracer in the Quake engine, but it fell well
short of usable performance.

This was based on a modified version of Quake's normal line-tracing
code, which also had an issue that the world (as seen by line-tracing)
is not exactly the same as what is seen when drawing it as geometry
(there are a lot of "overhangs" in the BSP where the line-trace thinks
it has hit something solid, but there is no actual geometry there).

Also, line tracing the BSP is a lot slower than one might expect...

Had at one point also tried using line-traces to try to further cull
down the geometry in the PVS, but this turned out to be slower than just
drawing the geometry directly and letting GL sort it out.


IME, line-tracing over a regular grid structure (or an octree) tends to
be more efficient than doing so via a recursive BSP walk (an octree
based engine likely being more efficient if one wants to implement a
ray-cast or ray-tracing renderer).

But, OTOH, doing a modified version of Quake where I rebuild all of the
maps from the map source (with custom tools) using an octree and similar
rather than a BSP, is probably "not really worth it".


>
> <more snip>
>
>>
>> I could make stuff look a lot better, but this would require:
>>    Using RGBA32 buffers and textures;
>
> 16bit textures shaded into 24bit frame buffers was the standard for a
> while when memories were smaller weren't they? seemed to be acceptable
> for the time and probably not to shabby for the retro gaming inclined
> today either.
>

Probably, I am using RGB555A, which can sorta mimic RGBA32 and (on
average) looks better than RGBA4444 or similar.


RGBA32 for a framebuffer can look better, but using it would be kind of
a waste when the output framebuffer is using RGB555. And, on the FPGA
board I am using, the VGA output only has 4 bits per component, so even
the RGB555 output is effectively using a Bayer dither in this case.

Though, I did come up with a trick to mostly hide the Bayer dither by
rotating the dither mask for each VGA refresh.

Granted, it is possible that RGB888 could still offer some level of
benefit over RGB555 here.


Had considered going over to Z24.S8 buffers, but would have needed to
rewrite a lot of my span-drawing functions to use it (all those written
to assume a 16-bit Z-Buffer), and a 32-bit Z+Stencil buffer would be
kind of a waste if stencil is used infrequently (the Quake games don't
use any stencil effects).

Had also looked into Z12.S4, but this would cause unacceptable levels of
Z-fighting in my tests. This led to the stencil cases to use a separate
stencil buffer.


>
>
>>    Not doing as much corner cutting;
>>    Generally spending a lot more clock cycles on the 3D rendering;
>>    ...
>>
>> It is pretty hard to try to get playable GLQuake on something running
>> at 50MHz without a GPU.
>>
>> Like, if there is one big advantage that the PlayStation had, it was
>> that it had a GPU.
>
> Maybe there's a hint to be had there...
>
> Something like a big-little multi core design,
> your large but leaner WEX core handling all the game input, camera &
> object updates, then vector style churning though all the floating point
> geometry to leave an array of Z-sorted integer polygons that can be fed
> to a number of tiny 16/24bit risc/misc like cores to render into however
> many spare tile buffers you can fit into your fpga.
>
> And by tiny, I mean like a 16bit 6502 with half the instruction set
> thrown out, if an instruction doesn't aid in placing a texture sampled
> pixel into the tile buffer --- axe it!
>

It is possible, though if I were to try to fit TKRA-GL to it, it would
likely mean cores that were more like:
2-wide with 64-bit Packed-Integer SIMD;
Probably still needing a 32-bit address space.

Being able to dealing with transforms would still likely require
FP-SIMD, but could be reduced to the S.E8.F16 form (possibly with Lanes
1 and 2 operating independently). Could potentially omit support for
Binary64 FPU ops.


Some of my previously considered GPU related features, I had back-ported
to BJX2 as "proof of concept".

It is possible that the "GPU" could be running a more restricted BJX2
subset.


Not too long ago, I had considered another possibility:
BJX2 core is used as the GPU;
I add a secondary "CPU" core, mostly running RV64 or similar.
Say, the GLQuake engine-side logic is run on an RISC-V core.

Though, my attempts at RISC-V cores have thus far end up more expensive
than ideal (in my attempts, a full RV64G core would end up costing
*more* than another BJX2 core), and even making it single-issue isn't
really enough to compensate for this.


> I'm thinking a small quantity of tiny independent cores working
> simultaneously might work better over-all than one big complex core
> trying to do it all. YMMV
>

Only if the work can be split up to use the cores efficiently.

With the current balance, unless these cores could also handle vertex
transform and similar, they wont save much.


A bigger amount of savings would likely be possible with a redesigned
API design, possibly:
Rendering state configuration is folded into "objects";
Front-end interface mostly uses fixed-point or similar;
...


>
>
>> Ironically, due to the affine texturing and similar, my GLQuake port
>> tends to look kinda like it was running on a PlayStation.
>>
>> Ironically, I think I may not be too far off from something like the
>> Sega Saturn, given that most of the examples I had seen of Saturn
>> games had very simplistic geometry and lots of pop-in at relatively
>> short distances.
>>
>> Well, contrast with something like Mega-Man Legends (on the
>> PlayStation), which had very simplistic geometry but often fairly
>> large draw distances (in comparison).
>>
>> Like this game was like "Behold, this character has cubes for arms and
>> hands!" or "behold this house, with its epic 6 polygons! No 3D modeled
>> roof overhangs or windows here!"
>>
>
> I always end with the Crystal Dynamics TombRaider games for my nostalgia
> trips, Lara has plenty of polgons, in all the right places, they even,
> uhh jigg...
>
> Ummm, better not finish that last sentence, lest the WokePC brigade are
> watching! ;-)
>

Kinda curious that they did this back then, whereas many later games
(Half-Life 2, Doom 3, etc) didn't bother with these sorts of effects
(they probably could have if they wanted as special cases of ragdoll
within their skeletal animation systems).

Many newer games apparently also do things like soft-body physics and
cloth simulation and similar.


I guess I am in the category of having a certain level of sentimentalism
for characters like Tron Bonne and similar (from the Mega-Man Legends
series), though in part because of finding her character relatable.


>>
>> Then, there is GLQuake, which despite having "simple looking"
>> geometry, a lot of this geometry is already cut into a fair number of
>> small pieces by the BSP algorithm.
>>
>> FWIW: The dynamic tessellation actually only splits a relative
>> minority of primitives, mostly limited to those a short distance from
>> the camera.
>>
>
> I've always been tempted to write a game engine called the REPYES engine
> - remember every polygon you've ever seen.
>
> Basically a giant view direction and player position dependent database
> that loads and frees polygons and each of their individual texture maps
> to Vram from main memory, so that older laptops and such with weakish
> GPUs can enjoy near maximum / lush visible poly counts as they work
> their way through a game level.
>
> But instead of using BSPs to figure it all out, I'd just brute force
> paint polygonIDs into the frame buffer then trace over the buffer and
> record exactly which polygons were visible, step view direction, step
> position, rince repeat over all player accessible regions of the game map.
>
>
> But it'll probably never happen, cause, urrr, it's possibly quite a
> stupid thing to do in practice I guess.
>
> Yeah, best forget I mentioned that. :-)
>

OK.


In one of my more recent experimental engines with a Minecraft style
terrain system.


I basically had the camera cast out rays in every direction and then
building a list of visible blocks (recorded as their world coordinates).

The renderer then draws all of the blocks on this list.


This approach is faster and saves memory on BJX2 when compared with the
"build and draw a vertex array for every chunk" approach.

However, it doesn't scale as well with draw distance, and on a PC with a
GPU, it is faster to use per-chunk vertex arrays and occlusion queries
(however, using ray-casts to build block lists still uses less RAM).


Performance on my BJX2 core was comparable to Quake, but it can do
outdoors environments (nevermind the limited draw distance in this case).


Ironically, despite running on a 50MHz CPU core, performance still
somehow manages to be better than Minecraft with a similar draw distance
on a Vista era laptop.



Doing something vaguely similar, but with ray-casting over an octree,
also seems possible (where the oct-tree would keep dividing geometry
until it reaches a certain maximum number of polygons).


Unlike Quake, by using a few Minecraft style tricks, it would also be
possible to extend it to arbitrarily large environments. Say, the
top-level world is split up into a grid of 256 meter cube-shaped
regions, with each cube divided into an octree (each region being
roughly the size of a Quake 2 map).

Quadibloc

unread,
Aug 7, 2022, 7:30:15 PM8/7/22
to
On Sunday, August 7, 2022 at 10:25:45 AM UTC-6, MitchAlsup wrote:

> My 66000 does not have those disadvantages, either--and is essentially orthogonal
> to the compiler.

That is a respect, then, in which your design is far superior to any of mine. In
order to squeeze more featulres in to my ISA, orthogonality has been almost
always the first thing I threw out the window.

John Savard

luke.l...@gmail.com

unread,
Aug 7, 2022, 7:38:45 PM8/7/22
to
On Thursday, August 4, 2022 at 9:28:24 PM UTC+1, MitchAlsup wrote:
> Does anyone have a reference where a group of people measured the
> percentage of floating point operands that are constant/immediate.

clearly it'll very much depend on the target workload, so for example
the imdct36 function of ffmpeg for MP3 requires quite a lot of magic
FP constants.

this and 3D was enough without needing details to go for two
fp-const instructions

we decided to propose fmvis - float move immediate - and fishmv
(float immediate second half move) which adds the second half
of an FP32

https://libre-soc.org/openpower/sv/int_fp_mv/#fmvis

the nice thing about fmvis is, the 16-bit immediate is a BF16 and
drops nicely into place as an FP32 or FP64.

the nice thing about fishmv is, the 16-bit immediate is just
the lower 16 bits of a FP32 mantissa.

no need for variable-length-encoded instructions unless you
already have 48-bit encoding.

l.

BGB

unread,
Aug 7, 2022, 7:42:35 PM8/7/22
to
On 8/7/2022 11:25 AM, MitchAlsup wrote:
> On Sunday, August 7, 2022 at 9:20:26 AM UTC-5, Quadibloc wrote:
>> On Saturday, August 6, 2022 at 8:19:02 PM UTC-6, MitchAlsup wrote:
>>> On Saturday, August 6, 2022 at 8:38:21 PM UTC-5, Quadibloc wrote:
>>
>>>> That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
>>>> possible constants instead of just a fraction of them, it would be even better.
>>
>>> My 66000 allows for all FP constants (float and double) including NaNs with payloads
> <
>> I'm aware of that. But it also has variable-length instructions.
> <
> Yes, it has variable length instructions, but it has fixed width instruction specifiers.
> <
> Fixed width instruction specifier is the key to not screwing up the Decodability of the ISA.
> <
> All of the registers, all of the sizes, all of the modes, all of the operand routing is
> in the instruction-specifier. The only variability is in the amount of constants attached
> to the instruction specified.

In my case, current scheme still mostly:
(15:12):
E, F: Sz=1;
7, 9: Sz=1 (if XGPR extension exists, UD otherwise);
Else: Sz=0.
(15:8):
EA/EB, EE/EF, F4..F7, FC..FF: Wx=1
Else: Wx=0

Then, say, the 96 bit bundle is split into bits for 3 dwords:
Sz0, Wx0, Sz1, Wx1, Sz2
0zzzz: 16-bit
10zzz: 32-bit
110zz: 48-bit (unused)
1110z: 64-bit
11110: 80-bit (unused)
11111: 96-bit

Though, the WXE and WX2 bits also influence this.
00: Wx is always 0.
01: Sz=((1:0)==11), Wx=0 (RISC-V Mode)
10: Scheme above.
11: Reserved


This replaced a scheme used in an earlier form of the ISA:
(15:13)==111: 32+
(12:10)==111: 48+
(9:8)=11: 64

This encoding was dropped, with FC/FD becoming 32-bit (WEX'ed F8/F9
blocks), and FE/FF becoming the Jumbo Prefixes.


> <
> This is a far cry from VAX and x86 and in line with IBM 360. VAX and x86 do serial parsing
> of the instruction stream. IBM has a single instruction specifier (the first 16-bits) and
> a series of additions based solely on the first 16-bits--this may not be true of system/Z
> now, but was circa 1965.

Yeah; x86 and VAX would be a pain here.


> <
> It is better than RISC-V where where the registers are depends on whether you have
> compressed instruction or a uncompressed instruction.

RISC-V C:
Looks straightforward enough on the surface.
Look a little deeper, it is a dog-chewed mess (even worse than Thumb).


> My 66000 always has the
> register specifiers in the same position. So, the 1-wide machine can always route
> inst<4..0> to the Rs2 register port, inst<12..8> to the RS3 register port, and inst<26..22>
> to the Rs1 register port. This saves 2-gates* of delay wrt RISC-V minimal implementations
> with the compressed extension in getting from "instruction arrives" to bits into the
> register file port decoder. (*) there is potential for more wire delay, here, also.
> <
> In addition, there are encodings where the register specifier is converted into a
> signed 5-bit immediate. This enables a single instruction to do:
> <
> ADD Rd,#1,-Rs2
> or
> STW #3,[SP+1234]

Hit or miss in my case.

Registers are "mostly stable", albeit a few ports may move around in
decoding (depending on the instruction), and a few blocks (such as the
F8 block) use different bits for the destination register.

Part of the reason the F8 block is awkward is that I wanted to keep the
Imm16 part contiguous, but there was also no good way to do this while
also keeping other parts of the encoding consistent.

There are a few consistency issues within the 16-bit space as well.

Could be better, could be worse.


RISC-V's 32-bit encodings keep registers more consistent at the cost of
turning immediate values and displacements into a chewed up mess.

>>
>> How can one have immediate values while still _also_ having the advantage that all
>> instructions are the same length, so that the CPU can just fetch instructions 256 bits at
>> a time, and start decoding all eight instructions in a block in parallel? Unless given
>> advance notice to ignore certain instruction slots in a block.
> <
> A) you can't--mainly because you phrased the question improperly. You are not playing
> both ends towards the middle.
> <
> B) what you can do is to PARSE everything in the instruction buffer as it arrives, so that
> figuring out which containers contain instruction-specifiers and which containers
> contain constants. In My 66000 this takes 4 gates of delay (31 total gates) to come up
> with, instruction length, offset to immediate, offset to displacement. At this point (4
> gates into the cycle) you can double your DECODE width every added gate of delay
> (up to 16 instruction where it starts taking 2 gates to double your DECODE width).

In my case, extended constant bits are held in jumbo prefixes, which are
effectively treated as NOP (and mostly special in that the payload bits
are routed into the adjacent decoder).


>>
>> First, my Concertina II attempts tried to do this with a complicated scheme of block
>> headers - that provided other VLIW features as well, to try to make use of the big
>> overhead this imposed.
>>
>> Now, I've come up with something that requires "no overhead", and which doesn't
>> complicate compilation by chopping up the instruction stream into pieces.
> <
> My 66000 ISA does not chop the instruction stream into pieces and accomplishes all
> of what you desire (with respect to constants).
>>
>> Not that it doesn't have disadvantages - by requiring seven delay slots for every
>> immediate instruction, in a way, unlike the Concertina II scheme, it's forcing each
>> immediate instruction to involve a pessimal restriction on possible branch targets,
>> whereas a block scheme usually doesn't restrict branch targets at all.
> <
> My 66000 does not have those disadvantages, either--and is essentially orthogonal
> to the compiler.

Delay slots are a bad idea IMO.


Also better IMO to keep instructions in a format where they at least
"make sense" as a sequentially executed instruction scheme.

Say, well-formed code in BJX2 has as a requirement that one can ignore
WEX and execute stuff sequentially, and it should still produce the same
results as the bundled version (Or, IOW: If bundled and sequential
execution produce different results, the code is broken).

...


>>
>> John Savard

MitchAlsup

unread,
Aug 7, 2022, 8:43:32 PM8/7/22
to
On Sunday, August 7, 2022 at 6:38:45 PM UTC-5, luke.l...@gmail.com wrote:
> On Thursday, August 4, 2022 at 9:28:24 PM UTC+1, MitchAlsup wrote:
> > Does anyone have a reference where a group of people measured the
> > percentage of floating point operands that are constant/immediate.
<
> clearly it'll very much depend on the target workload, so for example
> the imdct36 function of ffmpeg for MP3 requires quite a lot of magic
> FP constants.
>
> this and 3D was enough without needing details to go for two
> fp-const instructions
>
> we decided to propose fmvis - float move immediate - and fishmv
> (float immediate second half move) which adds the second half
> of an FP32
>
> https://libre-soc.org/openpower/sv/int_fp_mv/#fmvis
>
> the nice thing about fmvis is, the 16-bit immediate is a BF16 and
> drops nicely into place as an FP32 or FP64.
>
> the nice thing about fishmv is, the 16-bit immediate is just
> the lower 16 bits of a FP32 mantissa.
<
Luke, can you take a body of floating point source. compile
it to assembly (or similar) and count the number of FP operands,
and count the number of FMVis + FISHMV so we could get a
percentage of the number of floating point constants in that body
of code ??
>
> no need for variable-length-encoded instructions unless you
> already have 48-bit encoding.
<
The question was not about Variable Length ISAs, but the percentage
of floating point constants that survive all optimizations and are "in"
the object code.
>
> l.
<
Thanks, BTW.

MitchAlsup

unread,
Aug 7, 2022, 8:54:48 PM8/7/22
to
On Sunday, August 7, 2022 at 6:42:35 PM UTC-5, BGB wrote:
> On 8/7/2022 11:25 AM, MitchAlsup wrote:
> > On Sunday, August 7, 2022 at 9:20:26 AM UTC-5, Quadibloc wrote:
> >> On Saturday, August 6, 2022 at 8:19:02 PM UTC-6, MitchAlsup wrote:
> >>> On Saturday, August 6, 2022 at 8:38:21 PM UTC-5, Quadibloc wrote:
> >>
> >>>> That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
> >>>> possible constants instead of just a fraction of them, it would be even better.
> >>
> >>> My 66000 allows for all FP constants (float and double) including NaNs with payloads
> > <
> >> I'm aware of that. But it also has variable-length instructions.
> > <
> > Yes, it has variable length instructions, but it has fixed width instruction specifiers.
> > <
> > Fixed width instruction specifier is the key to not screwing up the Decodability of the ISA.
> > <
> > All of the registers, all of the sizes, all of the modes, all of the operand routing is
> > in the instruction-specifier. The only variability is in the amount of constants attached
> > to the instruction specified.
<snip>
But the addition of compressed instructions changes where the register
specifiers are found in the compressed versus uncompressed instructions !!!
> >>
<snip>
> > <
> > My 66000 does not have those disadvantages, either--and is essentially orthogonal
> > to the compiler.
> Delay slots are a bad idea IMO.
>
I learned how bad they were on my* Mc 88100 ISA.
(*) yes, mine.
>
> Also better IMO to keep instructions in a format where they at least
> "make sense" as a sequentially executed instruction scheme.
<
In memory, I completely agree. Once fetched into a processor, you are no
longer bound by this criterion, and can exploit position rearrangement
so long as you can annotate register and memory ordering dependencies.
With such relaxation, the code sequence to swap 2 register values:
<
MOV Rt,Rx
MOV Rx,Ry
MOV Ry,Rt
<
can be rearranged into:
<
MOV Rx,Ry; MOV Ry,Rx
<
such that they can be performed at the same time with no remaining sequential
dependencies. and if Rt is overwritten, in the horizon of the peep-hole optimizer,
that MOV can be totally eliminated.
<
Why this works:: The right hand side of both MOV instructions get the current
rename value of Rx and Ry respectively, while the left hand side get new register
names from the renamer.
<
More exotic "decoders" might perform the MOVs in the renamer itself.
Presto, zero cycle swaps.

Terje Mathisen

unread,
Aug 8, 2022, 2:42:54 AM8/8/22
to
BGB wrote:
> FMUL: 6 cycles.
> FDIV: 120 cycles.
>
> FDIV is also rare though for most things; apart from Quake's software
> renderer where it is kind of a boat anchor (perspective correct texture
> mapping needs fast FDIV).

That was a core idea behind Mike Abrash' 3x speedup in asm: He managed
to overlap the FDIV latency with integer code that was finishing up the
previous 16-pixel span. 16 pixels was determined as a nice compromise
between leaving any visual non-perspective artifacts and the number of
cycles available under the FDIV shadow.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,
Aug 8, 2022, 4:06:47 AM8/8/22
to
BGB wrote:
> On 8/5/2022 2:01 PM, MitchAlsup wrote:
>> Which, BTW, does not achieve 0.5 ULP accuracy, whereas y=x/3.0 does.
>
> The 120-cycle HW FDIV gets ~ 0.5, whereas the software divide
> (Newton-Raphson based) is seemingly limited to somewhere around 3.5 ULP
> or so for Binary64.
>
>
> For most things, "X/C" vs "X*(1.0/C)" doesn't have any real practical
> effect on behavior, but if the latter is 20x faster, this is what it is
> going to be.
>
> Likewise, the 0.5 ULP requirement does not seem to be a requirement for
> the C standard.

The C standard is mostly a "lowest common subset", it requires very little.

It is ieee754 which requires perfect rounding, in all modes, for all the
5 "core" operations (FADD/FSUB/FMUL/FDIV/FSQRT).

There are strong suggestions to increase this to require (near-)perfect
results for all other ops which can reasonably support it, i.e. ~0.5 ulp.

We know how to do it for trig (including arbitrary range
reduction)/log/exp, rsqrt can be done perfectly with the same or less
effort than sqrt.

BGB

unread,
Aug 8, 2022, 4:07:13 AM8/8/22
to
On 8/8/2022 1:42 AM, Terje Mathisen wrote:
> BGB wrote:
>> FMUL: 6 cycles.
>> FDIV: 120 cycles.
>>
>> FDIV is also rare though for most things; apart from Quake's software
>> renderer where it is kind of a boat anchor (perspective correct
>> texture mapping needs fast FDIV).
>
> That was a core idea behind Mike Abrash' 3x speedup in asm: He managed
> to overlap the FDIV latency with integer code that was finishing up the
> previous 16-pixel span. 16 pixels was determined as a nice compromise
> between leaving any visual non-perspective artifacts and the number of
> cycles available under the FDIV shadow.
>

Yes, but assumes that FDIV can operate in parallel with other
instructions. If the FDIV merely stalls the pipeline for 120 cycles when
used, this is a little less useful.


For my Quake port (based on the C version of Quake's SW renderer; just
modified early on to from 8-bit indexed color to RGB555 pixels), I used
a cheap approximation with a single N-R stage.

And, as noted elsewhere, TKRA-GL uses affine mapping.

> Terje
>

Terje Mathisen

unread,
Aug 8, 2022, 4:13:29 AM8/8/22
to
This is why the original 8087 had a "Load Pi" instruction which would
put the hardcoded 80-bit value on the FP stack.

luke.l...@gmail.com

unread,
Aug 8, 2022, 5:14:21 AM8/8/22
to
On Monday, August 8, 2022 at 9:13:29 AM UTC+1, Terje Mathisen wrote:
> Stephen Fuld wrote:
> > On 8/5/2022 7:01 PM, BGB wrote:
> >> On 8/5/2022 8:12 PM, robf...@gmail.com wrote:
> >> The vast majority of constants in a program tend to be things like
> >> 100.0, 1.375, 420.0, ..., which can be expressed exactly as Binary16
> >> values (which are then show up as-if they were the original Binary64
> >> constants).

hence why we are proposing exactly that for Power ISA
https://libre-soc.org/openpower/sv/int_fp_mv/#fmvis

> This is why the original 8087 had a "Load Pi" instruction which would
> put the hardcoded 80-bit value on the FP stack.

that's in Power ISA PackedSIMD apparently (aka VSX) along
with 15 other commonly-used constants.

l.

luke.l...@gmail.com

unread,
Aug 8, 2022, 5:23:53 AM8/8/22
to
On Monday, August 8, 2022 at 1:43:32 AM UTC+1, MitchAlsup wrote:

> Luke, can you take a body of floating point source. compile
> it to assembly (or similar) and count the number of FP operands,
> and count the number of FMVis + FISHMV so we could get a
> percentage of the number of floating point constants in that body
> of code ??

we do have to do that, we just have to be careful about when
(i will keep everyone appraised, but it may be a few weeks when
we have the budget).

you can however see at least from this
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/mp3_1_imdct36_float.s;hb=HEAD

that's *17* magic constants which ended up in the fully loop-unrolled
assembler, but i mean, there's really not a lot of choice on that

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/imdct36_standalone.c;h=9cf347661ff05fefb53b777e74ab5961d55e7d89;hb=HEAD#l57

the loop-unrolling performed by gcc will mask the proportion of
magic constants (17 out of 445 lines)

l.


MitchAlsup

unread,
Aug 8, 2022, 12:06:36 PM8/8/22
to
In fact, there are many Newton-Raphson iterations that converge more
quickly using RSQRT() than SQRT()....

MitchAlsup

unread,
Aug 8, 2022, 12:10:18 PM8/8/22
to
On Monday, August 8, 2022 at 4:23:53 AM UTC-5, luke.l...@gmail.com wrote:
> On Monday, August 8, 2022 at 1:43:32 AM UTC+1, MitchAlsup wrote:
>
> > Luke, can you take a body of floating point source. compile
> > it to assembly (or similar) and count the number of FP operands,
> > and count the number of FMVis + FISHMV so we could get a
> > percentage of the number of floating point constants in that body
> > of code ??
> we do have to do that, we just have to be careful about when
> (i will keep everyone appraised, but it may be a few weeks when
> we have the budget).
>
> you can however see at least from this
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/mp3_1_imdct36_float.s;hb=HEAD
<
Thanks for the data.

MitchAlsup

unread,
Aug 8, 2022, 1:47:03 PM8/8/22
to
Still only had 64-bits of fraction. What one needs is pi with fraction+8 bits
typically--to avoid 0.5ULP error after some calculation.

BGB

unread,
Aug 8, 2022, 1:55:36 PM8/8/22
to
On 8/8/2022 3:06 AM, Terje Mathisen wrote:
> BGB wrote:
>> On 8/5/2022 2:01 PM, MitchAlsup wrote:
>>> Which, BTW, does not achieve 0.5 ULP accuracy, whereas y=x/3.0 does.
>>
>> The 120-cycle HW FDIV gets ~ 0.5, whereas the software divide
>> (Newton-Raphson based) is seemingly limited to somewhere around 3.5
>> ULP or so for Binary64.
>>
>>
>> For most things, "X/C" vs "X*(1.0/C)" doesn't have any real practical
>> effect on behavior, but if the latter is 20x faster, this is what it
>> is going to be.
>>
>> Likewise, the 0.5 ULP requirement does not seem to be a requirement
>> for the C standard.
>
> The C standard is mostly a "lowest common subset", it requires very little.
>
> It is ieee754 which requires perfect rounding, in all modes, for all the
> 5 "core" operations (FADD/FSUB/FMUL/FDIV/FSQRT).
>

Yeah, but the thing is, to have a conformant C compiler does not require
strict adherence to IEEE-754 rules, more so when one is using an FPU
design which can't really uphold a strict interpretation of IEEE-754 in
the first place (because doing so would add a non-trivial cost increase
over one built on top of a bunch of corner cutting).

So, it uses the traditional formats, and (in most cases) will produce
results accurate to the full width of the mantissa, but this is as far
as it really goes.

Say:
FADD/FSUB/FMUL, work pretty well.
FDIV, now exists, but still kinda slow
It is routed through the integer divider.

For "sqrt()", the software version is still faster than my attempt at
doing it in hardware, so I have stuck with the software version for now.


Rounding is sorta lazy, kind of a "rounds correctly so long as this does
not result in a long carry propagation" approach (will produce
truncate-towards-zero output for cases which result in a long carry
propagation).



Then there is the faster "low precision" FPU, which for a newly
considered "GPU profile" will be all that is required.

Where, say, GPU profile looks like:
Has XGPR (R0..R63);
This is a case where XGPR is useful.
Does not allow instructions in Lane 3;
Infrequently used, mostly disabled to reduce cost.
MMU is demoted to optional / disabled;
Address space reduced to 32 bit in this case.
Only the low-precision FPU is required;
FP-SIMD still required (low precision only).
...

Basically, a core tuned to what is required for 3D rendering, but
disabling most everything else. I am left to wonder how effectively
constant propagation can be used to prune stuff.

The 'GPU profile' is special mostly in that it doesn't really follow the
"tech tree" of the main CPU-like profiles.


Copy/pasting the whole core in this case doesn't seem like an ideal
strategy (had done this in a few past experiments), so considering
possibly trying to use an 'input' and hope that Vivado can effectively
use a constant input for bulk pruning (on smaller scale things, this
strategy appears to work).

This would be opposed to ifdef, which can't really be used to configure
things on a per-instance basis in the same way 'input' can.

Nevermind if Vivado gives warnings whenever it trims or prunes anything
as a result of constant propagation, ...

Basic idea being that if one cuts off the inputs and outputs to
something, that it will "fall into the void and disappear" during synthesis.


> There are strong suggestions to increase this to require (near-)perfect
> results for all other ops which can reasonably support it, i.e. ~0.5 ulp.
>
> We know how to do it for trig (including arbitrary range
> reduction)/log/exp, rsqrt can be done perfectly with the same or less
> effort than sqrt.
>

Most of these are typically managed by the C library in any case (rather
than by the FPU itself). So, this would seem more like a C library issue.


Well, and I had been gradually replacing the C library I was using "ship
of Theseus" style to be less broken.

Not too long ago, replaced the original "atan()" from the old library
with a version I had written for an eventual replacement library. The
old version was using a convoluted loop and would sometimes get stuck in
infinite recursion (its implementation was built on calling itself
recursively on various ranges of inputs).

Replaced it with a faster (and more stable) version based on using the
Taylor series expansion.


> Terje
>
>

MitchAlsup

unread,
Aug 8, 2022, 2:23:54 PM8/8/22
to
On Monday, August 8, 2022 at 12:55:36 PM UTC-5, BGB wrote:
> On 8/8/2022 3:06 AM, Terje Mathisen wrote:
> > BGB wrote:

> > It is ieee754 which requires perfect rounding, in all modes, for all the
> > 5 "core" operations (FADD/FSUB/FMUL/FDIV/FSQRT).
> >
> Yeah, but the thing is, to have a conformant C compiler does not require
> strict adherence to IEEE-754 rules, more so when one is using an FPU
> design which can't really uphold a strict interpretation of IEEE-754 in
> the first place (because doing so would add a non-trivial cost increase
> over one built on top of a bunch of corner cutting).
<
Yes, but having a IEEE 754-conformant C does require that.
<
And your philosophy violates the principle of least surprise.
>
> So, it uses the traditional formats, and (in most cases) will produce
> results accurate to the full width of the mantissa, but this is as far
> as it really goes.
<
If these are not 0.5ULP then they are not accurate in a IEEE 754 sense.
>
> Say:
> FADD/FSUB/FMUL, work pretty well.
> FDIV, now exists, but still kinda slow
> It is routed through the integer divider.
>
> For "sqrt()", the software version is still faster than my attempt at
> doing it in hardware, so I have stuck with the software version for now.
>
>
> Rounding is sorta lazy, kind of a "rounds correctly so long as this does
> not result in a long carry propagation" approach (will produce
> truncate-towards-zero output for cases which result in a long carry
> propagation).
>
IEEE 754 rounding requires calculation of the intermediate result as if to
an infinite number of digits, and then performing a single rounding.
>
Sort of IEEE 754-like is not IEEE 754-like at all !
>

MitchAlsup

unread,
Aug 8, 2022, 2:41:05 PM8/8/22
to
On Monday, August 8, 2022 at 12:55:36 PM UTC-5, BGB wrote:
Why is atan() not just some special casing, argument reduction, and a
single polynomial ?
<
x  [ -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
x  (-1.0, +1.0]:: ATAN( x ) = + ATAN( x );
x  [ 1.0, +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
<
ATAN() is::
p(r) = C0 + C1*r + C2*r^2 + C3*r^3 + C4*r^2 + C5*r^5 + C6*r^6 + C7*r^7 + C8*r^8
Where the coefficients are pulled from a 12 entry table of coefficients
based on the HoBs of the reduced fraction (x or 1/x).
<
p(r) is always between -π/2 and +π/2, and in particular:
+atan(1/x) is always between 0 and -π/2 so the result is -π/2 to -π
-atan(1/x) is always between 0 and +π/2 so the result is +π/2 to +π
Thus, avoiding precision loss.
<
Now for atan2() there is a lot more special casing, some argument
reduction, and then use the polynomial of atan().
<
double ATAN2( double y, double x )
{ // IEEE 754-2008 quality ATAN2
// deal with NaNs
if( ISNAN( x ) ) return x;
if( ISNAN( y ) ) return y;
// deal with infinities
if( x == +∞ && |y|== +∞ ) return copysign( π/4, y );
if( x == +∞ ) return copysign( 0.0, y );
if( x == -∞ && |y|== +∞ ) return copysign( 3π/4, y );
if( x == -∞ ) return copysign( π, y );
if( |y|== +∞ ) return copysign( π/2, y );
// deal with signed zeros
if( x == 0.0 && y != 0.0 ) return copysign( π/2, y );
if( x >=+0.0 && y == 0.0 ) return copysign( 0.0, y );
if( x <=-0.0 && y == 0.0 ) return copysign( π, y );
//
// calculate ATAN2 high performance style
// Note: at this point x != y
//
if( x > 0.0 )
{
if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
if( y < 0.0 && |y| > |x| ) return + ATAN( y / x );
if( y > 0.0 && |y| < |x| ) return + ATAN( y / x );
if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
}
if( x < 0.0 )
{
if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
if( y < 0.0 && |y| > |x| ) return + π - ATAN( y / x );
if( y > 0.0 && |y| < |x| ) return + π - ATAN( y / x );
if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
}
}
<
The algorithm is arranged such that add/sub of the constant
based on pi does not loose precision nor need excess precision.
<
But recursion ?? perish the thought.

BGB

unread,
Aug 8, 2022, 3:23:05 PM8/8/22
to
On 8/8/2022 1:23 PM, MitchAlsup wrote:
> On Monday, August 8, 2022 at 12:55:36 PM UTC-5, BGB wrote:
>> On 8/8/2022 3:06 AM, Terje Mathisen wrote:
>>> BGB wrote:
>
>>> It is ieee754 which requires perfect rounding, in all modes, for all the
>>> 5 "core" operations (FADD/FSUB/FMUL/FDIV/FSQRT).
>>>
>> Yeah, but the thing is, to have a conformant C compiler does not require
>> strict adherence to IEEE-754 rules, more so when one is using an FPU
>> design which can't really uphold a strict interpretation of IEEE-754 in
>> the first place (because doing so would add a non-trivial cost increase
>> over one built on top of a bunch of corner cutting).
> <
> Yes, but having a IEEE 754-conformant C does require that.
> <
> And your philosophy violates the principle of least surprise.

Things like Quake and similar still work, so mostly good enough...

Likewise "atof()" and similar deal with a decent number of digits and
don't "fly off into space" or similar, so, also, mostly good enough.


>>
>> So, it uses the traditional formats, and (in most cases) will produce
>> results accurate to the full width of the mantissa, but this is as far
>> as it really goes.
> <
> If these are not 0.5ULP then they are not accurate in a IEEE 754 sense.

I think DAZ+FTZ semantics, etc, also break it from being IEEE-754.

I make no claims as to the BJX2 FPU being IEEE-754 conformant.

If high precision is needed, there is still the option of doing the
floating-point math in software.


Doing a fully conformant FPU would be too expensive in this case, and
would be either impractically slow or blow out the FPGA's resource budget.


Whole reason the secondary "low-precision FPU" exists is because the
main FPU was still not fast enough in some cases. You can do Binary64
with it, and it will truncate-towards-zero... the low 36 bits of the
mantissa... But, in some cases, this is still the preferable option,
say, because 3 cycles is less than 6.



>>
>> Say:
>> FADD/FSUB/FMUL, work pretty well.
>> FDIV, now exists, but still kinda slow
>> It is routed through the integer divider.
>>
>> For "sqrt()", the software version is still faster than my attempt at
>> doing it in hardware, so I have stuck with the software version for now.
>>
>>
>> Rounding is sorta lazy, kind of a "rounds correctly so long as this does
>> not result in a long carry propagation" approach (will produce
>> truncate-towards-zero output for cases which result in a long carry
>> propagation).
>>
> IEEE 754 rounding requires calculation of the intermediate result as if to
> an infinite number of digits, and then performing a single rounding.

The implementations of the various operators tend to work by discarding
almost everything which falls below the ULP.

The only reason FADD has as many bits as it does, was because I needed a
few extra bits to be able for pull off full-range Int64<->Double
conversions (requiring a mantissa large enough to hold an Int64).

But, having a few bits below the ULP tends to result in the majority of
cases producing an exact result, and a subset producing incorrectly
rounded results.


But, still better than, say, if all the low-order bits of the mantissa
were truncated or filled with random garbage or similar.


Could have also done an FPU that was hard-wired to only do
truncate-towards-zero operations or similar (vs, say, normal rounding
modes with an ~ 0.1% probability that the result will be inexactly rounded).



>>
> Sort of IEEE 754-like is not IEEE 754-like at all !
>>
>

Hardly any software I have ported thus far really seems to notice or care.

My cores are not meant for scientific computing or data-centers. Intel
and AMD already serve these use cases well enough.


In practice, what tends to matter a lot more is that the bit-patterns
for the floating point values match the expected formats.

MitchAlsup

unread,
Aug 8, 2022, 3:41:00 PM8/8/22
to
On Monday, August 8, 2022 at 2:23:05 PM UTC-5, BGB wrote:
> On 8/8/2022 1:23 PM, MitchAlsup wrote:
> > On Monday, August 8, 2022 at 12:55:36 PM UTC-5, BGB wrote:
> >> On 8/8/2022 3:06 AM, Terje Mathisen wrote:
> >>> BGB wrote:
> >
> >>> It is ieee754 which requires perfect rounding, in all modes, for all the
> >>> 5 "core" operations (FADD/FSUB/FMUL/FDIV/FSQRT).
> >>>
> >> Yeah, but the thing is, to have a conformant C compiler does not require
> >> strict adherence to IEEE-754 rules, more so when one is using an FPU
> >> design which can't really uphold a strict interpretation of IEEE-754 in
> >> the first place (because doing so would add a non-trivial cost increase
> >> over one built on top of a bunch of corner cutting).
> > <
> > Yes, but having a IEEE 754-conformant C does require that.
> > <
> > And your philosophy violates the principle of least surprise.
> Things like Quake and similar still work, so mostly good enough...
<
Quake would run perfectly fine with IBM 360 floating point
Quake would run perfectly find with CRAY quality floating point
>
> Likewise "atof()" and similar deal with a decent number of digits and
> don't "fly off into space" or similar, so, also, mostly good enough.
> >>
> >> So, it uses the traditional formats, and (in most cases) will produce
> >> results accurate to the full width of the mantissa, but this is as far
> >> as it really goes.
> > <
> > If these are not 0.5ULP then they are not accurate in a IEEE 754 sense.
> I think DAZ+FTZ semantics, etc, also break it from being IEEE-754.
<
Yes, they do. Luckily, once your main computing engine is an FMAC
unit, denorms are essentially free, and FTZ unnecessary.
>
> I make no claims as to the BJX2 FPU being IEEE-754 conformant.
>
> If high precision is needed, there is still the option of doing the
> floating-point math in software.
>
>
> Doing a fully conformant FPU would be too expensive in this case, and
> would be either impractically slow or blow out the FPGA's resource budget.
>
I, personally, got lambasted by none other than William Kahan due to
DAZ and FTZ on Mc 88100.
>
> Whole reason the secondary "low-precision FPU" exists is because the
> main FPU was still not fast enough in some cases. You can do Binary64
> with it, and it will truncate-towards-zero... the low 36 bits of the
> mantissa... But, in some cases, this is still the preferable option,
> say, because 3 cycles is less than 6.
<
My FPU in Samsung GPU did DAD and nFTZ, without breaking a sweat.
<
<snip>
> > IEEE 754 rounding requires calculation of the intermediate result as if to
> > an infinite number of digits, and then performing a single rounding.
> The implementations of the various operators tend to work by discarding
> almost everything which falls below the ULP.
<
Graphics seems to think it can "get away" with this, whereas bad rounding leads
to shimmering.
> >>
> > Sort of IEEE 754-like is not IEEE 754-like at all !
> >>
> >
> Hardly any software I have ported thus far really seems to notice or care.
<
Hardy any art critic seems to notice my forgery of Girl with Ear Ring as a fake,
either. That does not make it the real Johannes Veermer painting.
>
> My cores are not meant for scientific computing or data-centers. Intel
> and AMD already serve these use cases well enough.
>
Yes, there are times and places where one can be lax wrt floating point.
There are not times and places where on can be lax about IEEE 754 floating
point.
>
> In practice, what tends to matter a lot more is that the bit-patterns
> for the floating point values match the expected formats.
<
At some low quality metric, sure.

luke.l...@gmail.com

unread,
Aug 8, 2022, 4:04:48 PM8/8/22
to
On Monday, August 8, 2022 at 8:41:00 PM UTC+1, MitchAlsup wrote:

> Yes, there are times and places where one can be lax wrt floating point.
> There are not times and places where on can be lax about IEEE 754 floating
> point.

fascinatingly - bizarrely - one of those is MP3. the use of DCT
is best done in FP32 and less accuracy results in audio
artefacts that will piss people off.

l.

BGB

unread,
Aug 8, 2022, 5:02:47 PM8/8/22
to
On 8/8/2022 1:41 PM, MitchAlsup wrote:
> On Monday, August 8, 2022 at 12:55:36 PM UTC-5, BGB wrote:
>> On 8/8/2022 3:06 AM, Terje Mathisen wrote:
>>> BGB wrote:
>
>> Not too long ago, replaced the original "atan()" from the old library
>> with a version I had written for an eventual replacement library. The
>> old version was using a convoluted loop and would sometimes get stuck in
>> infinite recursion (its implementation was built on calling itself
>> recursively on various ranges of inputs).
>>
>> Replaced it with a faster (and more stable) version based on using the
>> Taylor series expansion.
> <
> Why is atan() not just some special casing, argument reduction, and a
> single polynomial ?
> <
> x  [ -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
> x  (-1.0, +1.0]:: ATAN( x ) = + ATAN( x );
> x  [ 1.0, +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
> <
> ATAN() is::
> p(r) = C0 + C1*r + C2*r^2 + C3*r^3 + C4*r^2 + C5*r^5 + C6*r^6 + C7*r^7 + C8*r^8
> Where the coefficients are pulled from a 12 entry table of coefficients
> based on the HoBs of the reduced fraction (x or 1/x).
> <
> p(r) is always between -π/2 and +π/2, and in particular:
> +atan(1/x) is always between 0 and -π/2 so the result is -π/2 to -π
> -atan(1/x) is always between 0 and +π/2 so the result is +π/2 to +π
> Thus, avoiding precision loss.

The version I replaced it with splits it up into two functions:
_atan_i: Does a Taylor expansion which works over 0 <= x <= PI/2.
atan: Uses an if/else tree to select between the cases as above, calling
_atan_i with the fixed up value.

The Taylor-expansion does basically reduce to a polynomial of a similar
form to that above.
Yeah.

In this case, I took note of it mostly because I was stumbling onto edge
cases where it was overflowing the stack (so copied over a different
version to patch over it).


It is a C library where most of this code was "kinda terrible", so I
ended up mostly replacing it piece by piece whenever running into cases
which were broken.

It appears to have been written sometime back in the 90s originally for
some sort of MS-DOS clone running on top of IBM MVS (mostly ANSI syntax
with random bits of K&R syntax floating around). It had ended up being
used in some of my early projects, and has mostly floating around since
(being reused in my current project).

However, I have already ended up rewriting a fair chunk of the C library
due to "general terribleness".



I think the original reason I ended up using it was because some time
around middle school I had tried writing an OS (for x86), but my early
attempt at a quick-and-dirty C library didn't really work, so I used
someone else's C library.

At some point later, I started trying to rework this into an x86 + Win32
emulator (I think at the time, the idea was that I could use it to
emulate a Win32 userland on top of a RasPi or something). This effort
kinda fizzled out at the time though (noting that it was faster and
easier to just recompile code to run natively on the RasPi).


Early on, my BJX project reused some parts of my old code, so the C
library just sort of came along for the ride. TestKern was itself,
partly new code, and partly some amount of really old code.

...

BGB

unread,
Aug 8, 2022, 6:19:27 PM8/8/22
to
On 8/8/2022 2:40 PM, MitchAlsup wrote:
> On Monday, August 8, 2022 at 2:23:05 PM UTC-5, BGB wrote:
>> On 8/8/2022 1:23 PM, MitchAlsup wrote:
>>> On Monday, August 8, 2022 at 12:55:36 PM UTC-5, BGB wrote:
>>>> On 8/8/2022 3:06 AM, Terje Mathisen wrote:
>>>>> BGB wrote:
>>>
>>>>> It is ieee754 which requires perfect rounding, in all modes, for all the
>>>>> 5 "core" operations (FADD/FSUB/FMUL/FDIV/FSQRT).
>>>>>
>>>> Yeah, but the thing is, to have a conformant C compiler does not require
>>>> strict adherence to IEEE-754 rules, more so when one is using an FPU
>>>> design which can't really uphold a strict interpretation of IEEE-754 in
>>>> the first place (because doing so would add a non-trivial cost increase
>>>> over one built on top of a bunch of corner cutting).
>>> <
>>> Yes, but having a IEEE 754-conformant C does require that.
>>> <
>>> And your philosophy violates the principle of least surprise.
>> Things like Quake and similar still work, so mostly good enough...
> <
> Quake would run perfectly fine with IBM 360 floating point
> Quake would run perfectly find with CRAY quality floating point

It almost runs fine with all the floating point truncated to a 16-bit
mantissa as well, apart from the player getting stuck on things and the
physics glitching out with platforms.


>>
>> Likewise "atof()" and similar deal with a decent number of digits and
>> don't "fly off into space" or similar, so, also, mostly good enough.
>>>>
>>>> So, it uses the traditional formats, and (in most cases) will produce
>>>> results accurate to the full width of the mantissa, but this is as far
>>>> as it really goes.
>>> <
>>> If these are not 0.5ULP then they are not accurate in a IEEE 754 sense.
>> I think DAZ+FTZ semantics, etc, also break it from being IEEE-754.
> <
> Yes, they do. Luckily, once your main computing engine is an FMAC
> unit, denorms are essentially free, and FTZ unnecessary.

But, FMAC would increase the required instruction latency...


>>
>> I make no claims as to the BJX2 FPU being IEEE-754 conformant.
>>
>> If high precision is needed, there is still the option of doing the
>> floating-point math in software.
>>
>>
>> Doing a fully conformant FPU would be too expensive in this case, and
>> would be either impractically slow or blow out the FPGA's resource budget.
>>
> I, personally, got lambasted by none other than William Kahan due to
> DAZ and FTZ on Mc 88100.


I guess it is possible there could be a "slow and expensive but more
accurate" FPU.


For my own uses, "cheap and goes fast" seem like bigger priorities.

Though, as noted, I ended up with two FPUs, mostly because the "goes
fast" wasn't fast enough, and also while ULP rounding issues are mostly
invisible, discarding the low 36 bits or similar off a Binary64 value is
not...

Nevermind if the latter is basically the plan for the "GPU Profile" cores:
S.E11.F16.Z36
(Z=Ignored/Zeroed)


>>
>> Whole reason the secondary "low-precision FPU" exists is because the
>> main FPU was still not fast enough in some cases. You can do Binary64
>> with it, and it will truncate-towards-zero... the low 36 bits of the
>> mantissa... But, in some cases, this is still the preferable option,
>> say, because 3 cycles is less than 6.
> <
> My FPU in Samsung GPU did DAD and nFTZ, without breaking a sweat.
> <
> <snip>

Getting 3-cycle latency is harder (but in this case is fully pipelined).

Getting 1-cycle or 2-cycle latency for SIMD-FPU ops is mostly still out
of reach though.


>>> IEEE 754 rounding requires calculation of the intermediate result as if to
>>> an infinite number of digits, and then performing a single rounding.
>> The implementations of the various operators tend to work by discarding
>> almost everything which falls below the ULP.
> <
> Graphics seems to think it can "get away" with this, whereas bad rounding leads
> to shimmering.

Graphics isn't really being done with Binary64.
I think "industry standard" here is doing everything as Binary32.

I am mostly using a mix of Packed-Word fixed-point and Binary16 for
intermediate values for pixel data.


Transforms and similar still need to use Binary32, mostly because stuff
comes out looking wonky and distorted if one tries to use Binary16 for
the transform-and-projection matrices and similar.

It is hit or miss for texture coordinates. For textures much larger than
around 256x256, Binary16 is insufficient.

16-bit mantissa should be sufficient for textures up to around
16384x16384 (very large), so the low precision FPU still also works for
this.


>>>>
>>> Sort of IEEE 754-like is not IEEE 754-like at all !
>>>>
>>>
>> Hardly any software I have ported thus far really seems to notice or care.
> <
> Hardy any art critic seems to notice my forgery of Girl with Ear Ring as a fake,
> either. That does not make it the real Johannes Veermer painting.

As long as one isn't claiming full conformance, it shouldn't matter.


>>
>> My cores are not meant for scientific computing or data-centers. Intel
>> and AMD already serve these use cases well enough.
>>
> Yes, there are times and places where one can be lax wrt floating point.
> There are not times and places where on can be lax about IEEE 754 floating
> point.

Yeah, If someone wants something like a Xeon or similar, they are
probably better off using a Xeon or similar.

This isn't really the area I am aiming at though.

>>
>> In practice, what tends to matter a lot more is that the bit-patterns
>> for the floating point values match the expected formats.
> <
> At some low quality metric, sure.

...

MitchAlsup

unread,
Aug 8, 2022, 8:26:05 PM8/8/22
to
On Monday, August 8, 2022 at 5:19:27 PM UTC-5, BGB wrote:
> On 8/8/2022 2:40 PM, MitchAlsup wrote:
> > My FPU in Samsung GPU did DAD and nFTZ, without breaking a sweat.
> > <
> > <snip>
> Getting 3-cycle latency is harder (but in this case is fully pipelined).
<
We used 4 cycle latency--but then again, to execute a WARP of a single
instruction always took 4 cycles {int, logical, FP (other than FDIV)} so
back to back operations always used the forwarding logic.
>
> Getting 1-cycle or 2-cycle latency for SIMD-FPU ops is mostly still out
> of reach though.
<
Can't be done today: FADD no faster than 3 cycles, FMUL no faster than 4
{All double) FMAC is typically 4 cycles.
>
<snip>
> > Hardy any art critic seems to notice my forgery of Girl with Ear Ring as a fake,
> > either. That does not make it the real Johannes Veermer painting.
> As long as one isn't claiming full conformance, it shouldn't matter.
<
So, you can't claim IEEE 754 conformance.
<

MitchAlsup

unread,
Aug 8, 2022, 11:09:43 PM8/8/22
to
On Monday, August 8, 2022 at 4:23:53 AM UTC-5, luke.l...@gmail.com wrote:
> On Monday, August 8, 2022 at 1:43:32 AM UTC+1, MitchAlsup wrote:
>
> > Luke, can you take a body of floating point source. compile
> > it to assembly (or similar) and count the number of FP operands,
> > and count the number of FMVis + FISHMV so we could get a
> > percentage of the number of floating point constants in that body
> > of code ??
> we do have to do that, we just have to be careful about when
> (i will keep everyone appraised, but it may be a few weeks when
> we have the budget).
>
> you can however see at least from this
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/mp3_1_imdct36_float.s;hb=HEAD
>
Report on Luke's data::

There are 64 FMULs
There are 133 FADD/FSUBs
There are 19 constants used 24 times

Constants are used 1.26 times each
Constants are used in 12.2% of FP instructions

My overall guess is that this substantiates that FP constants are used "about as often"
as constants in/from source code. Assembly code uses a lot of hidden constants
{displacements, switch tables, strength reduction, ...}

Thanks again, Luke.

Terje Mathisen

unread,
Aug 9, 2022, 3:25:06 AM8/9/22
to
Yeah, that's why I wrote "the same or less effort". :-)

See also the InvSqrt saga:

https://www.beyond3d.com/content/articles/8/

Terje Mathisen

unread,
Aug 9, 2022, 3:32:57 AM8/9/22
to
They did try: Their internal pi value had 67 bits which they used for
argument reduction.

Using a ~1100 bit reciprocal and 2-3 MULs is of course both faster and
better, but they would have needed ~8250 bits for their extended/long
double format.

fp128 uses the same exponent size so it would only need 51 more
reciprocal bits to cover the wider mantissa.

Let's call it 8300 bits/1038 bytes!

Terje Mathisen

unread,
Aug 9, 2022, 3:40:18 AM8/9/22
to
This is much easier to notice visually, i.e. graphics:

If you try to get away with limited bit depth per pixel/color and do a
series of adjustments, it is extremely easy to end up with blue banding
in what was a very nicely graduated sky.

16x3 bits/pixel mostly fixes it, but for more ops you probably want FP.

luke.l...@gmail.com

unread,
Aug 9, 2022, 6:38:20 AM8/9/22
to
On Tuesday, August 9, 2022 at 4:09:43 AM UTC+1, MitchAlsup wrote:
> On Monday, August 8, 2022 at 4:23:53 AM UTC-5, luke.l...@gmail.com wrote:
> > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/mp3_1_imdct36_float.s;hb=HEAD
> >
> Report on Luke's data::
>
> There are 64 FMULs
> There are 133 FADD/FSUBs
> There are 19 constants used 24 times
>
> Constants are used 1.26 times each
> Constants are used in 12.2% of FP instructions

you may also be interested in this WIP:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=media/audio/mp3/mp3_1_imdct36_float_basicsv.s;hb=HEAD

it doesn't reduce the number of operations, but it reduces
the number of instructions. an inline blatch of 17 FADDs
is one 32-bit setvl instruction and one 64-bit sv.fadds.
QTY 3x32-bit words vs QTY 17x32-bit words, easy trade.

the reduced instruction count however does change any percentage
ratios of "constant-loading to executable size" metrics just
not "constant-loading to num-ops-executed".

> My overall guess is that this substantiates that FP constants are used "about as often"
> as constants in/from source code. Assembly code uses a lot of hidden constants
> {displacements, switch tables, strength reduction, ...}

* fmvis and fishmv would be 64-bit to load-immediate 32-bit constant.

* flds would be calculation of a base-address followed by a 32-bit flds
but the base-address calculation is shared with other flgds.

loading-immediates entirely in I-Cache vs loading-immediates
patly I-Cache partly D-Cache, hmmm...

ironically a Vectorised-fmvis and Vectorised-fishmv can only do
a broadcast-splat of the one constant. a vector-of-constants
into a vector-of-registers is for another revision of SVP64.

(perhaps with the Data-Pointer idea raised here a couple of
years ago)

l.

luke.l...@gmail.com

unread,
Aug 9, 2022, 6:46:46 AM8/9/22
to
On Tuesday, August 9, 2022 at 8:40:18 AM UTC+1, Terje Mathisen wrote:

> This is much easier to notice visually, i.e. graphics:
>
> If you try to get away with limited bit depth per pixel/color and do a
> series of adjustments, it is extremely easy to end up with blue banding
> in what was a very nicely graduated sky.
>
> 16x3 bits/pixel mostly fixes it, but for more ops you probably want FP.

the Khronos Group has amortised all of the cumulative expertise
of the top 3D GPU companies as to what you can and cannot get away with
let me see if i can find it...

https://libre-soc.org/openpower/transcendentals/
https://registry.khronos.org/SPIR-V/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

fma

Compute the correctly rounded floating-point representation of the
sum of c with the infinitely precise product of a and b. Rounding of
intermediate products shall not occur. Edge case results are per the
IEEE 754-2008 standard.

native_cos

Compute cosine of x radians over an implementation-defined range.
The maximum error is implementation-defined.

half_cos

Compute cosine of x radians. The resulting value is undefined if
x is not in the range -216 …​ +216.ha

Result Type and x must be float or vector(2,3,4,8,16) of float values.

you get the general idea.

l.

Marcus

unread,
Aug 9, 2022, 8:10:32 AM8/9/22
to
Neither can I. And honestly I don't care right now, as it's not going to
affect any of the applications that I care about (and I only need
*some* FP in HW for proving my pipeline etc).

This is an endless debate, it seems. I know that IEEE 754 conformance is
the right thing (TM), but on the other hand it adds very little value
in reality for the majority of FP applications. What's more, if you can
squeeze out a few more FLOPS/W by slightly violating the standard (e.g.
by skipping subnormals), I tend to prefer that - even as a SW developer
or computer user.

IMO the IEEE 754 standard has failed here. There is clearly a need in
the market for a simplified FP standard that can be used for all the
exaflops of I-don't-care-about-subnorms-or-eccentric-rounding-modes
computations being performed every second every hour every day of the
year on this planet.

Just my two cents.

/Marcus

BGB

unread,
Aug 9, 2022, 1:27:24 PM8/9/22
to
On 8/9/2022 2:40 AM, Terje Mathisen wrote:
> luke.l...@gmail.com wrote:
>> On Monday, August 8, 2022 at 8:41:00 PM UTC+1, MitchAlsup wrote:
>>
>>> Yes, there are times and places where one can be lax wrt floating point.
>>> There are not times and places where on can be lax about IEEE 754
>>> floating
>>> point.
>>
>> fascinatingly - bizarrely - one of those is MP3. the use of DCT
>> is best done in FP32 and less accuracy results in audio
>> artefacts that will piss people off.
>
> This is much easier to notice visually, i.e. graphics:
>
> If you try to get away with limited bit depth per pixel/color and do a
> series of adjustments, it is extremely easy to end up with blue banding
> in what was a very nicely graduated sky.
>
> 16x3 bits/pixel mostly fixes it, but for more ops you probably want FP.
>

As noted, in my case if I needed "looks nice" I would probably want a
bit more than RGB555.

Likely options would be more along the lines of:
RGBA32 / RGBA8888 (typical for LDR)
RGB30F (3x E5.F5)
Etc.

A 4xFP8 pixel format works (E4.F4), but it can be noted that its quality
(in the typical LDR range), is pretty similar to that of RGB555.

To match RGBA32 one would basically need a 7 or 8 bit mantissa.



If going for quality, would probably do:
XYZ: Binary32 or Binary64 (*)
ST: Binary32
RGBA: Binary16
Normals: Binary16
...


Though, going much beyond 8-bits per component on the final output side
probably doesn't gain anything:
Humans probably couldn't really see much difference even if monitors did
support it more reliably.



*: Typically Binary32 is the limit for graphics cards, but 3D engines
with large scenes typically need to do trickery, such as having a local
origin which moves around and keeps near the camera.

Otherwise, with a fixed origin (assuming a first-person style 3D
engine), one will typically start seeing jitter at around 2km from the
origin, which gets progressively worse the further one goes.

One of the first things to be effected by this tends to be things like
texture mapping, where the textures will start jittering and warping on
the geometry.


Then again, based on these effects tend to start happening at the 2km
mark, I wouldn't exactly be surprised if the GPUs are were also using a
16-bit mantissa or something...

Don't know if it is still this way on newer cards, the one I am running
is a model that was released roughly 8 years ago.


In my own Software GL attempts, I had before noticed (if using full
precision Binary32) I had not noticed significant jitter until around
256km from the origin.

But, then again, it is also possible that perspective-correct rendering
is more sensitive to this than affine rendering, would need to look more
into the math here...


Then again, if the OpenGL style vertex projection stages and similar
were performed using Binary64, there would probably be no more issues
with jitter within any "reasonable" world size.

Well, or even an intermediate option, like say:
S.E11.F32.Z20

AKA: Double with the low 20 bits cut off; Could be handled with a
similar number of DSP48's to a dedicated Binary32 unit (eg: 3 for FMUL),
but would still have higher precision than Binary32.


...

BGB

unread,
Aug 9, 2022, 2:06:58 PM8/9/22
to
Agreed.


Like, the reason I still don't have FMAC:
Because most designs I can come up with for an FMAC would have 30-40%
higher latency for FMUL than independent FMUL and FADD units, and one
can get a "usefully similar" result by just sorta plugging the units
together (with an ~ 60% higher latency), but otherwise allowing them to
be used independently.

You don't get free support for subnormals for multiply this way, but it
is possible one "could" have a special "multiply with subnormals"
instruction, which gives this feature but has a higher latency (say, if
normal FMUL is 6 cycles, but say, FMUL_DN is 10 cycles).


Similar for Binary64 rounding:
If one wants to be able to round the full width of the mantissa, this
would cost an extra 1 or 2 clock cycles.

It is at least a little easier to impose full-width rounding on Binary32
though, and for Binary16 it makes sense as a default (because the
partial-width rounding in the larger formats is basically already almost
the full width of the Binary16 mantissa anyways).


> IMO the IEEE 754 standard has failed here. There is clearly a need in
> the market for a simplified FP standard that can be used for all the
> exaflops of I-don't-care-about-subnorms-or-eccentric-rounding-modes
> computations being performed every second every hour every day of the
> year on this planet.
>

I have gradually been ending up with some of my own rules here (which,
in terms of "subtleties" have diverged in various ways from IEEE-754).

Namely, how to have "usefully good" floating-point in a more
cost-optimized way, as opposed to IEEE-754 which is precision-optimized
rather than cost-optimized.


Then there is the low-precision FPU, which has its own funky take on things:
Truncate only (no rounding);
Mostly ignore anything that falls below ULP.
Negates mantissas via bitwise NOT (1);
...


*1: Some testing also showed this to be "more accurate on average",
though it does sorta fail for doing integer math with Binary16, but this
isn't really a big use-case for Binary16 SIMD ops.


However, one thing that I will note probably is a requirement, is that
doing integer math using scalar floating point values should produce
exact integer results (with the primary exception here being FDIV).

5.0-2.0 should give 3.0 exactly, and not, say, 2.9999, because there is
a non-zero amount of code out there that will notice if stuff like this
happens (which basically means using proper twos complement negate
semantics).

Quadibloc

unread,
Aug 9, 2022, 8:30:37 PM8/9/22
to
On Tuesday, August 9, 2022 at 12:06:58 PM UTC-6, BGB wrote:

> Namely, how to have "usefully good" floating-point in a more
> cost-optimized way, as opposed to IEEE-754 which is precision-optimized
> rather than cost-optimized.

I tend to agree, but with one important modification. I think that floating-point
should be designed first and foremost with the needs of scientific computation
in mind.

And those needs are best met not so much by cost-optimized compromises
as by speed-optimized compromises.

John Savard

Terje Mathisen

unread,
Aug 10, 2022, 3:39:45 AM8/10/22
to
We'll just have to agree to disagreee then. :-)

Mitch have the exact numbers, but as we stated a number of times, any
modern FPU _must_ support FMA, and at that point you already have all
the hardware you need for subnormal, with zero cycle cost and
single-percent extra gates. I would guess the power differential to be
even less.

If you told me that you wanted a special case for fp16 for AI training I
would be willing to believe you, but afaik those chips use a ganged form
of FMA?

Quadibloc

unread,
Aug 10, 2022, 12:54:12 PM8/10/22
to
On Wednesday, August 10, 2022 at 1:39:45 AM UTC-6, Terje Mathisen wrote:

> Mitch have the exact numbers, but as we stated a number of times, any
> modern FPU _must_ support FMA, and at that point you already have all
> the hardware you need for subnormal, with zero cycle cost and
> single-percent extra gates. I would guess the power differential to be
> even less.

I agree with you about subnormal. But the best possible result from division
has an extra cost if you're using the fastest division algorithms, so I still
agree with the basic thrust of his post, that there's a need for a watered-down
version of IEEE 754, even if I agree with you about this particular detail.

John Savard

MitchAlsup

unread,
Aug 10, 2022, 1:44:53 PM8/10/22
to
In the Mc 88120 design, we used Goldschmidt division algorithm and we
had a 1 ULP answer at 12 cycles, so we delivered it. We then spent another 5
cycles (17 total) to develop the 0.5 ULP result, and if it disagreed with the
1ULP result, we re-sent the result, replaying those operations that consumed
the FDIV. This happened about 1/64 of the time.
<
So, we got both the speed, and the precision ! at some cost in hassle.
<
>
> John Savard

luke.l...@gmail.com

unread,
Aug 10, 2022, 3:05:53 PM8/10/22
to
On Wednesday, August 10, 2022 at 6:44:53 PM UTC+1, MitchAlsup wrote:

> In the Mc 88120 design, we used Goldschmidt division algorithm and we
> had a 1 ULP answer at 12 cycles, so we delivered it. We then spent another 5
> cycles (17 total) to develop the 0.5 ULP result, and if it disagreed with the
> 1ULP result, we re-sent the result, replaying those operations that consumed
> the FDIV. This happened about 1/64 of the time.

https://git.libre-soc.org/?p=soc.git;a=tree;f=src/soc/fu/div/experiment;hb=HEAD

you may be interested to know that Jacob Lifshay implemented
an academic paper's work which predicts, accurately, how many
iterations will be needed, in advance, based on analysing the
inputs.

he also implemented sqrt goldschmidtt at the same time because
it's easy enough to do once the infrastructure is in place.

l.

Terje Mathisen

unread,
Aug 11, 2022, 3:55:39 AM8/11/22
to
You basically had an FDIV predictor that delivered a 0.55 ulp result, so
even with the replay overhead you would average ~13 cycles,
significantly faster than the 17 required for the exact result.

Nice!

Terje Mathisen

unread,
Aug 11, 2022, 7:32:34 AM8/11/22
to
BTW, this does of course mean that you can never do an FDIV as part of
any security processing, since the timing differences would turn into a
back channel! OTOH, I have _never_ seen any use of FP in a security
setting except for bignum math, and there you never have any FDIVs.

Paul A. Clayton

unread,
Aug 14, 2022, 11:50:52 AM8/14/22
to
BGB wrote:
[snip]
> RISC-V C:
>   Looks straightforward enough on the surface.
>   Look a little deeper, it is a dog-chewed mess (even worse than
> Thumb).

I think this is largely a consequence of the RISC-V methodology;
encoding is reserved/provided for extension but little or no
planning is done for how extensions will fit together even when it
was known ahead of time that certain extensions would be desired.
(I do think the instruction length encoding supports excessive
flexibility; 16-bit parcels (length granularity) is unlikely to be
very useful above 64-bit length.) I think this is part of the
complaint "everything is an extension"; one difference between
subset and extension is that for a subset the superset is assumed
to be already defined. (The extension management for RISC-V is
also problematic. The original letter designations has been shown
to be inadequate for the breadth of custom designs. Defining
system profiles that specify a list of extensions may be an
'adequate' fix but seems another indication of inadequate
forethought.)

(For the C extension, the ABI was changed to simplify register
mapping compared to Andrew Waterman's original (academic) proposal
for a compressed instruction extension. While admitting to a
mistake and fixing it is good — and more difficult in a commercial
environment, not that it is easy in academia — I think a little
more forethought would have avoided this. It was known from the
start that 16-bit instructions would be used and obvious that
smaller register specifiers would be desirable for such.)

With respect to consistent placement of register specifiers,
16-bit instructions make such more difficult. I do think one could
encode less timing critical opcode information in the extra bit.
With two fixed-placement bits determining whether an instruction's
length is greater than 16 bits, one could perform a single logical
operation to generate the mask. It seems that the register address
could be partially decoded while this is done so that perhaps no
latency would be added to register access.

RISC-V's three 3-bit specifiers for some compressed instructions —
providing non-destructive ops is desirable — would make such
impractical (though one could gain one bit by having a register
specifier field with one bit in the next parcel and masking out
the 'low' bit but one bit is not enough — having the
software-named registers different than the instruction encoding
would be unattractive.). Slower parsing of the destination
register specifier would not be as problematic, but one would need
to determine quickly that two bits of the source register
specifiers are masked — or not mask those bits, making special
purpose register sets. (With substantial immediate
stack-pointer-relative loads and stores, one might reserve the SP
specifier for such opcodes, provide a third masking option, or
combine masking and special purpose registers.)

(One complaint I have with the C extension is that it is
"compressed", i.e., every C-extension instruction has to have — by
design principle not architectural mandate — a larger
non-compressed variant. This seems wasteful of the encoding space.
This is a consequence of the extension philosophy and the
rejection of special-purpose register sets. If the C extension was
for "code density", it might also have included 48-bit
instructions. (I also think complex instructions are attractive
for code density.) While a complete 32-bit-only ISA would allow
two bits to be ignored and not cached in the instruction cache and
managing the register specifier masking would add complexity
(though one might be able to get away with the masked-bit-as-zero
for the non-redundant opcodes, maybe), such simplicity seems minor
if one provides a reference decode implementation (for tiny cores).)

[snip]
> RISC-V's 32-bit encodings keep registers more consistent at the
> cost of turning immediate values and displacements into a chewed
> up mess.

At least immediates are generally less critical. Only control flow
instructions might *really* like to have the immediate available
early in the pipeline.
[snip]
> In my case, extended constant bits are held in jumbo prefixes,
> which are effectively treated as NOP (and mostly special in that
> the payload bits are routed into the adjacent decoder).

Having instruction-extending nops is a mechanism I kind of like.
It is less dense that a specialized encoding and 'requires' the
fields to be either consistent (e.g., using the 'same' fields in
the "nop" for adding register specifiers) or less critical. In a
sense it is a variation on the 'stop/start-bit' mechanism (which
has been used for x86 predecode and in VLIW). Using a set of nop
opcodes wastes less encoding space — a few opcodes or even only
one opcode versus half of the primary opcodes.

[snip]
> Also better IMO to keep instructions in a format where they at
> least "make sense" as a sequentially executed instruction scheme.
>
> Say, well-formed code in BJX2 has as a requirement that one can
> ignore WEX and execute stuff sequentially, and it should still
> produce the same results as the bundled version (Or, IOW: If
> bundled and sequential execution produce different results, the
> code is broken).

I am not sure about that. Itanium's always execute as if serial
seemed to be a mistake to me in that it prohibited an instruction
bundle from encoding a swap without temporary. Limiting such
parallelism to bundles does make sense, but renaming two
temporaries seems a small burden. Given Itanium's target uses,
less than single bundle at a time execution seems unlikely.

(However, I am biased. I feel that if one is going with static
scheduling one should not expect executable format compatibility.
Itanium's bundle mechanism seems an awkward provision for binary
compatibility.)

Paul A. Clayton

unread,
Aug 14, 2022, 11:50:59 AM8/14/22
to
MitchAlsup wrote:
[snip]
> Realistically: +1.0, -1.0, 0.0, 2.0, 5.0 10.0 are used a lot.

Itanium had FP registers hardwired to 0.0 and 1.0. This was used
to make FMADD also be FADD and FMUL.

MitchAlsup

unread,
Aug 14, 2022, 12:29:53 PM8/14/22
to
On Sunday, August 14, 2022 at 10:50:52 AM UTC-5, Paul A. Clayton wrote:
> BGB wrote:
> [snip]
> > RISC-V C:
> > Looks straightforward enough on the surface.
> > Look a little deeper, it is a dog-chewed mess (even worse than
> > Thumb).
>
> I think this is largely a consequence of the RISC-V methodology;
<
Methodology: I do not think methodology was used. Sure there
were a couple of papers illustrating the starting point, but it turned
into a free-for-all way too early to call it methodological.
<
> encoding is reserved/provided for extension but little or no
> planning is done for how extensions will fit together even when it
> was known ahead of time that certain extensions would be desired.
> (I do think the instruction length encoding supports excessive
> flexibility; 16-bit parcels (length granularity) is unlikely to be
> very useful above 64-bit length.) I think this is part of the
> complaint "everything is an extension"; one difference between
> subset and extension is that for a subset the superset is assumed
> to be already defined. (The extension management for RISC-V is
> also problematic. The original letter designations has been shown
> to be inadequate for the breadth of custom designs. Defining
> system profiles that specify a list of extensions may be an
> 'adequate' fix but seems another indication of inadequate
> forethought.)
>
> (For the C extension, the ABI was changed to simplify register
> mapping compared to Andrew Waterman's original (academic) proposal
> for a compressed instruction extension. While admitting to a
> mistake and fixing it is good — and more difficult in a commercial
> environment, not that it is easy in academia — I think a little
> more forethought would have avoided this. It was known from the
> start that 16-bit instructions would be used and obvious that
> smaller register specifiers would be desirable for such.)
<
Never-the-less: having 2 places from where a register specifier is stored
causes a multiplexer to exist between the arriving instruction and the
register file decoder port. This multiplexer is dependent on the bottom
2 bits of the instruction.
<
And there there are the compressed immediates--good luck reading them
if you don't have a tool to parse them and display them.
>
> With respect to consistent placement of register specifiers,
> 16-bit instructions make such more difficult. I do think one could
<
Not more difficult--impossible:: there is a vast difference.
<
> encode less timing critical opcode information in the extra bit.
> With two fixed-placement bits determining whether an instruction's
> length is greater than 16 bits, one could perform a single logical
> operation to generate the mask. It seems that the register address
> could be partially decoded while this is done so that perhaps no
> latency would be added to register access.
<
Putting the OpCode down at the bottom also prevents detecting
that you have just transferred control into data.
>
> RISC-V's three 3-bit specifiers for some compressed instructions —
> providing non-destructive ops is desirable — would make such
> impractical (though one could gain one bit by having a register
> specifier field with one bit in the next parcel and masking out
> the 'low' bit but one bit is not enough — having the
> software-named registers different than the instruction encoding
> would be unattractive.). Slower parsing of the destination
> register specifier would not be as problematic, but one would need
> to determine quickly that two bits of the source register
> specifiers are masked — or not mask those bits, making special
> purpose register sets. (With substantial immediate
> stack-pointer-relative loads and stores, one might reserve the SP
> specifier for such opcodes, provide a third masking option, or
> combine masking and special purpose registers.)
>
> (One complaint I have with the C extension is that it is
> "compressed", i.e., every C-extension instruction has to have — by
> design principle not architectural mandate — a larger
> non-compressed variant. This seems wasteful of the encoding space.
<
One would think that since RISC-V-C has these compressed instructions
and their papers indicate it saves ~30%± of code size that they would
have a more competitive code density than they achieved.
<
> This is a consequence of the extension philosophy and the
> rejection of special-purpose register sets. If the C extension was
<
Rejection of Special Purpose register sets in RISC-V ?!? They have
Floating Point, and Vector Registers.
<
> for "code density", it might also have included 48-bit
> instructions. (I also think complex instructions are attractive
> for code density.) While a complete 32-bit-only ISA would allow
> two bits to be ignored and not cached in the instruction cache and
> managing the register specifier masking would add complexity
> (though one might be able to get away with the masked-bit-as-zero
> for the non-redundant opcodes, maybe), such simplicity seems minor
> if one provides a reference decode implementation (for tiny cores).)
<
It might surprise you to notice that My 66000 ISA uses only ~2400
instructions to encode the semantic content of CoreMark while RISC-V
uses ~3400 instructions.
<
Real Code Density is the measure of how few instructions it takes
to encode and how few it takes to perform.
>
> [snip]
> > RISC-V's 32-bit encodings keep registers more consistent at the
> > cost of turning immediate values and displacements into a chewed
> > up mess.
>
> At least immediates are generally less critical. Only control flow
> instructions might *really* like to have the immediate available
> early in the pipeline.
<
Constants {immediates and displacements} are ALLWAYS cheaper to
deliver as operands as any other way of delivering an Operand.
<
> [snip]
> > In my case, extended constant bits are held in jumbo prefixes,
> > which are effectively treated as NOP (and mostly special in that
> > the payload bits are routed into the adjacent decoder).
>
> Having instruction-extending nops is a mechanism I kind of like.
<
Why does an ISA even need NoOps today ?

Anton Ertl

unread,
Aug 14, 2022, 1:12:51 PM8/14/22
to
"Paul A. Clayton" <paaron...@gmail.com> writes:
>Itanium's always execute as if serial
>seemed to be a mistake to me in that it prohibited an instruction
>bundle from encoding a swap without temporary. Limiting such
>parallelism to bundles does make sense, but renaming two
>temporaries seems a small burden. Given Itanium's target uses,
>less than single bundle at a time execution seems unlikely.

I think that every IA-64 implementation needs to be able to execute an
instruction at a time, for exception handling and debugging.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Paul A. Clayton

unread,
Aug 14, 2022, 2:27:22 PM8/14/22
to
MitchAlsup wrote:
> On Sunday, August 14, 2022 at 10:50:52 AM UTC-5, Paul A. Clayton wrote:
>> BGB wrote:
>> [snip]
>>> RISC-V C:
>>> Looks straightforward enough on the surface.
>>> Look a little deeper, it is a dog-chewed mess (even worse than
>>> Thumb).
>>
>> I think this is largely a consequence of the RISC-V methodology;
>
> Methodology: I do not think methodology was used. Sure there
> were a couple of papers illustrating the starting point, but it turned
> into a free-for-all way too early to call it methodological.

I think it is more than just the rush from academic proposal to
commercial use. RISC-V seems to have some design philosophy
issues. (I do not have the time or mental focus to comment at
length on this.)

[snip]

>> With respect to consistent placement of register specifiers,
>> 16-bit instructions make such more difficult. I do think one could
>
> Not more difficult--impossible:: there is a vast difference.

As I wrote, for two register 4-bit names, a 16-bit instruction is
large enough to allow placement consistent with two 5-bit source
register names in a 32-bit instruction. Masking one bit based on
an AND (or NAND) of two fixed-position bits in the instruction
does not seem "impossible".

I do not know how much code density benefit one can get within
this constraint. The loss of two bits of major opcode space is
also significant, so a fair comparison would have to consider
alternatives.

[snip]
>> (One complaint I have with the C extension is that it is
>> "compressed", i.e., every C-extension instruction has to have — by
>> design principle not architectural mandate — a larger
>> non-compressed variant. This seems wasteful of the encoding space.
>
> One would think that since RISC-V-C has these compressed instructions
> and their papers indicate it saves ~30%± of code size that they would
> have a more competitive code density than they achieved.

Unlike most code density oriented designs, load/store multiple
(mainly for function entry/exit) is not provided — it is not a
compressed instruction and is not 3-or-fewer source, 1 destination
operation (i.e., "RISC"). (My 66000's ENTER/EXIT is nice for this
purpose — complex instructions can also provide semantic
information which can be exploited by the implementation.)

Not including 48-bit instructions (also a consequence of
"compressed" philosophy) may also hurt code density. As you have
noted, pasting together constants from multiple instructions is
not efficient. Good code density is perhaps most important in the
32-bit microcontroller space, so 32-bit immediates might be "good
enough".

I think the code size comparisons are also limited to simple
conversion rather than biasing register allocation to use smaller
instructions.

>> This is a consequence of the extension philosophy and the
>> rejection of special-purpose register sets. If the C extension was
>
> Rejection of Special Purpose register sets in RISC-V ?!? They have
> Floating Point, and Vector Registers.

I meant special purpose beyond earlier RISCs; MIPS (which actually
had multiply/divide destination SPRs), Alpha, SPARC (condition
codes), POWER (multiple condition codes and two "front-end"
registers for link/count and indirect jumps) all had FPRs. Your
influence on M88k was insufficient to keep FPRs out before the ISA
died. (I have not looked at RISC-V's vector extension or its SIMD
extension.) A large set of registers for special use has less
register allocation issues, which seems to be the main complaint
about special purpose registers.

[snip]
> It might surprise you to notice that My 66000 ISA uses only ~2400
> instructions to encode the semantic content of CoreMark while RISC-V
> uses ~3400 instructions.

How much of that is just from ENTER/EXIT? My 66000's single
register set may have some instruction count advantages, but I do
not know what else would account for the difference. (Why/how is
often more interesting than what or even when.)

I forgot about My 66000's more extensive addressing modes; this
probably accounts for a substantial number of instructions.
(Others have urged for an extension to provide such, so this is a
recognized issue even if it has been "solved" by suggesting
instruction fusion.)


> Real Code Density is the measure of how few instructions it takes
> to encode and how few it takes to perform.

Not for microcontrollers (and some other embedded uses).

[snip]
>> Having instruction-extending nops is a mechanism I kind of like.
>
> Why does an ISA even need NoOps today ?

Should an ISA have hints? If so, one would probably want to
provide encodings for future hint use that do not generate
unimplemented instruction exceptions.

Nops can also be used for padding to provide alignment. With
caches having alignment biases (other features may also),
alignment might be significant.

(Presumably for function end padding, always excepting all-zero
instructions would be better, but a single nop is probably better
than a jump zero instructions forward instruction. For larger
padding or more sophisticated implementations, jumps may be
preferred.)

There may also be cases where one would like to use padding to
support limited in-place code modification.

"Quality of implementation" guarantees (as well as availability of
some features such as something like ESM) would presumably impact
how useful nops would be.

With respect to extending an instruction, the conceptual nop
(really a prefix [or postfix??]) is a different concept.

MitchAlsup

unread,
Aug 14, 2022, 3:31:07 PM8/14/22
to
Yes, you are touching on the difference:
<
"Do nothing that harms Decoding" paraphrased from Mark Horowitz
circa 1985-ish
<
"Everything should be as simple as possible, but no simpler" quote
from Albert Einstein.
<
It is my considered belief that RISC has gone too far on the reduced side
of things and left out too many good and useful instructions. CISC, on the
other hand, put too many things in, some CISCs from long ago were based
on µCode master sequencer and lead to abominations such as VAX and
x86. Other CISCs got the major ISA reasonably correct, and then went
overboard on the instructions they put in (IBM 360 for example).
>
> Not including 48-bit instructions (also a consequence of
> "compressed" philosophy) may also hurt code density. As you have
> noted, pasting together constants from multiple instructions is
> not efficient. Good code density is perhaps most important in the
> 32-bit microcontroller space, so 32-bit immediates might be "good
> enough".
>
> I think the code size comparisons are also limited to simple
> conversion rather than biasing register allocation to use smaller
> instructions.
<
> >> This is a consequence of the extension philosophy and the
> >> rejection of special-purpose register sets. If the C extension was
> >
> > Rejection of Special Purpose register sets in RISC-V ?!? They have
> > Floating Point, and Vector Registers.
<
> I meant special purpose beyond earlier RISCs; MIPS (which actually
> had multiply/divide destination SPRs), Alpha, SPARC (condition
> codes), POWER (multiple condition codes and two "front-end"
> registers for link/count and indirect jumps) all had FPRs. Your
> influence on M88k was insufficient to keep FPRs out before the ISA
> died. (I have not looked at RISC-V's vector extension or its SIMD
> extension.) A large set of registers for special use has less
> register allocation issues, which seems to be the main complaint
> about special purpose registers.
<
Yes, RISC-V has 32 double precision FP registers, then they went over
and added 32 vector registers each containing 8×32-bit containers.
RISC-V added 154 vector instructions.
<
Whereas: My 66000 has 32 registers of 64-bits each, which includes
floating point and vector data. My 66000 has 2 vector instructions.
My 66000 vectorizes C's str*() library and mem*() libraries so everyone
benefits from vectorization.
>
> [snip]
> > It might surprise you to notice that My 66000 ISA uses only ~2400
> > instructions to encode the semantic content of CoreMark while RISC-V
> > uses ~3400 instructions.
<
> How much of that is just from ENTER/EXIT? My 66000's single
> register set may have some instruction count advantages, but I do
> not know what else would account for the difference. (Why/how is
> often more interesting than what or even when.)
<
ENTER and EXIT contribute to code density, but overall not that many
subroutines save lots of registers on the stack. But, for example::
<
CoreMark: file core_main:: subroutine:: main::
RISC-V
<
main: # @main
# %bb.0:
addi sp, sp, -2032
sd ra, 2024(sp) # 8-byte Folded Spill
sd s0, 2016(sp) # 8-byte Folded Spill
sd s1, 2008(sp) # 8-byte Folded Spill
sd s2, 2000(sp) # 8-byte Folded Spill
sd s3, 1992(sp) # 8-byte Folded Spill
sd s4, 1984(sp) # 8-byte Folded Spill
sd s5, 1976(sp) # 8-byte Folded Spill
sd s6, 1968(sp) # 8-byte Folded Spill
sd s7, 1960(sp) # 8-byte Folded Spill
sd s8, 1952(sp) # 8-byte Folded Spill
sd s9, 1944(sp) # 8-byte Folded Spill
sd s10, 1936(sp) # 8-byte Folded Spill
sd s11, 1928(sp) # 8-byte Folded Spill
addi sp, sp, -224
mv a2, a1
lui a1, 1
addiw a1, a1, -1948
add a1, a1, sp
sw a0, 0(a1)
lui a0, 1
addiw a0, a0, -1958
add a0, a0, sp
mv s5, a0
lui a0, 1
addiw a0, a0, -1948
add a0, a0, sp
mv a1, a0
mv a0, s5
call portable_init
<
My 66000
>
main: ; @main
; %bb.0:
enter r19,r0,0,2120
mov r3,r2
stw r1,[sp,2116]
add r29,sp,#2000
add r30,r29,#106
add r2,sp,#2116
mov r1,r30
call portable_init
>
This little snippet of code illustrates 3 RISC-V weaknesses::
<
Prologue and epilogue sequences (4 registers per cycle)
16-bit displacements (compared to 12-bit)
16-bit immediates (compared to 12-bit)
<
But taken all-together, I am finding more compression bang for the
buck with the 16-bit constants than from ENTER and EXIT.
<
> I forgot about My 66000's more extensive addressing modes; this
> probably accounts for a substantial number of instructions.
<
RISC-V
# %bb.18:
addi s0, zero, 10
lui a0, 1
addiw a0, a0, -2020
add a0, a0, sp
sw s0, 0(a0)
call start_time
addi a0, sp, 2032
call iterate
call stop_time
call get_time
call time_in_secs
lui a1, %hi(.LCPI1_0)
fld ft1, %lo(.LCPI1_0)(a1)
fmv.d.x ft0, a0
flt.d a0, ft0, ft1
beqz a0, .LBB1_21
My 66000
.LBB1_18:
stw #10,[sp,2044]
call start_time
add r1,sp,#2000
call iterate
call stop_time
call get_time
call time_in_secs
fcmp r2,r1,#0x3FF0000000000000
bnge r2,.LBB1_22
<
16-bit constants
and floating point immediates
<
RISC-V
.LBB1_21: # %.loopexit16
fcvt.wu.d a1, ft0, rtz
addi a0, zero, 1
beqz a1, .LBB1_23
# %bb.22: # %.loopexit16
fcvt.lu.d a0, ft0, rtz
.LBB1_23: # %.loopexit16
lui a1, 1
addiw a1, a1, -2020
add a1, a1, sp
lw a1, 0(a1)
addi a2, zero, 10
divu a0, a2, a0
addi a0, a0, 1
mul a0, a1, a0
lui a1, 1
addiw a1, a1, -2020
add a1, a1, sp
sw a0, 0(a1)
My 66000
.LBB1_22: ; %.loopexit16
cvtdu r1,r1
mov r2,#1
cmov r1,r1,r1,r2
div r1,#10,r1
add r1,r1,#1
lduw r2,[sp,2044]
mul r1,r2,r1
stw r1,[sp,2044]
<
The small peaceable immediate in the div instruction helps
CMOVVe helps
16-bit constants help
<
RISC-V
.LBB1_24:
call start_time
addi a0, sp, 2032
call iterate
call stop_time
call get_time
lh a1, 2032(sp)
mv s6, a0
mv a0, a1
mv a1, zero
call crc16
lh a1, 2034(sp)
mv a2, a0
mv a0, a1
mv a1, a2
call crc16
lh a1, 2036(sp)
mv a2, a0
mv a0, a1
mv a1, a2
call crc16
lui a1, 1
addiw a1, a1, -2024
add a1, a1, sp
lh a1, 0(a1)
mv a2, a0
mv a0, a1
mv a1, a2
call crc16
mv s4, a0
lui a0, 16
lui a1, 8
addiw a1, a1, -1276
addiw s0, a0, -1
bge a1, s4, .LBB1_31
My 66000
.LBB1_23:
call start_time
mov r1,r29
call iterate
call stop_time
call get_time
mov r29,r1
ldsh r1,[sp,2000]
mov r28,#0
mov r2,r28
call crc16
mov r2,r1
ldsh r1,[r27]
call crc16
mov r2,r1
ldsh r1,[r26]
call crc16
mov r2,r1
ldsh r1,[sp,2040]
call crc16
mov r27,r1
mov r22,#65535
cmp r1,r27,#31492
ble r1,.LBB1_28
<
Here the ABI takes fewer moves from one subroutine CALL to the next.
16-bit immediates
32-bit immediates
<
RISC-V
# %bb.72:
lui a0, %hi(default_num_contexts)
lw a0, %lo(default_num_contexts)(a0)
lui a1, 1
addiw a1, a1, -2020
add a1, a1, sp
lw a1, 0(a1)
mul a0, a1, a0
fcvt.d.wu ft0, a0
fsd ft0, 24(sp) # 8-byte Folded Spill
mv a0, s6
call time_in_secs
fmv.d.x ft0, a0
fld ft1, 24(sp) # 8-byte Folded Reload
fdiv.d ft0, ft1, ft0
fmv.x.d a1, ft0
lui a0, %hi(.L.str.29)
addi a0, a0, %lo(.L.str.29)
lui a2, %hi(.L.str.18)
addi a2, a2, %lo(.L.str.18)
lui a3, %hi(.L.str.20)
addi a3, a3, %lo(.L.str.20)
call ee_printf
lui a0, %hi(.L.str.30)
addi a0, a0, %lo(.L.str.30)
lui a1, %hi(.L.str.22)
addi a1, a1, %lo(.L.str.22)
call ee_printf
lui a0, %hi(.L.str.31)
addi a0, a0, %lo(.L.str.31)
My 66000
; %bb.76:
lduw r1,[ip,default_num_contexts]
lduw r2,[sp,2044]
mul r1,r2,r1
srl r1,r1,<32:0>
cvtud r28,r1
mov r1,r29
call time_in_secs
fdiv r2,r28,r1
la r1,[ip,.L.str.29]
la r3,[ip,.L.str.18]
la r4,[ip,.L.str.20]
call ee_printf
la r1,[ip,.L.str.30]
la r2,[ip,.L.str.22]
call ee_printf
la r1,[ip,.L.str.31]
<
IP-relative memory references.
<
RISC-V
get_seed_32: # @get_seed_32
# %bb.0:
addiw a1, a0, -1
addi a2, zero, 4
bltu a2, a1, .LBB0_2 ; 3
# %bb.1:
addi a0, a0, -1
lui a1, %hi(.Lswitch.table.get_seed_32)
addi a1, a1, %lo(.Lswitch.table.get_seed_32)
slli a0, a0, 32
srli a0, a0, 29
add a0, a0, a1
ld a0, 0(a0)
lwu a0, 0(a0)
sext.w a0, a0
ret ; 13
.LBB0_2:
mv a0, zero
ret
My 66000
get_seed_32: ; @get_seed_32
; %bb.0:
add r1,r1,#-1
srl r2,r1,<32:0>
mov r1,#0
cmp r3,r2,#4
phi r3,0,FF
ldd r1,[ip,r2<<3,.Lswitch.table.get_seed_32]
lduw r1,[r1]
ret
<
Here, Predication saves the day.
AND finally::
RISC-V
.LBB2_12: # %.preheader9
# Parent Loop BB2_6 Depth=1
# => This Inner Loop Header: Depth=2
addi a1, a1, 1
lbu a0, 0(a1)
addi a2, a0, -32
bltu s1, a2, .LBB2_17 ; 58
# %bb.13: # %.preheader9
# in Loop: Header=BB2_12 Depth=2
slli a2, a2, 3
add a2, a2, s8
ld a3, 0(a2)
addi a2, zero, 4
jr a3
<elsewhere>
.LJTI2_0:
.quad .LBB2_14, .LBB2_17, .LBB2_17, .LBB2_15,
.LBB2_17, .LBB2_17, .LBB2_17, .LBB2_17,
.LBB2_17, .LBB2_17, .LBB2_17, .LBB2_11,
.LBB2_17, .LBB2_10, .LBB2_17, .LBB2_17,
.LBB2_16
My 66000
.LBB2_12: ; %.preheader8
; Parent Loop BB2_6 Depth=1
; => This Inner Loop Header: Depth=2
ldsb r2,[r1]
srl r3,r2,<32:0>
add r4,r3,#-32
cmp r3,r4,#16
bhi r3,.LBB2_17 ; 42
; %bb.13: ; %.preheader8
; in Loop: Header=BB2_12 Depth=2
mov r3,#4
jtth r4,#17
.jt16 .LBB2_14,.LBB2_17,.LBB2_17,.LBB2_15,
.LBB2_17,.LBB2_17,.LBB2_17,.LBB2_17,
.LBB2_17,.LBB2_17,.LBB2_17,.LBB2_11,
.LBB2_17,.LBB2_10,.LBB2_17,.LBB2_17,
.LBB2_16
<
Built-in switch instruction with a table of 16-bit displacements (PIC)
rather than a table of 64-bit absolutes.
>
> (Others have urged for an extension to provide such, so this is a
> recognized issue even if it has been "solved" by suggesting
> instruction fusion.)
<
It is not "just one thing" that RISC-V left out, from this benchmark
their displacement size is limited, their immediate size is limited,
they are missing several important instructions (PRED, CMOV),
IP-relative memory addressing is expensive, they are missing
placement of constants, floating point constants, and their ABI is
not as clean as My 66000.
<
Leaving out predication means RISC-V takes a lot more branches
then My 66000, thus wasting precious FETCH BW and power.
<
> > Real Code Density is the measure of how few instructions it takes
> > to encode and how few it takes to perform.
<
> Not for microcontrollers (and some other embedded uses).
<
It always takes less power to execute fewer instruction containing the
same total semantic content.
<
It always takes less power to deliver constants directly into execution
than to paste constants together with instructions.
>
> [snip]
> >> Having instruction-extending nops is a mechanism I kind of like.
> >
> > Why does an ISA even need NoOps today ?
<
> Should an ISA have hints? If so, one would probably want to
> provide encodings for future hint use that do not generate
> unimplemented instruction exceptions.
<
I take the other viewpoint: if you do not know what this encoding
is supposed to do on this current machine, then it should raise
an exception. Forwards compatibility is vastly harder than back-
wards compatibility.
<
I am yet to see a hint last more than 2 generations (where it remains
both viable and expressive.) Right now, I don't have any.
>
> Nops can also be used for padding to provide alignment. With
> caches having alignment biases (other features may also),
> alignment might be significant.
<
With everything on 32-bit boundaries, there is no such need.
>
> (Presumably for function end padding, always excepting all-zero
> instructions would be better, but a single nop is probably better
> than a jump zero instructions forward instruction. For larger
> padding or more sophisticated implementations, jumps may be
> preferred.)
<
I went out of my way to reserve in perpetuity 6 Major OpCodes
so if you find yourself executing integer or floating point data
(containing typical integer or floating point data), you are likely
to raise an OPERATION exception (UNIMPlemented) even without
the MMU/TLB protecting your code space.
>
> There may also be cases where one would like to use padding to
> support limited in-place code modification.
<
I do not even want a PTE with RWE = x11. If you can write to the
page, you cannot execute from it at the same time. This opens up
a rich source of race conditions--all of which need to be expressed
in the architecture manual, and a likely source of side channels.
>
> "Quality of implementation" guarantees (as well as availability of
> some features such as something like ESM) would presumably impact
> how useful nops would be.
<
ESM did not need NoOps.
>
> With respect to extending an instruction, the conceptual nop
> (really a prefix [or postfix??]) is a different concept.
<
My 66000 does have instruction-modifiers:: multi-instruction prefix
if you will.

Paul A. Clayton

unread,
Aug 14, 2022, 3:54:23 PM8/14/22
to
Anton Ertl wrote:
> "Paul A. Clayton" <paaron...@gmail.com> writes:
>> Itanium's always execute as if serial
>> seemed to be a mistake to me in that it prohibited an instruction
>> bundle from encoding a swap without temporary. Limiting such
>> parallelism to bundles does make sense, but renaming two
>> temporaries seems a small burden. Given Itanium's target uses,
>> less than single bundle at a time execution seems unlikely.
>
> I think that every IA-64 implementation needs to be able to execute an
> instruction at a time, for exception handling and debugging.

Yep. That is the specification. I do not believe such is a
necessary feature of a such a bundle-based ISA. Exception handling
and debugging would be more complex, but I *suspect* not horrible.
(One already has a distinction between source code
statement/expression and instruction both with a single expression
being compiled to multiple instructions and multiple source code
particles being compiled into a single
atomic-with-respect-to-interrupts instruction.) Traditional VLIWs
naturally have this kind of imprecise exceptions.

This design choice may not have been a mistake; I know I
undervalue exception handling and debugging and probably just
assume that developing excellent debugging tools is a completely
solved problem.☹ (This is one problem with being an armchair
computer architect without noticeable programming experience or
even computer science education. On the other hand, I am even more
ignorant on the electrical engineering side!)

BGB

unread,
Aug 14, 2022, 5:13:10 PM8/14/22
to
On 8/14/2022 10:50 AM, Paul A. Clayton wrote:
> BGB wrote:
> [snip]
>> RISC-V C:
>>    Looks straightforward enough on the surface.
>>    Look a little deeper, it is a dog-chewed mess (even worse than Thumb).
>
> I think this is largely a consequence of the RISC-V methodology;
> encoding is reserved/provided for extension but little or no planning is
> done for how extensions will fit together even when it was known ahead
> of time that certain extensions would be desired. (I do think the
> instruction length encoding supports excessive flexibility; 16-bit
> parcels (length granularity) is unlikely to be very useful above 64-bit
> length.) I think this is part of the complaint "everything is an
> extension"; one difference between subset and extension is that for a
> subset the superset is assumed to be already defined. (The extension
> management for RISC-V is also problematic. The original letter
> designations has been shown to be inadequate for the breadth of custom
> designs. Defining system profiles that specify a list of extensions may
> be an 'adequate' fix but seems another indication of inadequate
> forethought.)
>

Yeah.

Take an overly minimalist core ISA design.
Then haphazardly bolt a bunch of random crap on it.
All while not really addressing design deficiencies in the original part
of the ISA due to philosophy, while bolting on crap that is
significantly more expensive, and offers less benefit in practice.


> (For the C extension, the ABI was changed to simplify register mapping
> compared to Andrew Waterman's original (academic) proposal for a
> compressed instruction extension. While admitting to a mistake and
> fixing it is good — and more difficult in a commercial environment, not
> that it is easy in academia — I think a little more forethought would
> have avoided this. It was known from the start that 16-bit instructions
> would be used and obvious that smaller register specifiers would be
> desirable for such.)
>

Yes.


> With respect to consistent placement of register specifiers, 16-bit
> instructions make such more difficult. I do think one could encode less
> timing critical opcode information in the extra bit.
> With two fixed-placement bits determining whether an instruction's
> length is greater than 16 bits, one could perform a single logical
> operation to generate the mask. It seems that the register address could
> be partially decoded while this is done so that perhaps no latency would
> be added to register access.
>
> RISC-V's three 3-bit specifiers for some compressed instructions —
> providing non-destructive ops is desirable — would make such impractical
> (though one could gain one bit by having a register specifier field with
> one bit in the next parcel and masking out the 'low' bit but one bit is
> not enough — having the software-named registers different than the
> instruction encoding would be unattractive.). Slower parsing of the
> destination register specifier would not be as problematic, but one
> would need to determine quickly that two bits of the source register
> specifiers are masked — or not mask those bits, making special purpose
> register sets. (With substantial immediate stack-pointer-relative loads
> and stores, one might reserve the SP specifier for such opcodes, provide
> a third masking option, or combine masking and special purpose registers.)
>

In my case, I skipped out on 3-bit register fields for the 16-bit
instructions, as I was at least trying to keep the 16-bit encoding
semi-consistent.


Instead, the 4-bit registers are able to address a "useful subset":
6 scratch registers, including the first 4 function arguments;
7 preserved registers;
The SP register.


There are still some inconsistencies though, such as a few 16-bit
instructions which can use 5-bit register fields, and several different
sub-encodings (mostly differing in the location of the high bit).

A few instructions also encode 128-bit locations but fold the 5-bit
register down into 4 bits:
0=R0*, 1=R16, 2=R2, 3=R18, ..., C=R12, D=R28, E=R14*, F=R30
*: R0 and R14 are not allowed with 128-bit instructions.


Meanwhile, in RISC-V, the register assignments are a bit haphazard.


> (One complaint I have with the C extension is that it is "compressed",
> i.e., every C-extension instruction has to have — by design principle
> not architectural mandate — a larger non-compressed variant. This seems
> wasteful of the encoding space. This is a consequence of the extension
> philosophy and the rejection of special-purpose register sets. If the C
> extension was for "code density", it might also have included 48-bit
> instructions. (I also think complex instructions are attractive for code
> density.) While a complete 32-bit-only ISA would allow two bits to be
> ignored and not cached in the instruction cache and managing the
> register specifier masking would add complexity (though one might be
> able to get away with the masked-bit-as-zero for the non-redundant
> opcodes, maybe), such simplicity seems minor if one provides a reference
> decode implementation (for tiny cores).)
>

As I see it, for a "compressed" encoding to be validly called such,
there should be a simple and consistent scheme to "uncompress" it into a
corresponding full-length instruction. RV-C does not follow this pattern
IMO (it will effectively need its own decoder, rather than an unpacker).

Alternately, if one is designing a 16-bit encoding, it would make at
least some sense to try to keep the encoding fairly consistent from one
instruction to another:
Register fields in the same places;
Immediate values in the same places;
Immediate bits in a "sane" order;
And, the layout of immediate bits should be consistent from one
instruction to the next.

If one has two instructions with their immediate fields in the same
place, with the same size, and the relative ordering of the bits within
the immediate field is entirely different, this is a design fail IMO.



> [snip]
>> RISC-V's 32-bit encodings keep registers more consistent at the cost
>> of turning immediate values and displacements into a chewed up mess.
>
> At least immediates are generally less critical. Only control flow
> instructions might *really* like to have the immediate available early
> in the pipeline.

Except that displacement encodings are typically dog-chewed as well...


Granted, this area is kinda difficult in my case as well, partly when it
comes to the 64 bit encodings.

It also looks like some paths for unpacking immediate fields eat up a
decent chunk of LUTs in my instruction decoder (I suspect a lot also
related to sign and zero/one extension cases).


It seems the RISC-V decoder also represents a decent chunk of LUTs as well.

However, the 3-wide decoder still costs less than the L1 D$ cache or
Lane-1 ALU, so, probably not a huge issue.



Sadly, my recent attempts at shaving LUTs off my CPU core has resulted
in a storm of bugs, still trying to sort all this out.

I had been a little tempted at a few points to be like "screw it" and
revert back to a version of the core from around 2 weeks ago.

But, OTOH:
Total LUT cost has dropped from 82% -> 68%
(Shaved a several k LUT off the main CPU core).
Timing has gotten better;
Estimated power use has dropped from 1.1W to 0.8W;
...

And, otherwise, I would need to try to recreate some of these savings in
the restored version, and possibly still need to hunt down some of the
same bugs as a result (one of the bigger cost-saving features being
partly reworking how the interrupt-handling mechanism works, and
eliminating a bunch of the register side-channels as a result).

Doesn't help sometimes that trying to do much of anything non-trivial in
the Verilog code is often analogous to poking a bees nest with a stick.


It isn't the ones which cause immediate crashes which are the issue, but
rather the ones' which cause bugs which don't manifest until after
several hours or more of simulation time... (Eg: "Why is Quake not
rendering correctly here in the simulation?"...).


Also annoying when an instruction seems to work fine in the unit tests
and initial sanity checks, but then misbehaves for some unknown reason
once it is being used by the program, ...

This would seem to be the case for the low-precision SIMD operations (in
particular, the FADD operation), which for whatever reason doesn't seem
to be working correctly when used in TKRA-GL (it seemed like it was
working before, so dunno...).

Also Doom currently seems to be misbehaving as well in the simulation,
so more debugging is likely needed here.



Another minor saving was due to merging a few of the aliased ports
within the pipeline:
The Rx and Ry ports are shared between Lanes 1/2 and Lane 3;
However, these lanes had copies of the values, rather than sharing them
directly (sharing via 'wire' and 'assign' rather than value duplicated
registers can save some LUTs here it seems).

Well, along with misc things, like doing a staged left shift in
small-to-large order seemingly using less LUTs than doing it in
large-to-small order; ...


That and also making the observation that:
Rx <= 0;
Ry <= Rx;

Appears to only apply across a single clock edge in terms of constant
propagation (mostly effects using "parameter" to prune stuff).

So, say, while stuff connected to Rx might get pruned away from the 0,
something connected to Ry will not.

Or, in effect, the constant propagation appears mostly local to
combinatorial sections and the inputs to those sections (partly noted
based on LUT cost behavior, and also implicitly that Vivado apparently
generates warnings whenever it trims something or does constant
propagation, at least until it hits the maximum limit for that
particular warning).

One of the rare "useful" warnings in this category being "inferring
latch for ..." which usually means that one has forgotten to assign a
value here (and leaving this one unfixed basically throws a grenade into
LUT cost and timing; as well as often causing Verilator to start
freaking out about perceived circular logic, ...).

Well, and it also warns about using features which don't exist in
Verilog-95, ...


> [snip]
>> In my case, extended constant bits are held in jumbo prefixes, which
>> are effectively treated as NOP (and mostly special in that the payload
>> bits are routed into the adjacent decoder).
>
> Having instruction-extending nops is a mechanism I kind of like. It is
> less dense that a specialized encoding and 'requires' the fields to be
> either consistent (e.g., using the 'same' fields in the "nop" for adding
> register specifiers) or less critical. In a sense it is a variation on
> the 'stop/start-bit' mechanism (which has been used for x86 predecode
> and in VLIW). Using a set of nop opcodes wastes less encoding space — a
> few opcodes or even only one opcode versus half of the primary opcodes.
>

In this case, the Jumbo prefixes are the bundled version of the "LDIz
Imm24, R0" instruction.

Executed as a standalone instructions, they load a 24-bit (zero or one
extended) value into a fixed register;
Used in a bundle, they extend another instruction.

Combine them with themselves, and one gets either a 48-bit constant load
or an Abs48 branch instruction.


Recently and also dealing with a potentially "kinda tacky" way of doing
SIMD Int<->FP conversion, namely mapping the integer values to fractions
between 1.0 and 2.0 (allows for a cheaper conversion mechanism; and
doesn't require the use of an FADD unit to perform the conversion).

This mechanism requires doing SIMD Int<->FP conversions via multiple
instructions, but probably still better than needing to do it via scalar
instructions.

Doing the traditional conversion is generally more like:
Int->FP: Synthesize inputs to FADD which will result in the desired FP
output value (via hard-wired logic);
FP->Int: Add value to a synthesized value (representing a value larger
than the largest possible integer) and then extract the final integer
value from the mantissa (implicitly requires FADD's mantissa to be
larger than the size of the integer in question).

But, for SIMD operations, it is necessary to do these parts manually
(mostly cost reasons).


> [snip]
>> Also better IMO to keep instructions in a format where they at least
>> "make sense" as a sequentially executed instruction scheme.
>>
>> Say, well-formed code in BJX2 has as a requirement that one can ignore
>> WEX and execute stuff sequentially, and it should still produce the
>> same results as the bundled version (Or, IOW: If bundled and
>> sequential execution produce different results, the code is broken).
>
> I am not sure about that. Itanium's always execute as if serial seemed
> to be a mistake to me in that it prohibited an instruction bundle from
> encoding a swap without temporary. Limiting such parallelism to bundles
> does make sense, but renaming two temporaries seems a small burden.
> Given Itanium's target uses, less than single bundle at a time execution
> seems unlikely.
>
> (However, I am biased. I feel that if one is going with static
> scheduling one should not expect executable format compatibility.
> Itanium's bundle mechanism seems an awkward provision for binary
> compatibility.)
>

It makes semantics easier for emulation and for linting / detecting if
the compiler messing up here.

It is possible that dropping this restriction could allow for slightly
more ILP, but:
Would break binary compatibility with sequential execution;
Would require the emulator to also emulate the register-update semantics
of bundled instructions;
Would likely be a big hassle if an OoO core for the ISA were ever created;
...

luke.l...@gmail.com

unread,
Aug 14, 2022, 5:38:56 PM8/14/22
to
On Sunday, August 14, 2022 at 8:31:07 PM UTC+1, MitchAlsup wrote:

> It is my considered belief that RISC has gone too far on the reduced side
> of things and left out too many good and useful instructions.

this ycombinator post (and the associated Alibaba paper being
discussed) explain it well
https://news.ycombinator.com/item?id=24459314

> Yes, RISC-V has 32 double precision FP registers, then they went over
> and added 32 vector registers each containing 8×32-bit containers.
> RISC-V added 154 vector instructions.

last time i counted RVV 1.0 i got a number "190".

the deep flaw in RVV that i missed until recently, on analysing
ARM SVE/2, is not so much the number of vector registers but
the fact that implementations may choose *any* value of MAXVL.

this in turn requires *100%* of Vector implementations of
algorithms - all algorithms - to contain a branch-compare-loop!
you *have* to do this:

loop:
x = setvl(n) # x = min(n, MAXVL)
...
x -= n
bnz loop

whereas in Simple-V the MAXVL is *required* to be a fixed
specification-defined quantity (127) and with all implementations
required to have MAXVL=127 therre is no need to have loops,
if, at static-compile-time, you want to do an operation of length
1<=N<=127:

# guaranteed to set VL equal to 17 on *all* hardware
setvl 17
# guaranteed to load 17 elements into RT - RT+17
sv.ld *RT, 0(RA)

> <
> Whereas: My 66000 has 32 registers of 64-bits each, which includes
> floating point and vector data. My 66000 has 2 vector instructions.
> My 66000 vectorizes C's str*() library and mem*() libraries so everyone
> benefits from vectorization.

66000 recognises that loops are an important construct
and provides instructions for reducing the overhead of them.

it does not however have *single* instruction looping-capability
(what i term "Horizontal-First" Vectorisation) i.e. no built-in
equivalent of Zilog Z80 "CPIR".
http://z80-heaven.wikidot.com/instructions-set:cpir

> ENTER and EXIT contribute to code density, but overall not that many
> subroutines save lots of registers on the stack. But, for example::
> <
> CoreMark: file core_main:: subroutine:: main::
> RISC-V
> <
> main: # @main
> # %bb.0:
> addi sp, sp, -2032
> sd ra, 2024(sp) # 8-byte Folded Spill
> sd s0, 2016(sp) # 8-byte Folded Spill
> ....

in SVP64 that would be:
setvl 12 # 12 regs to save - then set #regs to 12!
sv.st *s0, 2016(sp)

> But taken all-together, I am finding more compression bang for the
> buck with the 16-bit constants than from ENTER and EXIT.

iiinterestiiing.

> Leaving out predication means RISC-V takes a lot more branches
> then My 66000, thus wasting precious FETCH BW and power.

that's pretty much what adrian_b said (and then some)
https://news.ycombinator.com/item?id=24459314

i missed the point at which you mention the lack of
condition-codes in Branches, but i am reminded of the
China ICT team on the Loongson MIPS64 architecture
doing a JIT-translation of x86 and managing to get 70%
of native performance by adding 200 custom instructions
to help with JIT.

what they said in the paper was that because MIPS does
not have Condition Codes, it took an astounding *ten*
instructions to JIT-emulate/translate an x86 branch operation.

basically that same fact applies to any ISA wthout Condition
Codes. and you can't do a retro-fit: the opcodes of RISC-V
were *specifically* designed on the principle of discarding CCs.
it's basically a full ISA redesign - you'd have to start again and
call it RISC-6

l.

MitchAlsup

unread,
Aug 14, 2022, 6:49:59 PM8/14/22
to
On Sunday, August 14, 2022 at 4:38:56 PM UTC-5, luke.l...@gmail.com wrote:
> On Sunday, August 14, 2022 at 8:31:07 PM UTC+1, MitchAlsup wrote:
>
> > It is my considered belief that RISC has gone too far on the reduced side
> > of things and left out too many good and useful instructions.
> this ycombinator post (and the associated Alibaba paper being
> discussed) explain it well
> https://news.ycombinator.com/item?id=24459314
> > Yes, RISC-V has 32 double precision FP registers, then they went over
> > and added 32 vector registers each containing 8×32-bit containers.
> > RISC-V added 154 vector instructions.
> last time i counted RVV 1.0 i got a number "190".
<
I must have missed some........
>
> the deep flaw in RVV that i missed until recently, on analysing
> ARM SVE/2, is not so much the number of vector registers but
> the fact that implementations may choose *any* value of MAXVL.
>
> this in turn requires *100%* of Vector implementations of
> algorithms - all algorithms - to contain a branch-compare-loop!
> you *have* to do this:
>
> loop:
> x = setvl(n) # x = min(n, MAXVL)
> ...
> x -= n
> bnz loop
>
> whereas in Simple-V the MAXVL is *required* to be a fixed
> specification-defined quantity (127) and with all implementations
> required to have MAXVL=127 therre is no need to have loops,
> if, at static-compile-time, you want to do an operation of length
> 1<=N<=127:
<
Conversely, all vectorization in My 66000 uses the LOOP instruction
without ANY knowledge of vector register length necessary.
>
> # guaranteed to set VL equal to 17 on *all* hardware
> setvl 17
> # guaranteed to load 17 elements into RT - RT+17
> sv.ld *RT, 0(RA)
> > <
> > Whereas: My 66000 has 32 registers of 64-bits each, which includes
> > floating point and vector data. My 66000 has 2 vector instructions.
> > My 66000 vectorizes C's str*() library and mem*() libraries so everyone
> > benefits from vectorization.
<
> 66000 recognises that loops are an important construct
> and provides instructions for reducing the overhead of them.
<
My 66000 recognizes that having 14 flavors of integer ADD is not
beneficial to the architecture and certainly not for the implementations.
>
> it does not however have *single* instruction looping-capability
> (what i term "Horizontal-First" Vectorisation) i.e. no built-in
> equivalent of Zilog Z80 "CPIR".
> http://z80-heaven.wikidot.com/instructions-set:cpir
<
> > ENTER and EXIT contribute to code density, but overall not that many
> > subroutines save lots of registers on the stack. But, for example::
> > <
> > CoreMark: file core_main:: subroutine:: main::
> > RISC-V
> > <
> > main: # @main
> > # %bb.0:
> > addi sp, sp, -2032
> > sd ra, 2024(sp) # 8-byte Folded Spill
> > sd s0, 2016(sp) # 8-byte Folded Spill
> > ....
>
> in SVP64 that would be:
> setvl 12 # 12 regs to save - then set #regs to 12!
> sv.st *s0, 2016(sp)
<
Clever, but ENTER and EXIT also do the SP setup and FP setup when desired.
All this arithmetic is performed while registers are exchanged with memory.
<
> > But taken all-together, I am finding more compression bang for the
> > buck with the 16-bit constants than from ENTER and EXIT.
<
> iiinterestiiing.
<
Yes, it was illuminating to find the patterns RISC-V needed 3 (sometimes 4)
instructions to do something My 66000 ISA can do in 1 instruction and
either 1 word or 2 words of code space. This surprised me.
<
> > Leaving out predication means RISC-V takes a lot more branches
> > then My 66000, thus wasting precious FETCH BW and power.
<
> that's pretty much what adrian_b said (and then some)
> https://news.ycombinator.com/item?id=24459314
>
> i missed the point at which you mention the lack of
> condition-codes in Branches, but i am reminded of the
> China ICT team on the Loongson MIPS64 architecture
> doing a JIT-translation of x86 and managing to get 70%
> of native performance by adding 200 custom instructions
> to help with JIT.
<
MIPS and RISC-V chose to have the compare and branch permanently
fused together. lack of access to immediates waste a significant
portion of this design decision::
RISV-V:
lui a0, 16
lui a1, 8
addiw a1, a1, -1276
addiw s0, a0, -1
bge a1, s4, .LBB1_31
My 66000:
mov r22,#65535
cmp r1,r27,#31492
ble r1,.LBB1_28
<
My 66000 chose to integrate compares, branches, and bit fields
so that there is no need to have SET instructions, nor is there a
need to choose between SETs that deliver {0,1} and SETs that
deliver {0, -1}.
<
CMP-BB can be CoIssued together taking no more pipeline
cycles than the MPIS, RISC-V architectures.
<
Over oon the FP side of the ISA, they have no choice::
RISC-V:
call time_in_secs
lui a1, %hi(.LCPI1_0)
fld ft1, %lo(.LCPI1_0)(a1)
fmv.d.x ft0, a0
flt.d a0, ft0, ft1
beqz a0, .LBB1_21
My 66000:
call time_in_secs
fcmp r2,r1,#0x3FF0000000000000
bnge r2,.LBB1_22
>
> what they said in the paper was that because MIPS does
> not have Condition Codes, it took an astounding *ten*
> instructions to JIT-emulate/translate an x86 branch operation.
<
My guess is that everyone is going to have trouble emulating
x86 when x86 code accesses the parity bit in the CCs.
<
But, surprisingly, My 66000 seems to use fewer instructions around
branching even with the fact that MIPS-RISC-V has an instruction
that SHOULD give them a direct advantage.
>
> basically that same fact applies to any ISA wthout Condition
> Codes. and you can't do a retro-fit: the opcodes of RISC-V
> were *specifically* designed on the principle of discarding CCs.
<
So were My 66000 (and Mc 88100 before that)
<
> it's basically a full ISA redesign - you'd have to start again and
> call it RISC-6
<
I agree, if you change the fundamental notion of how branch
directions are chosen, its a "start all over" event.
<
Me, I spent 5 years writing compilers and wondering why those
ISAs had condition codes--and this was prior to my doing the
ISA of Mc 88100. {I still believe that the budding computer architect
must spend a few years writing compilers to get that "touchy
feely" relations ship with how one directs flow control.
>
> l.
<
And then there is the issue of placing constants where they do
the most good::
RISC_V:
lui a2, 16
addiw a5, a2, -1
and a0, a0, a5
addi a2, zero, 2000
divu a0, a2, a0
lui a2, 1
addiw a2, a2, -2024
add a2, a2, sp
sw a0, 0(a2)
My 66000:
srl r1,r6,<16:0>
div r1,#2000,r1
stw r1,[sp,2040]
<
The DIV with the constant numerator is quite useful.
<
RISC-V:
core_list_init: # @core_list_init
# %bb.0:
<snip>
mv s3, a2
mv s2, a1
slli a0, a0, 32
srli a0, a0, 32
lui a1, 1035469
addiw a1, a1, -819
slli a1, a1, 12
addi a1, a1, -819
slli a1, a1, 12
addi a1, a1, -819
slli a1, a1, 12
addi a1, a1, -819
mulhu a0, a0, a1
srli a0, a0, 4
addi s5, a0, -2
slli s4, s5, 32
srli a0, s4, 28
add s7, s2, a0
srli a0, s4, 30
add s6, s7, a0
sd zero, 0(s2)
sd s7, 8(s2)
sh zero, 2(s7)
lui a0, 8
addiw a0, a0, 128
sh a0, 0(s7)
addi a0, s2, 16
sd a0, 16(sp)
addi a0, s7, 4
sd a0, 8(sp)
lui a0, 524288
addiw a0, a0, -1
sw a0, 0(sp)
mv a1, sp
addi a2, sp, 16
addi a3, sp, 8
mv a0, s2
mv a4, s7
mv a5, s6
call core_list_insert_new
My 66000:
core_list_init: ; @core_list_init
; %bb.0:
<snip>
mov r29,r3
mov r30,r2
srl r1,r1,<32:0>
div r1,r1,#20
add r26,r1,#-2
srl r23,r26,<32:0>
sll r1,r23,<0:4>
add r28,r30,r1
la r27,[r28,r23<<2,0]
std #0,[r30]
std r28,[r30,8]
sth #0,[r30,r1,2]
sth #32896,[r30,r1,0]
add r1,r30,#16
std r1,[sp,16]
add r1,r28,#4
std r1,[sp,8]
stw #2147483647,[sp]
add r2,sp,#0
add r3,sp,#16
add r4,sp,#8
mov r1,r30
mov r5,r28
mov r6,r27
call core_list_insert_new
<
Stores with immediates are very useful, here! not just stores of 0.
<
My 66000 can put the constant in the Rd location (STs) in the Rs1
location, and in the RS2 location, and even the Rs3 location for FMACs.

John Dallman

unread,
Aug 14, 2022, 7:14:34 PM8/14/22
to
In article <tdbjtc$38fo9$1...@dont-email.me>, paaron...@gmail.com (Paul
A. Clayton) wrote:

> Anton Ertl wrote:
> > I think that every IA-64 implementation needs to be able to
> > execute an instruction at a time, for exception handling and
> > debugging.
>
> Yep. That is the specification. I do not believe such is a
> necessary feature of a such a bundle-based ISA. Exception handling
> and debugging would be more complex, but I *suspect* not horrible.

It was horrible. Debuggers for Itanium in 2001-2005 made no attempt to
help you with the complexity of the architecture, and of compilation.
Trying to work out which disassembled instructions corresponded to which
source statements could take hours in reasonably simple functions; in
some of the more complex ones, I had to give up. Remember that in a new
and hard-to-compile-for architecture, there will be significant compiler
bugs.

After 2005, it was already clear that Itanium was a commercial failure,
and I spent no more time on it, apart from trying to get it dropped by
parts of the company who hadn't quite caught on.

> (One already has a distinction between source code
> statement/expression and instruction both with a single expression
> being compiled to multiple instructions and multiple source code
> particles being compiled into a single
> atomic-with-respect-to-interrupts instruction.)

Yes, you do. And figuring out the resulting bugs is /harder/ than in more
conventional architectures.

> This design choice may not have been a mistake;

I doubt they'd have got the thing working at all without it.

> I know I undervalue exception handling and debugging and probably
> just assume that developing excellent debugging tools is a completely
> solved problem.

It isn't. Not by a very long way.

John

Thomas Koenig

unread,
Aug 15, 2022, 11:18:17 AM8/15/22
to
luke.l...@gmail.com <luke.l...@gmail.com> schrieb:
> On Sunday, August 14, 2022 at 8:31:07 PM UTC+1, MitchAlsup wrote:
>
>> It is my considered belief that RISC has gone too far on the reduced side
>> of things and left out too many good and useful instructions.
>
> this ycombinator post (and the associated Alibaba paper being
> discussed) explain it well
> https://news.ycombinator.com/item?id=24459314
>
>> Yes, RISC-V has 32 double precision FP registers, then they went over
>> and added 32 vector registers each containing 8×32-bit containers.
>> RISC-V added 154 vector instructions.
>
> last time i counted RVV 1.0 i got a number "190".
>
> the deep flaw in RVV that i missed until recently, on analysing
> ARM SVE/2, is not so much the number of vector registers but
> the fact that implementations may choose *any* value of MAXVL.
>
> this in turn requires *100%* of Vector implementations of
> algorithms - all algorithms - to contain a branch-compare-loop!
> you *have* to do this:
>
> loop:
> x = setvl(n) # x = min(n, MAXVL)
> ...
> x -= n
> bnz loop

... which means that it is not possible to migrate processes between
cores with different vector lengths, correct?

>
> whereas in Simple-V the MAXVL is *required* to be a fixed
> specification-defined quantity (127) and with all implementations
> required to have MAXVL=127 therre is no need to have loops,
> if, at static-compile-time, you want to do an operation of length
> 1<=N<=127:
>
> # guaranteed to set VL equal to 17 on *all* hardware
> setvl 17
> # guaranteed to load 17 elements into RT - RT+17
> sv.ld *RT, 0(RA)
>
>> <
>> Whereas: My 66000 has 32 registers of 64-bits each, which includes
>> floating point and vector data. My 66000 has 2 vector instructions.
>> My 66000 vectorizes C's str*() library and mem*() libraries so everyone
>> benefits from vectorization.
>
> 66000 recognises that loops are an important construct
> and provides instructions for reducing the overhead of them.
>
> it does not however have *single* instruction looping-capability
> (what i term "Horizontal-First" Vectorisation) i.e. no built-in
> equivalent of Zilog Z80 "CPIR".

It has a memory move instruction, at least.

Stefan Monnier

unread,
Aug 15, 2022, 12:33:06 PM8/15/22
to
Thomas Koenig [2022-08-15 15:18:14] wrote:
>> loop:
>> x = setvl(n) # x = min(n, MAXVL)
>> ...
>> x -= n
>> bnz loop
>
> ... which means that it is not possible to migrate processes between
> cores with different vector lengths, correct?

Not quite: it means you can only move to a CPU with larger-or-equal
vector length (IOW, "from a small CPU to a large CPU").

Just like on x86 where you can migrate from a CPU with SSE to a CPU
with AES but not the reverse since the program may have checked the
presence of AES support to select the AES version of the code.


Stefan

Stefan Monnier

unread,
Aug 15, 2022, 12:35:23 PM8/15/22
to
> Just like on x86 where you can migrate from a CPU with SSE to a CPU
> with AES but not the reverse since the program may have checked the
> presence of AES support to select the AES version of the code.

Hmm... s/AES/AVX/g


Stefan

luke.l...@gmail.com

unread,
Aug 15, 2022, 1:09:46 PM8/15/22
to
several answers to this [Stefan clarifies in a way that is particularly relevant to ARM SVE/2 which btw is *not* Scalable Vectors, it is Predicated SIMD: they ran out of space in 32-bit opcodes to always add a predicate mask and didn't think to use register "tagging"]

RISC-V Vectors:

*without* that loop... correct, you cannot migrate to hardware with a different MAXVL (number of Vector "Lanes"). proponents of RISC-V Vectors try to gloss over this rather important fact. as Stefan points out in a later reply, you *could* attempt to write code that explicitly sets VL to a particular value but if attempted to run on hardware with not enough Lanes, it *will* fail (silently unless you do a pre-check).

last time i checked: with embedded systems allowing to go down to MAXVL=1, this is why you *have* to use loop-constructs. this has i believe been discussed and i heard that there might be a decision to force implementors to do at least MAXVL >= 4

NEC SX-Aurora (and the original Cray):

all hardware released had - has - the *exact* same MAXVL (Aurora: 256, Cray: 64) so you could in fact rely on setvl setting the VL to a deterministic amount.

ARM SVE/2

despite being called "Scalable" that is to the *silicon partners* not to programmers.
https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/102340_0001_00_en_introduction-to-sve2.pdf?revision=aae96dd2-5334-4ad3-9a47-393086a20fea

in SVE/2 there *is* no setvl instruction *at all* but i believe they advise you to still use some sort of looping constructs, using predicate masks. the only problem being: *not all instructions have predicate masks* [it is appropriate to undergo a face-palm moment at this point]

best in-depth tech review i could find:
https://gist.github.com/zingaburga/805669eb891c820bd220418ee3f0d6bd#file-sve2-md

with ARM SVE/2 although it is Predicated SIMD the fact that there are so many NEON programmers out there means that everyone is trying desperately to ignore the Scalability [which isn't Programmer-Scalable at all anyway] and basically turn usage of SVE/2 into a glorified version of NEON. of course... that is going down like a lead balloon, with Silicon Partners going "oh shit" and trying to agree amongst themselves to at least all do the exact same width in order to at least get binary compatibility. i heard on the grapevine that they're debating whether to do 2x128 multi-issue SVE/2 units rather than do 1x256. of course, the moment any Silicon Partner ignores that convention...

l.

MitchAlsup

unread,
Aug 15, 2022, 3:07:24 PM8/15/22
to
You seem to be replying to Luke; but here, you are correct, My 66000
does have a MM instruction. The MM instruction is classified as a
memory instruction not as a Vector instruction. Only {VEC and LOOP}
are vector instructions. In My 66000, loops get vectorized not instructions.

MitchAlsup

unread,
Aug 15, 2022, 3:10:09 PM8/15/22
to
More bad architecture.......
>
> last time i checked: with embedded systems allowing to go down to MAXVL=1, this is why you *have* to use loop-constructs. this has i believe been discussed and i heard that there might be a decision to force implementors to do at least MAXVL >= 4
>
> NEC SX-Aurora (and the original Cray):
>
> all hardware released had - has - the *exact* same MAXVL (Aurora: 256, Cray: 64) so you could in fact rely on setvl setting the VL to a deterministic amount.
>
> ARM SVE/2
>
> despite being called "Scalable" that is to the *silicon partners* not to programmers.
> https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/102340_0001_00_en_introduction-to-sve2.pdf?revision=aae96dd2-5334-4ad3-9a47-393086a20fea
<
It is not the first time programmers have been setup........

BGB

unread,
Aug 15, 2022, 4:11:04 PM8/15/22
to
They could have always gone the SuperH route:
"We want to cram an FPU into FnmZ..."
"We want Single, Double, and some 2 element SIMD operations, ..."
But, that wont fit... (only 4 bits of opcode).
Proceeds to add bits to the FPU control register, such that what the
instructions do depends on the state of these bits at the time the
instruction executes.


Then I come along later and try to implement this FPU, getting
frustrated with the timing issues of trying to make this work.

Of course, I apparently missed out on detail that was likely relevant at
the time:
The key would have been to have effectively 3 FPUs (1 double, 2 single),
and always fetch 2x 32-bit from the register file, then on the output
stage one selects the relevant outputs to write back to the register file.

Well, as opposed to trying to put converters on the register ports and
then run everything internally with a single FPU.


One other thing I didn't realize at the time, was that some of the other
quirks in the ISA make more sense if one realizes that it many of its
features were designed around the assumption of a 2-wide superscalar (in
which case, random instructions which operate on 64-bit data, or which
update 2 registers at the same time, etc, are less of an issue).

...


Ironically, I didn't really understand how the original SH FPU was
supposed to work until I went and added 128-bit operations to BJX2.


This is also likely an area where variable-length encodings make sense:
Running low on space in 32-bit land?
Well, there are still 64-bits encodings, ...


> RISC-V Vectors:
>
> *without* that loop... correct, you cannot migrate to hardware with a different MAXVL (number of Vector "Lanes"). proponents of RISC-V Vectors try to gloss over this rather important fact. as Stefan points out in a later reply, you *could* attempt to write code that explicitly sets VL to a particular value but if attempted to run on hardware with not enough Lanes, it *will* fail (silently unless you do a pre-check).
>
> last time i checked: with embedded systems allowing to go down to MAXVL=1, this is why you *have* to use loop-constructs. this has i believe been discussed and i heard that there might be a decision to force implementors to do at least MAXVL >= 4
>


Seems like it might have been simpler to treat the vectors simply as a
variable-length SIMD scheme.

Say, V0..V31: SIMD vectors, could be 128/256/512/... bits, who knows.
You can fetch the maximum supported value;
A control register somewhere sets the desired operating size (mostly
effects Load/Store and similar).

Main difficult case would be shuffles, and in practice one would likely
still need to build different versions of the code depending on vector
length, and/or build code in terms of the minimum vector length, ...


If one has 64-bit encodings, maybe they could set aside a few bits for
operation size, say:
000: 128-bit
001: 256-bit
010: 512-bit
011: 1024-bit
...

Then, say, some instruction like:
SHUF.L/256 V4, R0, V6
At least has well defined behavior (say, using a 24-bit shuffle mask
from R0).

Meanwhile:
SHUF.L/512 V4, R0, V6
Would require a 64-bit shuffle mask.

With combinations exceeding 16 elements needing to use vector-registers
for the shuffle masks.


In my case, for now, I am fine with 128-bit vectors (via GPR pairs) in
my ISA, as:
They map OK to most stuff I want to do with vectors;
There isn't really any good way to support larger vectors in a
cost-effective way (I went with 128-bit because I could support it
cheaply on top of a 3-wide 64-bit pipeline design).

With the newer "low-precision" FPU, I could almost allow co-issuing
independent 64-bit FP-SIMD operations on both lanes, except this would
create a case which works with the LP-FPU but not on the main FPU (and I
don't really want it to be a required part of the architecture).



Also, nevermind my newer "cheap" but "kinda crap" Packed Int<->FP
converters, could have done "better", but they would also have exposed
(at the ISA level) the existence of the low-precision FPU.

These cheap converters are awkward, but have an ~ 2-cycle latency (1
cycle latency would have cost more).

Eg:
MOV 0x3C003C003C003C00, R3 //4x 1.0
PCVTUB2HL R4, R5 //Load 4x Packed Unsigned Byte to 4x FP16 (1-2)
PSUB.H R5, R3, R5 //Get values into unit range
(Cost: ~ 6 cycles)

Or, Signed Byte:
MOV 0x4200420042004200, R3 //4x 3.0
PCVTSB2HL R4, R5 //Load 4x Packed Signed Byte to 4x FP16 (2-4)
PSUB.H R5, R3, R5 //Get values into -1.0 .. 1.0 range
(Cost: ~ 6 cycles)

There are also cases for:
Int16 <-> Binary16, and Int16 <-> Binary32
For now, Int32 cases will still need to go through the main FPU.


A "real" converters basically gluing the bit-twiddly parts onto an FADD.
Whether or not this approach makes sense later on is yet to be seen.

Kinda sucks, but should mostly work...


> NEC SX-Aurora (and the original Cray):
>
> all hardware released had - has - the *exact* same MAXVL (Aurora: 256, Cray: 64) so you could in fact rely on setvl setting the VL to a deterministic amount.
>
> ARM SVE/2
>
> despite being called "Scalable" that is to the *silicon partners* not to programmers.
> https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/102340_0001_00_en_introduction-to-sve2.pdf?revision=aae96dd2-5334-4ad3-9a47-393086a20fea
>
> in SVE/2 there *is* no setvl instruction *at all* but i believe they advise you to still use some sort of looping constructs, using predicate masks. the only problem being: *not all instructions have predicate masks* [it is appropriate to undergo a face-palm moment at this point]
>
> best in-depth tech review i could find:
> https://gist.github.com/zingaburga/805669eb891c820bd220418ee3f0d6bd#file-sve2-md
>
> with ARM SVE/2 although it is Predicated SIMD the fact that there are so many NEON programmers out there means that everyone is trying desperately to ignore the Scalability [which isn't Programmer-Scalable at all anyway] and basically turn usage of SVE/2 into a glorified version of NEON. of course... that is going down like a lead balloon, with Silicon Partners going "oh shit" and trying to agree amongst themselves to at least all do the exact same width in order to at least get binary compatibility. i heard on the grapevine that they're debating whether to do 2x128 multi-issue SVE/2 units rather than do 1x256. of course, the moment any Silicon Partner ignores that convention...
>

Yep.

EricP

unread,
Aug 15, 2022, 4:22:50 PM8/15/22
to
MitchAlsup wrote:
> <
> You seem to be replying to Luke; but here, you are correct, My 66000
> does have a MM instruction. The MM instruction is classified as a
> memory instruction not as a Vector instruction. Only {VEC and LOOP}
> are vector instructions. In My 66000, loops get vectorized not instructions..

Can a MM be in a VEC LOOP, or does that causes universe-ending paradoxes?

luke.l...@gmail.com

unread,
Aug 15, 2022, 4:39:12 PM8/15/22
to
On Monday, August 15, 2022 at 9:11:04 PM UTC+1, BGB wrote:

> One other thing I didn't realize at the time, was that some of the other
> quirks in the ISA make more sense if one realizes that it many of its
> features were designed around the assumption of a 2-wide superscalar (in
> which case, random instructions which operate on 64-bit data, or which
> update 2 registers at the same time, etc, are less of an issue).

you could always have the regfile split along the granularity
of the smallest ops and for larger ones read/write multiple...
this is what we will be doing in Simple-V: byte-level write-enable
lines. to write 64-bit, o deer, enable 8 write-enables, wot a
hardship :)

> Ironically, I didn't really understand how the original SH FPU was
> supposed to work until I went and added 128-bit operations to BJX2.

like i had no idea 66000 was a Vertical-First Vector system
until i accidentally went "hang on how about enumerating
instructions first and register-incrementing second"...

> > RISC-V Vectors:
> > last time i checked: with embedded systems allowing to go down to MAXVL=1, this is why you *have* to use loop-constructs. this has i believe been discussed and i heard that there might be a decision to force implementors to do at least MAXVL >= 4
> >
> Seems like it might have been simpler to treat the vectors simply as a
> variable-length SIMD scheme.

at the back-end this is what you have to do anyway.

no all they actually had to do was say:

"MAXVL is a specification-defined hard-defined quantity, X,
and *all* implementations MUST set VL to Programmer-
Deterministic values, and in hardware perform virtual
loops to make it look like there are actually MAXVL Lanes"

this is done in the Broadcom VideoCore-IV [but with a SIMD
FPU]. it appears to a programmer that there is a 16-wide
SIMD unit but actually it is a 4-wide SIMD FPU where the
ISA pipeline-schedules 4 sequential "batches" at a time.

> Main difficult case would be shuffles,

yes, shuffle instructions become pretty meaningless if
you have bugger-all idea what the bloody Vector Length
is going to be from hardware-to-hardware.

> and in practice one would likely
> still need to build different versions of the code depending on vector
> length, and/or build code in terms of the minimum vector length, ...

wark-wark, failllll...

> In my case, for now, I am fine with 128-bit vectors (via GPR pairs) in
> my ISA, as:

hey, as you're doing this for fun (i assume!) rather than for
a goal of designing an ubiquitous ISA for use in 100 million
units and above, you can pretty much do what you like :)
[such freedom... i am so jealous....]

> They map OK to most stuff I want to do with vectors;
> There isn't really any good way to support larger vectors in a
> cost-effective way (I went with 128-bit because I could support it
> cheaply on top of a 3-wide 64-bit pipeline design).

well, like the Broadcom VideoCore-IV you could always go
virtual. you don't *have* to expose the internals of the
hardware directly to the programmer level, you *can*
think of the ISA as more of an "API". honestly i feel
this is the biggest mistake that SIMD ISAs has inflicted
on programmers in the world.

> With the newer "low-precision" FPU, I could almost allow co-issuing
> independent 64-bit FP-SIMD operations on both lanes, except this would
> create a case which works with the LP-FPU but not on the main FPU (and I
> don't really want it to be a required part of the architecture).

you could always try the dynamic-partitioned SIMD thing.
although personally i wouldn't like to attempt it in anything
other than an OO programming language.

l.

MitchAlsup

unread,
Aug 15, 2022, 5:12:01 PM8/15/22
to
You don't need shuffles if each container is associated with its own
register ! This is, in effect, what VVM does. The calculation loop is
defined as a bunch of mem refs and calculations each in singularity.
Then, much of the time, the HW figures out how to perform multiple
iterations per cycle........No need for vector lengths, shuffles, or exotic
compiler analysis to determine that his loop can be vectorized. No
special masking,...
<
One expression, each implementation figures out how to run each
loop as fast as practicable without programmer involvement.
Preserving the value of the Software investment.
<
> > and in practice one would likely
> > still need to build different versions of the code depending on vector
> > length, and/or build code in terms of the minimum vector length, ...
> wark-wark, failllll...
> > In my case, for now, I am fine with 128-bit vectors (via GPR pairs) in
> > my ISA, as:
> hey, as you're doing this for fun (i assume!) rather than for
> a goal of designing an ubiquitous ISA for use in 100 million
> units and above, you can pretty much do what you like :)
> [such freedom... i am so jealous....]
> > They map OK to most stuff I want to do with vectors;
> > There isn't really any good way to support larger vectors in a
> > cost-effective way (I went with 128-bit because I could support it
> > cheaply on top of a 3-wide 64-bit pipeline design).
> well, like the Broadcom VideoCore-IV you could always go
> virtual. you don't *have* to expose the internals of the
> hardware directly to the programmer level, you *can*
> think of the ISA as more of an "API". honestly i feel
> this is the biggest mistake that SIMD ISAs has inflicted
> on programmers in the world.
<
You mean other than existing at all ?!?
<
> > With the newer "low-precision" FPU, I could almost allow co-issuing
> > independent 64-bit FP-SIMD operations on both lanes, except this would
> > create a case which works with the LP-FPU but not on the main FPU (and I
> > don't really want it to be a required part of the architecture).
<
Just a note: The definition of CoIssue is the selection of instructions
that get issued in pairs* based on counting of registers such that the
pair of instructions CoIssued does not exceed the register ports. This
is different from operation-fusing in that Op-Fu are serially dependent
instructions combined into a single <effective> instruction. (*) or larger.
<
My 66150 CoIssues STs with most of the integer and FP calculation*
OpCodes. ST instructions do not read Rd (the value to be stored)
until late in the pipeline, and then only when a register file read port
(in DECODE) is not being used by the issuing instruction. (*) and flow
control instructions.
<
My 66150 CoIssues { LD & calculation} with {{branch & PRED} on bit
and {branch & PRED} on condition} when the value generating instruction
writes the register consumed by the flow control instruction.
<
> you could always try the dynamic-partitioned SIMD thing.
<
You could always go back to the notion:: "SIMD considered Harmful"
which is similar to "Condition Codes Considered Harmful".......

luke.l...@gmail.com

unread,
Aug 15, 2022, 5:36:20 PM8/15/22
to
On Monday, August 15, 2022 at 10:12:01 PM UTC+1, MitchAlsup wrote:
> On Monday, August 15, 2022 at 3:39:12 PM UTC-5, luke.l...@gmail.com wrote:
> > yes, shuffle instructions become pretty meaningless if
> > you have bugger-all idea what the bloody Vector Length
> > is going to be from hardware-to-hardware.
> <
> You don't need shuffles if each container is associated with its own
> register ! This is, in effect, what VVM does. The calculation loop is
> defined as a bunch of mem refs and calculations each in singularity.

the issue there is that it relies, primarily, on memory. part of Jeff Bush's
research into Nyuzi was to measure the power consumption in a GPU
if you relied on memory rather than doing LD-COMPUTE-STORE
(where compute is in deliberately big f'ing regfiles). he found it was
enormous, even just L1 cache power consumption.

if we (Libre-SOC) didn't have the goal of doing a hybrid CPU-GPU-VPU
then Vertical-First-on-Memory would be perfect because it is as you
say so much more ridiculously easy: it is in effect auto-vectorisation,
in hardware.

> > this is the biggest mistake that SIMD ISAs has inflicted
> > on programmers in the world.
> <
> You mean other than existing at all ?!?

uh-huhn :)

> > you could always try the dynamic-partitioned SIMD thing.
> <
> You could always go back to the notion:: "SIMD considered Harmful"

i meant at the back-end (behind a decent deterministic ISA)
not in the actual ISA.

> which is similar to "Condition Codes Considered Harmful".......

i like condition codes! but only on top of a compiler that supports
them, and uses them as predicate masks, in batches.

given that 3D GPUs have predicate masks, and given that
3D GPU compilers these days *have* to turn every if/else
into a parallel-predicated-thingy (you can't do a parallel
branch, duh), it's not such a nightmare as it would be in a
scalar environment.

l.

MitchAlsup

unread,
Aug 15, 2022, 5:58:40 PM8/15/22
to
On Monday, August 15, 2022 at 4:36:20 PM UTC-5, luke.l...@gmail.com wrote:
> On Monday, August 15, 2022 at 10:12:01 PM UTC+1, MitchAlsup wrote:
> > On Monday, August 15, 2022 at 3:39:12 PM UTC-5, luke.l...@gmail.com wrote:
> > > yes, shuffle instructions become pretty meaningless if
> > > you have bugger-all idea what the bloody Vector Length
> > > is going to be from hardware-to-hardware.
> > <
> > You don't need shuffles if each container is associated with its own
> > register ! This is, in effect, what VVM does. The calculation loop is
> > defined as a bunch of mem refs and calculations each in singularity.
> the issue there is that it relies, primarily, on memory. part of Jeff Bush's
> research into Nyuzi was to measure the power consumption in a GPU
> if you relied on memory rather than doing LD-COMPUTE-STORE
> (where compute is in deliberately big f'ing regfiles). he found it was
> enormous, even just L1 cache power consumption.
<
Yes, just reading and writing the register files in a GPU consume big
amounts of power; SRAM power, 1024 bits per read or write.
<
>
> if we (Libre-SOC) didn't have the goal of doing a hybrid CPU-GPU-VPU
> then Vertical-First-on-Memory would be perfect because it is as you
> say so much more ridiculously easy: it is in effect auto-vectorisation,
> in hardware.
<
Thank you
<
> > > this is the biggest mistake that SIMD ISAs has inflicted
> > > on programmers in the world.
> > <
> > You mean other than existing at all ?!?
> uh-huhn :)
> > > you could always try the dynamic-partitioned SIMD thing.
> > <
> > You could always go back to the notion:: "SIMD considered Harmful"
<
> i meant at the back-end (behind a decent deterministic ISA)
> not in the actual ISA.
<
As I read it, Flynn's taxonomy which gave rise to SIMD notation,
is single instruction multiple data. My 66000 VVM operates
as it it were SIMD but it really is MIMD because each quanta of
the computing being performed is a unique instruction (in ASM).
<
So, while VVM definitely is MD, it definitely is not SI.
VVM is unlikely to be what Flynn defined to be MIMD, either.
<
> > which is similar to "Condition Codes Considered Harmful".......
<
> i like condition codes! but only on top of a compiler that supports
> them, and uses them as predicate masks, in batches.
<
How do you access non-operand data to make control flow decisions?
for example: Has memory interference been observed ? in ATOMIC
sequences ? if your control flow architecture only has condition
codes to decide control flow--how do you reach out and do other
kinds of decision making ? In the example given, one has to consume
the information (has memory interference happened) and make the
control flow decision in the same cycle ! in order to provide the
illusion of ATOMICity !!
<
There are certain things one wants to be able to make control flow
decisions on that are not related to operands and result of the
instructions in execution (more in system activities than in user
application codes).
>
> given that 3D GPUs have predicate masks, and given that
> 3D GPU compilers these days *have* to turn every if/else
> into a parallel-predicated-thingy (you can't do a parallel
> branch, duh), it's not such a nightmare as it would be in a
> scalar environment.
<
Continue along with your illusion..........
>
> l.

BGB

unread,
Aug 15, 2022, 6:54:23 PM8/15/22
to
On 8/15/2022 3:39 PM, luke.l...@gmail.com wrote:
> On Monday, August 15, 2022 at 9:11:04 PM UTC+1, BGB wrote:
>
>> One other thing I didn't realize at the time, was that some of the other
>> quirks in the ISA make more sense if one realizes that it many of its
>> features were designed around the assumption of a 2-wide superscalar (in
>> which case, random instructions which operate on 64-bit data, or which
>> update 2 registers at the same time, etc, are less of an issue).
>
> you could always have the regfile split along the granularity
> of the smallest ops and for larger ones read/write multiple...
> this is what we will be doing in Simple-V: byte-level write-enable
> lines. to write 64-bit, o deer, enable 8 write-enables, wot a
> hardship :)
>

Possibly.

The SH-4 was mostly a 32-bit ISA, with 16-bit instructions.
And a lot of funky edge cases.

For a few of the instructions, my earlier self was like "burn it with
fire". Another line of the SuperH family (SH-2A) had added different
sets of funky instructions.

And, there was another dead-end branch (SH-3-DSP) which did a bunch of
different stuff (which didn't exist in either SH-2 or SH-4).


For added fun, these ISA variants would not be binary compatible with
each other, meaning that for a CPU or emulator, there would need to be
some way to select which variant of the ISA is being used to be able to
decode instructions and similar.

Basically, 16-bit encoding space was tight, so would be reused in
incompatible ways between ISA variants, and the ISA effectively split
along several different trajectories (the SH-2 and SH-4 lines).



>> Ironically, I didn't really understand how the original SH FPU was
>> supposed to work until I went and added 128-bit operations to BJX2.
>
> like i had no idea 66000 was a Vertical-First Vector system
> until i accidentally went "hang on how about enumerating
> instructions first and register-incrementing second"...
>

OK.


>>> RISC-V Vectors:
>>> last time i checked: with embedded systems allowing to go down to MAXVL=1, this is why you *have* to use loop-constructs. this has i believe been discussed and i heard that there might be a decision to force implementors to do at least MAXVL >= 4
>>>
>> Seems like it might have been simpler to treat the vectors simply as a
>> variable-length SIMD scheme.
>
> at the back-end this is what you have to do anyway.
>
> no all they actually had to do was say:
>
> "MAXVL is a specification-defined hard-defined quantity, X,
> and *all* implementations MUST set VL to Programmer-
> Deterministic values, and in hardware perform virtual
> loops to make it look like there are actually MAXVL Lanes"
>
> this is done in the Broadcom VideoCore-IV [but with a SIMD
> FPU]. it appears to a programmer that there is a 16-wide
> SIMD unit but actually it is a 4-wide SIMD FPU where the
> ISA pipeline-schedules 4 sequential "batches" at a time.
>

Would make more sense.


>> Main difficult case would be shuffles,
>
> yes, shuffle instructions become pretty meaningless if
> you have bugger-all idea what the bloody Vector Length
> is going to be from hardware-to-hardware.
>

Pretty much.

One also has a lot of algorithms where the structure would vary
considerably based on how it maps to a given vector length (and often
shuffle operations and similar are a pretty big part of this).


Something which all it can do is parallel operations over arrays, and
maybe matrix multiply or similar, isn't super useful in general.

Granted, nevermind if this is mostly about the only things "FLOPS"
benchmarks tend to care about. They are basically bent on doing lots of
FMAC or similar and little else.

To win this game, "Yeah, go ahead and add FMA, an FMA unit is the
fastest way to do FMAC! Huge vectors capable of doing lots of FMAC! Even
better!"


>> and in practice one would likely
>> still need to build different versions of the code depending on vector
>> length, and/or build code in terms of the minimum vector length, ...
>
> wark-wark, failllll...
>
>> In my case, for now, I am fine with 128-bit vectors (via GPR pairs) in
>> my ISA, as:
>
> hey, as you're doing this for fun (i assume!) rather than for
> a goal of designing an ubiquitous ISA for use in 100 million
> units and above, you can pretty much do what you like :)
> [such freedom... i am so jealous....]
>

Yeah, mine is more a hobby ISA.

Torn between mostly random fiddly, and something I could potentially use
in robot projects.

But, it seems like using GPR pairs does have some advantage over
dedicated SIMD registers in terms of reducing complexity. Can also
sidestep some issues which exist with SSE and similar, where if one
lacks an instruction to do the needed operation using SSE, one is
basically hosed (shuffling data between XMM and GPRs costing more than
one would gain by having used SSE).


>> They map OK to most stuff I want to do with vectors;
>> There isn't really any good way to support larger vectors in a
>> cost-effective way (I went with 128-bit because I could support it
>> cheaply on top of a 3-wide 64-bit pipeline design).
>
> well, like the Broadcom VideoCore-IV you could always go
> virtual. you don't *have* to expose the internals of the
> hardware directly to the programmer level, you *can*
> think of the ISA as more of an "API". honestly i feel
> this is the biggest mistake that SIMD ISAs has inflicted
> on programmers in the world.
>

Possibly, though this assumes that one has some sort of microcode layer
or similar.

In my case, there is no microcode, so the amount that the hardware can
gloss over is pretty limited.


>> With the newer "low-precision" FPU, I could almost allow co-issuing
>> independent 64-bit FP-SIMD operations on both lanes, except this would
>> create a case which works with the LP-FPU but not on the main FPU (and I
>> don't really want it to be a required part of the architecture).
>
> you could always try the dynamic-partitioned SIMD thing.
> although personally i wouldn't like to attempt it in anything
> other than an OO programming language.
>

The main issue in this case with the existing 128-bit vector ops for 64b
vectors:
Can only operate on even pairs (rather than an arbitrary set of registers);
Both would be required to perform the same operation.


It is possible that the decoder could implement a hack:
PADD.F R5, R18, R11 | PADD.F R23, R17, R12
Drives both pipeline lanes, but then decoder the decoder quietly turns
this into a single 128-bit SIMD operation as far as the FPU is concerned.

The LP-FPU doesn't necessarily need the same restriction (it effectively
has multiple parallel units working independently), so in premise could
be made to execute these directly, or to execute different SIMD
operations on each lane with Binary32 operations (with the main
restriction at present being that it can't co-issue with Binary16
vectors). It could potentially also be used to co-issue scalar FPU
operations, ...

The main "costs" in the latter case would be mostly that it would
effectively make the existence of the LP-FPU visible as an architectural
feature (whereas, if its behavioral restrictions remain the same as the
main FPU; it can be enabled or disabled, where one can save or spend ~
2k LUT depending on whether they want 3-cycle or 10-cycle FP-SIMD
operators).


Well, and the terrible Packed Int<->FP conversion operators were mostly
so that I could route these through the ALU's CONV path or similar
(which already deals with a lot of the FP conversion cases), but can't
deal with "proper" Int<->FP conversion (which necessarily requires a
trip through an FADD unit or similar).

The LP-FPU doesn't currently implement this logic; and the main FPU
currently only supports it for the Int64<->Binary64 case, ...

Either way, ~6c < ~50c...


> l.

BGB

unread,
Aug 15, 2022, 8:29:20 PM8/15/22
to
On 8/15/2022 4:36 PM, luke.l...@gmail.com wrote:
> On Monday, August 15, 2022 at 10:12:01 PM UTC+1, MitchAlsup wrote:
>> On Monday, August 15, 2022 at 3:39:12 PM UTC-5, luke.l...@gmail.com wrote:
>>> yes, shuffle instructions become pretty meaningless if
>>> you have bugger-all idea what the bloody Vector Length
>>> is going to be from hardware-to-hardware.
>> <
>> You don't need shuffles if each container is associated with its own
>> register ! This is, in effect, what VVM does. The calculation loop is
>> defined as a bunch of mem refs and calculations each in singularity.
>
> the issue there is that it relies, primarily, on memory. part of Jeff Bush's
> research into Nyuzi was to measure the power consumption in a GPU
> if you relied on memory rather than doing LD-COMPUTE-STORE
> (where compute is in deliberately big f'ing regfiles). he found it was
> enormous, even just L1 cache power consumption.
>

I am using a pretty mundane L1 cache.
Can Load/Store 8/16/32/64/128 bits at a time;
Stalls the pipeline on miss or similar;
Loaded values may be freely aligned, and sign or zero extended.


Did experiment with a few LoadOp/StoreOp cases:
The CPU can at least survive...

Though, as long as things stay mostly in registers, the gains of LoadOp
and StoreOp are likely to be small (they would make more sense on a more
register-starved target where values are more frequently evicted).


> if we (Libre-SOC) didn't have the goal of doing a hybrid CPU-GPU-VPU
> then Vertical-First-on-Memory would be perfect because it is as you
> say so much more ridiculously easy: it is in effect auto-vectorisation,
> in hardware.
>

Yeah, I am currently trying to do all this on the CPU, via a software
renderer.


Interestingly, there is a non-linearity in the framefrate in my GLQuake
port and similar (based on emulator):
50MHz, ~ single-digits (5-10 fps);
100MHz, ~ 15-25 fps.
200MHz, ~ 35-60 fps.

Though, implicitly, this is also assuming that RAM speed scales linearly
with MHz (and as one scales up MHz, GLQuake spends an increasing
percentage of its CPU time in 3D rendering tasks).

If RAM stays the same speed (and thus requires more clock cycles per
cache miss), any speed gains are minimal (at 200MHz in this scenario,
one would just end up with around 90% or so of the clock cycles being
spent waiting for RAM).


And the relative amount of CPU time spent in things like
"PR_ExecuteProgram" and similar rapidly drops off.



My plans for a GPU were one of either:
Dedicated GPU with a more specialized ISA;
Tweaked CPU core with a more specialized feature-set.

Had considered the former, but would basically need to write a new
compiler backend to support it. Latter is easier, as it would mostly
turn it into "feature tweaks" (may or may not retain binary
compatibility with the main core).

My current ideas for the latter would be "subset compatible", as in code
compiled to a common subset would probably work.

Either way, still difficult to fit two cores of the "current weight
class" onto an XC7A100T (even if the CPU core omits most "non essential"
features for the use-case).

And, it wouldn't be worthwhile if it would end up making things slower
than they are at present with the existing software rasterizer.



>>> this is the biggest mistake that SIMD ISAs has inflicted
>>> on programmers in the world.
>> <
>> You mean other than existing at all ?!?
>
> uh-huhn :)
>
>>> you could always try the dynamic-partitioned SIMD thing.
>> <
>> You could always go back to the notion:: "SIMD considered Harmful"
>
> i meant at the back-end (behind a decent deterministic ISA)
> not in the actual ISA.
>
>> which is similar to "Condition Codes Considered Harmful".......
>
> i like condition codes! but only on top of a compiler that supports
> them, and uses them as predicate masks, in batches.
>

I severely limited the use of condition codes.
Only a select few instructions actually use flag bits;
For most purposes, only a single bit is used (known as SR.T).

Some SIMD ops can use per-element bits (P,Q,R,O), which function similar
to SR.T, ... this is used for operations like Packed-Compare and
Packed-Select.


> given that 3D GPUs have predicate masks, and given that
> 3D GPU compilers these days *have* to turn every if/else
> into a parallel-predicated-thingy (you can't do a parallel
> branch, duh), it's not such a nightmare as it would be in a
> scalar environment.
>

I had at one point considered using multiple bits, but:
Not entirely free (in terms of LUTs);
For 3-wide, much more than 1 predicate bit is not usually needed.

In my ISA design, it is possible to schedule the true and false branches
of an "if" in parallel.

There was an experiment that allowed also using another bit (SR.S) for
predication, but the encoding basically required using a larger
instruction encoding.

In my case, an encoding case for this exists within the Op64 and Op40x2
encodings.

Where, Op40x2 encodes a 2-wide bundle in 96-bits, but allows access to a
slightly expanded feature-set while doing so; hardly ever used in
practice (the name of this encoding being because it effectively expands
each instruction within the bundle to 40 bits).



Had at one point also experimented with an "SR.T Twiddle" feature, which
could in premise be used to help with complex if/else blocks and
operations like && and ||, basically by treating predicate bits as a
stack with the ability to push/pop and modify the top of the stack.

I ended up leaving this disabled partly as:
My Compiler can't really use it;
Most cases where it "could" apply are better served by using a branch
instruction;
Would not allow interleaving instructions from if/else branches.


Dealing "in general" with interleaving multiple sets of predicated
branches would likely require the use of explicit 1-bit "predicate
registers", more like Itanium or something.

However, to pull this off, one is basically pretty much required to move
on to a larger instruction format (there being a steep limit to how much
is practical to fit into a 32-bit instruction word).

An ISA built primarily around 64-bit instruction words or similar could
also work in theory, but it wouldn't exactly be winning any prizes for
code density.


The 1-bit predicate scheme works mostly OK, but does effectively limit
things to only predicating a single if/else branch at a time.

Interleaving and bundling instructions from the if/else branches does
have an advantage that it can reduce the latency of handling a
predicated branch (at least on a naive in-order machine).

...
It is loading more messages.
0 new messages