Floating Point Constants

556 views
Skip to first unread message

MitchAlsup

unread,
Aug 4, 2022, 4:28:24 PMAug 4
to
Does anyone have a reference where a group of people measured the
percentage of floating point operands that are constant/immediate.

10 minutes of Google turns up billions of useless papers and threads
on everything other than what I am looking for.

BGB

unread,
Aug 4, 2022, 10:50:11 PMAug 4
to
( Retry, clicked wrong button at first and unintentionally sent an
email... )
I don't know of any papers or anyone else looking into this.


Gleaning a few stats from my BGBCC output while compiling my port of
GLQuake:
FADD: 541
FSUB: 321
FMUL: 968
FCMP/etc: 2329 (FCMPxx, FNEG, FABS, ...)
Total: 4159
FP Constant Loads:
Binary16: 1146
Binary32: 78
Binary64: ~ 100 (Guesstimate, *1)
Total: ~ 1324

*1: Being more accurate on this estimate would require adding something
to differentiate between Binary64 loads and other types of 64-bit
constants. For the others, it is less ambiguous.

This isn't previously something I had looked into before, so I don't
really have dedicated stats for this.


Quick survey skim of a the ASM output dump, it looks like a fair chunk
of the FPU ops use immediate values.

Totals would imply an average case of around 32% of FPU ops needing a
constant. This roughly agrees with my attempts at skimming the ASM dump.

In terms of relative constant-density ranking, high to low:
FCMPGT
FCMPEQ
FMUL
FADD
FSUB
...

For FCMP, constants appear to be very common, slightly less so for FMUL,
and a relative minority for FADD and FSUB.


I guess someone could go and add FP immediate forms.

It looks like these would maybe save around 0.2% to 0.4% of the total
clock-cycle budget in this case.

Would save maybe a little more if dealing with a more FP dominated
use-case. GLQuake spends a decent chunk of time in TKRA-GL, where the
backend rasterizer functions and similar are at present pretty much
entirely built on fixed-point integer stuff.

But, I don't really have many FP dominated workloads in this case.


Dunno if of any real use here...

MitchAlsup

unread,
Aug 5, 2022, 10:38:31 AMAug 5
to
VAX did a long time ago
My 66000 did circa 2010 (in this case they fell out for free.)
>
> It looks like these would maybe save around 0.2% to 0.4% of the total
> clock-cycle budget in this case.
>
> Would save maybe a little more if dealing with a more FP dominated
> use-case. GLQuake spends a decent chunk of time in TKRA-GL, where the
> backend rasterizer functions and similar are at present pretty much
> entirely built on fixed-point integer stuff.
<
Blending (LERP) does:: z = x×a + y×(1.0-a)

EricP

unread,
Aug 5, 2022, 11:58:09 AMAug 5
to
I looked through the VAX papers on instruction set usage stats
but they don't break out floating point constants.

I searched for a while but couldn't find any stats either.

Some points:

- Many RISC ISA's dont have float immediates at all.
Alpha uses IP-rel addressing to load constants, including floats,
from tables located just prior to the routine entry point.

Superficially these show up as regular loads.

- The stats will be skewed by optimization. I saw various references
to whether compiler optimization did/didnt do constant folding.

These could show up as multiple constant loads where one might have
done with constant folding. But then you get into the language rules
for rearranging floating point expressions.

One issue I have with many of the ISA usage papers is that they
simply scan and count instruction types, but don't look deeper
for idioms to try to infer why some sequence was being done.
E.g. It would be nice to know the difference between a compare
guarding a bounds check trap and a compare for program logic.
And then that frequency in turn depends on the source language.

MitchAlsup

unread,
Aug 5, 2022, 1:42:30 PMAug 5
to
Thanks for reminding me that VAX data has int, pointer, and float
in their #immed data.
>
> I searched for a while but couldn't find any stats either.
>
> Some points:
>
> - Many RISC ISA's dont have float immediates at all.
<
Specifically, I have been tasked with comparing RISC-V with My 66000.
One has constants, one does not. I have usefully good data on ints and
pointers but not on FP.
<
I am specifically wondering if FP constants are used often enough to
bring up the topic in my comparison. My experience with big numerics
says no, my experience with GPU says yes--for example:
LERP = x×a + y×(1.0-a)
So, I don't have enough data to form an opinion.
<
> Alpha uses IP-rel addressing to load constants, including floats,
> from tables located just prior to the routine entry point.
<
It is simpler to say Alpha does not have floating pint constants.
>
> Superficially these show up as regular loads.
<
If you have to load it is it not a constant.
constant = setof{ immediate, displacement }
>
> - The stats will be skewed by optimization. I saw various references
> to whether compiler optimization did/didnt do constant folding.
<
My other problem is that I do not have access to the code bases which
cost money to access (SPEC for example) whereas I do have a usefully
good LLVM port to My 66000 with working clang and flang. For example
I can get Linpack, Livermore Loops, certain sections of BLAS, but not
the applications which might use those.
>
> These could show up as multiple constant loads where one might have
> done with constant folding. But then you get into the language rules
> for rearranging floating point expressions.
>
> One issue I have with many of the ISA usage papers is that they
> simply scan and count instruction types, but don't look deeper
> for idioms to try to infer why some sequence was being done.
<
I have this is spades:: whereas RISC-V has compare-and-branch with
11-bit target range, My 66000 has compare to zero and branch, and
a compare instruction CoIssued with the successive branch on condition
instruction. My branches have 18-bit target range (or 28-bit for unconditional)
So, any fair comparison needs to take the instruction count, the execution
cycles, and the number of times fix-ups are required into account.
<
I also have to figure out how to rate Predication verus branching only...
<
RISC-V thinks 123456/0 = 0xffffffffffffffff
My 66000 thinks 123456/0 = DIVZERO exception and if the exception is
not enabled (sign)123456/0 = {signed,unsigned}MAXimum.
<
RISC-V literature says there are 50 integer instructions. the actual
assembly reference manual has a list that I counted to be 150
different instructions. My 66000 has similar "counting" problems
depending on where you are looking {assembler "spellings", actual
OpCodes, expanded OpCodes, ...}
<
My 66000 has a switch instruction (JTT; jump through table) that
performs the range comparison [0..k] and sends out of range to
the k+1th entry in the table, and the table can be bytes, half, words
and is PIC. RISC-V has none of this and has to use AUIPC to get
PICed.
<
RISC-V is proud of their MOV FP<->INT instruction showing that
extracting and debiasing the exponent of an FP number can be
done in 5 instructions. My 66000 has a single EXPON instruction
that does the same work.
<
> E.g. It would be nice to know the difference between a compare
> guarding a bounds check trap and a compare for program logic.
> And then that frequency in turn depends on the source language.
<
If it were easy, anyone (or his brother) could do it.
<
Of all of these, access to suitable code bases is the hardest.....
<
Also note: I am not being paid to do this, and don't have the kind
of cash needed to just spring for the code myself.

BGB

unread,
Aug 5, 2022, 1:56:19 PMAug 5
to
Doesn't fall out for free in my case; I would need some new encodings
and to throw some format converters into ID2 (at least, assuming) a lack
of 64-bit immediate values, but given these only exist as FE-FE-F8
encodings, this is unlikely.

Only option other than ID2 is to put the FP converter into instruction
decoder, which I guess could be "less bad" than putting it into the
register handling logic (they would then follow the same path 64-bit
immediate values in ID2, rather than creating a new path).



Some possible encodings:
FFw0_0vii-F0nm_5go8 FADD Rm, Ro, Rn, Imm8 //(Exists) Imm=rounding
FFw0_0Vii-F0nm_5gi8 FADD Rm, Imm16, Rn //(Possible), Adds Immed

Also possible would be putting them in the F2 block:
FFw0_PPjj-F2nm_0gjj OP Rm, Imm18s, Rn

Where, say: PP: 00=ADD, 58=FADD, 59=FSUB, 5A=FMUL


Main limitation to the idea is mostly that, while it seems around 32% of
the FPU operations use constants, FPU instructions are not a big enough
part of the total clock cycles for this to likely make all that big of a
difference in terms of performance.


In general, things like Binary16 and Binary32 constant-load encodings
(*1) are still a lot better than using memory loads here (which most of
the other ISAs seem happy with using).

F88n_iiii FLDCH Imm16, Rn //Load Binary16
FFw0_iiii_F88n_iiii FLDCF Imm32, Rn //Load Binary32



>>
>> It looks like these would maybe save around 0.2% to 0.4% of the total
>> clock-cycle budget in this case.
>>
>> Would save maybe a little more if dealing with a more FP dominated
>> use-case. GLQuake spends a decent chunk of time in TKRA-GL, where the
>> backend rasterizer functions and similar are at present pretty much
>> entirely built on fixed-point integer stuff.
> <
> Blending (LERP) does:: z = x×a + y×(1.0-a)

In TKRA-GL, the bilinear interpolation also used fixed-point...

Pretty much everything past the transformation and projection stage is
fixed point in this case.


This would change for HDR, but HDR would likely use Binary16 SIMD for
pixels and fixed-point for pretty much everything else. LERP would
likely still use fixed pioint for bilinear, because it still mostly
works OK on floating point values as long as the exponent is similar
(with larger exponent differences, fixed point LERP develops an obvious
S-Curve effect, but this isn't too much of an issue for texture
interpolation as a sudden change usually also means an edge-like feature
or similar in the texture).

No idea what would happen if/when it get a shader compiler, it is likely
either I would need to rework a few things, or make the shader compiler
have "meta types", where it figures out where the vectors are "actually"
using fixed-point and merely pretends like everything is FP-SIMD.

Trying to "naively" compile the shaders (and then put FP->Fixed
conversion in the "texture2D()" calls and similar; or worse, use an
actual function call here), would perform like garbage.

For now, I will ignore this.

BGB

unread,
Aug 5, 2022, 2:17:42 PMAug 5
to
I suspect most ISAs don't have these.
Hell, not even x86 has these (well, at least as far as I last looked
into this; I mostly ignore AVX and newer).

I have a Binary16 immediate load instruction and similar, this is still
more than most of them (and a lot better than using memory loads IMO).


> - The stats will be skewed by optimization. I saw various references
>   to whether compiler optimization did/didnt do constant folding.
>
>   These could show up as multiple constant loads where one might have
>   done with constant folding. But then you get into the language rules
>   for rearranging floating point expressions.
>

In my case, my C compiler evaluates constant expressions when possible.

Things like:
y=x/(1.0+2.0);
Will get transformed into, say:
y=x*0.333333;

Eliminating FDIV whenever possible also makes sense, since this is
generally significantly slower than the other instructions.

FMUL: 6 cycles.
FDIV: 120 cycles.


FDIV is also rare though for most things; apart from Quake's software
renderer where it is kind of a boat anchor (perspective correct texture
mapping needs fast FDIV).

In my case, TKRA-GL uses affine projection (with dynamic tessellation),
which at least mostly avoids the FDIV issue.


> One issue I have with many of the ISA usage papers is that they
> simply scan and count instruction types, but don't look deeper
> for idioms to try to infer why some sequence was being done.
> E.g. It would be nice to know the difference between a compare
> guarding a bounds check trap and a compare for program logic.
> And then that frequency in turn depends on the source language.
>

Probably.

JimBrakefield

unread,
Aug 5, 2022, 2:32:02 PMAug 5
to
On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:
A rough and risky approach is to posit floating point constants are as frequent as integer constants normalized by the relative occurrence rates for integer and floating-point operations. Another possible assumption is the total number of unique floating-point constants is relatively small for a given program and have the linker put them in a single memory area referenced by a short offset. I.e. chose between in lined single or double immediate constants or a memory/cache reference. My preference is to give the user that choice.

MitchAlsup

unread,
Aug 5, 2022, 3:01:08 PMAug 5
to
Which, BTW, does not achieve 0.5 ULP accuracy, whereas y=x/3.0 does.
>
> Eliminating FDIV whenever possible also makes sense, since this is
> generally significantly slower than the other instructions.
>
> FMUL: 6 cycles.
> FDIV: 120 cycles.
>
FMUL 6 cycles
FDIV 18 cycles

MitchAlsup

unread,
Aug 5, 2022, 3:09:30 PMAug 5
to
useable
<
> Another possible assumption
> is the total number of unique floating-point constants is relatively small for
> a given program and have the linker put them in a single memory area
> referenced by a short offset. I.e. chose between in lined single or double
> immediate constants or a memory/cache reference.
<
Here, the use of an FP constant costs 2 instructions, one to LD and one to calculate,
compared to 1 instruction and 1 issue cycle, where the instruction is 64 bits (32-bit
immediate which can be expanded to 64-bits depending on FP calculation size) or
96 bits (with 64-bit immediate). Also note: these immediates come through the
instruction cache and do not need read access to that page, so the data cache
is not polluted, nor is the TLB disturbed.
<
> My preference is to give the user that choice.
<
My 66000 has FP immediates in the instruction itself:
<
FDIV R9,#3.1415926535863,R7
or
FDIV R9,R7,#3.1415926535863
<
and Brian's LLVM port already does this. My problem is comparing one that does
against one that does not.

Marcus

unread,
Aug 5, 2022, 3:11:35 PMAug 5
to
I'm not sure if it would help, but I find that ffmpeg (GPL) is a
relatively large, portable code base that is heavy on data processing
(codecs), and it has some FP code too. I have not analyzed it
thoroughly, though, and you may have to tweak the build (configuration
parameters) to get all the interesting codecs built.

It built out-of-the-box for MRISC32 (no OS, just newlib) - so it's
portable alright.

https://ffmpeg.org

>>
>> These could show up as multiple constant loads where one might have
>> done with constant folding. But then you get into the language rules
>> for rearranging floating point expressions.
>>
>> One issue I have with many of the ISA usage papers is that they
>> simply scan and count instruction types, but don't look deeper
>> for idioms to try to infer why some sequence was being done.
> <
> I have this is spades:: whereas RISC-V has compare-and-branch with
> 11-bit target range, My 66000 has compare to zero and branch, and
> a compare instruction CoIssued with the successive branch on condition
> instruction. My branches have 18-bit target range (or 28-bit for unconditional)
> So, any fair comparison needs to take the instruction count, the execution
> cycles, and the number of times fix-ups are required into account.

Don't forget, RISC-V compare-and-branch only works with registers, so if
you want to compare to an immediate you need an extra instruction to
load the immediate into a register.

RISC-V (2 instructions):

li a2, 55
blt a2, a0, foo

MRISC32 (2 instructions):

sle r2, r1, #55
bns r2, foo

...the RISC-V compare-and-branch shines for loops (where the loop count
is preloaded into a register), which is what it was optimized for, I
suppose.

Stephen Fuld

unread,
Aug 5, 2022, 3:15:16 PMAug 5
to
On 8/5/2022 11:32 AM, JimBrakefield wrote:
> On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> On Thursday, August 4, 2022 at 9:50:11 PM UTC-5, BGB wrote:
>>>> ( Retry, clicked wrong button at first and unintentionally sent an
>>>> email... )
>>>> On 8/4/2022 3:28 PM, MitchAlsup wrote:
>>>>> Does anyone have a reference where a group of people measured the
>>>>> percentage of floating point operands that are constant/immediate.

> A rough and risky approach is to posit floating point constants are as frequent as integer constants normalized by the relative occurrence rates for integer and floating-point operations. Another possible assumption is the total number of unique floating-point constants is relatively small for a given program

I would guess that is probably true, but I suggest a different approach
than yours. Why not put those few constants in ROM within the CPU and
reference them from there? It would require some ISA modifications, but
it would eliminate the load completely.


and have the linker put them in a single memory area referenced by a
short offset. I.e. chose between in lined single or double immediate
constants or a memory/cache reference. My preference is to give the
user that choice.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)


MitchAlsup

unread,
Aug 5, 2022, 3:17:42 PMAug 5
to
On Friday, August 5, 2022 at 2:11:35 PM UTC-5, Marcus wrote:
My 66000 has the LOOP instruction which does ADD:CMP:branch back to the
top of the loop as 1 instruction. Using this, loops can execute as wide as the
cache access port width.
<
VEC...LOOP are the access means to SIMD functionality.
So a low end machine can do byte-by-byte copy loops at 32-iterations per
clock (160 instruction per clock).
A higher end machine could to DGEMM at 4 iterations per cycle (32 IPC).

MitchAlsup

unread,
Aug 5, 2022, 3:19:05 PMAug 5
to
On Friday, August 5, 2022 at 2:15:16 PM UTC-5, Stephen Fuld wrote:
> On 8/5/2022 11:32 AM, JimBrakefield wrote:
> > On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Thursday, August 4, 2022 at 9:50:11 PM UTC-5, BGB wrote:
> >>>> ( Retry, clicked wrong button at first and unintentionally sent an
> >>>> email... )
> >>>> On 8/4/2022 3:28 PM, MitchAlsup wrote:
> >>>>> Does anyone have a reference where a group of people measured the
> >>>>> percentage of floating point operands that are constant/immediate.
> > A rough and risky approach is to posit floating point constants are as frequent as integer constants normalized by the relative occurrence rates for integer and floating-point operations. Another possible assumption is the total number of unique floating-point constants is relatively small for a given program
> I would guess that is probably true, but I suggest a different approach
> than yours. Why not put those few constants in ROM within the CPU and
> reference them from there? It would require some ISA modifications, but
> it would eliminate the load completely.
<
I did this in my Denelcor compiler code generator for a useful subset of
the constants (found in the code).

BGB

unread,
Aug 5, 2022, 7:53:46 PMAug 5
to
The 120-cycle HW FDIV gets ~ 0.5, whereas the software divide
(Newton-Raphson based) is seemingly limited to somewhere around 3.5 ULP
or so for Binary64.


For most things, "X/C" vs "X*(1.0/C)" doesn't have any real practical
effect on behavior, but if the latter is 20x faster, this is what it is
going to be.

Likewise, the 0.5 ULP requirement does not seem to be a requirement for
the C standard.


>>
>> Eliminating FDIV whenever possible also makes sense, since this is
>> generally significantly slower than the other instructions.
>>
>> FMUL: 6 cycles.
>> FDIV: 120 cycles.
>>
> FMUL 6 cycles
> FDIV 18 cycles

Not going to get 18 cycles from Shift-Add unit, would need a divider
that can do multiple bits per clock cycle for this.


Going and enabling FDIV:
GLQuake FDIV=1.65% (vs FMUL=2.86% and FADD=2.51%)
SWQuake FDIV=2.42% (vs FMUL=8.40% and FADD=4.85%)

It relative slowness doesn't really seem to matter too much...

MitchAlsup

unread,
Aug 5, 2022, 8:24:25 PMAug 5
to
Are these overall time spent executing ?op, or occurrence of ?op ??

robf...@gmail.com

unread,
Aug 5, 2022, 9:12:11 PMAug 5
to
It may be interesting to know the kind of precision required for float-point constants.
VAX had six-bit float immediates I think. If there are 16-bits available in the instruction for a constant,
there may be a lot of float-constants that could be mapped to a higher precision. The whole 64-bit or
128-bit constant may not need to be encoded for constants to be useful.


Andy

unread,
Aug 5, 2022, 9:29:08 PMAug 5
to
On 6/08/22 05:56, BGB wrote:

<snip>

>>> Would save maybe a little more if dealing with a more FP dominated
>>> use-case. GLQuake spends a decent chunk of time in TKRA-GL, where the
>>> backend rasterizer functions and similar are at present pretty much
>>> entirely built on fixed-point integer stuff.
>> <
>> Blending (LERP) does:: z = x×a + y×(1.0-a)
>
> In TKRA-GL, the bilinear interpolation also used fixed-point...
>
> Pretty much everything past the transformation and projection stage is
> fixed point in this case.

I'm assuming you've implemented some kind of deferred shading tile
renderer?, since block-rams would seem to be the perfect fit for tile
buffers if they aren't too oddly sized.

BGB

unread,
Aug 5, 2022, 9:41:08 PMAug 5
to
Percentage of clock cycle budget spent executing these instructions,
according to my emulator (which does account for the clock-cycle costs
of the various instructions, including for things like pipeline
interlocks and cache-miss costs and similar).


Granted, I can disable this modeling via the command line (where the
emulator then uses simpler models and assumes constant-cycle costs for
the various instructions) and get a significant speedup.

If these checks though, cache hit/miss modeling is the most significant
(as it is the most dynamic and context dependent), so disabling
cache-miss modeling causes the estimates to be wildly inaccurate.

BGB

unread,
Aug 5, 2022, 10:01:23 PMAug 5
to
That is basically the case in my experience here.

The vast majority of constants in a program tend to be things like
100.0, 1.375, 420.0, ..., which can be expressed exactly as Binary16
values (which are then show up as-if they were the original Binary64
constants).

In general, around 90% of the FP constants can fit into Binary16 and be
represented exactly, without any loss of precision due to the
limitations of the format.

However, going much smaller than Binary16, and this starts dropping off
fairly rapidly (so, while Binary16 can represent most of the constants
exactly, an 8 or 9 bit format can represent relatively few).


This basically means that FPU Immediate instructions would mostly need
around 16 bits or so to be particularly useful (maybe 12 could work, but
is pushing it).

Though, some more statistical modeling could be useful here.

>

MitchAlsup

unread,
Aug 5, 2022, 10:10:46 PMAug 5
to
In my case: FP immediates come in 32-bit and 64-bit flavors.
32-bit FP immediates are converted to 64-bit FP immediates during operand delivery
when the calculation is double and left alone when the calculation is 32-bits.
<
Since My 66000 does not currently have FP8 or FP16 the rest of the size question is moot.

Stephen Fuld

unread,
Aug 6, 2022, 1:05:06 AMAug 6
to
On 8/5/2022 7:01 PM, BGB wrote:
> On 8/5/2022 8:12 PM, robf...@gmail.com wrote:
>> It may be interesting to know the kind of precision required for
>> float-point constants.
>> VAX had six-bit float immediates I think. If there are 16-bits
>> available in the instruction for a constant,
>> there may be a lot of float-constants that could be mapped to a higher
>> precision. The whole 64-bit or
>> 128-bit constant may not need to be encoded for constants to be useful.
>>
>
> That is basically the case in my experience here.
>
> The vast majority of constants in a program tend to be things like
> 100.0, 1.375, 420.0, ..., which can be expressed exactly as Binary16
> values (which are then show up as-if they were the original Binary64
> constants).
>
> In general, around 90% of the FP constants can fit into Binary16 and be
> represented exactly, without any loss of precision due to the
> limitations of the format.

I am a little surprised by that. Are pi and e for example, not
frequently used?

Thomas Koenig

unread,
Aug 6, 2022, 1:57:02 AMAug 6
to
MitchAlsup <Mitch...@aol.com> schrieb:
> On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:

>> - The stats will be skewed by optimization. I saw various references
>> to whether compiler optimization did/didnt do constant folding.
><
> My other problem is that I do not have access to the code bases which
> cost money to access (SPEC for example) whereas I do have a usefully
> good LLVM port to My 66000 with working clang and flang. For example
> I can get Linpack, Livermore Loops, certain sections of BLAS, but not
> the applications which might use those.

Try the Polyhedron benchmarks at
https://www.fortran.uk/fortran-compiler-comparisons/the-polyhedron-solutions-benchmark-suite/

And I'm not sure why you can only get certain sections of BLAS, the
reference implementation is at https://netlib.org/blas/ .

BGB

unread,
Aug 6, 2022, 2:47:01 AMAug 6
to
No, errm, TKRA-GL is a software renderer implementing the OpenGL API on
the front end...


It implements roughly enough of the OpenGL API to render Quake 1/2/3 and
similar (roughly ~ OpenGL 1.3):
Fixed function rendering;
Blending;
Bilinear Interpolation (sorta);
Stencil Buffers (Optional);
Various stuff for managing transform and projection matrices;
...

Includes some other features:
Texture compression;
Half Float;
...

Omits some stuff that Quake/etc don't use:
Display lists;
GL_LIGHT and GL_FOG and similar (*);

*: While Quake3 does fog effects, it does so by effectively drawing a
bunch of translucent fog layers with depth writes disabled (rather than
using OpenGL's built in fog effect).


I had considered possibly adding GL_LIGHT stuff, on the basis that it is
"not entirely useless", and I could potentially support a "poor man's
Phong" extension (mostly by running the Gouraud Shading math after
tessellation rather than before tessellation).


Rendering process is sorta like (per primitive):
Project vertices to screen space;
If primitive is too big, split it up:
Keep splitting until no longer too big;
Draw the primitive (as its subdivided pieces).

This process is implemented via a small stack, where a primitive is
popped, projected, and if it needs to be subdivided, each of the pieces
is pushed back to the stack, else it is drawn. If the stack limit is
reached, the primitive is discarded.


Primitive Drawing:
Set up function pointers/etc depending on OpenGL settings and the
primitive being drawn;
Walk the edges of the primitive (step down left and right edges);
Call span-drawing function at each scanline.

The span-drawing function is invoked using function pointers, where
properties of the primitive and of the current GL state settings are
used to select which span-drawing function pointers to use.


The edge-walking basically keeps several sets of vectors, for the left
and right sides:
XZ / XZ+Stencil
ST
RGBA
XZ_ystep
ST_ystep
RGBA_ystep

Don't need a vertex normal here, as the Normal vector and similar are
"dead" by this point.

Where for each scanline:
If Y is outside viewport
Add step vectors to current vectors;
Continue.
Calculate step values for span func;
Clip span to viewport;
Call span-drawing function;
Add step vectors to current vectors;
Continue.

For a triangle, it walks from the top vertex to the middle vertex, then
recalculates the step values and goes from the middle to the bottom.

For a quad, currently it draws it as two triangles (0,1,2; 0,2,3).
In earlier stages, triangles and quads are treated as two different
primitive types.

Trying to draw a polygon primitive results in it being decomposed into
quads and triangles.

The preferred interface here is glDrawArrays or glDrawElements, with the
glBegin/glVertex*/glEnd existing as a wrapper on top of an internal set
of vertex arrays.



Span drawing functions are classified according to major features:
Flat, Only draws a flat color;
Tex, use raster texture (no color modulation)
ModTex, color modulated texture, opaque
Utx, UTX2 texture (no color modulation)
ModUtx, UTX2 texture, color modulated, opaque
AlphaModTex, color modulated texture, alpha blend
AlphaModUtx, UTX2 texture, color modulated, alpha blend
BlModTex, color modulated texture, opaque, bilinear
BlModUtx, UTX2 texture, color modulated, opaque, bilinear
AlphaBlModTex, color modulated texture, alpha blend, bilinear
AlphaBlModUtx, UTX2 texture, color modulated, alpha blend, bilinear
Atest*, Alpha-tested variants
LMap*, Lightmap cases
Blend*, Generic Blend cases
...

Then with suffixes:
*Mort*, Uses Morton Order;
-, Does not use Z buffer;
Zb, Check ZBuffer (GL_LEQUAL) and write ZBuffer
Zt, Check ZBuffer (GL_LEQUAL), but no write to ZBuffer.
...


This is an unwieldy mess, but is needed for performance, basically these
features interact as a sort of combinatorial explosion. But this is an
area where trying to solve it in a "clean" way (say, nested "if()" or
"switch()" blocks) results in code that is incredibly slow.

There are a few cases for specific blending modes
Alpha ~= (GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)
Other blending modes fall back to a slower "Blend" case.
This one calls additional functions to perform the blend.

Likewise, trying to set glDepthFunc to something other than GL_LESS or
GL_LESS_EQUAL/GL_LEQUAL will adversely effect performance.

For example, at present trying to draw stencil shadows would fall into a
large number of slow cases (so, probably can't do anything like Doom 3
on this anytime soon).


LLVMpipe had instead taken a route of generating lots of paths in a
procedural way and then using LLVM to generate machine code (also as the
backend for a shader compiler), but this is likely a bit too heavyweight
for this.

One other possibility would be to use procedural generation for C (with
some overhead for using C rather than ASM), which is sort of a poor-mans
solution. Going too far in this direction though would lead to excessive
code bloat.

Things would also be easier if there were fewer features one could
glEnable or glDisable (which need to be supported for rendering to work
correctly).



A lot of these span drawing functions are written in ASM as well.


Buffers used (all stored in raster order):
Color Buffer: RGB555A
Depth: 16-bit
Stencil: 8-bit

Internal texture formats:
Generic rectangular, stored as RGB555A (Raster Order);
Square RGB555A (Morton Order);
Square UTX2 (Morton Order).

Morton order allows better performance for span drawing, but only works
on textures that are either square or have a 2:1 aspect ratio (though an
S/T flip mechanism also allows 1:2 textures).

Say:
256x256, Can use Morton;
256x128, Can use Morton;
128x256, Can use Morton (S/T Flip);
256x64, Needs raster
64x256, Needs raster
...

In practice, most textures can use Morton order.


RGB555A is a modified RGB555 variant:
0rrr-rrgg-gggb-bbbb RGB555 (Opaque)
1rrr-ragg-ggab-bbba RGB444+A3 (Translucent)



UTX2 is a block texture compression format:
(63:32): Pixel Selectors (P), 4x4x2 bits;
(31:16): ColorB
(15: 0): ColorA

UTX2 has several sub-modes (Bits 31 and 15):
00: Opaque, Interpolated
Endpoints interpreted as RGB555
P: 00=A, 01=(5/8)*A+(3/8)*B, 10=(3/8)*A+(5/8)*B, 11=B
01: Color+Alpha
Endpoints interpreted as RGB444A3 above;
1 bit selects Color, the other Alpha.
10: Opaque+Transparency
Endpoints interpreted as RGB555
00=A, 01=(A+B)/2, 10=Transparent, 11=B
11: Translucent, Interpolated.
Endpoints interpreted as RGB444A3
Interpolate linearly (as in 00).

Modes 00 and 10 can mimic DXT1, whereas 01 and 11 also allow it to mimic
DXT5 (albeit with fewer bits). Note that the use of explicit mode-bits
(rather than relative comparisons), means this is cheaper to decode than
DXT1 or DXT5 would have been.


There is also a UTX3L format (128-bit):
(127:96): Alpha Selectors, 4x4x2;
( 95:64): Color Selectors, 4x4x2;
( 63:32): ColorB (RGBA32 / RGBA8888)
( 31: 0): ColorA (RGBA32 / RGBA8888)
And, UTX3H (128-bit):
Basically the same as UTX3L, but treat endpoints as FP8U (E4.F4);
Result unpacked to 4x Binary16.

UTX3 is designed to try to be "mostly comparable" to BC7 and BC6H,
albeit with a significantly cheaper hardware decoder (and skipping the
"partitioned" formats).

Had previously considered some "more flexible" ideas, but they would
have been a lot more expensive to decode. Both UTX2 and UTX3 use
overlapping machinery internally.


At the OpenGL API level, generally the traditional formats (DXT1/DXT5 /
BC1/BC3/BC6H/BC7) are used, with TKRA-GL translating them internally.

Where relevant, a lot of my code is assuming the DDS and BMP file formats.


I could make stuff look a lot better, but this would require:
Using RGBA32 buffers and textures;
Not doing as much corner cutting;
Generally spending a lot more clock cycles on the 3D rendering;
...

It is pretty hard to try to get playable GLQuake on something running at
50MHz without a GPU.

Like, if there is one big advantage that the PlayStation had, it was
that it had a GPU.

Ironically, due to the affine texturing and similar, my GLQuake port
tends to look kinda like it was running on a PlayStation.

Ironically, I think I may not be too far off from something like the
Sega Saturn, given that most of the examples I had seen of Saturn games
had very simplistic geometry and lots of pop-in at relatively short
distances.

Well, contrast with something like Mega-Man Legends (on the
PlayStation), which had very simplistic geometry but often fairly large
draw distances (in comparison).

Like this game was like "Behold, this character has cubes for arms and
hands!" or "behold this house, with its epic 6 polygons! No 3D modeled
roof overhangs or windows here!"


Then, there is GLQuake, which despite having "simple looking" geometry,
a lot of this geometry is already cut into a fair number of small pieces
by the BSP algorithm.

FWIW: The dynamic tessellation actually only splits a relative minority
of primitives, mostly limited to those a short distance from the camera.


...

BGB

unread,
Aug 6, 2022, 2:57:42 AMAug 6
to
While M_PI and M_E are cases that don't fit into 16-bits (these would
need a full 64-bit constant), values like M_PI and M_E (and other
derived values) are nowhere near being the majority of the floating
point constants in use in my experience.

They tend to be vastly outnumbered by other much less noteworthy constants.

Ivan Godard

unread,
Aug 6, 2022, 7:59:43 AMAug 6
to
Us too :-(

EricP

unread,
Aug 6, 2022, 12:36:56 PMAug 6
to
MitchAlsup wrote:
> On Friday, August 5, 2022 at 10:58:09 AM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>>> On 8/4/2022 3:28 PM, MitchAlsup wrote:
>>>>> Does anyone have a reference where a group of people measured the
>>>>> percentage of floating point operands that are constant/immediate.
>>>>>
> <
>> Alpha uses IP-rel addressing to load constants, including floats,
>> from tables located just prior to the routine entry point.
> <
> It is simpler to say Alpha does not have floating pint constants.

Sure it does. These are constant values stored in read-only memory.
They just are not immediate constants.

>> Superficially these show up as regular loads.
> <
> If you have to load it is it not a constant.
> constant = setof{ immediate, displacement }

The point I'm making relates to the issue I raise below
about papers analysis of benchmarks ISA usage stats.

>> One issue I have with many of the ISA usage papers is that they
>> simply scan and count instruction types, but don't look deeper
>> for idioms to try to infer why some sequence was being done.
> <
> I have this is spades:: whereas RISC-V has compare-and-branch with
> 11-bit target range, My 66000 has compare to zero and branch, and
> a compare instruction CoIssued with the successive branch on condition
> instruction. My branches have 18-bit target range (or 28-bit for unconditional)
> So, any fair comparison needs to take the instruction count, the execution
> cycles, and the number of times fix-ups are required into account.

In order to get at the stats you seek one has to look deeper into the code.
Taking Alpha for example. To determine if a load was for a FP constant

- Alpha uses the idiom Branch And Link BAL+0 to copy the current
incremented IP into a link register Rx. In routines that need
to access constants it does this in the routine prologue.
- Scan the routine from start and note which register Rx it uses for BAL+0
- Later if you encounter a load FLD frd,[Rx+offset] that loads a float
register using the prior Rx then you know that was a FP constant.

The above assumes the ISA has separate register banks for INT and FP and
therefore separate LD and FLD instructions to tell you what it was doing.

If ISA has a unified INT-FP bank then you need to continue scanning
forward following the branch flow until you find an instruction that
reads the register previously loaded.


MitchAlsup

unread,
Aug 6, 2022, 3:57:56 PMAug 6
to
not really:: 1/pi, 2/pi, 4/pi are used a lot more than pi, but in many of these
cases, the pi needs 1¼ fractions of width to be sufficiently accurate (64-bits)
<
Realistically: +1.0, -1.0, 0.0, 2.0, 5.0 10.0 are used a lot.

MitchAlsup

unread,
Aug 6, 2022, 4:03:56 PMAug 6
to
I see; you are showing how to determine it was an Alpha FP constant, not
stating Alpha used it as a #immed in an instruction. You are showing how to
count. Thanks.

Quadibloc

unread,
Aug 6, 2022, 9:38:21 PMAug 6
to
That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
possible constants instead of just a fraction of them, it would be even better.

Inspired by the "Heads and Tails" scheme, I came up with the idea of dividing programs into
blocks of eight 32-bit instructions - with an indication of how many slots are not used for
instructions. Then immediates could be referenced by pointers within the block.

This had the virtue of letting all instructions be 32 bits long for fast decoding. But it meant
that the instruction stream is artificially divided into blocks, instead of being uniform and
continuous, which complicates compilers. Plus, for a three-bit field that indicates the number
of unused slots, I need to accept 32 bits of overhead!

An obvious solution would be to have eight 31-bit instructions in a block, but the problem is
that the immediates need to be 32 bits and 64 bits long, not 31 bits and 62 bits long! There is
a way around that, too, but I didn't like it because it would strongly tempt people to implement
it with serial decoding - and the whole point of having a uniform instruction length is to facilitate
parallel decoding.

I think I've finally come up with an alternative way of doing this.

Some architectures had branch instructions with a delay - the branch instruction appears in
the code, and is defined as causing a branch... after the next two instructions are executed.

Well, then, why not have instructions with immediates work like this:

A 32-bit long instruction which also has an immediate argument appears in the code.

That instruction will be executed... after seven more instructions following it. Following
the seventh such instruction, the immediate value appears in the instruction stream!

Of course, that means that none of those seven instructions may be a branch target,
because branching into that area would cause the immediate value to be executed
as code.

There you go - 32-bit and 64-bit immediates, uniform 32-bit instructions, eight
instructions can be executed in parallel at a time, since adequate warning is given...
and yet the instruction stream is uniform!

Note that if an instruction with an immediate value occurs within the seven
instructions following such an instruction... each 32 bits of the length of the
immediate called for by the first instruction is counted as an "instruction",
since we only need delay slots in space, not in time.

John Savard

MitchAlsup

unread,
Aug 6, 2022, 10:19:02 PMAug 6
to
On Saturday, August 6, 2022 at 8:38:21 PM UTC-5, Quadibloc wrote:
> On Friday, August 5, 2022 at 7:12:11 PM UTC-6, robf...@gmail.com wrote:
> > It may be interesting to know the kind of precision required for float-point constants.
> > VAX had six-bit float immediates I think. If there are 16-bits available in the instruction for a constant,
> > there may be a lot of float-constants that could be mapped to a higher precision. The whole 64-bit or
> > 128-bit constant may not need to be encoded for constants to be useful.
> That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
> possible constants instead of just a fraction of them, it would be even better.
<
My 66000 allows for all FP constants (float and double) including NaNs with payloads
>

Andy

unread,
Aug 7, 2022, 7:02:51 AMAug 7
to
On 6/08/22 18:46, BGB wrote:
> On 8/5/2022 8:29 PM, Andy wrote:

>> I'm assuming you've implemented some kind of deferred shading tile
>> renderer?, since block-rams would seem to be the perfect fit for tile
>> buffers if they aren't too oddly sized.
>>
>
> No, errm, TKRA-GL is a software renderer implementing the OpenGL API on
> the front end...

So, pretty much like the standard Z-buffer software scanline renderer
one would write on a PC then?

They invented the PowerVR style tile renderer for a reason, so may as
well steal from the best. :-)

I'm guessing it might help boost your frame-rates for pretty much the
same reasons, -- the small internal tile buffer negates the need to
read/write to dram for every buffer lookup/update.

And two or more scans over the tile buffer lets you figure out exactly
which pixels of the polygons that are visible need to be shaded and
textured, versus the possibly not insignificant overdraw a z-buffer
renderer might waste time and bandwidth on.

Of course ripping up and changing your existing core probably isn't
something you'd happily contemplate, so take my suggestion with a large
grain of salt. ;-)


<snip>

> For a triangle, it walks from the top vertex to the middle vertex, then
> recalculates the step values and goes from the middle to the bottom.
>
> For a quad, currently it draws it as two triangles (0,1,2; 0,2,3).
> In earlier stages, triangles and quads are treated as two different
> primitive types.
>
> Trying to draw a polygon primitive results in it being decomposed into
> quads and triangles.

So why'd you deprecate the quads as primitives?

I've been looking through the source code of Core Design's
un-released/developed game 'TombRaider Anniversary Edition' (not to be
confused with the Crystal Dynamics game which did get released),

there's plenty of quad polygon handling functions to be seen, which
backs up my intuition that tri's and quads should be handled equally
well if at all possible.


<more snip>

>
> I could make stuff look a lot better, but this would require:
>   Using RGBA32 buffers and textures;

16bit textures shaded into 24bit frame buffers was the standard for a
while when memories were smaller weren't they? seemed to be acceptable
for the time and probably not to shabby for the retro gaming inclined
today either.



>   Not doing as much corner cutting;
>   Generally spending a lot more clock cycles on the 3D rendering;
>   ...
>
> It is pretty hard to try to get playable GLQuake on something running at
> 50MHz without a GPU.
>
> Like, if there is one big advantage that the PlayStation had, it was
> that it had a GPU.

Maybe there's a hint to be had there...

Something like a big-little multi core design,
your large but leaner WEX core handling all the game input, camera &
object updates, then vector style churning though all the floating point
geometry to leave an array of Z-sorted integer polygons that can be fed
to a number of tiny 16/24bit risc/misc like cores to render into however
many spare tile buffers you can fit into your fpga.

And by tiny, I mean like a 16bit 6502 with half the instruction set
thrown out, if an instruction doesn't aid in placing a texture sampled
pixel into the tile buffer --- axe it!

I'm thinking a small quantity of tiny independent cores working
simultaneously might work better over-all than one big complex core
trying to do it all. YMMV



> Ironically, due to the affine texturing and similar, my GLQuake port
> tends to look kinda like it was running on a PlayStation.
>
> Ironically, I think I may not be too far off from something like the
> Sega Saturn, given that most of the examples I had seen of Saturn games
> had very simplistic geometry and lots of pop-in at relatively short
> distances.
>
> Well, contrast with something like Mega-Man Legends (on the
> PlayStation), which had very simplistic geometry but often fairly large
> draw distances (in comparison).
>
> Like this game was like "Behold, this character has cubes for arms and
> hands!" or "behold this house, with its epic 6 polygons! No 3D modeled
> roof overhangs or windows here!"
>

I always end with the Crystal Dynamics TombRaider games for my nostalgia
trips, Lara has plenty of polgons, in all the right places, they even,
uhh jigg...

Ummm, better not finish that last sentence, lest the WokePC brigade are
watching! ;-)

>
> Then, there is GLQuake, which despite having "simple looking" geometry,
> a lot of this geometry is already cut into a fair number of small pieces
> by the BSP algorithm.
>
> FWIW: The dynamic tessellation actually only splits a relative minority
> of primitives, mostly limited to those a short distance from the camera.
>

I've always been tempted to write a game engine called the REPYES engine
- remember every polygon you've ever seen.

Basically a giant view direction and player position dependent database
that loads and frees polygons and each of their individual texture maps
to Vram from main memory, so that older laptops and such with weakish
GPUs can enjoy near maximum / lush visible poly counts as they work
their way through a game level.

But instead of using BSPs to figure it all out, I'd just brute force
paint polygonIDs into the frame buffer then trace over the buffer and
record exactly which polygons were visible, step view direction, step
position, rince repeat over all player accessible regions of the game map.


But it'll probably never happen, cause, urrr, it's possibly quite a
stupid thing to do in practice I guess.

Yeah, best forget I mentioned that. :-)

EricP

unread,
Aug 7, 2022, 9:49:12 AMAug 7
to
Yes.
The instruction opcode gives the data type and size.
It is followed by 0 to 5 operand address mode specifiers.
The operand specifier byte can hold a 6 bit literal constant,
either unsigned integer or unsigned float(exp,frac)=(3,3).
Opspec could also be long form immediate data 1,2,4,8,16 bytes.

If a literal it is converted to the opcodes float type (s,e,f)
F (1,8,23), D (1,8,55), G (1,11,52), H (1,15,96)

VAX static instruction stream stats show literal operands used between
10% and 18% of instructions, average 15.2%, was second to register 40%.
Interestingly, the longer immediate data format only occurs 2.4%.
No data types are given.

- A Case Study of VAX-11 Instruction Set Usage For Compiler Execution 1982

- Characterization of Processor Performance in the VAX-11/780 1984

Quadibloc

unread,
Aug 7, 2022, 10:20:26 AMAug 7
to
I'm aware of that. But it also has variable-length instructions.

How can one have immediate values while still _also_ having the advantage that all
instructions are the same length, so that the CPU can just fetch instructions 256 bits at
a time, and start decoding all eight instructions in a block in parallel? Unless given
advance notice to ignore certain instruction slots in a block.

First, my Concertina II attempts tried to do this with a complicated scheme of block
headers - that provided other VLIW features as well, to try to make use of the big
overhead this imposed.

Now, I've come up with something that requires "no overhead", and which doesn't
complicate compilation by chopping up the instruction stream into pieces.

Not that it doesn't have disadvantages - by requiring seven delay slots for every
immediate instruction, in a way, unlike the Concertina II scheme, it's forcing each
immediate instruction to involve a pessimal restriction on possible branch targets,
whereas a block scheme usually doesn't restrict branch targets at all.

John Savard

MitchAlsup

unread,
Aug 7, 2022, 12:25:45 PMAug 7
to
On Sunday, August 7, 2022 at 9:20:26 AM UTC-5, Quadibloc wrote:
> On Saturday, August 6, 2022 at 8:19:02 PM UTC-6, MitchAlsup wrote:
> > On Saturday, August 6, 2022 at 8:38:21 PM UTC-5, Quadibloc wrote:
>
> > > That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
> > > possible constants instead of just a fraction of them, it would be even better.
>
> > My 66000 allows for all FP constants (float and double) including NaNs with payloads
<
> I'm aware of that. But it also has variable-length instructions.
<
Yes, it has variable length instructions, but it has fixed width instruction specifiers.
<
Fixed width instruction specifier is the key to not screwing up the Decodability of the ISA.
<
All of the registers, all of the sizes, all of the modes, all of the operand routing is
in the instruction-specifier. The only variability is in the amount of constants attached
to the instruction specified.
<
This is a far cry from VAX and x86 and in line with IBM 360. VAX and x86 do serial parsing
of the instruction stream. IBM has a single instruction specifier (the first 16-bits) and
a series of additions based solely on the first 16-bits--this may not be true of system/Z
now, but was circa 1965.
<
It is better than RISC-V where where the registers are depends on whether you have
compressed instruction or a uncompressed instruction. My 66000 always has the
register specifiers in the same position. So, the 1-wide machine can always route
inst<4..0> to the Rs2 register port, inst<12..8> to the RS3 register port, and inst<26..22>
to the Rs1 register port. This saves 2-gates* of delay wrt RISC-V minimal implementations
with the compressed extension in getting from "instruction arrives" to bits into the
register file port decoder. (*) there is potential for more wire delay, here, also.
<
In addition, there are encodings where the register specifier is converted into a
signed 5-bit immediate. This enables a single instruction to do:
<
ADD Rd,#1,-Rs2
or
STW #3,[SP+1234]
>
> How can one have immediate values while still _also_ having the advantage that all
> instructions are the same length, so that the CPU can just fetch instructions 256 bits at
> a time, and start decoding all eight instructions in a block in parallel? Unless given
> advance notice to ignore certain instruction slots in a block.
<
A) you can't--mainly because you phrased the question improperly. You are not playing
both ends towards the middle.
<
B) what you can do is to PARSE everything in the instruction buffer as it arrives, so that
figuring out which containers contain instruction-specifiers and which containers
contain constants. In My 66000 this takes 4 gates of delay (31 total gates) to come up
with, instruction length, offset to immediate, offset to displacement. At this point (4
gates into the cycle) you can double your DECODE width every added gate of delay
(up to 16 instruction where it starts taking 2 gates to double your DECODE width).
>
> First, my Concertina II attempts tried to do this with a complicated scheme of block
> headers - that provided other VLIW features as well, to try to make use of the big
> overhead this imposed.
>
> Now, I've come up with something that requires "no overhead", and which doesn't
> complicate compilation by chopping up the instruction stream into pieces.
<
My 66000 ISA does not chop the instruction stream into pieces and accomplishes all
of what you desire (with respect to constants).
>
> Not that it doesn't have disadvantages - by requiring seven delay slots for every
> immediate instruction, in a way, unlike the Concertina II scheme, it's forcing each
> immediate instruction to involve a pessimal restriction on possible branch targets,
> whereas a block scheme usually doesn't restrict branch targets at all.
<
My 66000 does not have those disadvantages, either--and is essentially orthogonal
to the compiler.
>
> John Savard

BGB

unread,
Aug 7, 2022, 5:58:38 PMAug 7
to
On 8/7/2022 6:02 AM, Andy wrote:
> On 6/08/22 18:46, BGB wrote:
>> On 8/5/2022 8:29 PM, Andy wrote:
>
>>> I'm assuming you've implemented some kind of deferred shading tile
>>> renderer?, since block-rams would seem to be the perfect fit for tile
>>> buffers if they aren't too oddly sized.
>>>
>>
>> No, errm, TKRA-GL is a software renderer implementing the OpenGL API
>> on the front end...
>
> So, pretty much like the standard Z-buffer software scanline renderer
> one would write on a PC then?
>

Pretty much.

The same code can build and run on a PC as well, albeit it has a slight
disadvantage on PC due to it lacking a few special features I have in my
ISA. Despite my PC having 74x the clock speed, it only seems to pulls
off around 20x the fill rate (takes roughly 4x as many clock-cycles per
pixel).


> They invented the PowerVR style tile renderer for a reason, so may as
> well steal from the best. :-)
>
> I'm guessing it might help boost your frame-rates for pretty much the
> same reasons, -- the small internal tile buffer negates the need to
> read/write to dram for every buffer lookup/update.
>

It would mostly help if one is throwing multiple cores at the problem.
Some of my past experiments with multi-threaded renders had split the
screen half or into quarters, with one thread working on each part of
the screen.

Though, my current rasterizer is single threaded.

Originally, it was intended to be dual threaded, but ran into a problem
when I started exceeding the resource budgets needed for doing dual core
on the FPGA I am using, and the emphasis shifted to trying to make a
single core run fast.



L1/L2 misses from the raster drawing part aren't too bad IME.

Texture related misses were a bigger issue, which is part of why I am
using Morton ordering when possible.

A 256x256 texture has more texels than a 320x200 framebuffer has pixels.


Splitting up geometry and then drawing each tile sequentially is not as
likely to be helpful.

There is a possibility of a slight advantage to drawing geometry
Z-buffer-only first, and then going back and drawing surfaces.

This would matter a lot more for doing a lot of blending or possibly if
running shaders, since in this case per pixel cost is a bigger issue.


I already have some special cases where geometry hidden behind
previously drawn geometry will be culled.

Some of the span drawing loops have "Z-fail sub-loop" special cases:
If the first pixel would be Z-Fail, go into a Z-fail loop;
If we hit a pixel that is Z-Pass, branch back into Z-Pass loop.

The Z-Fail sub-loop simply updates the state variables and checks for
Z-Pass.

The Z-Pass loops generally use predication for Z test, though another
possible (valid) design would be to have the Z-Pass loop branch back
into the Z-Fail loop on Z-Fail.



> And two or more scans over the tile buffer lets you figure out exactly
> which pixels of the polygons that are visible need to be shaded and
> textured, versus the possibly not insignificant overdraw a z-buffer
> renderer might waste time and bandwidth on.
>
> Of course ripping up and changing your existing core probably isn't
> something you'd happily contemplate, so take my suggestion with a large
> grain of salt. ;-)
>

Yeah.

Also it isn't likely to offer a huge advantage in this case (with a
software renderer).

Also possibly counter-intuitively, a bigger amount of time is currently
going into the transform stages than into the raster-drawing parts.


So for GLQuake, time budget seems to be, roughly, say:
~ 50%, Quake itself;
~ 38%, transform stages
~ 4%, Edge Walking
~ 12%, Span Drawing

Part of the reason for the transform stage cost is that GLQuake draws a
fair number of primitives that end up being culled.


Say, for example, Quake tries to draw 1000 primitives in a frame, 300
fragmented, 900 are culled. Then, 400 are drawn.

Typically, the majority are culled due to frustum culling, and also a
fair number due to backface and Z-occlusion checks.



>
> <snip>
>
>> For a triangle, it walks from the top vertex to the middle vertex,
>> then recalculates the step values and goes from the middle to the bottom.
>>
>> For a quad, currently it draws it as two triangles (0,1,2; 0,2,3).
>> In earlier stages, triangles and quads are treated as two different
>> primitive types.
>>
>> Trying to draw a polygon primitive results in it being decomposed into
>> quads and triangles.
>
> So why'd you deprecate the quads as primitives?
>

Originally it only did triangles internally, but then I added quads as a
special case in the transform stages:
Projecting 4 vertices is less than 6;
They tessellate more efficiently;
...

This stops when it gets to the final rasterization stages, mostly
because the general-case logic for walking the edges of a quad is more
complicated than for a triangle (so didn't bother implementing it).

So, the "WalkEdgesQuad" function makes two calls to the
WalkEdgesTriangle function...

There are a few special cases that could probably be handled without too
much complexity, but they don't really happen that often in 3D scenery.


> I've been looking through the source code of Core Design's
> un-released/developed game 'TombRaider Anniversary Edition' (not to be
> confused with the Crystal Dynamics game which did get released),
>
> there's plenty of quad polygon handling functions to be seen, which
> backs up my intuition that tri's and quads should be handled equally
> well if at all possible.
>

Wasn't aware of any of the code for any of the Tomb Raider games having
been released, but then again I wasn't really much into Tome Raider.


Back when I was much younger (middle school), my brother had a
PlayStation, a few games I remember on it:
Mega-Man Legends;
Mega-Man X4;
Final Fantasy VII;
Xenogears;
Silent Hill;
Crash Bandicoot;
...

There was also a demo CD with demo versions of games like Spyro the
Dragon and Tomb Raider and similar.

By high-school, he had a Dreamcast and the Sonic Adventure games and
similar, along with a PlayStation 2 with games like Grand Theft Auto 3
and similar.


I had been mostly into PC stuff at the time (Quake 1/2, etc).

I had preferred Quake 1 over Quake 2, as while Quake 2 had a few things
going for it (a more hub-world like structure), Quake 1 was more
interesting. In high-school it was mostly Half-Life. I think by the time
I was taking college classes, was mostly poking around in Half-Life 2
and Garry's Mod, then Portal came out, ...

Well, until I started messing around with Minecraft, not really done
much else in gaming much past this point.


Contrary to some people, I suspect the HL/HL2 era is when graphics got
"good enough", despite newer advances in terms of rendering technology,
the "better graphics" don't really improve the gameplay experience all
that much.

One of the more interesting recent developments is real-time ray-tracing.


However, it is not so easy to write a ray-tracer that is competitive
with a raster renderer. I had experimented (not too long ago) with
implementing a software ray-tracer in the Quake engine, but it fell well
short of usable performance.

This was based on a modified version of Quake's normal line-tracing
code, which also had an issue that the world (as seen by line-tracing)
is not exactly the same as what is seen when drawing it as geometry
(there are a lot of "overhangs" in the BSP where the line-trace thinks
it has hit something solid, but there is no actual geometry there).

Also, line tracing the BSP is a lot slower than one might expect...

Had at one point also tried using line-traces to try to further cull
down the geometry in the PVS, but this turned out to be slower than just
drawing the geometry directly and letting GL sort it out.


IME, line-tracing over a regular grid structure (or an octree) tends to
be more efficient than doing so via a recursive BSP walk (an octree
based engine likely being more efficient if one wants to implement a
ray-cast or ray-tracing renderer).

But, OTOH, doing a modified version of Quake where I rebuild all of the
maps from the map source (with custom tools) using an octree and similar
rather than a BSP, is probably "not really worth it".


>
> <more snip>
>
>>
>> I could make stuff look a lot better, but this would require:
>>    Using RGBA32 buffers and textures;
>
> 16bit textures shaded into 24bit frame buffers was the standard for a
> while when memories were smaller weren't they? seemed to be acceptable
> for the time and probably not to shabby for the retro gaming inclined
> today either.
>

Probably, I am using RGB555A, which can sorta mimic RGBA32 and (on
average) looks better than RGBA4444 or similar.


RGBA32 for a framebuffer can look better, but using it would be kind of
a waste when the output framebuffer is using RGB555. And, on the FPGA
board I am using, the VGA output only has 4 bits per component, so even
the RGB555 output is effectively using a Bayer dither in this case.

Though, I did come up with a trick to mostly hide the Bayer dither by
rotating the dither mask for each VGA refresh.

Granted, it is possible that RGB888 could still offer some level of
benefit over RGB555 here.


Had considered going over to Z24.S8 buffers, but would have needed to
rewrite a lot of my span-drawing functions to use it (all those written
to assume a 16-bit Z-Buffer), and a 32-bit Z+Stencil buffer would be
kind of a waste if stencil is used infrequently (the Quake games don't
use any stencil effects).

Had also looked into Z12.S4, but this would cause unacceptable levels of
Z-fighting in my tests. This led to the stencil cases to use a separate
stencil buffer.


>
>
>>    Not doing as much corner cutting;
>>    Generally spending a lot more clock cycles on the 3D rendering;
>>    ...
>>
>> It is pretty hard to try to get playable GLQuake on something running
>> at 50MHz without a GPU.
>>
>> Like, if there is one big advantage that the PlayStation had, it was
>> that it had a GPU.
>
> Maybe there's a hint to be had there...
>
> Something like a big-little multi core design,
> your large but leaner WEX core handling all the game input, camera &
> object updates, then vector style churning though all the floating point
> geometry to leave an array of Z-sorted integer polygons that can be fed
> to a number of tiny 16/24bit risc/misc like cores to render into however
> many spare tile buffers you can fit into your fpga.
>
> And by tiny, I mean like a 16bit 6502 with half the instruction set
> thrown out, if an instruction doesn't aid in placing a texture sampled
> pixel into the tile buffer --- axe it!
>

It is possible, though if I were to try to fit TKRA-GL to it, it would
likely mean cores that were more like:
2-wide with 64-bit Packed-Integer SIMD;
Probably still needing a 32-bit address space.

Being able to dealing with transforms would still likely require
FP-SIMD, but could be reduced to the S.E8.F16 form (possibly with Lanes
1 and 2 operating independently). Could potentially omit support for
Binary64 FPU ops.


Some of my previously considered GPU related features, I had back-ported
to BJX2 as "proof of concept".

It is possible that the "GPU" could be running a more restricted BJX2
subset.


Not too long ago, I had considered another possibility:
BJX2 core is used as the GPU;
I add a secondary "CPU" core, mostly running RV64 or similar.
Say, the GLQuake engine-side logic is run on an RISC-V core.

Though, my attempts at RISC-V cores have thus far end up more expensive
than ideal (in my attempts, a full RV64G core would end up costing
*more* than another BJX2 core), and even making it single-issue isn't
really enough to compensate for this.


> I'm thinking a small quantity of tiny independent cores working
> simultaneously might work better over-all than one big complex core
> trying to do it all. YMMV
>

Only if the work can be split up to use the cores efficiently.

With the current balance, unless these cores could also handle vertex
transform and similar, they wont save much.


A bigger amount of savings would likely be possible with a redesigned
API design, possibly:
Rendering state configuration is folded into "objects";
Front-end interface mostly uses fixed-point or similar;
...


>
>
>> Ironically, due to the affine texturing and similar, my GLQuake port
>> tends to look kinda like it was running on a PlayStation.
>>
>> Ironically, I think I may not be too far off from something like the
>> Sega Saturn, given that most of the examples I had seen of Saturn
>> games had very simplistic geometry and lots of pop-in at relatively
>> short distances.
>>
>> Well, contrast with something like Mega-Man Legends (on the
>> PlayStation), which had very simplistic geometry but often fairly
>> large draw distances (in comparison).
>>
>> Like this game was like "Behold, this character has cubes for arms and
>> hands!" or "behold this house, with its epic 6 polygons! No 3D modeled
>> roof overhangs or windows here!"
>>
>
> I always end with the Crystal Dynamics TombRaider games for my nostalgia
> trips, Lara has plenty of polgons, in all the right places, they even,
> uhh jigg...
>
> Ummm, better not finish that last sentence, lest the WokePC brigade are
> watching! ;-)
>

Kinda curious that they did this back then, whereas many later games
(Half-Life 2, Doom 3, etc) didn't bother with these sorts of effects
(they probably could have if they wanted as special cases of ragdoll
within their skeletal animation systems).

Many newer games apparently also do things like soft-body physics and
cloth simulation and similar.


I guess I am in the category of having a certain level of sentimentalism
for characters like Tron Bonne and similar (from the Mega-Man Legends
series), though in part because of finding her character relatable.


>>
>> Then, there is GLQuake, which despite having "simple looking"
>> geometry, a lot of this geometry is already cut into a fair number of
>> small pieces by the BSP algorithm.
>>
>> FWIW: The dynamic tessellation actually only splits a relative
>> minority of primitives, mostly limited to those a short distance from
>> the camera.
>>
>
> I've always been tempted to write a game engine called the REPYES engine
> - remember every polygon you've ever seen.
>
> Basically a giant view direction and player position dependent database
> that loads and frees polygons and each of their individual texture maps
> to Vram from main memory, so that older laptops and such with weakish
> GPUs can enjoy near maximum / lush visible poly counts as they work
> their way through a game level.
>
> But instead of using BSPs to figure it all out, I'd just brute force
> paint polygonIDs into the frame buffer then trace over the buffer and
> record exactly which polygons were visible, step view direction, step
> position, rince repeat over all player accessible regions of the game map.
>
>
> But it'll probably never happen, cause, urrr, it's possibly quite a
> stupid thing to do in practice I guess.
>
> Yeah, best forget I mentioned that. :-)
>

OK.


In one of my more recent experimental engines with a Minecraft style
terrain system.


I basically had the camera cast out rays in every direction and then
building a list of visible blocks (recorded as their world coordinates).

The renderer then draws all of the blocks on this list.


This approach is faster and saves memory on BJX2 when compared with the
"build and draw a vertex array for every chunk" approach.

However, it doesn't scale as well with draw distance, and on a PC with a
GPU, it is faster to use per-chunk vertex arrays and occlusion queries
(however, using ray-casts to build block lists still uses less RAM).


Performance on my BJX2 core was comparable to Quake, but it can do
outdoors environments (nevermind the limited draw distance in this case).


Ironically, despite running on a 50MHz CPU core, performance still
somehow manages to be better than Minecraft with a similar draw distance
on a Vista era laptop.



Doing something vaguely similar, but with ray-casting over an octree,
also seems possible (where the oct-tree would keep dividing geometry
until it reaches a certain maximum number of polygons).


Unlike Quake, by using a few Minecraft style tricks, it would also be
possible to extend it to arbitrarily large environments. Say, the
top-level world is split up into a grid of 256 meter cube-shaped
regions, with each cube divided into an octree (each region being
roughly the size of a Quake 2 map).

Quadibloc

unread,
Aug 7, 2022, 7:30:15 PMAug 7
to
On Sunday, August 7, 2022 at 10:25:45 AM UTC-6, MitchAlsup wrote:

> My 66000 does not have those disadvantages, either--and is essentially orthogonal
> to the compiler.

That is a respect, then, in which your design is far superior to any of mine. In
order to squeeze more featulres in to my ISA, orthogonality has been almost
always the first thing I threw out the window.

John Savard

luke.l...@gmail.com

unread,
Aug 7, 2022, 7:38:45 PMAug 7
to
On Thursday, August 4, 2022 at 9:28:24 PM UTC+1, MitchAlsup wrote:
> Does anyone have a reference where a group of people measured the
> percentage of floating point operands that are constant/immediate.

clearly it'll very much depend on the target workload, so for example
the imdct36 function of ffmpeg for MP3 requires quite a lot of magic
FP constants.

this and 3D was enough without needing details to go for two
fp-const instructions

we decided to propose fmvis - float move immediate - and fishmv
(float immediate second half move) which adds the second half
of an FP32

https://libre-soc.org/openpower/sv/int_fp_mv/#fmvis

the nice thing about fmvis is, the 16-bit immediate is a BF16 and
drops nicely into place as an FP32 or FP64.

the nice thing about fishmv is, the 16-bit immediate is just
the lower 16 bits of a FP32 mantissa.

no need for variable-length-encoded instructions unless you
already have 48-bit encoding.

l.

BGB

unread,
Aug 7, 2022, 7:42:35 PMAug 7
to
On 8/7/2022 11:25 AM, MitchAlsup wrote:
> On Sunday, August 7, 2022 at 9:20:26 AM UTC-5, Quadibloc wrote:
>> On Saturday, August 6, 2022 at 8:19:02 PM UTC-6, MitchAlsup wrote:
>>> On Saturday, August 6, 2022 at 8:38:21 PM UTC-5, Quadibloc wrote:
>>
>>>> That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
>>>> possible constants instead of just a fraction of them, it would be even better.
>>
>>> My 66000 allows for all FP constants (float and double) including NaNs with payloads
> <
>> I'm aware of that. But it also has variable-length instructions.
> <
> Yes, it has variable length instructions, but it has fixed width instruction specifiers.
> <
> Fixed width instruction specifier is the key to not screwing up the Decodability of the ISA.
> <
> All of the registers, all of the sizes, all of the modes, all of the operand routing is
> in the instruction-specifier. The only variability is in the amount of constants attached
> to the instruction specified.

In my case, current scheme still mostly:
(15:12):
E, F: Sz=1;
7, 9: Sz=1 (if XGPR extension exists, UD otherwise);
Else: Sz=0.
(15:8):
EA/EB, EE/EF, F4..F7, FC..FF: Wx=1
Else: Wx=0

Then, say, the 96 bit bundle is split into bits for 3 dwords:
Sz0, Wx0, Sz1, Wx1, Sz2
0zzzz: 16-bit
10zzz: 32-bit
110zz: 48-bit (unused)
1110z: 64-bit
11110: 80-bit (unused)
11111: 96-bit

Though, the WXE and WX2 bits also influence this.
00: Wx is always 0.
01: Sz=((1:0)==11), Wx=0 (RISC-V Mode)
10: Scheme above.
11: Reserved


This replaced a scheme used in an earlier form of the ISA:
(15:13)==111: 32+
(12:10)==111: 48+
(9:8)=11: 64

This encoding was dropped, with FC/FD becoming 32-bit (WEX'ed F8/F9
blocks), and FE/FF becoming the Jumbo Prefixes.


> <
> This is a far cry from VAX and x86 and in line with IBM 360. VAX and x86 do serial parsing
> of the instruction stream. IBM has a single instruction specifier (the first 16-bits) and
> a series of additions based solely on the first 16-bits--this may not be true of system/Z
> now, but was circa 1965.

Yeah; x86 and VAX would be a pain here.


> <
> It is better than RISC-V where where the registers are depends on whether you have
> compressed instruction or a uncompressed instruction.

RISC-V C:
Looks straightforward enough on the surface.
Look a little deeper, it is a dog-chewed mess (even worse than Thumb).


> My 66000 always has the
> register specifiers in the same position. So, the 1-wide machine can always route
> inst<4..0> to the Rs2 register port, inst<12..8> to the RS3 register port, and inst<26..22>
> to the Rs1 register port. This saves 2-gates* of delay wrt RISC-V minimal implementations
> with the compressed extension in getting from "instruction arrives" to bits into the
> register file port decoder. (*) there is potential for more wire delay, here, also.
> <
> In addition, there are encodings where the register specifier is converted into a
> signed 5-bit immediate. This enables a single instruction to do:
> <
> ADD Rd,#1,-Rs2
> or
> STW #3,[SP+1234]

Hit or miss in my case.

Registers are "mostly stable", albeit a few ports may move around in
decoding (depending on the instruction), and a few blocks (such as the
F8 block) use different bits for the destination register.

Part of the reason the F8 block is awkward is that I wanted to keep the
Imm16 part contiguous, but there was also no good way to do this while
also keeping other parts of the encoding consistent.

There are a few consistency issues within the 16-bit space as well.

Could be better, could be worse.


RISC-V's 32-bit encodings keep registers more consistent at the cost of
turning immediate values and displacements into a chewed up mess.

>>
>> How can one have immediate values while still _also_ having the advantage that all
>> instructions are the same length, so that the CPU can just fetch instructions 256 bits at
>> a time, and start decoding all eight instructions in a block in parallel? Unless given
>> advance notice to ignore certain instruction slots in a block.
> <
> A) you can't--mainly because you phrased the question improperly. You are not playing
> both ends towards the middle.
> <
> B) what you can do is to PARSE everything in the instruction buffer as it arrives, so that
> figuring out which containers contain instruction-specifiers and which containers
> contain constants. In My 66000 this takes 4 gates of delay (31 total gates) to come up
> with, instruction length, offset to immediate, offset to displacement. At this point (4
> gates into the cycle) you can double your DECODE width every added gate of delay
> (up to 16 instruction where it starts taking 2 gates to double your DECODE width).

In my case, extended constant bits are held in jumbo prefixes, which are
effectively treated as NOP (and mostly special in that the payload bits
are routed into the adjacent decoder).


>>
>> First, my Concertina II attempts tried to do this with a complicated scheme of block
>> headers - that provided other VLIW features as well, to try to make use of the big
>> overhead this imposed.
>>
>> Now, I've come up with something that requires "no overhead", and which doesn't
>> complicate compilation by chopping up the instruction stream into pieces.
> <
> My 66000 ISA does not chop the instruction stream into pieces and accomplishes all
> of what you desire (with respect to constants).
>>
>> Not that it doesn't have disadvantages - by requiring seven delay slots for every
>> immediate instruction, in a way, unlike the Concertina II scheme, it's forcing each
>> immediate instruction to involve a pessimal restriction on possible branch targets,
>> whereas a block scheme usually doesn't restrict branch targets at all.
> <
> My 66000 does not have those disadvantages, either--and is essentially orthogonal
> to the compiler.

Delay slots are a bad idea IMO.


Also better IMO to keep instructions in a format where they at least
"make sense" as a sequentially executed instruction scheme.

Say, well-formed code in BJX2 has as a requirement that one can ignore
WEX and execute stuff sequentially, and it should still produce the same
results as the bundled version (Or, IOW: If bundled and sequential
execution produce different results, the code is broken).

...


>>
>> John Savard

MitchAlsup

unread,
Aug 7, 2022, 8:43:32 PMAug 7
to
On Sunday, August 7, 2022 at 6:38:45 PM UTC-5, luke.l...@gmail.com wrote:
> On Thursday, August 4, 2022 at 9:28:24 PM UTC+1, MitchAlsup wrote:
> > Does anyone have a reference where a group of people measured the
> > percentage of floating point operands that are constant/immediate.
<
> clearly it'll very much depend on the target workload, so for example
> the imdct36 function of ffmpeg for MP3 requires quite a lot of magic
> FP constants.
>
> this and 3D was enough without needing details to go for two
> fp-const instructions
>
> we decided to propose fmvis - float move immediate - and fishmv
> (float immediate second half move) which adds the second half
> of an FP32
>
> https://libre-soc.org/openpower/sv/int_fp_mv/#fmvis
>
> the nice thing about fmvis is, the 16-bit immediate is a BF16 and
> drops nicely into place as an FP32 or FP64.
>
> the nice thing about fishmv is, the 16-bit immediate is just
> the lower 16 bits of a FP32 mantissa.
<
Luke, can you take a body of floating point source. compile
it to assembly (or similar) and count the number of FP operands,
and count the number of FMVis + FISHMV so we could get a
percentage of the number of floating point constants in that body
of code ??
>
> no need for variable-length-encoded instructions unless you
> already have 48-bit encoding.
<
The question was not about Variable Length ISAs, but the percentage
of floating point constants that survive all optimizations and are "in"
the object code.
>
> l.
<
Thanks, BTW.

MitchAlsup

unread,
Aug 7, 2022, 8:54:48 PMAug 7
to
On Sunday, August 7, 2022 at 6:42:35 PM UTC-5, BGB wrote:
> On 8/7/2022 11:25 AM, MitchAlsup wrote:
> > On Sunday, August 7, 2022 at 9:20:26 AM UTC-5, Quadibloc wrote:
> >> On Saturday, August 6, 2022 at 8:19:02 PM UTC-6, MitchAlsup wrote:
> >>> On Saturday, August 6, 2022 at 8:38:21 PM UTC-5, Quadibloc wrote:
> >>
> >>>> That's certainly true. But these days, DRAM is so very slow that if one could handle _all_
> >>>> possible constants instead of just a fraction of them, it would be even better.
> >>
> >>> My 66000 allows for all FP constants (float and double) including NaNs with payloads
> > <
> >> I'm aware of that. But it also has variable-length instructions.
> > <
> > Yes, it has variable length instructions, but it has fixed width instruction specifiers.
> > <
> > Fixed width instruction specifier is the key to not screwing up the Decodability of the ISA.
> > <
> > All of the registers, all of the sizes, all of the modes, all of the operand routing is
> > in the instruction-specifier. The only variability is in the amount of constants attached
> > to the instruction specified.
<snip>
But the addition of compressed instructions changes where the register
specifiers are found in the compressed versus uncompressed instructions !!!
> >>
<snip>
> > <
> > My 66000 does not have those disadvantages, either--and is essentially orthogonal
> > to the compiler.
> Delay slots are a bad idea IMO.
>
I learned how bad they were on my* Mc 88100 ISA.
(*) yes, mine.
>
> Also better IMO to keep instructions in a format where they at least
> "make sense" as a sequentially executed instruction scheme.
<
In memory, I completely agree. Once fetched into a processor, you are no
longer bound by this criterion, and can exploit position rearrangement
so long as you can annotate register and memory ordering dependencies.
With such relaxation, the code sequence to swap 2 register values:
<
MOV Rt,Rx
MOV Rx,Ry
MOV Ry,Rt
<
can be rearranged into:
<
MOV Rx,Ry; MOV Ry,Rx
<
such that they can be performed at the same time with no remaining sequential
dependencies. and if Rt is overwritten, in the horizon of the peep-hole optimizer,
that MOV can be totally eliminated.
<
Why this works:: The right hand side of both MOV instructions get the current
rename value of Rx and Ry respectively, while the left hand side get new register
names from the renamer.
<
More exotic "decoders" might perform the MOVs in the renamer itself.
Presto, zero cycle swaps.