Reconsidering Variable-Length Operands

225 views
Skip to first unread message

Quadibloc

unread,
Jun 5, 2022, 6:57:07 PMJun 5
to
I had been of the opinion that a computer which addressed 12-bit units of storage could offer optimum sizes of floating-point numbers...

36 bits for single precision, since 32 bits was too short;
60 bits for double precision, since double precision is excessively and artificially long, and the Control Data 6600 proved that 60 bits was enough;
and 48 bits to give ten digits, which is shown to be the most common desired
scientific precision both by the HP 35 and by many old mathematical tables.

However, while many 24-bit computers used two words both for single
precision and double precision, just leaving some bits unused for single
precision, I knew that one 24-bit computer, the British ICL 1900 computer,
used 48 bits for single precision, and 96 bits for double precision.

Ah, well, one strange outlier. I thought nothing of it.

But recently I took another look at the Control Data 1604A, and the Control
Data 6600. And I found that the 1604A used 96 bits for double precision,
and the 6600 used 120 bits for double precision!

That pretty much explodes my thinking that 60 bits is enough for anyone;
apparently longer precisions are indeed of some valid use - it could also
explain why IBM brought in 128-bit Extended Precision reals with the
360/85 as well; 120-bit double precision on the 6600 could have proven to
be *so* useful and valuable that IBM needed to compete with it... it wasn't
just for show and bragging rights!

This inspires me to consider another direction, if I still want to have some
flexibility in lengths beyond powers of two.

The 1604 had a 48-bit word length, but its instructions were 24 bits long.

If only 24 bits are needed for instructions, and 36 bits is a reasonable
starting size for floating-point numbers...

Let us consider... a computer with a 72-bit bus to memory, which
addresses 36-bit halfwords in order to access the smallest size of
floating-point number... but which packs three 24-bit instructions in a 72-
bit word. So you can only jump to every third instruction.

The PDP-8 implemented hardware floating-point by means of the Floating
Point Processor, which could be started going by an I/O instruction from
the PDP-8... and which had its _own_ program counter. It could share
cycles with the PDP-8 or it could take over complete control of the
memory bus.

Recently, I saw an ad for one of the old 24-bit machines - I forget which
one - that bragged that it did its hardware floating-point option in the
PDP-8 way, instead of by adding floating-point instructions to its own
instruction set!

While this seemed odd to me, since it seems like bragging that one's
computer is *less* powerful, it suggested something. Since auxilliary
processors are more widespread than I thought... why not make an
auxillilary processor for decimal and string instructions that could be
tacked on to a scientific computer, sharing the same memory with it?

John Savard

Quadibloc

unread,
Jun 5, 2022, 7:40:28 PMJun 5
to
On Sunday, June 5, 2022 at 4:57:07 PM UTC-6, Quadibloc wrote:

> Recently, I saw an ad for one of the old 24-bit machines - I forget which
> one - that bragged that it did its hardware floating-point option in the
> PDP-8 way, instead of by adding floating-point instructions to its own
> instruction set!

It was the SEL 840 that did that.

John Savard

Ivan Godard

unread,
Jun 5, 2022, 7:43:59 PMJun 5
to
On 6/5/2022 3:57 PM, Quadibloc wrote:
> I had been of the opinion that a computer which addressed 12-bit units of storage could offer optimum sizes of floating-point numbers...
>
> 36 bits for single precision, since 32 bits was too short;
> 60 bits for double precision, since double precision is excessively and artificially long, and the Control Data 6600 proved that 60 bits was enough;
> and 48 bits to give ten digits, which is shown to be the most common desired
> scientific precision both by the HP 35 and by many old mathematical tables.
>
> However, while many 24-bit computers used two words both for single
> precision and double precision, just leaving some bits unused for single
> precision, I knew that one 24-bit computer, the British ICL 1900 computer,
> used 48 bits for single precision, and 96 bits for double precision.

So do the Burroughs (Unisys) mainframes. Memory path was 51 bits in the
B6500, 48 data and 3 tag

Quadibloc

unread,
Jun 5, 2022, 7:50:07 PMJun 5
to
It turns out that EAU instructions on the SEL 840 were part of the SEL 840
instruction set; an EAU instruction could overlap with the next non-EAU
instruction on the SEL 840, that was the extent of its independence. So it
wasn't an auxilliary computer the way the FPP was.

John Savard

Bill Findlay

unread,
Jun 5, 2022, 8:28:43 PMJun 5
to
On 5 Jun 2022, Quadibloc wrote
(in article<b0b53c16-1229-449c...@googlegroups.com>):

> I had been of the opinion that a computer which addressed 12-bit units of
> storage could offer optimum sizes of floating-point numbers...
>
> 36 bits for single precision, since 32 bits was too short;
> 60 bits for double precision, since double precision is excessively and
> artificially long, and the Control Data 6600 proved that 60 bits was enough;
> and 48 bits to give ten digits, which is shown to be the most common desired
> scientific precision both by the HP 35 and by many old mathematical tables.
>
> However, while many 24-bit computers used two words both for single
> precision and double precision, just leaving some bits unused for single
> precision, I knew that one 24-bit computer, the British ICL 1900 computer,
> used 48 bits for single precision, and 96 bits for double precision.

KDF9 had 48 bit reals and 96 bit doubles.

> While this seemed odd to me, since it seems like bragging that one's
> computer is *less* powerful, it suggested something. Since auxilliary
> processors are more widespread than I thought... why not make an
> auxillilary processor for decimal and string instructions that could be
> tacked on to a scientific computer, sharing the same memory with it?

Have you reinvented the IBM 8000?

--
Bill Findlay

MitchAlsup

unread,
Jun 5, 2022, 9:17:54 PMJun 5
to
On Sunday, June 5, 2022 at 5:57:07 PM UTC-5, Quadibloc wrote:
> I had been of the opinion that a computer which addressed 12-bit units of storage could offer optimum sizes of floating-point numbers...
>
> 36 bits for single precision, since 32 bits was too short;
> 60 bits for double precision, since double precision is excessively and artificially long, and the Control Data 6600 proved that 60 bits was enough;
> and 48 bits to give ten digits, which is shown to be the most common desired
> scientific precision both by the HP 35 and by many old mathematical tables.
>
> However, while many 24-bit computers used two words both for single
> precision and double precision, just leaving some bits unused for single
> precision, I knew that one 24-bit computer, the British ICL 1900 computer,
> used 48 bits for single precision, and 96 bits for double precision.
>
> Ah, well, one strange outlier. I thought nothing of it.
>
> But recently I took another look at the Control Data 1604A, and the Control
> Data 6600. And I found that the 1604A used 96 bits for double precision,
> and the 6600 used 120 bits for double precision!
>
> That pretty much explodes my thinking that 60 bits is enough for anyone;
> apparently longer precisions are indeed of some valid use - it could also
> explain why IBM brought in 128-bit Extended Precision reals with the
> 360/85 as well; 120-bit double precision on the 6600 could have proven to
> be *so* useful and valuable that IBM needed to compete with it... it wasn't
> just for show and bragging rights!
<
Yes, it was well known (in the time of Stretch) that there were some kinds
of calculations that one needed more than 60-64 bits to have any reasonable
kind of resulting precision. What is surprising to me is that the early efforts
(without the every growing exponent of IEEE) were seen as completely
successful.)
>
> This inspires me to consider another direction, if I still want to have some
> flexibility in lengths beyond powers of two.
>
> The 1604 had a 48-bit word length, but its instructions were 24 bits long.
>
> If only 24 bits are needed for instructions, and 36 bits is a reasonable
> starting size for floating-point numbers...
<
One of your constant points was to provide an ISA that had all those data types
and instructions to manipulate them. You will find that your added criterion
will not allow for 24-bit instructions0--except in the most simple sense.
>
> Let us consider... a computer with a 72-bit bus to memory, which
> addresses 36-bit halfwords in order to access the smallest size of
> floating-point number... but which packs three 24-bit instructions in a 72-
> bit word. So you can only jump to every third instruction.
<
In todays fabrication technology, I am more in the mood for a 512-bit
"bus" to memory (cache line in a single beat). Heck, once you get 16-32
cores on a die you are going to need this kind of BW.
>
> The PDP-8 implemented hardware floating-point by means of the Floating
> Point Processor, which could be started going by an I/O instruction from
> the PDP-8... and which had its _own_ program counter. It could share
> cycles with the PDP-8 or it could take over complete control of the
> memory bus.
>
> Recently, I saw an ad for one of the old 24-bit machines - I forget which
> one - that bragged that it did its hardware floating-point option in the
> PDP-8 way, instead of by adding floating-point instructions to its own
> instruction set!
<
You have just discovered "attached processors" (a coprocessor is fed
from the CPU instruction stream, an attached processor is fed the starting
address of a Kernel. In some ways a vector processor is an attached
processor, but the kernel is 1 instruction long, consisting of 64 units
of work.
>
> While this seemed odd to me, since it seems like bragging that one's
> computer is *less* powerful, it suggested something. Since auxilliary
> processors are more widespread than I thought... why not make an
> auxillilary processor for decimal and string instructions that could be
> tacked on to a scientific computer, sharing the same memory with it?
<
In some ways VVM is like making something akin to a vector attached
processor (except for the precise interrupts--which is always a snag at
the programming language level.)
>
> John Savard
<
I still don't think you have progressed to the point where you realize
architecture is just about as much about what to leave out as what
to put in.

BGB

unread,
Jun 6, 2022, 4:38:29 AMJun 6
to
On 6/5/2022 5:57 PM, Quadibloc wrote:
> I had been of the opinion that a computer which addressed 12-bit units of storage could offer optimum sizes of floating-point numbers...
>
> 36 bits for single precision, since 32 bits was too short;
> 60 bits for double precision, since double precision is excessively and artificially long, and the Control Data 6600 proved that 60 bits was enough;
> and 48 bits to give ten digits, which is shown to be the most common desired
> scientific precision both by the HP 35 and by many old mathematical tables.
>
> However, while many 24-bit computers used two words both for single
> precision and double precision, just leaving some bits unused for single
> precision, I knew that one 24-bit computer, the British ICL 1900 computer,
> used 48 bits for single precision, and 96 bits for double precision.
>
> Ah, well, one strange outlier. I thought nothing of it.
>

My current list of primary floating point types is:
Half: S.E5.F10 / Binary16
Single: S.E8.F23 / Binary32
Double: S.E11.F52 / Binary64

Secondary:
FP8: E4.F4, S.E4.F3, E4.F3.S (RGB30A)
FP24: S.E8.F15[.Z8] (32-bit native storage, semantic edge case)
FP96: S.E15.F80[.Z32] (128-bit native storage)
Quad: S.E15.F112 / Binary128

Where the primary formats have dedicated ISA level operators (Scalar or
Vector), whereas secondary formats only exist via converter ops or edge
cases.

Though, "proper" scalar ops only exist for Binary64, with both Binary32
and Binary16 only having SIMD instructions.



It seems that for 3D rendering, FP24 appears to be mostly sufficient.
For general use, Binary16 falls short of being particularly usable.

Binary32 is "standard", but possibly a little overkill for Quake sized
maps. Can deal with spaces up to around 2..8 km before there start being
significant issues.

If the mantissa were extended to 31 bits (Eg: S.E8.F31), one can extend
the usable limit to around 1024 or 2048 km.


For pixel data and audio, FP8 sort of falls in this ambiguous area
between "good" and "kinda awful". It falls short of being "useful
enough" to be promoted to a primary type. Main path for operating on FP8
data would be to internally unpack to Binary16, and then convert back
when done.

There is RGB30A, which is more specialized for HDR color data.
High 2 bits of 32-bit word:
00: 3x E5.F5 (R/G/B, A=1.0)
01: 3x E5.F4.S (R/G/B, A=1.0)
10: 2x E4.F4 (G/A) / 2x E4.F3 (R/B)
11: 2x E4.F3.S (G/A) / 2x E4.F2.S (R/B)

The vector format would be selected based on the vector being encoded
(and unpacked to 4x Binary16).


> But recently I took another look at the Control Data 1604A, and the Control
> Data 6600. And I found that the 1604A used 96 bits for double precision,
> and the 6600 used 120 bits for double precision!
>

Except maybe for scientific computing or similar, these are likely
overkill for most general-purpose uses.

I suspect for most purposes, people have gone with 64-bit Binary64:
It is sufficient for most stuff one needs to do;
It is not so large as to be impractical.


Binary128 makes sense for as a software-emulated format.
But, fairly expensive and not used enough to justify doing it in
hardware in many cases.

And, for something that is used infrequently, one can justify the cost
of it being "kinda slow" (partly more so when one can leverage 128-bit
integer operations for the task of "slightly more efficient" software
emulation).


> That pretty much explodes my thinking that 60 bits is enough for anyone;
> apparently longer precisions are indeed of some valid use - it could also
> explain why IBM brought in 128-bit Extended Precision reals with the
> 360/85 as well; 120-bit double precision on the 6600 could have proven to
> be *so* useful and valuable that IBM needed to compete with it... it wasn't
> just for show and bragging rights!
>

I partly disagree on the basis that, if there was a strong use-case for
very large floating-point types, they likely would have come into more
common use already.


> This inspires me to consider another direction, if I still want to have some
> flexibility in lengths beyond powers of two.
>
> The 1604 had a 48-bit word length, but its instructions were 24 bits long.
>
> If only 24 bits are needed for instructions, and 36 bits is a reasonable
> starting size for floating-point numbers...
>
> Let us consider... a computer with a 72-bit bus to memory, which
> addresses 36-bit halfwords in order to access the smallest size of
> floating-point number... but which packs three 24-bit instructions in a 72-
> bit word. So you can only jump to every third instruction.
>
> The PDP-8 implemented hardware floating-point by means of the Floating
> Point Processor, which could be started going by an I/O instruction from
> the PDP-8... and which had its _own_ program counter. It could share
> cycles with the PDP-8 or it could take over complete control of the
> memory bus.
>
> Recently, I saw an ad for one of the old 24-bit machines - I forget which
> one - that bragged that it did its hardware floating-point option in the
> PDP-8 way, instead of by adding floating-point instructions to its own
> instruction set!
>
> While this seemed odd to me, since it seems like bragging that one's
> computer is *less* powerful, it suggested something. Since auxilliary
> processors are more widespread than I thought... why not make an
> auxillilary processor for decimal and string instructions that could be
> tacked on to a scientific computer, sharing the same memory with it?
>

FWIW: MSP430 did pretty much everything beyond the core integer ISA by
mapping stuff to MMIO devices.


> John Savard

Quadibloc

unread,
Jun 6, 2022, 4:41:16 AMJun 6
to
On Sunday, June 5, 2022 at 6:28:43 PM UTC-6, Bill Findlay wrote:

> Have you reinvented the IBM 8000?

That design looks like it's based on STRETCH.

John Savard

Quadibloc

unread,
Jun 6, 2022, 4:44:42 AMJun 6
to
On Sunday, June 5, 2022 at 7:17:54 PM UTC-6, MitchAlsup wrote:

> One of your constant points was to provide an ISA that had all those data types
> and instructions to manipulate them. You will find that your added criterion
> will not allow for 24-bit instructions0--except in the most simple sense.

That is definitely true; 24-bit instructions similar to those of the classic 24-bit
machines have room only for a 6-bit instruction.

There are solutions. One would be to give up on a full 15-bit address field.

Another would be to go to a load-store architecture, and only have the full
opcode for register-to-register instructions.

Also, the post was about giving up on haiving more than three or four floating-point
types.

John Savard

Quadibloc

unread,
Jun 6, 2022, 4:48:59 AMJun 6
to
On Monday, June 6, 2022 at 2:38:29 AM UTC-6, BGB wrote:

> Except maybe for scientific computing or similar, these are likely
> overkill for most general-purpose uses.

Since scientific computing is what floating-point is *for*,
this doesn't seem to be an argument against 120-bit
floats; what I'm not sure of yet, despite having encountered
some evidence they may be useful, is if they're really that
important even for scientific computing.

John Savard

MitchAlsup

unread,
Jun 6, 2022, 12:24:14 PMJun 6
to
I agree with 98% of the above, except that I think it should be over in the GPU
(rather than decorating the CPU with stuff better done elsewhere.) I especially
agree with the A=1.0 case. I disagree with the FP16 in a CPU cases.
<
> > But recently I took another look at the Control Data 1604A, and the Control
> > Data 6600. And I found that the 1604A used 96 bits for double precision,
> > and the 6600 used 120 bits for double precision!
> >
> Except maybe for scientific computing or similar, these are likely
> overkill for most general-purpose uses.
<
But straightforward to provide--whereas SW is replete with special casing.
<
But 1 thing that can be pointed out:: FP8 can be implemented in ROM
at essentially zero HW cost.

Stefan Monnier

unread,
Jun 6, 2022, 12:54:06 PMJun 6
to
> I had been of the opinion that a computer which addressed 12-bit units of
> storage could offer optimum sizes of floating-point numbers...

I don't think there can be anything magical about 12bit as opposed to
any other size. I think what's going on here instead is that you don't
want the progression of sizes to be limited to powers of 2 because you
inevitably have an "almost 100% overhead" in those cases where you need
a tiny bit more than 2^N (and are hence forced to use 2^N+1).

If you could have more options, such as supporting both sizes of the form

16bit * 2^N

and sizes of the form

16bit * 2^N * 1.5

then the worst-case overhead would be reduced by half (i.e. down to
50%), making it presumably more tolerable.

This said, a 100% overhead is pretty standard in all kinds of places
(e.g. buddy allocator, stop&copy GC, most competitive analysis, ...) so
I'm not sure it's that terribly important.


Stefan

BGB

unread,
Jun 6, 2022, 7:11:29 PMJun 6
to
Side note:
The funkiness of putting the sign bit in the low-order bit for RGB30A
was that I noted that this can results in a noticeable cost saving vs
putting it in the high-order bit in this case (effectively changes
whether the low-order bit is interpreted as a sign or zero, without
shifting all the other bits around).

It is similar rationale to how I ended up with RGB555A the way it is:
0rrrrrgggggbbbbb
1rrrraggggabbbba
As this was basically the "cheapest" option I could come up to deal with
this.



Yeah, a GPU would probably be "better", apart from the concern of how to
do a "good" GPU in a budget of ~ 15-20 kLUT.

If I could somehow fit 3 cores into the CPU:
CPU core;
Geometry / Transform Core;
Rasterizer core.

Assuming each stages had roughly the same performance as on my CPU core,
with minimal synchronization overhead; math implies that I could
potentially get around 8x the fill-rate.


But, to pull something like this off, would likely need a bigger FPGA
(such as the XC7A200T in a Nexys Video or similar). But, alas, don't
have the money to burn on this right now, so the current option has been
to throw some GPU like features at the CPU core in an attempt to make it
fast enough for 3D rendering.



How much this could translate to "faster GLQuake" is uncertain; as-is
roughly half of the CPU time goes into stuff related to my GL backend,
with much of the rest going into the Quake engine itself.


As-is, there does appear to be a non-linearity in Quake, where doubling
the clock-speed seemingly gets ~ 3x the framerate (so, as-is, GLQuake
would apparently be pulling off ~ 20-30 fps at 100MHz, despite still
being mostly limited to single-digit territory at 50MHz).

Did recently drop the audio mixing sample rate to 8 kHz, which while not
sounding as good (as 16 kHz), does at least free up some CPU time.


However, some recent "optimizing stuff" efforts got GLQuake mostly into
the "upper single digits" at 50MHz (or ~ 6-9 fps typically), so it is
now pulling off around twice the framerate as software Quake (I would
generally consider staying "more or less consistently" above 10 fps as a
"playability threshold").


Would be easier if I were dealing with something "graphically simpler"
than GLQuake.

Like, if I were dealing with something that was able to stay under
around 200-300 polygons per frame (triangles/quads, *), could be a lot
easier.

GLQuake pretty often tries to push around 700 .. 1000 polygons per frame
(and in some cases may hit 1200 .. 1600 polygons/frame), which my GL
renderer has a problems dealing with at 50 MHz.


*: I have special cases for dealing with both triangles and quads as
"native" primitive types, whereas more complex polygons will need to be
decomposed into triangles and/or quads:
TRIANGLE_FAN: Decomposed into triangles;
POLYGON:
3 vertices -> triangle
4 vertices -> quad
5 vertices -> quad + triangle
6 vertices -> 2 quads
7+ -> triangles (generic fallback)
...

Would also be easier maybe if rendering something which wasn't trying to
texture map and color-modulate pretty much everything on screen, ...
(Like, you know, maybe some big flat-shaded polygons or something).


While arguably fragmenting geometry to reduce affine warping does
increase the amount of drawn primitives:
It doesn't really have much effect on framerate;
It is basically necessary to avoid stuff looking like broken garbage;
The number of primitives due to fragmenting is small relative to the
"baseline load" drawn by GLQuake itself (only a relative minority of the
input primitives end up being fragmented).

Despite Quake's scenery being "visually simplistic":
There is often a fair bit of overdraw due to the PVS algorithm being
"fairly conservative";
A lot of the geometry has already been "good and cut up" by the BSP
building algorithm (so, a lot of the walls and floors were already cut
up into a fair number of smaller polygons by QBSP);
...

So, one might observe (after I added a wireframe view) that the engine
will often draw things which are around corners or nearby adjacent rooms
which are on the other side of a wall, ...

Starts thinking about how I might have done some things differently,
then realized I was essentially reinventing the Quake 3 BSP format, hmm.
I guess if I really wanted, could possibly rebuild the Quake 1 BSPs from
the map source via a modified QBSP3 or something.



Have also observed that at higher clock-speeds, the relative amount of
time spent in the span-drawing loop increases (say, from 8-12% at 50MHz
up to around 20% at 100MHz).


Did experimentally try seeing what would happen if I dropped the
clock-speed to 25MHz:
GLQuake drops to 1-2 fps;
The 1kHz timer interrupt starts eating a fair chunk of CPU time;
CPU time spent in span-drawing drops to around 4-6% of the total;
A majority of the CPU time goes into the Quake engine.


The CPU cost of the timer interrupt handler seems somewhat sensitive to
whether XGPR is enabled (effects whether it saves/restores R0..R31 or
R0..R63).


Other than this, a lot of time goes into "other stuff":
Walking the BSP and drawing stuff;
Drawing alias models (1);
Stuff like line-tracing and similar;
Copying the rendered frame to VRAM;
...



1: Though I had added a hack where models further away from the camera
are drawn as low-res sprites rather than an actual 3D model (Quake 1
predates things like LODs, and there isn't really a good way to "auto
LOD" the Quake 1 models). General strategy was basically to render each
alias model frame from various angles, and basically generate a sprite
sheet for each frame with the model rendered at each of these angles.

For rendering, one can select the faces pointing in the general
direction of the player, and then render these faces (with a certain
amount of rotation applied; as opposed to traditional "billboard
sprites" which are always aligned to face the camera directly).


This can give sort of a mockery that the 3D model is still there (but
still kinda looks like crap up close). The billboard sprite approach
looks better up-close, but from a distance it is more obvious that the
model has been replaced with a sprite than if it is rendered via several
overlapping faces (aligned with the same orientation as the original 3D
model).

For "better effect", one could alpha-blend the faces based on the
relative vector to the camera, but this would be more computationally
expensive.

Noted a nifty image:
https://en.wikipedia.org/wiki/File:Icosahedron-golden-rectangles.svg

So, say, the 3D model is rendered in terms of the rectangles within the
icosohedron, projected as it would be seen from one of the faces of an
icosohedron. Then one builds sprite-sheets for the various alias-model
frames.

At the moment, not really aware of anyone else doing it this way
(usually more just traditional LOD and/or billboard sprites).



In this case, things like Text / UI / HUD drawing is also fairly expensive.

Could potentially move these parts from GL to drawing them in software.
Uploading and drawing them via a texture wouldn't be particularly efficient;
Using glDrawPixels for this could work, would need to add support though
for glDrawPixels with alpha testing though;
...


Well, that and other recent thoughts for a "TKGDI" interface, where the
idea would be moving from accessing the graphics hardware directly, over
to an API along vaguely similar lines to the Windows GDI.

However, unlike Windows GDI, it would lack any concept of GUI widgets
(my concept here being that programs can either draw their own widgets,
use OpenGL, or use a widget toolkit which is separate from the
underlying GDI layer).

In effect, one would just have handles to Windows or Screens that they
can draw onto (sort of more like X11 in this sense). Unlike GDI, would
assume that any "windows" (in a GUI sense) would likely have their own
backing memory, so the program will draw to the window once and can
assume that whatever it draws will remain (as opposed to being expected
to be told to redraw parts of the window if another window gets moved
over the top of it or similar).

Though, it is also quite possible that the backing memory for each
window would be in a block-compressed form, with windows likely keeping
their drawable areas and window decorations at a multiple of 4 pixels (a
lot of this stuff can be done much more efficiently in my case if one
assumes a 4 pixel alignment rather than a 1 pixel alignment).


Still TBD if trying to do a GUI would "make sense" at this point in
time, and more so is arguably "kind of moot" in that my FPGA board
doesn't allow connecting a keyboard and mouse at the same time (unless I
go and buy a PS2 Mouse/Keyboard PMOD or similar that hopefully allows
both to be plugged in at the same time).

GUI would be kinda moot if limited to keyboard-only.

...



>>> But recently I took another look at the Control Data 1604A, and the Control
>>> Data 6600. And I found that the 1604A used 96 bits for double precision,
>>> and the 6600 used 120 bits for double precision!
>>>
>> Except maybe for scientific computing or similar, these are likely
>> overkill for most general-purpose uses.
> <
> But straightforward to provide--whereas SW is replete with special casing.
> <

Possibly, it mostly comes down to whether it is used enough in a
performance-sensitive context to where it is worthwhile to spend the
costs of doing it in hardware.

So, for example, even if C's "long double" type is "slow as molasses",
it doesn't matter that much if hardly anyone ever uses it in a context
where this matters.


> But 1 thing that can be pointed out:: FP8 can be implemented in ROM
> at essentially zero HW cost.

FP8 is in a gray area as to whether it can be supported effectively
primarily via lookup tables.

If used "only occasionally", lookup tables would be sufficient. If used
for working with pixel data in a real-time context or similar, one is
going to need some converter ops and similar at least.


In my case, there isn't any direct hardware support for operating on FP8
directly, but there are converter ops for Packed FP8 to/from Packed
Binary16.

FP8 wouldn't be too expensive to support directly, just lacks a strong
use-case for doing so ATM.

They could maybe be done as 1/1 ops (single cycle latency), as opposed
to the 3/1 (3 cycle latency) needed for Binary16 ops.

However, FP8 is also near the lower-end of "acceptable quality" even for
pixel data, and performing calculations directly in this format would
not likely help matters here.

So, it typically makes more sense as a "storage format", with any
intermediate calculations being done in a higher precision format (such
as Binary16).

Similar is also typically the case for LDR RGBA, where one might use
RGBA32 or RGB555 for storage, but then do "the math" on packed Int16 or
similar (0.16, 2.14, or 4.12, or similar).




The value fields in FP8 are still a little large to map efficiently to
doing things directly as LUTs though.

A format with 3-bit exponents would likely be easier to pull off in
LUTs, though a 3-bit exponent doesn't allow for all that much dynamic range.


Say, with a 4-bit exponent:
F: 256.0 .. xx | Inf (*)
E: 128.0 .. 256.0
D: 64.0 .. 128.0
C: 32.0 .. 64.0
B: 16.0 .. 32.0
A: 8.0 .. 16.0
9: 4.0 .. 8.0
8: 2.0 .. 4.0
7: 1.0 .. 2.0
6: 0.5 .. 1.0
5: 0.25 .. 0.5
4: 0.125 .. 0.25
3: 0.0625 ..
2: 0.03125 ..
1: 0.015625 ..
0: Zero


*: Semantics differ slightly from IEEE FP here:
The format mostly behaves as-if Inf and NaN do not exist.
However, 0x7F and 0xFF may be treated as-if they were a "soft Inf".
The handling of Zero is also subject to interpretation:
0x00: May be a "Magic Zero"
0x01: May be either Zero or 0.0083 / 0.0088, ...

The FP8 decoder logic has a flag parameter (typically a constant) which
effects the interpretation of Zero here.


While there are other potential "non-graphics" use-cases, but they
haven't really come up in my case, and most use-cases I can think up
still end up (to some extent) related to image processing tasks.

BGB

unread,
Jun 6, 2022, 7:23:56 PMJun 6
to
Floating point is pretty much "general purpose" at this point.

I suspect there is probably far more floating point math going on at any
given moment in the context of 3D gaming and similar than in the entire
history of scientific computing combined.

...


And, most of it is going on at the lower end of the precision scale,
more important that it be fast and cheap than "good".

A lot of it is stuff that could arguably also be done with fixed-point
arithmetic, apart from pesky issues like dynamic range and similar.


> John Savard

BGB

unread,
Jun 6, 2022, 9:24:37 PMJun 6
to
On 6/6/2022 11:54 AM, Stefan Monnier wrote:
>> I had been of the opinion that a computer which addressed 12-bit units of
>> storage could offer optimum sizes of floating-point numbers...
>
> I don't think there can be anything magical about 12bit as opposed to
> any other size. I think what's going on here instead is that you don't
> want the progression of sizes to be limited to powers of 2 because you
> inevitably have an "almost 100% overhead" in those cases where you need
> a tiny bit more than 2^N (and are hence forced to use 2^N+1).
>

Agreed, I see no reason to see multiples of 12 as any more special than
multiples of 8 or 16 in a numerical sense (well, and also don't see
numerology as particularly relevant in the context of computing).


Well, except maybe in the "mystical" properties of power-of-2 sizes to
map conveniently to bit shifting, going so far as to make them faster
and cheaper than pretty much any other possible alternative.

Well, and maybe other "nifty" properties one can get from prime numbers
and Mersenne primes and similar, ...



> If you could have more options, such as supporting both sizes of the form
>
> 16bit * 2^N
>
> and sizes of the form
>
> 16bit * 2^N * 1.5
>
> then the worst-case overhead would be reduced by half (i.e. down to
> 50%), making it presumably more tolerable.
>

Yeah, one could have:
16, 24, 32, 48, 64, 96, 128

This is almost the pattern I am ending up with in some cases, usually
with the intermediate formats being a truncated version of the next
larger format.


I suspect 1.5 is basically like computing's equivalent of the Golden
Ratio...

Well, also it can be found when one, starting from a power of 2, does:
x+(x>>1)


> This said, a 100% overhead is pretty standard in all kinds of places
> (e.g. buddy allocator, stop&copy GC, most competitive analysis, ...) so
> I'm not sure it's that terribly important.
>

Yep.

Can note that in a 3D engine I did not too long ago, most of the memory
allocations, region sizes, ... ended up quantized along the (2^n) /
1.5*(2^n) curve, ...


Partly this was because while this quantization does introduce some
up-front storage overhead, for a use-case involving near continuous
memory allocation and freeing (such as when running around the world in
a Minecraft like terrain system), over time it tends to save more memory
than it loses, due to reducing memory fragmentation.


In the TestKern memory allocator, stuff is more generally quantized to a
factor of 1.25x (medium objects), or a multiple of 16 (small objects).


The 1.25x scale system also allows essentially mapping every object size
to an 8-bit microfloat (E6.F2), though with a partial break for small
object sizes:
00: 0
01: 16
02: 32
03: 48
04: 64
...
1E: 480
1F: 496
20..23: -
24: 512
25: 640
26: 768
27: 896
28: 1024
29: 1280
...

Though, one could also use multiples of 16 as the base unit rather than
1 byte, with a single unified scale for all object normal sizes:
00: -
01: 16
02: 32
03: 48
04: 64
05: 80
06: 96
07: 112
08: 128
09: 160
0A: 192
0B: 224
0C: 256
0D: 320
0E: 384
0F: 448
10: 512
...

Though, the scale might (for practical reasons) will still end up
breaking once it gets large enough that one switches to using the page
allocator.


I have typically ended up not doing block split/merge, as I had usually
found that this approach seems to hurt fragmentation more than it helps
(at least for medium and large objects).

Basically, split/merge tends to gradually lose memory over time due to
ever-increasing fragmentation, whereas quantized sizes (with no
split/merge) tends to reach a "steady state" after a certain amount of time.



For small objects, I had often used a "cell and bitmap" allocator, where
allocation strategy was usually:
Check free list for desired object size;
If found, use this;
Scan for a free span of cells using a bitmap allocator;
If found, mark in use in the bitmap, use this span;
Free all the objects in the free list
Mark bitmap cells as free, discard objects.
Scan again for a free run of cells that is sufficiently large.
If found, mark in use in the bitmap, use this span;
Try to allocate more memory for the cell allocator heap;
Try again to allocate a span.

The scanning steps typically use a rover, and one knows that allocation
hasn't found anything once the rover reaches its starting point.
However, the "cell and bitmap" strategy is still not entirely immune to
ever-increasing fragmentation (but this can be reduced by limiting it
mostly to fairly small objects).


Though, a simpler strategy for a small object allocator:
Check free list for desired object size;
If found, use this;
Check for a small object block that "isn't completely full yet";
If found, allocate object from this block;
Block needs to have enough space left for the object in question.
Advances an internal offset based on object size;
When we hit the end of the block, the block is full.
Intra-block memory will never be "freed" in this sense.
Expand the small object heap with a new block.
Heap is basically a linked list of fixed-size blocks;
Each block has an offset for how much of the block is used up.

Freeing an object adds it to the free list for its size.

This approach eats more memory up-front, but tends to reach a steady state.


One way to test an allocator for this property is to create a loop which
basically allocates and frees objects at random, with sizes generated by
a random number generator (within a bounded range and size distribution).

If the allocator is "stable" here, the heap will reach a maximum size
and then stop growing (IOW: "steady state").

If it is not, the heap will keep growing indefinitely, if albeit slowly,
and may run out of memory if it can't expand further.



Though, Doom's Z_Malloc system seems to get good results with a
split/merge approach and seems not to have too much of a problem with
Z_Malloc running out of memory due to fragmentation, so it may require
more investigation.

Then again, Doom primarily operates on a "allocate everything when
entering a level, discard it all again when the level exits" strategy,
and Quake replaced this with the "Hunk" system (essentially a LIFO stack
allocator), implying that things were maybe "not exactly perfect" with
Z_Malloc.

...


Quadibloc

unread,
Jun 7, 2022, 1:33:55 AMJun 7
to
On Sunday, June 5, 2022 at 7:17:54 PM UTC-6, MitchAlsup wrote:

> I still don't think you have progressed to the point where you realize
> architecture is just about as much about what to leave out as what
> to put in.

I'm _aware_ of the principle, even if I don't practice it much.

Also, my inclination is to leave the "leaving out" to the _implementation_
stage; the ISA is designed to have room for everything one might want
in a computer... and then any given implementation leaves out what is
not useful to the intended customer.

John Savard

Quadibloc

unread,
Jun 7, 2022, 1:36:58 AMJun 7
to
On Monday, June 6, 2022 at 10:54:06 AM UTC-6, Stefan Monnier wrote:

> If you could have more options, such as supporting both sizes of the form
>
> 16bit * 2^N
>
> and sizes of the form
>
> 16bit * 2^N * 1.5
>
> then the worst-case overhead would be reduced by half (i.e. down to
> 50%), making it presumably more tolerable.

Yes, that _is_ one of the things motivating me. So I have a whole section
of my web site devoted to schemes whereby this could somehow be
made technically feasible:

http://www.quadibloc.com/arch/perint.htm

John Savard

Thomas Koenig

unread,
Jun 7, 2022, 1:39:14 AMJun 7
to
Quadibloc <jsa...@ecn.ab.ca> schrieb:
What variant of your ISA should software vendors assume?

Marcus

unread,
Jun 7, 2022, 2:58:33 AMJun 7
to
Most ISA:s have some mechanism for handling "extensions", which can be
exposed to software in the form of flags in a system register, for
instance.

I like the approach of trying to design a complete ISA, with all the
parts following a common philosophy, rather than bolting on extensions
later (which usually leads to inconsistencies and poorly balanced
compromises). You can still group functionality into logical "modules"
(e.g. floating-point, vector processing, string functions, etc).

/Marcus

Brett

unread,
Jun 7, 2022, 3:10:11 AMJun 7
to
The PlayStation 1 had fixed-point arithmetic, which was the only part that
sucked hard. Real floats would have made the hardware more expensive, so
this was the correct decision.

Marcus

unread,
Jun 7, 2022, 3:27:39 AMJun 7
to
I think that you'll find that most floating-point operations that are
done worldwide during one day are not related to scientific computing.
Graphics, gaming, video decoding etc are all commonplace, and all use
massive amounts of floating-point calculations.

When you're dealing with "real world data" (measured or simulated
signals etc), you're usually fine with binary32 (or smaller). There are
a few cases where you benefit from binary64, but that's the exception
rather than the norm.

...and bandwidth matters.

/Marcus

BGB

unread,
Jun 7, 2022, 4:38:49 AMJun 7
to
I had been going with a 2-layer approach:
Profiles, which define major configurations.
This would include major architectural features:
Number of GPRs;
Size of virtual address space;
...
Binary compatibility is not required between profiles (1).
Extensions, which deal with smaller / narrower-scope features.


This allows a certain range of scale, for example, between simpler


1: This came up not too long ago when I tried making a minicore, which
had implemented a variant of the 'D' profile, whereas my main core
currently implements the 'A' profile (with some 'G' profile features),
and then I ran into a roadblock that the two profiles are different
enough that they effectively can't share the same Boot ROM code...

I actually had to make special effort such that the initial entry of the
Boot ROM was written (in ASM) in a special "common subset" of both ISA
profiles (the minicore, if present, just goes and spins in a loop
waiting to be signaled to start).


In effect, the profiles are different enough that the only way to write
code that works on both is via a limited ASM subset with assembler
directives to tell the compiler "what's up".

Ironically, it isn't that much different at present to trying to run
RISC-V (RV64IM) code on my BJX2 core (which is still at the stage of
being "too limited to really be all that useful for much of anything").


Then I was left to doubt the utility of having a second core which still
"barely fits" and implements a profile that is not binary compatible
with any of the code running on the main core (or vice versa).

Would have made more sense to have gone with a different profile where
the common subset was "more substantial". Though, the initial thinking
was partly that the 'D' profile was close to RV64I to be able to mostly
minimize the amount of hardware functionality which the RV64I mode could
not use (or, at least short of implementing a custom RISC-V core which
operates on my ringbus).


Another recent line of thinking would be to try to instead design a GPU,
possibly built around a modified version of the BJX2 core (and/or code
derived from Minicore). At least for a GPU, the binary compatibility
issues "stand to reason".

But, then, I don't know if I could give a useful level of performance in
a sufficiently low LUT cost.


> /Marcus

BGB

unread,
Jun 7, 2022, 5:15:15 AMJun 7
to
Yeah.

In my case, I am using floating-point on the front-end, but mostly
fixed-point on the backend. Ironically, also because of the affine
texturing (with dynamic subdivision), and me tending to often render
stuff with nearest filtering, stuff tends to sort of look like it was
rendered on a PS1.

General performance stats for 3D rasterization seem to fall a little
short of the numbers I could find for the PS1 though.



My core does at least run at a slightly higher clock speed on an FPGA
than the MIPS core in the PS1 though (or the SH-2 cores in the Sega Saturn).

Albeit, both of these consoles had the advantage of dedicated GPU
hardware (vs running stuff using software rendering; or a software
renderer implemented behind the OpenGL 1.x API).

...

Stefan Monnier

unread,
Jun 7, 2022, 8:54:10 AMJun 7
to
>> 16bit * 2^N * 1.5
>>
>> then the worst-case overhead would be reduced by half (i.e. down to
>> 50%), making it presumably more tolerable.
>
> Yes, that _is_ one of the things motivating me. So I have a whole section
> of my web site devoted to schemes whereby this could somehow be
> made technically feasible:

I'd look at it a different way: separate the "data-processing" and "data
storage" sides. E.g. if you want to support 48bit or 24bit integers,
you can already get most of it fairly efficiently today.

- "Load NNbit integer at address X" can be done with a normal
(unaligned) 64bit load and then shift/mask out the excess.
- "add/sub/mul/... on NNbit integer" can be done with 64bit operations
followed by some overflow handling.
- "Store NNbit integer at address X" is a bit more problematic.
You'll probably have to load the 64-NN missing bits at the
destination, combine them with the NN bits of data and then write out
the resulting 64bit integer.

AFAICT the first two points can be done fairly efficiently without very
much work. So only the store part seems potentially expensive, but
since it's a store it's fairly easy to hide the latency of the operation
so the only real worry is to implement it in a way that offers
good throughput.


Stefan

MitchAlsup

unread,
Jun 7, 2022, 12:01:12 PMJun 7
to
Which instructions is the compiler universally allowed to use ?
<
In my case, there are only a few instructions that are optional:: BMM
(Bit Matrix Multiply) is the classic one.

MitchAlsup

unread,
Jun 7, 2022, 12:08:59 PMJun 7
to
On Tuesday, June 7, 2022 at 1:58:33 AM UTC-5, Marcus wrote:
> On 2022-06-07, Thomas Koenig wrote:
> > Quadibloc <jsa...@ecn.ab.ca> schrieb:
> >> On Sunday, June 5, 2022 at 7:17:54 PM UTC-6, MitchAlsup wrote:
> >>
> >>> I still don't think you have progressed to the point where you realize
> >>> architecture is just about as much about what to leave out as what
> >>> to put in.
> >>
> >> I'm _aware_ of the principle, even if I don't practice it much.
> >>
> >> Also, my inclination is to leave the "leaving out" to the _implementation_
> >> stage; the ISA is designed to have room for everything one might want
> >> in a computer... and then any given implementation leaves out what is
> >> not useful to the intended customer.
> >
> > What variant of your ISA should software vendors assume?
> Most ISA:s have some mechanism for handling "extensions", which can be
> exposed to software in the form of flags in a system register, for
> instance.
<
For example: the decode of an invalid instruction causes exception transfer
to handler where the instruction can be evaluated in SW. Make these transfers
of control fast enough and you don't need the flag.

Ivan Godard

unread,
Jun 7, 2022, 12:28:00 PMJun 7
to
All quad width arithmetic. Decimal FP. Binary FP. div and rem. sqrt.

Thomas Koenig

unread,
Jun 7, 2022, 1:50:21 PMJun 7
to
Marcus <m.de...@this.bitsnbites.eu> schrieb:
> On 2022-06-07, Thomas Koenig wrote:
>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>>> On Sunday, June 5, 2022 at 7:17:54 PM UTC-6, MitchAlsup wrote:
>>>
>>>> I still don't think you have progressed to the point where you realize
>>>> architecture is just about as much about what to leave out as what
>>>> to put in.
>>>
>>> I'm _aware_ of the principle, even if I don't practice it much.
>>>
>>> Also, my inclination is to leave the "leaving out" to the _implementation_
>>> stage; the ISA is designed to have room for everything one might want
>>> in a computer... and then any given implementation leaves out what is
>>> not useful to the intended customer.
>>
>> What variant of your ISA should software vendors assume?
>
>
> Most ISA:s have some mechanism for handling "extensions", which can be
> exposed to software in the form of flags in a system register, for
> instance.

I know, and a PITA it is if you are trying to distribute software
in binary format, or in a library.

You will then see horrible code like

if (matmul_fn == NULL)
{
matmul_fn = matmul_r4_vanilla;
if (__builtin_cpu_is ("intel"))
{
/* Run down the available processors in order of preference. */
#ifdef HAVE_AVX512F
if (__builtin_cpu_supports ("avx512f"))
{
matmul_fn = matmul_r4_avx512f;
goto store;
}

#endif /* HAVE_AVX512F */

#ifdef HAVE_AVX2
if (__builtin_cpu_supports ("avx2")
&& __builtin_cpu_supports ("fma"))
{
matmul_fn = matmul_r4_avx2;
goto store;
}

#endif

[...]

to select the right version of a library routine, either for
efficiency reasons or if it outright doesn't run. (I mostly
wrote the code above, so I can justly claim it's horrible).

Having different versions of an ISA works for people who compile
everything themselves. Fine for embedded systems, not so fine
for computers or mobiles phones.

Thomas Koenig

unread,
Jun 7, 2022, 1:53:22 PMJun 7
to
MitchAlsup <Mitch...@aol.com> schrieb:
> On Tuesday, June 7, 2022 at 1:58:33 AM UTC-5, Marcus wrote:
>> On 2022-06-07, Thomas Koenig wrote:
>> > Quadibloc <jsa...@ecn.ab.ca> schrieb:
>> >> On Sunday, June 5, 2022 at 7:17:54 PM UTC-6, MitchAlsup wrote:
>> >>
>> >>> I still don't think you have progressed to the point where you realize
>> >>> architecture is just about as much about what to leave out as what
>> >>> to put in.
>> >>
>> >> I'm _aware_ of the principle, even if I don't practice it much.
>> >>
>> >> Also, my inclination is to leave the "leaving out" to the _implementation_
>> >> stage; the ISA is designed to have room for everything one might want
>> >> in a computer... and then any given implementation leaves out what is
>> >> not useful to the intended customer.
>> >
>> > What variant of your ISA should software vendors assume?
>> Most ISA:s have some mechanism for handling "extensions", which can be
>> exposed to software in the form of flags in a system register, for
>> instance.
><
> For example: the decode of an invalid instruction causes exception transfer
> to handler where the instruction can be evaluated in SW. Make these transfers
> of control fast enough and you don't need the flag.

Looking at the matmul code I just posted in reply to Marcus... it is
part of

https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/generated/matmul_r4.c;hb=HEAD

How, exactly, would you do this in libgfortran? Clearly, trapping
on and emulating on every AVX2 instruction in a tightly written
matmul would be ridiculous from a speed perspective.

Do you have other options in mind?

BGB

unread,
Jun 7, 2022, 2:59:09 PMJun 7
to
This is why I don't bother with emulated instructions:
If the ISR will cost more than using a runtime call (pretty much
inevitable), it makes more sense to just use a runtime call instead.

Or, to generate alternate versions of the ASM blobs depending on which
features are available to the compiler.

It could be possible to dynamically configure stuff based on CPUID
flags, which basically means:
Use a function call, but it turns into a double-jump;
We need to patch up these calls at runtime based on what features are
present.


This approach still has a non-zero overhead (mostly as it seems, there
will typically be 10 or 20 cycles of overhead mostly related to
caller-side overheads for performing the function call).

Though, it seems like this approach could potentially be offloaded to
the PEL loader, mostly by hacking the DLL import mechanism (already
hacked some vs the original PE/COFF version).


Say:
__sdivsq:
BRA48 __sdivsq_sw //software version

Then the PEL IAT points at __sdivsq (say, as a "BJX2!_SDIVSQ" import),
which is flagged to be updated as a BRA48 instruction rather than a raw
address (the BRA48 instruction being a 64-bit encoding which encodes a
branch as a 48-bit absolute address; 1).

If there is a CPU instruction, it is patched to an alternate version, say:
__sdivsq_native:
SDIVS.Q R4, R5, R2
RTS

...



1: In 96-bit addressing mode, this can't jump outside of a given 48-bit
quadrant; there isn't currently any way to jump between quadrants apart
from using an ISR handler.

There would have been JMPX and JSRX, which could do Abs96 jumps, but
these had "not good" effects on LUT cost...

BGB

unread,
Jun 7, 2022, 3:25:17 PMJun 7
to
This is where a "good" portable IR could make sense.
Then, say, we can AOT compile binaries based on the specific ISA
feature-set of the target machine.

Sadly, the world has yet to see such a "good" IR.


One of the better examples I think is .NET CIL, but it still has its
issues, and isn't really ideally designed for C or C++ programs.

Though, one thing that helps is to remap certain types of ifdef's to an
alternate form that is evaluated much later in the compiler pipeline.

I don't really get the obsession a lot of the people who design these
sorts of VMs have with trying to "replace" C and C++...



Then again, another possible route is to use some other ISA as a
makeshift IR (say, use x86-64 or RISC-V or similar as an IR).

I had considered before trying to write an x86-64 to BJX2 transpiler
(possibly using PE32+ binaries as a distribution format), but admittedly
still have not gotten around to this...

John Dallman

unread,
Jun 7, 2022, 3:59:13 PMJun 7
to
In article <b0b53c16-1229-449c...@googlegroups.com>,
jsa...@ecn.ab.ca (Quadibloc) wrote:

> That pretty much explodes my thinking that 60 bits is enough for
> anyone; apparently longer precisions are indeed of some valid use

Welcome to reality.

For the field of mathematical modelling I work in, modellers are quite
carefully designed to get the most coordinate space out of 64-bit real
numbers with acceptable accuracy. They could doubtless be modified to
work with 60-bit reals, but the coordinate space would inevitably be
smaller.

That would mean that models were portable between platforms with 64-bit
reals, as they are at present, and models made with 60-bit reals could be
imported into software running with 64-bit reals. But models that tried
to use most of the coordinate space possible with 64-bit reals could not
work correctly on platforms with 60-bit reals. It's not a compelling
prospect for a software vendor, is it?

We'd use 128-bit reals if they were widely available with good
performance. That's enough extra potential to do serious work for. 80-bit
reals didn't add enough to be worth it.

John

EricP

unread,
Jun 7, 2022, 4:18:20 PMJun 7
to
What about double-doubles?
Too slow?
More trouble than they are worth?

Michael S

unread,
Jun 7, 2022, 4:18:55 PMJun 7
to
What is "good performance" ?
Is FPU bandwidth/latency a bottleneck now, or other parts (logic? memory access?) more dominant?
If the later, by how much?

Thomas Koenig

unread,
Jun 7, 2022, 4:47:17 PMJun 7
to
John Dallman <j...@cix.co.uk> schrieb:

> We'd use 128-bit reals if they were widely available with good
> performance. That's enough extra potential to do serious work for. 80-bit
> reals didn't add enough to be worth it.

A colleague bought a https://www.raptorcs.com/TALOSII/ a couple
of years ago for CFD work; at the time, POWER9 was the fastest
processor for memory bandwidth around, and it had some impressive
OpenFOAM benchmarks. After some trouble, it ran Ubuntu fairly well.
Make sure you buy a sound-proofed version, though, or people will
be able to find your office without opening their eyes.

It has hardware qp float with IEEE 128-bit format, and since gcc12,
this is also supported in Fortran at least (and, I think, C++).
You'll need IBM's "advance toolchain" in a recent enough version
to work with it, so it is not so easy to set up (I had some trouble
with it).

As for performance: The little program

program main
implicit none
integer, parameter :: wp = selected_real_kind(30)
integer, parameter :: n=401, p=401, m=667
real (kind=wp), allocatable :: c(:,:), a(:,:), b(:, :)
character(len=80) :: line
real (kind=wp) :: fl = 2.d0*n*m*p
integer :: i,j
real :: t1, t2

allocate (c(n,p), a(n,m), b(m, p))

print *,wp

line = '10 10'
call random_number(a)
call random_number(b)
call cpu_time (t1)
c = matmul(a,b)
call cpu_time (t2)
print '(A,F6.1)',"MFlops: ", fl/(t2-t1)*1e-6
read (unit=line,fmt=*) i,j
write (unit=line,fmt=*) c(i,j)
end program main

yields on my home box (the 16 is just the KIND number used
by the compiler), using libquadmath:

16
MFlops: 50.8

and on a POWER using -mabi=ieeelongdouble (IEEE qp):

16
MFlops: 250.8

which may or may not be fast enough for you.

Thomas Koenig

unread,
Jun 7, 2022, 4:48:45 PMJun 7
to
EricP <ThatWould...@thevillage.com> schrieb:

> What about double-doubles?
> Too slow?

On the same machine which gave 250 MFlops for a matrix
multiplication for IEEE qp, I also got around 50 MFlops for double
double (similar to libquadmath on my home box).

Ivan Godard

unread,
Jun 7, 2022, 5:02:18 PMJun 7
to
Kinda tough to be good for arbitrary ISA; not hard to be good for an ISA
family, selecting for the implemented optional features. See Mill genAsm.

Ivan Godard

unread,
Jun 7, 2022, 5:03:55 PMJun 7
to
Mill Gold (and bigger) has quad real. We're working on "widely
available" :-)

John Dallman

unread,
Jun 7, 2022, 5:37:52 PMJun 7
to
In article <t7odgi$t4u$1...@newsreader4.netcologne.de>,
tko...@netcologne.de (Thomas Koenig) wrote:

> A colleague bought a https://www.raptorcs.com/TALOSII/ a couple
> of years ago for CFD work; at the time, POWER9 was the fastest
> processor for memory bandwidth around, and it had some impressive
> OpenFOAM benchmarks.

Yup, we know of these. The customers aren't wanting quad-precision enough
to justify this kind of kit yet.

John

John Dallman

unread,
Jun 7, 2022, 5:37:53 PMJun 7
to
In article <doOnK.13820$_T.1...@fx40.iad>,
ThatWould...@thevillage.com (EricP) wrote:

> What about double-doubles? Too slow?

Used in just a few places where we don't have to do lots of crunching and
the precision is needed.

> More trouble than they are worth?

There's no single algorithm to re-write using them. There are /lots/ of
algorithms, somewhere between 20 and 400, depending on how you could
special cases.

John

John Dallman

unread,
Jun 7, 2022, 5:37:53 PMJun 7
to
In article <b1138a21-218c-4a4c...@googlegroups.com>,
already...@yahoo.com (Michael S) wrote:

> What is "good performance" ?
> Is FPU bandwidth/latency a bottleneck now, or other parts (logic?
> memory access?) more dominant? If the later, by how much?

The primary bottleneck is memory bandwidth. Integer and branch
performance comes next, and then FPU. The floating-point work tends to
come in brief bursts. It isn't big matrix crunches: our kind of modelling
does not work that way,

We'd want quad-precision float running at something close to current
double-precision speed, on a widely-used architecture, before we started
a serious project.

John

John Dallman

unread,
Jun 7, 2022, 5:39:47 PMJun 7
to
In article <t7oefo$d8t$2...@dont-email.me>, iv...@millcomputing.com (Ivan
Godard) wrote:

> Mill Gold (and bigger) has quad real. We're working on "widely
> available" :-)

Hope it works.

John

BGB

unread,
Jun 7, 2022, 5:47:16 PMJun 7
to
For BGBCC and BJX2 I also have "ifarch", which can push decisions about
which ASM to use, or which version of a function to generate code for,
until code generation time.

This mostly allows things like the C library, etc, to not need to be
fully specialized for the target machine.


However, it does not apply to the final binaries, so would likely
require distributing programs as RIL3 bytecode or similar to make use of.

It is also nowhere near as "general purpose" as a traditional ifdef.



The current design is also not particularly great for a lightweight
compiler backend.

Say, if I wanted a compiler which is able to read in and compile an
image "one function at a time", I will need to significantly redesign
the IR packaging. The current design would effectively require reading
in the entire program image before it is possible to start compiling it.

Or, potentially to allow for a simpler / lighter-weight compilation
stage which more directly translates stack operations into machine-code
without going through a 3AC intermediate stage, ...

This would be a partial rationale for the considered possible move over
to a RIFF based packaging format (with structure-based metadata, more
like in the JVM), as well as some "modest" changes to how things like
variables are handled in the bytecode (1).

However, this would represent a non-trivial level of redesign to the
middle stage of my compiler.



*1: Most likely, variable reference by-name being replaced by a tagged
approach:
...zzzz000, Temporary Variable
...zzzz010, Argument
...zzzz100, Local
...zzzz110, Global
...zzz0001, Literal
...

Where the bits above the Rice-coded tag effectively encode an index into
the corresponding tables.

While I could (in premise) merge arguments, temporaries, and local
variables into a single table (like in the JVM), BGBCC tends to treat
these as separate types of variables.

Where, say, the goal would be to be able to AOT compile something the
size of Doom or Quake using under around 1 or 2 MB of RAM (as opposed to
10s of MB needed with my current backend).

BGB

unread,
Jun 7, 2022, 6:21:37 PMJun 7
to
Many approaches also involve trying to invoke "funky magic" behavior
from the SIMD units, some of which would fail in pretty major ways if
attempted with the FPU design used by my ISA.


I went with adding Binary128 and defining "long double" as also being
Binary128 (albeit not necessarily at full precision), however at present
this is implemented in software.


Ironically, recently I also went and reworked part of my emulator to use
software emulated floating-point (rather than using the native FPU on
x86-64), because I was running into some bugs related to differing
behavior between my BJX2 core and emulator regarding floating-point
operations.

This involved both some differences in the behavior of the
floating-point conversion operators, and the "slightly different"
rounding semantics between the ISAs.


There are pros/cons, mostly in this case it seemed more useful for
debugging to be able to get consistent behavior between my CPU core and
emulator (partly came up with a bug where Quake was using floating-point
math to calculate the size of a buffer, where rounding differences were
causing the buffer to end up being undersized and then the code in
question would write slightly past the end of the allocated buffer).

Though, there was the "cheap trick" here of multiplying the computed
buffer size by 1.01, which caused the buffer to be slightly bigger than
needed rather than "come up short".


...

MitchAlsup

unread,
Jun 7, 2022, 6:22:26 PMJun 7
to
It is the totally exceptional instruction that gets emulated. 99.999,85% of instructions
generated by the compiler are native (maybe even 100%).
<
Plus I have nothing similar to AVX--using VEC-LOOP and native instructions for
vectorization. VEC-LOOP eliminate any need for SIMD ISA--that is the implementation
of the moment decides with width of the SIMD path, and the programmer/compiler
accesses this via VEC-LOOP. 1-wide, 2-wide, 3-wide !!, 4-wide, 8-wide, 96-wide !!
and no wide-registers are needed to provide the capability.
<
VEC-LOOP provides vectorized string handling, memory handling, and when I get around
to it, decimal handling.

Michael S

unread,
Jun 7, 2022, 6:23:50 PMJun 7
to
I am not sure it's a right attitude, given what you wrote at the beginning of the post.
If it is true then there is good chance that you don't need "close to current double precision"
and don't even need "5 times slower than current double precision".
Judged by beginning of your post, It sound likely that with arithmetic about 20 times slower
than current double precision you would end up only 2-3 times slower at full application level.
And "20 times slower than current double precision" is within a reach of well-coded
quad-precision library. Well, may be, not in throughput in SIMD-friendly kernels. But
measured by latency, good library will perform even better than "20 times slower than DP".

MitchAlsup

unread,
Jun 7, 2022, 6:28:00 PMJun 7
to
On Tuesday, June 7, 2022 at 2:59:13 PM UTC-5, John Dallman wrote:
> In article <b0b53c16-1229-449c...@googlegroups.com>,
> jsa...@ecn.ab.ca (Quadibloc) wrote:
>
> > That pretty much explodes my thinking that 60 bits is enough for
> > anyone; apparently longer precisions are indeed of some valid use
> Welcome to reality.
>
> For the field of mathematical modelling I work in, modellers are quite
> carefully designed to get the most coordinate space out of 64-bit real
> numbers with acceptable accuracy. They could doubtless be modified to
> work with 60-bit reals, but the coordinate space would inevitably be
> smaller.
<
General question:: What good are 60-bit register containers if you end up
storing them in 64-bit memory containers.
>
> That would mean that models were portable between platforms with 64-bit
> reals, as they are at present, and models made with 60-bit reals could be
> imported into software running with 64-bit reals. But models that tried
> to use most of the coordinate space possible with 64-bit reals could not
> work correctly on platforms with 60-bit reals. It's not a compelling
> prospect for a software vendor, is it?
<
Importing 60-bits reals with 16-bit exponents will be problematic in
every respect (CDC 6600 and 7600. And it should be pointed out that
CDC 8600 went to 64-bit containers with a mostly compatible ISA.
CDC 6600 used 60-bits because the implementation technology
could buffer up a single signal to close the operand latches in
3 gates of delay (5× per stage) so they could latch 125 bits but not
128 bits AND CDC had 12-bit memory modules from earlier designs.}

Michael S

unread,
Jun 7, 2022, 6:31:08 PMJun 7
to
What is your home box and how many cores were used to get 50 MFlops QP?
If it's more than one core on modern Intel/AMD, it sounds rather poor.
In fact, on really modern Intel/AMD able to go above 4.5 GHz even on one core
it sounds poor.

MitchAlsup

unread,
Jun 7, 2022, 6:33:20 PMJun 7
to
On Tuesday, June 7, 2022 at 4:37:53 PM UTC-5, John Dallman wrote:
> In article <b1138a21-218c-4a4c...@googlegroups.com>,
> already...@yahoo.com (Michael S) wrote:
>
> > What is "good performance" ?
> > Is FPU bandwidth/latency a bottleneck now, or other parts (logic?
> > memory access?) more dominant? If the later, by how much?
> The primary bottleneck is memory bandwidth. Integer and branch
> performance comes next, and then FPU. The floating-point work tends to
> come in brief bursts. It isn't big matrix crunches: our kind of modelling
> does not work that way,
<
Agreed that memory BW and latency limit perf more than FP128
But for FP kernels {LL, BLAS} integer and branches hardly limit perf.

MitchAlsup

unread,
Jun 7, 2022, 6:38:54 PMJun 7
to
On Tuesday, June 7, 2022 at 4:47:16 PM UTC-5, BGB wrote:
> On 6/7/2022 4:02 PM, Ivan Godard wrote:
> > On 6/7/2022 12:25 PM, BGB wrote:

> > Kinda tough to be good for arbitrary ISA; not hard to be good for an ISA
> > family, selecting for the implemented optional features. See Mill genAsm.
> >
> For BGBCC and BJX2 I also have "ifarch", which can push decisions about
> which ASM to use, or which version of a function to generate code for,
> until code generation time.
>
> This mostly allows things like the C library, etc, to not need to be
> fully specialized for the target machine.
>
This is what VVM solves. Even the C (and other) library do not need to be
"optimized" for the machine at hand. The My 66000 machine at hand uses
VEC-LOOP to run original code at the perf possible in THAT implementation.
<
THAT is how you "get out of the game" of:
a) widening registers each iteration,
b) adding another 60 instructions per iteration
c) changing SW every iteration
d) changing compiler every iteration
e) changing libraries every iteration
f) specializing libraries when new SW is configured for a machine
g) you don't even need to configure SW per machine
.........
>
> However, it does not apply to the final binaries, so would likely
> require distributing programs as RIL3 bytecode or similar to make use of.
<
It does under VVM
>
> It is also nowhere near as "general purpose" as a traditional ifdef.
>
Unless #ifdef is unnecessary.
>
>
> The current design is also not particularly great for a lightweight
> compiler backend.
<
SIMD is just generally bad for everything associated with SW.

John Dallman

unread,
Jun 7, 2022, 7:34:28 PMJun 7
to
In article <0ae277a4-c63d-4156...@googlegroups.com>,
already...@yahoo.com (Michael S) wrote:

> Judged by beginning of your post, It sound likely that with arithmetic
> about 20 times slower than current double precision you would end up
> only 2-3 times slower at full application level.

Quite possibly, but the number of customers who would use that would be
tiny. This modeller is used interactively.

John

MitchAlsup

unread,
Jun 7, 2022, 8:09:11 PMJun 7
to
Hardware often gets to where it needs to go by being "a little" wider--for example
HW can perform FDIV with a 57×57 multiplier tree rather than resort to FP128.
I, myself, need about 64-bits of fraction to correctly calculate the coefficients
to my fast transcendentals. But, since these are calculated before the die is
printed, speed is of little importance. I need 58×58 multiplier in order to achieve
1 misrounding every ~237 calculations.
<
HW, it seems, forgets to give SW calculation increments that are appropriate
to the job at hand--precisely because of the problem of testing the device sufficiently.
{Something SW rarely does.}

BGB

unread,
Jun 7, 2022, 10:18:48 PMJun 7
to
On 6/7/2022 5:38 PM, MitchAlsup wrote:
> On Tuesday, June 7, 2022 at 4:47:16 PM UTC-5, BGB wrote:
>> On 6/7/2022 4:02 PM, Ivan Godard wrote:
>>> On 6/7/2022 12:25 PM, BGB wrote:
>
>>> Kinda tough to be good for arbitrary ISA; not hard to be good for an ISA
>>> family, selecting for the implemented optional features. See Mill genAsm.
>>>
>> For BGBCC and BJX2 I also have "ifarch", which can push decisions about
>> which ASM to use, or which version of a function to generate code for,
>> until code generation time.
>>
>> This mostly allows things like the C library, etc, to not need to be
>> fully specialized for the target machine.
>>
> This is what VVM solves. Even the C (and other) library do not need to be
> "optimized" for the machine at hand. The My 66000 machine at hand uses
> VEC-LOOP to run original code at the perf possible in THAT implementation.
> <
> THAT is how you "get out of the game" of:
> a) widening registers each iteration,
> b) adding another 60 instructions per iteration
> c) changing SW every iteration
> d) changing compiler every iteration
> e) changing libraries every iteration
> f) specializing libraries when new SW is configured for a machine
> g) you don't even need to configure SW per machine
> .........

More issues than just SIMD:
Whether the target has 32 or 64 GPRs (partial);
SIMD ops;
MOV.X (128-bit Load/Store);
RGB helper ops;
Single and Half precision Load/Store ops;
...

There are some bigger issues, like "sizeof(void *)", but (sadly) not
even the RIL3 bytecode is able to fully gloss over this (though, .NET
also has a similar issue related to 32-bit vs 64-bit targets with
C++/CLI, it is kind of a thorny issue).

There are also misc features, like whether or not the target has integer
divide, a 64-bit multiplier, but these cases come up less frequently, ...


>>
>> However, it does not apply to the final binaries, so would likely
>> require distributing programs as RIL3 bytecode or similar to make use of.
> <
> It does under VVM
>>
>> It is also nowhere near as "general purpose" as a traditional ifdef.
>>
> Unless #ifdef is unnecessary.

The main difference is that ifdef can be used to modify things at a
syntactic level, and pretty much any other aspect of the C language.

The scope of the ifarch mechanism is more limited:
Enabling or disabling functions and variables;
Enabling or disabling code blocks;
Functions sort of like "if(0){block}" or "if(1){block}"
Enabling or disabling chunks of inline ASM.

But, notably, is unable to modify:
typedefs or struct/union declarations;
Tokens or other syntactic elements.


>>
>>
>> The current design is also not particularly great for a lightweight
>> compiler backend.
> <
> SIMD is just generally bad for everything associated with SW.

SIMD is actually the lesser of the issues here.

Unlike the "xmmintrin.h" stuff, BGBCC's vector extensions work
regardless of whether the SIMD instructions are enabled or not (if not,
they decay into scalar code or runtime calls, but this is typically not
really any worse than had one just been using scalar code).



At the machine code level, however:
Expecting a different number of GPRs;
Expecting a different pointer size;
...

Are more serious issues, as a mismatch here typically means the code
wont run at all.


Though, for the most part, the 64-bit ABI can retain some level of
binary compatibility by mostly pretending that XGPR doesn't exist (it is
still mostly left as something for ASM to mess with, if enabled);
however, ISRs still need to deal with it; and running code which uses
R0..R63 on a kernel that only saves R0..R31 on context switches would
be... not ideal...


Then again, IIRC, there was a similar sort of issue with trying to use
SSE on Windows 95.

Not sure if there was any good solution at the time...

Thomas Koenig

unread,
Jun 8, 2022, 1:39:19 AMJun 8
to
Michael S <already...@yahoo.com> schrieb:
> On Tuesday, June 7, 2022 at 11:48:45 PM UTC+3, Thomas Koenig wrote:
>> EricP <ThatWould...@thevillage.com> schrieb:
>> > What about double-doubles?
>> > Too slow?
>> On the same machine which gave 250 MFlops for a matrix
>> multiplication for IEEE qp, I also got around 50 MFlops for double
>> double (similar to libquadmath on my home box).
>
> What is your home box and how many cores were used to get 50 MFlops QP?

/proc/cpuinfo tells me it's a AMD Ryzen 7 1700X Eight-Core Processor
, so Zen 1. And I have no idea what it was clocked at the time.

It was a single core, just using gfortran's matmul (which is
OK for moderate sizes of double and has not been tuned at all
for 16-byte reals - we may well overflow the cache there).

> If it's more than one core on modern Intel/AMD, it sounds rather poor.
> In fact, on really modern Intel/AMD able to go above 4.5 GHz even on one core
> it sounds poor.

Like I said, it's my home box (which is about to be replaced anyway).

Thomas Koenig

unread,
Jun 8, 2022, 1:56:38 AMJun 8
to
MitchAlsup <Mitch...@aol.com> schrieb:
> On Tuesday, June 7, 2022 at 4:47:16 PM UTC-5, BGB wrote:
>> On 6/7/2022 4:02 PM, Ivan Godard wrote:
>> > On 6/7/2022 12:25 PM, BGB wrote:
>
>> > Kinda tough to be good for arbitrary ISA; not hard to be good for an ISA
>> > family, selecting for the implemented optional features. See Mill genAsm.
>> >
>> For BGBCC and BJX2 I also have "ifarch", which can push decisions about
>> which ASM to use, or which version of a function to generate code for,
>> until code generation time.
>>
>> This mostly allows things like the C library, etc, to not need to be
>> fully specialized for the target machine.
>>
> This is what VVM solves. Even the C (and other) library do not need to be
> "optimized" for the machine at hand. The My 66000 machine at hand uses
> VEC-LOOP to run original code at the perf possible in THAT implementation.
><
> THAT is how you "get out of the game" of:
> a) widening registers each iteration,
> b) adding another 60 instructions per iteration
> c) changing SW every iteration
> d) changing compiler every iteration
> e) changing libraries every iteration
> f) specializing libraries when new SW is configured for a machine
> g) you don't even need to configure SW per machine

You're right (especially since VVM can just be a simple loop on
simple machines). There is one aspect where it could still be
improved in My 66000, at least in the docs that you sent me (I
have ISA66000-ISA-20.pdf), and discounting the register issue
for VEC that came up when I sent Brian the Fortran codes as
test cases (different languages do have different requirements...)

Right now, My 66000 does every floating point operation in double
precision. This of course gives correct results.

But what if SIMD is used for implementing VVM? If you implement VVM
with a 256 bit-wide SIMD, My 66000 is restricted to four floating
point variables in parallel (as far as I can see). If all people
want to do is to operate on single precision, it could do eight
in parallel; for float16, it could have sixteen in parallel.

Or am I missing something?

Thomas Koenig

unread,
Jun 8, 2022, 1:58:11 AMJun 8
to
John Dallman <j...@cix.co.uk> schrieb:
Out of curiosity: What does it model?

John Dallman

unread,
Jun 8, 2022, 2:44:39 AMJun 8
to
In article <t7pdpg$iec$3...@newsreader4.netcologne.de>,
tko...@netcologne.de (Thomas Koenig) wrote:

> John Dallman <j...@cix.co.uk> schrieb:
> > Quite possibly, but the number of customers who would use that
> > would be tiny. This modeller is used interactively.
>
> Out of curiosity: What does it model?

If I don't say, I can speak a bit more freely.

John

Michael S

unread,
Jun 8, 2022, 3:42:52 AMJun 8
to
With such approach higher precision will *never* happen.
At any given point in time double precision will be faster than quad.
Even if all common hardware has excellent 128-bit FPUs, double still would be
measurably faster due to lower memory footprint.
If you want it happening you should set an absolute threshold on what quad
math considered fast enough rather than measure it relatively to best double.

Michael S

unread,
Jun 8, 2022, 4:16:37 AMJun 8
to
On Wednesday, June 8, 2022 at 8:39:19 AM UTC+3, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Tue