Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Alternative Representations of the Concertina II ISA

396 views
Skip to first unread message

Quadibloc

unread,
Nov 23, 2023, 11:16:02 PM11/23/23
to
The Concertina II ISA, as it stands now, has a
very limited number of opcodes for load and
store memory-reference operations, but a huge
number of opcodes for register-to-register
operate instructions.

This makes it suitable for an implementation where:

Physical memory is some power-of-two multiple of
96 bits in width, for example, 192 bits or 384 bits
wide;

A simple hardware divide-by-three circuit allows
accessing memory normally with the RAM module width
of 64 bits as the fundamental unit;

Applying no conversion allows 24 bits to be the
fundamental unit;

An even simpler _multiply_ by five circuit allows
accessing memory with a 60-bit word as the
fundamental unit.

So it's possible to take an implementation of the
architecture with three-channel memory, and add the
feature of allocating data memory blocks that look
like they're made up of 48-bit words or 60-bit words.

When a memory-reference instruction refers to an
address which is inside of such a variant block,
its operation woudl be modified to handle a data
type of an appropriate width for that kind of memory.

Thus:

In 60-bit memory,

Load Floating would be an invalid operation;
Load Double would load a 60-bit floating-point number;
Load Medium would load a 45-bit floating-point number, as
such a float would still be wide enough to be useful.

Load Byte would be an invalid operation.
Load Halfword would load a 15-bit integer;
Load would load a 30-bit integer;
Load Long would load a 60-bit integer.

In 36-bit memory,

Load Floating would load a 36-bit floating-point number
Load Double would load a 72-bit floating-point nuber
Load Medium would load a 54-bit floating-point number

Load Byte would load a 9-bit byte
Load Halfword would load an 18-bit halfword integer
Load would load a 36-bit integer
Load Long would load a 72-bit integer into a _pair_
of registers, the first of which would be an even-numbered
register, as the registers are only 64 bits long

In 24-bit memory,

Load Floating would be an invalid operation
Load Double would load a 72-bit floating-point number;
Load Medium would load a 48-bit floating-point number;

Load Byte would load a 6-bit character;
Load Halfword would load a 12-bit integer;
Load would load a 24-bit integer;
Load Long would load a 48-bit integer.

Note that the relationship between floating-point
widths and integer widths in instructions is offset
by a factor of two when 24-bit memory is referenced.

And there would be separate opcodes for doing arithmetic
on numbers of these different lengths in the registers.

This is all very well. But in program code, which, as
it is so far specified, can _only_ reside in 64-bit memory,
what about pseudo-immediate values?

Since the instruction format that references pseudo-immediate
values is that of an operate instruction, pseudo-immediates
_could_ be allowed, with considerable wasted space in most
cases, in operate instructions for the data types associated
with variant memory widths.

However, I have thought of a way to allow the CPU to treat
the different memory widths with greater equality.

Instead of devising a whole new ISA for each of the other
memory widths, the *existing* ISA, designed around standard
memory built up from the 64-bit unit, could be given
_alternative representations_ wherein programs could be
stored in memory of other widths, using the same ISA with
minimum modifications.

The 36-bit Representation

In the standard 64-bit representation, a block may begin
with 1100, followed by a 2-bit prefix field for each of the
remaining fourteen 16-bit parts of the remaining seven
32-bit instruction slots in the block.

So, make that portion of the ISA the only one supported in
the 36-bit representation, but with each 36-bit word, in
a block, now 288 bits long, with eight 36-bit words, consisting
of a prefix field, followed by 16 instruction bits, repeated
sixteen times.

But it must still be possible to have block headers if one is
to have pseudo-immediates.

So the first 36-bit instruction slot of a block, if it is to
be a header, would start with 1011100011, and the pre- field
portion of the second 18 bits of that instruction slot would
contain 11. 1011100011 consists of 10 (32-bit instruction)
and 11100011 (header prefix in 32-bit mode).

So the ISA is modified because in 64-bit memory, only seven
instruction slots, not all eight, can contain 36-bit instructions,
and there are no headers inside 36 bit instructions. But the
modifications are minimal, for the purpose of adapting programs
to the different size of memory, with that size being native
for them.

The 24-bit representation

Since a 24-bit word contains three 8-bit bytes, here, the
form of instructions isn't modified at all. But blocks shrink
from eight instruction slots that are 32-bits in length to
eight 24-bit words, so a block would be 192 bits long.

The amount of wasted space for data in 24-bit types wouldn't
be changed much in many cases by using this representation
instead of the 32-bit representation for program code,
because the clash between instruction widths and data widths
wouldn't be affected. However, _some_ improvement would be
achieved, because now if there are multiple pseudo-immediate
values used in a single block, the 24-bit data items could
at least be *packed* correctly in the block.

The 60-bit representation

Here, a block is 480 bits long.

It consists of two 240-bit sub-blocks, in the following form:

One 15 bit halfword containing 15 prefix bits,

and fifteen 15-bit halfwords to which those bits apply,
turning them into 16-bit halfwords, which are used to form
instructions.

The blocks are aligned on 480-bit boundaries. When code
consists of 32-bit instructions, the first sub-block begins
with an instruction, and the second sub-block begins with the
second half of an instruction.

The first whole 32-bit instruction in a sub-block can be a
header, and so one has a header applying to seven subsequent
instruction slots in the first sub-block, and a header applying
to six subsequent instruction slots in the second sub-block.

John Savard

Quadibloc

unread,
Nov 23, 2023, 11:42:35 PM11/23/23
to
On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

> In 24-bit memory,
>
> Load Floating would be an invalid operation Load Double would load a
> 72-bit floating-point number;
> Load Medium would load a 48-bit floating-point number;

Alternatively, it would perhaps be simpler to do it this way:

Load Floating would load a 48-bit floating-point number;
Load Medium would load a 72-bit floating-point number.

This would prioritize structure over function, but only
the compiler would ever know...

The 24-bit representation of code, however, has a bigger problem.

In the 60-bit representation, a 15-bit halfword that can be
addressed easily in 60-bit memory corresponds to a 16-bit area
in the original ISA;

In the 36-bit representation, an 18-bit halfword, easily addressible
in 36-bit memory, does the same.

In the 24-bit representation, though, a problem exists; the simplest
way to deal with that would be, in code consisting entirely of 32-bit
instructions, to allow only every third instruction to be a branch
target.

John Savard

MitchAlsup

unread,
Nov 24, 2023, 1:51:02 PM11/24/23
to
I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do
you seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......

Quadibloc

unread,
Nov 24, 2023, 5:38:36 PM11/24/23
to
On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

> I have not found a need to have LD/ST floating point instructions
> because I don't have FP register file(s) {or SIMD files}. Why do you
> seem to want/need a file dedicated to floating point values ??
> This saves a huge chunk of OpCode Space.......

Well, for one thing, by giving the integer functional unit one register
file, and the floating-point functional unit its own register file, then
those two functional units are not constrained to be adjacent.

I can stick the Decimal Floating-Point functional unit on the other side
of the floating register file, and the Simple Floating functional unit,
or the Register Packed Decimal functional unit, on the other side of the
integer register file.

But of course my main reason is that this is the way everyone (i.e.
the IBM System/360) does it, so of course I should do it that way!

But having an excess of data types is at least a helpful excuse...

John Savard

Quadibloc

unread,
Nov 24, 2023, 5:41:08 PM11/24/23
to
On Fri, 24 Nov 2023 04:42:31 +0000, Quadibloc wrote:

> On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:
>
>> In 24-bit memory,
>>
>> Load Floating would be an invalid operation Load Double would load a
>> 72-bit floating-point number;
>> Load Medium would load a 48-bit floating-point number;
>
> Alternatively, it would perhaps be simpler to do it this way:
>
> Load Floating would load a 48-bit floating-point number;
> Load Medium would load a 72-bit floating-point number.
>
> This would prioritize structure over function, but only the compiler
> would ever know...

Silly me. How could I have forgotten that 24-bit memory is the
only place where the "ideal" lengths of floating-point numbers
can exist, so I should use them, rather than try to concoct
something that's 24-bit native.

So

Load Floating would load a 36-bit floating-point number,
Load Medium would load a 48-bit floating-point number, and
Load Double would load a 60-bit floating-point number.

John Savard

Quadibloc

unread,
Nov 24, 2023, 5:44:37 PM11/24/23
to
On Fri, 24 Nov 2023 22:38:32 +0000, Quadibloc wrote:

> On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:
>
>> I have not found a need to have LD/ST floating point instructions
>> because I don't have FP register file(s) {or SIMD files}. Why do you
>> seem to want/need a file dedicated to floating point values ??
>> This saves a huge chunk of OpCode Space.......

Also, because IEEE 754 calls for a hidden first bit, and
gradual underflow, these things make floating-point
arithmeitc slightly slower.

Hence, I deal with them at load and store time, by using
the load and store floating point instructions to convert
to and from an *internal representation* of floats. That
in itself precludes using the integer instructions for them.

John Savard

Quadibloc

unread,
Nov 24, 2023, 5:54:01 PM11/24/23
to
On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

> An even simpler _multiply_ by five circuit allows accessing memory with
> a 60-bit word as the fundamental unit.

As well, a multiply by three circuit is used to address the 36-bit memory.

> Instead of devising a whole new ISA for each of the other memory widths,
> the *existing* ISA, designed around standard memory built up from the
> 64-bit unit, could be given _alternative representations_ wherein
> programs could be stored in memory of other widths, using the same ISA
> with minimum modifications.

While this is an "improvement", I do realize that this, along with even
supporting multiple memory widths, even if I have found ways that it can
technically be done, is sheer insanity.

I do have a rationale for going there. It's simple enough. It comes from
the x86.

Since the irresistible force of availability of applications means that
we are doomed to always have just *one* dominant ISA...

then, to minimize the terrible loss that this entails, that one dominant
ISA should be maximally flexible. To approach, as closely as possible, the
"good old days" where the PDP-10 and the CDC 6600 coexisted alongside the
PDP-11 and the IBM System/360.

This is the driving force behind my insane quest to design an ISA for a CPU
which has the ability to do something which no one seems to want.

John Savard

MitchAlsup

unread,
Nov 25, 2023, 3:02:22 PM11/25/23
to
Quadibloc wrote:

> On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

>> I have not found a need to have LD/ST floating point instructions
>> because I don't have FP register file(s) {or SIMD files}. Why do you
>> seem to want/need a file dedicated to floating point values ??
>> This saves a huge chunk of OpCode Space.......

> Well, for one thing, by giving the integer functional unit one register
> file, and the floating-point functional unit its own register file, then
> those two functional units are not constrained to be adjacent.

But Integer Multiply and Divide can share the FPU that does these.
And we have the need to use FP values in deciding to take branches.
So there is a lot of cross pollination of types to FUs. And finally
you have to have 2 function units one to convert from FP to Int and
one to convert INT to FP.

Lets keep them apart and now we need a FU bigger than the FMAC unit
to perform integer multiplies and divides. In addition, you need a Bus
from the FPU side to the Condition unit that cooperates with integer
branch instructions.

Does not look like you can keep them "that far apart" and even when you
do there is lots of cross wiring--some of which are more costly in HW
than keeping FP and Int in the same RF.

BGB

unread,
Nov 25, 2023, 3:46:40 PM11/25/23
to
On 11/24/2023 4:44 PM, Quadibloc wrote:
> On Fri, 24 Nov 2023 22:38:32 +0000, Quadibloc wrote:
>
>> On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:
>>
>>> I have not found a need to have LD/ST floating point instructions
>>> because I don't have FP register file(s) {or SIMD files}. Why do you
>>> seem to want/need a file dedicated to floating point values ??
>>> This saves a huge chunk of OpCode Space.......
>
> Also, because IEEE 754 calls for a hidden first bit, and
> gradual underflow, these things make floating-point
> arithmeitc slightly slower.
>

DAZ/FTZ fixes the gradual underflow issue...
Not the standard solution, but works.


One could maybe also make a case for truncate-only rounding:
Easier to make bit-exact across implementations;
Cheapest option;
...

Though, truncate does occasionally have visible effects on code.
For example, the player's yaw angle in my Quake port was prone to drift,
had to manually add a small bias to stop the drifting.
...

Say, Round-Nearest at least having the advantage that looping relative
computations don't have a tendency to drift towards 0. Though, on the
other side, computations which are stable with truncate rounding will
tend to drift away from zero with round-nearest.

In this case, this applied mostly to 'float' and 'short float' (when set
to be routed through the low-precision unit via a compiler flag), with
the main FPU (used for 'double'; and the others with default options)
still doing proper round-nearest.



I don't imagine a non-hidden bit representation would be much better here.

Renormalization would still be required regularly to avoid other issues,
and manual re-normalization would negatively effect performance.

Operations like FADD/FSUB would also still need to be able to
shift-align operands to be able to do their thing, so non-normalized
values would not save much here either.


Well, with the partial exception if one instead re-interprets the
floating-point to be more like variable-fixed-point, in which case the
compiler would need to deal with these issues (and probably the
programmer as well, say, to specify how many bits are above or below the
decimal point).

But, in this case, may as well save a few bits and just use "normal"
fixed-point (or maybe add a compiler type, like "__fixed(16,16)" or
similar, then have the compiler deal with the shifts and similar...).


> Hence, I deal with them at load and store time, by using
> the load and store floating point instructions to convert
> to and from an *internal representation* of floats. That
> in itself precludes using the integer instructions for them.
>

Possible.

I went with integer operations myself...



> John Savard

Quadibloc

unread,
Nov 26, 2023, 8:30:00 PM11/26/23
to
On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

> But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.

I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be. After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.

> And we have the need to use FP values in deciding to take branches.

What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).

John Savard

MitchAlsup

unread,
Nov 26, 2023, 9:02:06 PM11/26/23
to
Quadibloc wrote:

> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

>> But Integer Multiply and Divide can share the FPU that does these.

> But giving each one its own multiplier means more superscalar
> goodness!

> But you _do_ have a good point. This kind of "superscalar goodness"
> is usually wasted, because most programs don't do an equal amount
> of integer and FP arithmetic. So it would be way better to have
> twice as many multipliers and dividers that are designed to be used
> either way than having units of two kinds.

> I'm assuming, though, that one can make the multipliers and dividers
> simpler, hence faster, by making them single-use, and that it's
> possible to know, for a given use case, what the ratio between integer
> and FP operations will be.

In non-scientific code, there are 10 integer mul/divs per 1 FP mul/div
In Scientific code, there are 2 integer mul/divs per 10 FP mul/divs.

> After all, integer multiplication discards
> the MSBs of the result, and floating-point multiplication discards the
> LSBs of the result.

Means almost nothing because a FMAC can end up with those very same bits
as are preferred for Int ×s

>> And we have the need to use FP values in deciding to take branches.

> What gets used in taking branches is the _condition codes_. They're
> conveyed from the integer and floating point functional units, as well
> as all the other functional units, to a central place (the program
> status word).

LD R4,[address of FP value]
BFEQ0 R4,some label

How do you do this on a condition code machine in 2 instructions ??
System 460 had to::

LD F4,[address of FP value]
TST F4
BEQ some label

> John Savard

BGB

unread,
Nov 27, 2023, 1:18:42 AM11/27/23
to
On 11/26/2023 7:29 PM, Quadibloc wrote:
> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>
>> But Integer Multiply and Divide can share the FPU that does these.
>
> But giving each one its own multiplier means more superscalar
> goodness!
>
> But you _do_ have a good point. This kind of "superscalar goodness"
> is usually wasted, because most programs don't do an equal amount
> of integer and FP arithmetic. So it would be way better to have
> twice as many multipliers and dividers that are designed to be used
> either way than having units of two kinds.
>
> I'm assuming, though, that one can make the multipliers and dividers
> simpler, hence faster, by making them single-use, and that it's
> possible to know, for a given use case, what the ratio between integer
> and FP operations will be. After all, integer multiplication discards
> the MSBs of the result, and floating-point multiplication discards the
> LSBs of the result.
>

If one wants to have the most superscalar goodness, for general case
code, this means mostly lots of simple ALU ops (particularly,
ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

Where, say, Load/Store ops and simple ALU ops tend to dominate over
pretty much everything else in terms of usage frequency.


In most code, FPU ops are comparably sparse, as is MUL, and DIV/MOD is
fairly rare in comparison to either. MUL is common enough that you want
it to be fast when it happens, but not so common to where one is dealing
with piles of them all at the same time (except maybe in edge cases,
like the DCT transform in JPEG/MPEG).

If FMUL and MUL can't be co issued, this is unlikely to matter.


DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
most code will not notice (except in rasterizers or similar, which
actually need fast DIV, but can cheat in that they don't usually need
exact DIV and the divisors are small, allowing for "shortcuts", such as
using lookup tables of reciprocals or similar).


Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).

For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.



>> And we have the need to use FP values in deciding to take branches.
>
> What gets used in taking branches is the _condition codes_. They're
> conveyed from the integer and floating point functional units, as well
> as all the other functional units, to a central place (the program
> status word).
>

Yeah, better to not have this style of condition codes...

Like, there are other ways of doing conditional branches...



> John Savard

Anton Ertl

unread,
Nov 27, 2023, 4:44:43 AM11/27/23
to
BGB <cr8...@gmail.com> writes:
>On 11/26/2023 7:29 PM, Quadibloc wrote:
>> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>>
>>> But Integer Multiply and Divide can share the FPU that does these.
>>
>> But giving each one its own multiplier means more superscalar
>> goodness!

Having two multipliers that serve both purposes means even more
superscalar goodness for similar area cost. However, there is the
issue of latency. The Willamette ships the integers over to the FPU
for multiplication, and the result back, crossing several clock
domains (at one clock loss per domain crossing), resulting in a
10-cycle latency for integer multiplication. I think that these days
every high-performance core with real silicon invests into separate
GPR and FP/SIMD (including integer SIMD) multipliers.

>In most code, FPU ops are comparably sparse

In terms of executed ops, that depends very much on the code. GP
cores have acquired SIMD cores primarily for FP ops, as can be seen by
both SSE and supporting only FP at first, and only later adding
integer stuff, because it cost little extra. Plus, we have added GPUs
that are now capable of doing huge amounts of FP ops, with uses in
graphics rendering, HPC and AI.

>Otherwise, one can debate whether or not having DIV/MOD in hardware
>makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
>at least "probably" faster than a generic software-only solution).

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.

However, looking at more recent architectures, the RISC-V M extension
(which is part of RV64G and RV32G, i.e., a standard extension) has not
just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
RISC-V seems significant to me because there the philosophy seems to
be to go for minimalism. So the debate has apparently come to the
conclusion that for general-purpose architectures, you include an
integer divide instruction.

>For cases like divide-by-constant, also, typically it is possible to
>turn it in the compiler into a multiply by reciprocal.

But for that you want a multiply instruction, which in the RISC-V case
means including the M extension, which also includes divide
instructions. Multiplying by the reciprocal may still be faster.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Quadibloc

unread,
Nov 27, 2023, 8:27:05 AM11/27/23
to
On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

> That debate has been held, and MIPS has hardware integer divide, Alpha
> and IA-64 don't have a hardware integer divide; they both have FP divide
> instructions.

I didn't know this about the Itanium. All I remembered hearing was that
the Itanium "didn't have a divide instruction", and so I didn't realize
this applied to fixed-point arithmetic only.

John Savard

Anton Ertl

unread,
Nov 27, 2023, 11:07:49 AM11/27/23
to
You are right, I was wrong.
<http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
says:

|A number of floating-point operations defined by the IEEE Standard are
|deferred to software by the IA-64 architecture in all its
|implementations
|
|* floating-point divide (integer divide, which is based on the
| floating-point divide operation, is also deferred to software)

The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA
instruction, and then multiplying with a. It shows a sequence of 13
instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.

Given that integer division is based on FP division, it probably takes
even more instructions.

Meanwhile, a lowly Cortex-A53 with its divide instruction produces an
integer division result with a small quotient in a few cycles. In
other words, if you want to use that method for division, still
provide a division instruction, and then implement it with that
method. This provides the flexibility to switch to some other method
(e.g., the one used by ARM) later.

BGB

unread,
Nov 27, 2023, 2:41:20 PM11/27/23
to
On 11/27/2023 3:17 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> On 11/26/2023 7:29 PM, Quadibloc wrote:
>>> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>>>
>>>> But Integer Multiply and Divide can share the FPU that does these.
>>>
>>> But giving each one its own multiplier means more superscalar
>>> goodness!
>
> Having two multipliers that serve both purposes means even more
> superscalar goodness for similar area cost. However, there is the
> issue of latency. The Willamette ships the integers over to the FPU
> for multiplication, and the result back, crossing several clock
> domains (at one clock loss per domain crossing), resulting in a
> 10-cycle latency for integer multiplication. I think that these days
> every high-performance core with real silicon invests into separate
> GPR and FP/SIMD (including integer SIMD) multipliers.
>

I ended up with different multipliers mostly because the requirements
are different...


>> In most code, FPU ops are comparably sparse
>
> In terms of executed ops, that depends very much on the code. GP
> cores have acquired SIMD cores primarily for FP ops, as can be seen by
> both SSE and supporting only FP at first, and only later adding
> integer stuff, because it cost little extra. Plus, we have added GPUs
> that are now capable of doing huge amounts of FP ops, with uses in
> graphics rendering, HPC and AI.
>

Yeah, this was not to say there is not FPU dense code, or that FP-SIMD
is not useful, but rather that in most general code, ALU and LD/ST ops
tend to dominate by a fair margin.

Similarly, SIMD ops may still be useful, even if they are a relative
minority of the executed instructions (even in code sequences which are
actively using SIMD ops...).


Like, typically, for every SIMD op used, there are also things like:
The loads and stores to get the value from memory and put the results in
memory;
ALU ops to calculate the index into the array or similar that we are
loading from or storing to;
...

Well, along with other ops, like shuffles and similar to get the SIMD
elements into the desired order, etc.


Like, some of this is why it is difficult to get anywhere near the
theoretical 200 MFLOP of the SIMD unit, apart from very contrived
use-cases (such as running neural net code), and had involved wonk like
operations which combined a SIMD shuffle into the SIMD ADD/SUB/MUL ops.


For a lot of other use cases, I can just be happy enough that the SIMD
ops are "not slow".


>> Otherwise, one can debate whether or not having DIV/MOD in hardware
>> makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
>> at least "probably" faster than a generic software-only solution).
>
> That debate has been held, and MIPS has hardware integer divide, Alpha
> and IA-64 don't have a hardware integer divide; they both have FP
> divide instructions.
>

Technically, also, BJX2 has ended up having both Integer DIV and FDIV
instructions. But, they don't gain all that much, so are still left as
optional features.

The integer divide isn't very fast, but it doesn't matter if it isn't
used all that often.


The FDIV is effectively a boat anchor (around 122 clock cycles).
Though, mostly this was based on the observation that with some minor
tweaks, the integer divide unit could be made to perform floating-point
divide as well.

The main merit though (over a software N-R divider) is that it can
apparently give exact results (my N-R dividers generally can't converge
past the low order 4 bits).



> However, looking at more recent architectures, the RISC-V M extension
> (which is part of RV64G and RV32G, i.e., a standard extension) has not
> just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
> integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
> and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
> RISC-V seems significant to me because there the philosophy seems to
> be to go for minimalism. So the debate has apparently come to the
> conclusion that for general-purpose architectures, you include an
> integer divide instruction.
>

Yeah.

I mostly ended up adding integer divide so that the RISC-V mode could
support the 'M' extension, and if I have it for RISC-V, may as well also
add it in BJX2 as its own optional extension.


Had also added a "FAZ DIV" special case that made the integer divide
faster, where over a limited range of input values, the integer divide
would be turned into a multiply by reciprocal.

So, say, DIV latency:
64-bit: 68 cycle
32-bit: 36 cycle
32-bit FAZ: 3 cycle.

As it so happens, FAZ also covers a similar range to that typically used
for a rasterizer, but is more limited in that it only handles a range of
values it can calculate exactly. For my hardware rasterizer, a similar
strategy was used, but extended to support bigger divisors at the
tradeoff that it becomes inexact with larger divisors.

However, adding a "Divide two integers quickly, but may give an
inaccurate result" instruction would be a bit niche (and normal C code
couldn't use it without an intrinsic or similar).

Partial observation is that mostly, the actual bit patterns in the
reciprocals tends to be fairly repetitive, so it is possible to
synthesize a larger range of reciprocals using lookup tables and
shift-adjustments.

In the hardware rasterizer, I had experimented with a fixed-point 1/Z
divider for "more accurate" perspective-correct rasterization, but this
feature isn't cheap (fixed-point reciprocal is more complicated), and
ended up not using it for now (in favor of merely dividing S and T by Z
and then multiplying by the interpolated Z again during rasterization,
as a sort of less accurate "poor man's" version).

The "poor man's perspective correct" strategy doesn't eliminate the need
to subdivide primitives, but does allow the primitives to be somewhat
larger without having as much distortion (mostly relevant as my GL
renderer is still mostly CPU bound in the transform and subdivision
stages, *1).


In theory, proper perspective correct would entirely eliminate the need
to subdivide primitives, but would require clipping geometry to the
viewing frustum (otherwise, the texture coordinates freak out for
primitives crossing outside the frustum or crossing the near plane).

Approximate FDIV is possible, but I have typically used a different
strategy for the reciprocal.



*1: Though, ironically, this does also mean that, via multi-texturing,
it is semi-viable to also use lightmap lighting again in GLQuake (since
this doesn't add too much CPU penalty over the non-multitexture case).


However, dynamic lightmaps still aren't viable, as the CPU-side cost of
drawing the dynamic light-sources to the lightmaps, and then uploading
them, is still a bit too much.

I suspect, probably in any case, GLQuake wasn't likely written to run on
a 50MHz CPU.


Might have been a little easier if the games had poly-counts and and
draw distances more on par with PlayStation games (say, if the whole
scene is kept less than 500 polys; rather than 1k-2k polys).

Say, Quake being generally around 1k-2k polys for the scene, and roughly
300-500 triangles per alias model, ... Despite the scenes looking crude
with large flat surfaces, most of these surfaces had already been hacked
up a fair bit by the BSP algorithm (though, theoretically, some
annoyances could be reduced if the Quake1 maps were rebuilt using a
modified Q3BSP or similar, as the Quake3 tools natively support vertex
lighting and also don't hack up the geometry quite as much as the
original Quake1 BSP tool did, but alas...).

Wouldn't save much, as I still couldn't legally redistribute the data
files (and my GLQuake port isn't particularly usable absent using a
process to convert all the textures into DDS files and rebuilding the
PAK's and similar, so...).

To have something redistributable, would need to replace all of the
textures, sound effects, alias models, etc. An old (probably long dead)
"OpenQuartz" project had partly done a lot of this, but "got creative"
with the character models in a way I didn't like (would have preferred
something at least "in the same general spirit" as the original models).

Similar sort of annoyances as with FreeDoom, but at least FreeDoom
stayed closer to the original in these areas.



Also, possibly, I may need to rework some things in terms of how TKRA-GL
works to better match up with more recent developments (all this is
still a bit hacky; effectively linking the whole GL implementation to
the binary that uses GL).

Granted, it is also possible I may need to at some point move away from
linking the whole TestKern kernel to any programs that use it as well,
with the tradeoff that programs would no longer be able to launch "bare
metal" in the emulator (but would always need the kernel to also be
present).


>> For cases like divide-by-constant, also, typically it is possible to
>> turn it in the compiler into a multiply by reciprocal.
>
> But for that you want a multiply instruction, which in the RISC-V case
> means including the M extension, which also includes divide
> instructions. Multiplying by the reciprocal may still be faster.
>

Yeah.

For the relevant cases, widening multiply for 32-bit integers generally
has a 3 cycle latency in my case.

Whereas DIV will often have a 36 cycle latency in the general case.
So, the choice is obvious.

Granted, this still generally needs a shift instruction following the
multiply.


Having a "Do a widening 32-bit multiply and then right shift the results
32/33/34/35 bits" would be possible in theory, but a little wonky.

Could possibly make sense if it could also encode a 32-bit immediate:
DMULSH33S Rm, Imm32u, Rn //Int32: Rn=(Rm*Imm32)>>33
DMULSH33U Rm, Imm32u, Rn //UInt32: Rn=(Rm*Imm32)>>33

Probably don't really have the encoding space to pull this off though
(with a 64-bit encoding; Could be doable in theory with an 96-bit
FE-FF-OP encoding though).

Though, as can be noted, this could likely borrow some of the logic from
the FAZ-DIV mechanism or similar.

Would allow handling "y=x/C;" cases in 3L/1T though, but, integer
division still isn't really all that common (at least, not enough to
justify going through all this just to save using a right-shift op).


> - anton

MitchAlsup

unread,
Nov 27, 2023, 8:02:08 PM11/27/23
to
BGB wrote:

> On 11/26/2023 7:29 PM, Quadibloc wrote:
>> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>>
>>> But Integer Multiply and Divide can share the FPU that does these.
>>
>> But giving each one its own multiplier means more superscalar
>> goodness!
>>
>> But you _do_ have a good point. This kind of "superscalar goodness"
>> is usually wasted, because most programs don't do an equal amount
>> of integer and FP arithmetic. So it would be way better to have
>> twice as many multipliers and dividers that are designed to be used
>> either way than having units of two kinds.
>>
>> I'm assuming, though, that one can make the multipliers and dividers
>> simpler, hence faster, by making them single-use, and that it's
>> possible to know, for a given use case, what the ratio between integer
>> and FP operations will be. After all, integer multiplication discards
>> the MSBs of the result, and floating-point multiplication discards the
>> LSBs of the result.
>>

> If one wants to have the most superscalar goodness, for general case
> code, this means mostly lots of simple ALU ops (particularly,
> ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

> Where, say, Load/Store ops and simple ALU ops tend to dominate over
> pretty much everything else in terms of usage frequency.

LD+ST 33%
Int 45%

> In most code, FPU ops are comparably sparse,

FP 9%
> as is MUL,
MUL 2%
> and DIV/MOD is
DIV 0.3%
> fairly rare in comparison to either. MUL is common enough that you want
> it to be fast when it happens, but not so common to where one is dealing
> with piles of them all at the same time (except maybe in edge cases,
> like the DCT transform in JPEG/MPEG).

MUL is much more common in FORTRAN than in C-like languages as it forms
the underlying math for multidimensional arrays, and more ×s exist in
typical number crunching code (F) than in pointer chasing (C) languages.

> If FMUL and MUL can't be co issued, this is unlikely to matter.

Since both are pipelined, this is a small stall {{Unlike MIPS R2000 where
IMUL was 12 cycles blocking while FMUL was 3 (or was it 4) pipelined.

> DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
> most code will not notice (except in rasterizers or similar, which
> actually need fast DIV, but can cheat in that they don't usually need
> exact DIV and the divisors are small, allowing for "shortcuts", such as
> using lookup tables of reciprocals or similar).

We know that FDIV can be performed in 17 cycles in the FMUL unit, and
this has been known since 1989......why are you making it so long ??

0.3% IDIV × 100 cycles and you have added 3 cycles to the average
instruction !!!

> Otherwise, one can debate whether or not having DIV/MOD in hardware
> makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
> at least "probably" faster than a generic software-only solution).

With an 8-bit start, and Newton-Raphson iteration, SW should be able
to do this in mid-30-cycles. {{See ITANIC FDIV SW profile}}

> For cases like divide-by-constant, also, typically it is possible to
> turn it in the compiler into a multiply by reciprocal.

And you should if it helps.

>>> And we have the need to use FP values in deciding to take branches.
>>
>> What gets used in taking branches is the _condition codes_. They're
>> conveyed from the integer and floating point functional units, as well
>> as all the other functional units, to a central place (the program
>> status word).
>>

> Yeah, better to not have this style of condition codes...

RISC-V, MIPS, Mc 88K, My 66000, have no condition codes......

> Like, there are other ways of doing conditional branches...

"Compare to zero and branch" × {Signed, Unsigned, FP16, FP32, FP64}}
for 1 instruction FP branches

Compare and then branch on anything (2 inst::1 int, 1 FP) your heart desires
Branch {NaN, -Infinity, +Infinity, -DeNorm, +Denorm, -Normal, +Normal, Zero,
-zero, +zero, < <= > >= == !=, 0 < x < y, 0 <=x < y, 0 < x <= y, 0 <= x <= y, ... }
Can you even record all these states in a single condition code ??

>> John Savard

MitchAlsup

unread,
Nov 27, 2023, 8:06:07 PM11/27/23
to
Anton Ertl wrote:

> BGB <cr8...@gmail.com> writes:
>>On 11/26/2023 7:29 PM, Quadibloc wrote:
>>> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>>>
>>>> But Integer Multiply and Divide can share the FPU that does these.
>>>
>>> But giving each one its own multiplier means more superscalar
>>> goodness!

> Having two multipliers that serve both purposes means even more
> superscalar goodness for similar area cost. However, there is the
> issue of latency. The Willamette ships the integers over to the FPU
> for multiplication, and the result back, crossing several clock
> domains (at one clock loss per domain crossing), resulting in a
> 10-cycle latency for integer multiplication. I think that these days
> every high-performance core with real silicon invests into separate
> GPR and FP/SIMD (including integer SIMD) multipliers.

>>In most code, FPU ops are comparably sparse

> In terms of executed ops, that depends very much on the code. GP
> cores have acquired SIMD cores primarily for FP ops, as can be seen by
> both SSE and supporting only FP at first, and only later adding
> integer stuff, because it cost little extra. Plus, we have added GPUs
> that are now capable of doing huge amounts of FP ops, with uses in
> graphics rendering, HPC and AI.

Once SIMD gains Integer operations, the Multiplier has to be built to
do both, might as well use it for more things than just SIMD.

>>Otherwise, one can debate whether or not having DIV/MOD in hardware
>>makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
>>at least "probably" faster than a generic software-only solution).

> That debate has been held, and MIPS has hardware integer divide, Alpha
> and IA-64 don't have a hardware integer divide; they both have FP
> divide instructions.

And all of these lead fruitful long productive lives before taking over
-------Oh wait !!

> However, looking at more recent architectures, the RISC-V M extension
> (which is part of RV64G and RV32G, i.e., a standard extension) has not
> just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
> integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
> and REMUW.

All of which are possible in My 66000 using operand sign control, S-bit,
and CARRY when you want 64×64->128 or 128/64->{64 quotient, 64 remainder}

> ARM A64 also has divide instructions (SDIV, UDIV), but
> RISC-V seems significant to me because there the philosophy seems to
> be to go for minimalism. So the debate has apparently come to the
> conclusion that for general-purpose architectures, you include an
> integer divide instruction.

Several forms of integer DIV at least signed and unsigned.....

MitchAlsup

unread,
Nov 27, 2023, 8:06:08 PM11/27/23
to
Anton Ertl wrote:

> Quadibloc <quad...@servername.invalid> writes:
>>On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:
>>
>>> That debate has been held, and MIPS has hardware integer divide, Alpha
>>> and IA-64 don't have a hardware integer divide; they both have FP divide
>>> instructions.
>>
>>I didn't know this about the Itanium. All I remembered hearing was that
>>the Itanium "didn't have a divide instruction", and so I didn't realize
>>this applied to fixed-point arithmetic only.

> You are right, I was wrong.
> <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
> says:

> |A number of floating-point operations defined by the IEEE Standard are
> |deferred to software by the IA-64 architecture in all its
> |implementations
> |
> |* floating-point divide (integer divide, which is based on the
> | floating-point divide operation, is also deferred to software)

> The paper goes on to describe that FP division a/b is based on
> determining 1/b through Newton-Raphson approximation, using the FMA
> instruction, and then multiplying with a. It shows a sequence of 13
> instructions for double precision, 1 frcpa and 12 fma or fnma
> instructions.

> Given that integer division is based on FP division, it probably takes
> even more instructions.

You have to synthesize the bits between 63 and 53 to get 64/64->64.

Quadibloc

unread,
Nov 27, 2023, 11:50:34 PM11/27/23
to
On Mon, 27 Nov 2023 15:54:03 +0000, Anton Ertl wrote:

> Quadibloc <quad...@servername.invalid> writes:
>>On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:
>>
>>> That debate has been held, and MIPS has hardware integer divide, Alpha
>>> and IA-64 don't have a hardware integer divide; they both have FP
>>> divide instructions.
>>
>>I didn't know this about the Itanium. All I remembered hearing was that
>>the Itanium "didn't have a divide instruction", and so I didn't realize
>>this applied to fixed-point arithmetic only.
>
> You are right, I was wrong.
> <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf> says:
>
> |A number of floating-point operations defined by the IEEE Standard are
> |deferred to software by the IA-64 architecture in all its
> |implementations |
> |* floating-point divide (integer divide, which is based on the |
> floating-point divide operation, is also deferred to software)
>
> The paper goes on to describe that FP division a/b is based on
> determining 1/b through Newton-Raphson approximation, using the FMA
> instruction, and then multiplying with a. It shows a sequence of 13
> instructions for double precision, 1 frcpa and 12 fma or fnma
> instructions.
>
> Given that integer division is based on FP division, it probably takes
> even more instructions.

The fastest _hardware_ implementations of FP division are Goldschmidt
division and Newton-Raphson divisiion, and they both make use of
multiplication hardware.

So using Newton-Raphson division to increase the width of an FP division
result to perform 64-bit integer division won't be too hard.

Since the Wikipedia article didn't go into detail, I had to look further
to find out that you _were_ right about the DEC Alpha, however. It did
have floating divide but not integer divide. The 21264 added floating
square root.

John Savard

Anton Ertl

unread,
Nov 28, 2023, 4:18:57 AM11/28/23
to
mitch...@aol.com (MitchAlsup) writes:
>Anton Ertl wrote:
>> <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
>> says:
>
>> |A number of floating-point operations defined by the IEEE Standard are
>> |deferred to software by the IA-64 architecture in all its
>> |implementations
>> |
>> |* floating-point divide (integer divide, which is based on the
>> | floating-point divide operation, is also deferred to software)
>
>> The paper goes on to describe that FP division a/b is based on
>> determining 1/b through Newton-Raphson approximation, using the FMA
>> instruction, and then multiplying with a. It shows a sequence of 13
>> instructions for double precision, 1 frcpa and 12 fma or fnma
>> instructions.
>
>> Given that integer division is based on FP division, it probably takes
>> even more instructions.
>
>You have to synthesize the bits between 63 and 53 to get 64/64->64.

The paper says that they use double-extended division to synthesize
64/64-bit division; I expect that double-extended uses even more
Newton-Raphson steps (and FMAs) than double precision; and then there
are the additional steps for getting an integer.

The paper only shows signed 16/16 bit division, a 15-instruction
sequence including 1 fcrpa and 3 fma/fnma instructions. So you can
expect that 64/64 is at least 13+15-4=24 instructions long, probably
longer (because of double-extended precision).

Anton Ertl

unread,
Nov 28, 2023, 4:33:40 AM11/28/23
to
mitch...@aol.com (MitchAlsup) writes:
>Several forms of integer DIV at least signed and unsigned.....

Gforth has unsigned, floored (signed) and symmetric (signed), in
double-cell/single-cell variants (standardized in Forth) as well as in
single/single variants.

Floored rounds the quotient down, symmetric (often called truncated)
rounds the quotient towards 0. There is also something called
Euclidean that always produces a remainder >=0, but the practical
difference from floored is when the divisor is negative, which is
exceedingly rare.

Given the rareness of negative divisors, one might wonder about
signed/unsigned, but at least implementations of current programming
languages would probably make little use of such instructions even if
they were cheaper than the signed/signed ones.

BGB

unread,
Nov 28, 2023, 5:33:04 AM11/28/23
to
Yeah, pretty much...


>> fairly rare in comparison to either. MUL is common enough that you
>> want it to be fast when it happens, but not so common to where one is
>> dealing with piles of them all at the same time (except maybe in edge
>> cases, like the DCT transform in JPEG/MPEG).
>
> MUL is much more common in FORTRAN than in C-like languages as it forms
> the underlying math for multidimensional arrays, and more ×s exist in
> typical number crunching code (F) than in pointer chasing (C) languages.
>

Yeah.

Multidimensional arrays are comparably rarely used in C land, and when
they are used, they typically have power-of-2 dimensions.


>> If FMUL and MUL can't be co issued, this is unlikely to matter.
>
> Since both are pipelined, this is a small stall {{Unlike MIPS R2000 where
> IMUL was 12 cycles blocking while FMUL was 3 (or was it 4) pipelined.
>

Slow IMUL but fast FMUL: WTF?...
Unless this was only counting single precision, then it could make more
sense.


>> DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
>> most code will not notice (except in rasterizers or similar, which
>> actually need fast DIV, but can cheat in that they don't usually need
>> exact DIV and the divisors are small, allowing for "shortcuts", such
>> as using lookup tables of reciprocals or similar).
>
> We know that FDIV can be performed in 17 cycles in the FMUL unit, and
> this has been known since 1989......why are you making it so long ??
>
> 0.3% IDIV × 100 cycles and you have added 3 cycles to the average
> instruction !!!
>

Can note that using DIV/MOD seems a lot rarer in my case...

But, I can note that my compiler turns constant divide into alternatives:
y=x/C; //~ y=(x*RCP)>>SHR;
y=x%C; //~ y=x-(((x*RCP)>>SHR)*C);

Though, with some extra wonk thrown in so negative values round the
divide towards zero and similar (otherwise, negative values would round
away from zero, which is not the commonly accepted result).

y=((x+((x<0)?(C-1):0))*RCP)>>SHR;

The variable cases are comparably infrequent.



>> Otherwise, one can debate whether or not having DIV/MOD in hardware
>> makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
>> is at least "probably" faster than a generic software-only solution).
>
> With an 8-bit start, and Newton-Raphson iteration, SW should be able
> to do this in mid-30-cycles. {{See ITANIC FDIV SW profile}}
>

I was thinking of integer divide here (being 68 cycle for DIVS.Q/DIVU.Q
in my case; 36 for DIVS.L/DIVU.L).

But, yeah, the FDIV instruction is even slower (120 cycles); and
software Newton-Raphson will win this race...


Partly this is the drawback of using a Shift-Add design.


Say, every clock-cycle, the value is shifted left by 1 bit, and logic
determines whether or not to add a second value to the working value.

For multiply, you can use the bits of one input to mask whether or not
to add the other input. For divide, the carry of the adder can be used
to mask whether to add the inverse of the divisor.

For FDIV, one runs the divider for more clock cycles, and it generates
bits below the decimal point.



Though, at least for the merit of an Integer DIV op, it is faster to use
a 36-cycle integer divide instruction than it is to run a shift-subtract
loop in software.



>> For cases like divide-by-constant, also, typically it is possible to
>> turn it in the compiler into a multiply by reciprocal.
>
> And you should if it helps.
>

Yeah, it does.
Multiply by reciprocal is considerably faster.


>>>> And we have the need to use FP values in deciding to take branches.
>>>
>>> What gets used in taking branches is the _condition codes_. They're
>>> conveyed from the integer and floating point functional units, as well
>>> as all the other functional units, to a central place (the program
>>> status word).
>>>
>
>> Yeah, better to not have this style of condition codes...
>
> RISC-V, MIPS, Mc 88K, My 66000, have no condition codes......
>

Yeah.

In BJX2, there is only SR.T, however, the way it is updated and used is
much narrower in scope.

A case could be made for narrowing its scope further though.


>> Like, there are other ways of doing conditional branches...
>
> "Compare to zero and branch" × {Signed, Unsigned, FP16, FP32, FP64}}
> for 1 instruction FP branches
>

Yeah, I had ended up adding this strategy as well as the prior SR.T
mechanism.


> Compare and then branch on anything (2 inst::1 int, 1 FP) your heart
> desires Branch {NaN, -Infinity, +Infinity, -DeNorm, +Denorm, -Normal,
> +Normal, Zero, -zero, +zero, < <= > >= == !=, 0 < x < y, 0 <=x < y, 0 <
> x <= y, 0 <= x <= y, ... }
> Can you even record all these states in a single condition code ??
>

Possible.

>>> John Savard

Marko Zec

unread,
Nov 28, 2023, 8:16:53 AM11/28/23
to
MitchAlsup <mitch...@aol.com> wrote:
...
> 0.3% IDIV ? 100 cycles and you have added 3 cycles to the average
> instruction !!!

Not really, more likeliky it would be a 0.33 average CPI bump, assuming
a 1.0 baseline CPI for non-DIV instructions. Still way to much to be
ignored...

BGB

unread,
Nov 28, 2023, 2:00:20 PM11/28/23
to
I missed that, but yeah.

Still, it seems the incidence of the integer DIV/MOD/etc instructions
being used is somewhat rarer than this estimate in the code I was measuring.

Checking my compiler output for static use count:
Around 0.05% ...


It might have been more common though, if my compiler didn't
aggressively eliminate the divide by constant cases (and my coding
practices didn't actively avoid using division in general, etc).

Since, between either a slowish DIV instruction, or a software
shift-subtract loop, neither was expected to be all that fast.


On pretty much every system I have used IRL, division has tended to be
fairly slow, so I had tended to avoid it whenever possible.


In cases where fast divide was needed, there were often other
workarounds (such as using a lookup table of reciprocals).


If inexact results are acceptable, it is possible to stretch the table
of reciprocals up to pretty much arbitrary divisors (via normalization
and some shift-related trickery).
In software, this is helped with a CLZ instruction or similar, but had
also used a variant of this effectively in my hardware rasterizer module.

Say:
z=x/y;
y:
0.. 31: Direct multiply by MAX/y (Table A)
32.. 63: Normalized multiply by MAX/y (Table B)
64..127: Use (y>>1) as index to Table B, add 1 to final SHR
128..255: Use (y>>2) as index to Table B, add 2 to final SHR
...

The sizes of Tables A/B can be reduced (via another lookup table) by
making use of repetitions in the bit patterns (though, this is more
practical in Verilog than in software).

...


Though, getting acceptable results from fixed-point division is a little
more of an issue.


Stephen Fuld

unread,
Nov 28, 2023, 2:22:53 PM11/28/23
to
On 11/28/2023 11:00 AM, BGB wrote:

snip

> Still, it seems the incidence of the integer DIV/MOD/etc instructions
> being used is somewhat rarer than this estimate in the code I was
> measuring.

Is this a result of your using mostly a few old games as your
"reference" code? i.e. your code base may be different from what gave
rise to the statistics mentioned by others.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,
Nov 28, 2023, 2:41:04 PM11/28/23
to
Goldschmidt is about 2× faster than N-R because:: Goldschmidt inner loop
has 2 independent multiplies, while N-R has 2 dependent multiplies. To
meet IEEE 754 accuracy, the last iteration in Goldschmide is ½ a N-R
iteration {{you perform the N-R 1st step and then use the sign bit to
determine +1 or +2 and the +0 can be determined while the multiply is
transpiring.}}

BGB

unread,
Nov 28, 2023, 3:09:27 PM11/28/23
to
On 11/28/2023 1:22 PM, Stephen Fuld wrote:
> On 11/28/2023 11:00 AM, BGB wrote:
>
> snip
>
>> Still, it seems the incidence of the integer DIV/MOD/etc instructions
>> being used is somewhat rarer than this estimate in the code I was
>> measuring.
>
> Is this a result of your using mostly a few old games as your
> "reference" code? i.e. your code base may be different from what gave
> rise to the statistics mentioned by others.
>

Possibly.

There is my C library and kernel/runtime code;
And games like Doom and Quake and similar.

I suspect probably ID also avoided integer division when possible as
well, as it is rarely used (and was probably still slow back in
early/mid 90s PCs).

Looking up, 486 and Pentium 1 still had ~ 40+ cycle 32-bit DIV/IDIV.
...


Looks like MUL/IMUL had similar latency to DIV/IDIV on the 386 and 486,
but then the latency dropped down a fair bit on the Pentium.

However, Doom makes extensive use of integer multiply.



Given the timings, I suspect it is possible they were also using a
Shift-Add design for the multiply/divide unit.

In my case, I am using Shift-Add for DIV and also 64-bit multiply, but a
faster DSP48 based multiplier for 32-bit multiply.

If I were to make a guess based on the latency, I would guess the P1 was
using something like a radix-16 long-multiply.


Comparably, it seems DIV/IDIV didn't actually get much faster until the
Ryzen.



Have noted that these old games did have a nifty trick for calculating
approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);


Can be extended to 3D without too much issue, but for 4D, the sorting
step becomes more of an issue.

Kinda makes sense though, as finding the square-root of things is also
expensive...


But, not sure how a lot of this compares with other (non game) coding
practices.


Have noticed though that a lot of these games really do like using lots
of global variables (a lot more so than I would prefer to use in my
coding practices).

Though, this is one case where it is useful to be able to load/store
global variables in a single instruction (and using something like a GOT
may have a significant penalty on code that uses lots of global variables).


BGB

unread,
Nov 28, 2023, 3:13:32 PM11/28/23
to
Well, and/or just use separate multipliers for each, under the rationale
that, at least on an FPGA, it already has the DSP48's whether or not one
uses them (and a lot more DSP48's than I can really make effective use of).

Well, with the caveat that they only natively do signed 18-bit multiply.



>>> Otherwise, one can debate whether or not having DIV/MOD in hardware
>>> makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
>>> is at least "probably" faster than a generic software-only solution).
>
>> That debate has been held, and MIPS has hardware integer divide, Alpha
>> and IA-64 don't have a hardware integer divide; they both have FP
>> divide instructions.
>
> And all of these lead fruitful long productive lives before taking over
> -------Oh wait !!
>

MIPS at least did moderately well for a time, might have done better if
it were more open.

Both Alpha and IA-64 had a lot of questionable seeming design choices.


>> However, looking at more recent architectures, the RISC-V M extension
>> (which is part of RV64G and RV32G, i.e., a standard extension) has not
>> just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
>> integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
>> and REMUW.
>
> All of which are possible in My 66000 using operand sign control, S-bit,
> and CARRY when you want 64×64->128 or 128/64->{64 quotient, 64 remainder}
>
>>            ARM A64 also has divide instructions (SDIV, UDIV), but
>> RISC-V seems significant to me because there the philosophy seems to
>> be to go for minimalism.  So the debate has apparently come to the
>> conclusion that for general-purpose architectures, you include an
>> integer divide instruction.
>
> Several forms of integer DIV at least signed and unsigned.....

I had left it as optional, but I guess, optional is not the same as
absent, and my main profiles had ended up including it.


The ops are generally given names like:
DIVS.L, DIVU.L, DIVS.Q, DIVU.Q
MODS.L, MODU.L, MODS.Q, MODU.Q

One may notice a curious level of overlap with the RISC-V ops here...

Though, for 32-bit:
MULS.L, MULU.L: Perform a narrow multiply.
DMULS.L, DMULU.L: Perform a widening multiply.
Along with:
MULS.Q, MULU.Q: 64-bit multiply
MULHS.Q, MULHU.Q: 64-bit multiply, high 64 bits of result


DIV, MOD, and 64-bit multiply, is drastically slower than 32-bit
multiply though.

MitchAlsup

unread,
Nov 30, 2023, 11:27:17 AM11/30/23
to
BGB wrote:

> 64-bit multiply, is drastically slower than 32-bit
> multiply though.

Should only be 1 gate of delay longer......

MitchAlsup

unread,
Nov 30, 2023, 11:27:17 AM11/30/23
to
BGB wrote:

> Have noted that these old games did have a nifty trick for calculating
> approximate distance, say:
> dx=x0-x1;
> dy=y0-y1;
> adx=dx^(dx>>31);
> ady=dy^(dy>>31);
> if(ady>adx)
> { t=adx; adx=ady; ady=t; } //common
> // { adx^=ady; ady^=adx; adx^=ady; } //sometimes
> d=adx+(ady>>1);

Why not::

dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
d=adx+(ady>>1);
else
d=ady+(adx>>1);

Robert Finch

unread,
Nov 30, 2023, 12:08:38 PM11/30/23
to
FPGA land using DSPs, 24x16 multiply is single cycle, 32x32 could be
single cycle if a lower fmax is acceptable, but 64x64 best pipelined for
several cycles (6). Thor had a fast 24x16 multiply as it is good enough
for array address calcs in a lot of circumstances.

The slower 64x64 multiply means that the basic shift and subtract divide
is more appealing compared to NR or Goldschmidt.

MitchAlsup

unread,
Nov 30, 2023, 12:51:09 PM11/30/23
to
Robert Finch wrote:

> On 2023-11-30 11:25 a.m., MitchAlsup wrote:
>> BGB wrote:
>>
>>>               64-bit multiply, is drastically slower than 32-bit
>>> multiply though.
>>
>> Should only be 1 gate of delay longer......

> FPGA land using DSPs, 24x16 multiply is single cycle, 32x32 could be
> single cycle if a lower fmax is acceptable, but 64x64 best pipelined for
> several cycles (6). Thor had a fast 24x16 multiply as it is good enough
> for array address calcs in a lot of circumstances.

Nothing stops you from building a multiplier in LUTs with Boothe recoding.
Each pair of LUTs being a 3-input XOR and a 3-input Majority gate. This also
gets rid of BGBs argument on building an incomplete multiplier tree {because
IEE needs 53-bits while your DSP only supplies 48.}

BGB

unread,
Nov 30, 2023, 1:56:50 PM11/30/23
to
On 11/30/2023 11:08 AM, Robert Finch wrote:
> On 2023-11-30 11:25 a.m., MitchAlsup wrote:
>> BGB wrote:
>>
>>>               64-bit multiply, is drastically slower than 32-bit
>>> multiply though.
>>
>> Should only be 1 gate of delay longer......
>
> FPGA land using DSPs, 24x16 multiply is single cycle, 32x32 could be
> single cycle if a lower fmax is acceptable, but 64x64 best pipelined for
> several cycles (6). Thor had a fast 24x16 multiply as it is good enough
> for array address calcs in a lot of circumstances.
>

Yeah, 18x18s (by extension, 16s and 16u) can be single cycle, no issue.

In my case, 32x32 is 3 cycle.
For 64x64, "all hell breaks loose" if one tries to express it directly.

The limiting factor isn't the first-stage of multipliers, but all the
adders needed to glue the results together. Likely, wherever is going on
inside the DSP48 is a lot faster than what happens with the generic FPGA
fabric.

For small 3x3->6 bit multipliers or similar, these can be built directly
from LUTs, and for very small values this can be favorable to using a DSP48.


Only reason the multiply in Binary64 FMUL works OK, is because I
effectively ignore all the low-order bits that would have existed.
Though, arguably, this could be part of why N-R can't converge on the
last 4 bits of the result (since this part seemingly involves sub-ULP
shenanigans that would depend on the low-order bits of the internal
multiply, which don't exist in this case).



When I looked at it originally, doing a direct 64-bit multiply wasn't
terribly attractive because it would be both fairly expensive and not
that much faster than a plain software option built on 32-bit widening
multiply.


Say (with ADD.X, 64x64 -> 128):
SHAD.Q R4, -32, R6 | SHAD.Q R5, -32, R7 //1c
MOV 0, R3 | DMULU.L R4, R5, R16 //1c
DMULU.L R6, R4, R18 //1c *1
DMULU.L R4, R7, R19 //1c
DMULS.L R6, R7, R17 //1c
MOVLLD R18, R3, R20 | SHAD.Q R18, -32, R21 //1c
MOVLLD R19, R3, R22 | SHAD.Q R19, -32, R23 //2c
ADD.X R20, R22, R2 //2c
ADD.X R2, R16, R2 //1c
RTS //2c
So, ~ 13 cycles (with 2c interlock penalty).

Or, say 64x64->64:
SHAD.Q R4, -32, R6 | SHAD.Q R5, -32, R7 //1c
MOV 0, R3 | DMULU.L R4, R5, R16 //1c
DMULU.L R6, R4, R18 //2c *1
DMULU.L R4, R7, R19 //3c
ADD R18, R19, R17 //2c
SHAD.Q R17, 32, R17 //2c
ADD R16, R17, R2 //1c
RTS //2c
Here, ~ 14 cycles (5 cycles interlock penalty).


*1: Can't remember for certain ATM whether DMULS.L or DMULU.L is needed
for these parts. Mostly effects the contributions of sign in the
high-order results. But, for 64-bit narrow multiply, we can mostly
ignore the sign, as it doesn't change anything in the low 64 bits of the
result.



Though, there will be some additional overhead for the function call,
which would be avoided with a CPU instruction.

Even with a slow 68-cycle MULS.Q, when function call overheads are
considered, it still isn't too far from break-even.

In my case, this part is optional, and can be controlled with a compiler
command-line option.


> The slower 64x64 multiply means that the basic shift and subtract divide
> is more appealing compared to NR or Goldschmidt.
>

Yes, agreed.

These require a fast full-precision multiplier (or two for
Goldschmidt?), which I don't have either in this case.


But, luckily, had noted that Shift-and-Subtract was trivially turned
into Shift-and-Add, and that the same general algorithm could be used
(with some minor tweaks) to also function as a 64-bit multiplier.

It isn't fast, but:
It works...
Is is moderately affordable.
Can technically give a 64-bit multiply op as well.
Even if, for performance, still better to do 64b MUL in software.

Though, the difference between the ops isn't too unreasonable when
function-call overhead is considered.


And, for general case DIV and MOD, the ops were still faster than the
software versions (even when the software version may have optimizations
like mapping small divisors through a lookup table of reciprocals).


This was mostly because of overheads:
Function calls aren't free (may need to spill stuff, shuffle values
around in registers, etc);
Special-casing the small divisors requires additional logic, and adds a
layer of function-call overhead for the general case.

And, vs a 36 cycle DIVS.L, is isn't hard to burn this just with function
call related overheads.



BGB

unread,
Nov 30, 2023, 2:17:35 PM11/30/23
to
But...

Then one needs to deal with the full latency of a multiplier built from
LUT's and CARRY4's, which isn't going to be small...

One could do this, but would likely need to add a few cycles of latency
to the FMUL.



As is, only around 34 bits of the DSP output are used if one is using
them as unsigned 17 bit multipliers. There is still a short-fall of a
few bits, but this can be patched over using 3-bit LUT multipliers or
similar.

It seems like the ADDer capability of the DSPs isn't getting used, but
this is partly a case of being able to get it to synthesize in a useful way.

Say:
D <= A * B + C;
Which means, one needs the 'C' input before they can do the multiply
(and happens to work in a way that is more useful for small MAC
operations than for building a bigger multiplier).



Luckily at least, my FMUL does produce a full-width mantissa for
Binary64, with the low-end part of the multiplier being cut off mostly
effecting the ULP rounding.


BGB

unread,
Nov 30, 2023, 2:43:44 PM11/30/23
to
That also works, I think this is closer to the form it was used in ROTT
(where I had first noticed it being used).


Swapping the values is easier to extend to 3D though, say:
dx=x0-x1;
dy=y0-y1;
dz=z0-z1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
adz=dz^(dz>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; }
if(adz>ady)
{ t=ady; ady=adz; adz=t; }
if(ady>adx)
{ t=adx; adx=ady; ady=t; }
d=adx+(ady>>1)+(adz>>2);


There isn't a strong reason to prefer one option over another here,
except to avoid the XOR swap, as in this case the XOR's are sequentially
dependent on each other.


Not so pretty for 4D though, as swaps quickly come to dominate things.

There is a way of doing it with a "poor man's square root", but it would
be a lot more hairy looking than the above, and for fixed-point requires
knowing where the decimal point is at (whereas the above manages to work
while also remaining agnostic as to the location of the decimal point).

Luckily at least, one doesn't really need to deal with 4D spheres with
fixed-point all that often.


MitchAlsup

unread,
Nov 30, 2023, 3:32:24 PM11/30/23
to
BGB wrote:

> On 11/30/2023 10:23 AM, MitchAlsup wrote:
>> BGB wrote:
>>
>>> Have noted that these old games did have a nifty trick for calculating
>>> approximate distance, say:
>>>    dx=x0-x1;
>>>    dy=y0-y1;
>>>    adx=dx^(dx>>31);
>>>    ady=dy^(dy>>31);
>>>    if(ady>adx)
>>>      { t=adx; adx=ady; ady=t; }         //common
>>> //  { adx^=ady; ady^=adx; adx^=ady; }  //sometimes
>>>    d=adx+(ady>>1);
>>
>> Why not::
>>
>>     dx=x0-x1;
>>     dy=y0-y1;
>>     adx=dx^(dx>>31);
>>     ady=dy^(dy>>31);
>>     if(ady>adx)
>>         d=adx+(ady>>1);
>>     else
>>         d=ady+(adx>>1);

As long as ISA has max() and min():: Why not::

    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);
d = min( adx, ady ) + (max( adx, ady ) >> 1);

>

BGB

unread,
Nov 30, 2023, 5:02:28 PM11/30/23
to
Yeah, I guess that could work as well for 2D...


Though, makes more sense to use the "__int32_min" intrinsic or similar,
rather than the generic C library "min()"/"max()" macros.
#define min(a, b) (((a)<(b))?(a):(b))
#define max(a, b) (((a)>(b))?(a):(b))

Though, apparently the existence of these macros in stdlib.h is
controversial (they are apparently sort of a POSIX legacy feature, and
not part of the C standard).

Mostly, due to implementation issues in BGBCC, ?: needs an actual
branch, so is ironically actually worse for performance in this case
than using an explicit if/else...

A case could be made for C11 "_Generic", but would need to implement it.


But, yeah, apart from compiler issues, in theory the min/max part could
be 3 instructions.


If done in ASM, might be something like:
// R4..R7 (x0, y0, x1, y1)
SUB R4, R6, R16 | SUB R5, R7, R17
SHAD R16, -31, R18 | SHAD R17, -31, R19
XOR R16, R18, R16 | XOR R17, R19, R17
CMPGT R17, R16
CSELT R16, R17, R18 | CSELT R17, R16, R19
SHAD R19, -1, R19
ADD R18, R19, R2


I don't have actual MIN/MAX ops though.
Though, it seems RISC-V does have them in the Bitmanip extension.

Could probably add them in-theory (maybe along with FMIN/FMAX).

I guess, if I did add FMIN/FMAX, I would technically have enough in
place to "more or less" be able to support the F/D extensions in RISC-V
mode as well at this point, which (if I re-enable the BJX2-LDOP
extension) would in-turn be enough to theoretically be able to push it
potentially op to RV64G in userland (or maybe RV64GC if I got around to
finishing the RV-C decoder).


I guess, RV64GC could be useful as this is generally what the Linux
userland is using (but, RV64IM_ZfinxZdinx, not so much...).

But, F/D could have F0..F31 be mapped to R32..R63 in RISC-V mode, ...


But, annoyingly, without the Privileged spec, nor going off and trying
to clone a bunch of SiFive's hardware level interface or similar," more
correct" CSR handling, ..., there would be little hope of running an
unmodified Linux kernel on it.

But, I guess in theory, if one has RV64GC, the kernel could be ported to
work with the unusual interrupt handling and similar. Theoretically,
there are now mechanisms in place in the design to allow the CPU to
handle interrupts directly in RISC-V mode, avoiding the need for a mixed
ISA kernel (though, very little else would be glossed over at this level).


But, would there be much point in running RV64 Linux on the BJX2
core?... Seems like it would pretty much defeat the whole point in this
case (say, vs using a CPU actually designed to run RV64 as its native ISA).


Well, and/or mostly just run much of the userland in RV64 mode and then
mimic the Linux kernel's syscalls...

MitchAlsup

unread,
Nov 30, 2023, 6:16:55 PM11/30/23
to
max() and min() are single instructions in my ISA. On the FP side, they
even get NaNs put in the proper places, too.

> Though, apparently the existence of these macros in stdlib.h is
> controversial (they are apparently sort of a POSIX legacy feature, and
> not part of the C standard).

It took Brian quite some time to recognize that crap and turn it all into
max() and min() instructions.

> Mostly, due to implementation issues in BGBCC, ?: needs an actual
> branch, so is ironically actually worse for performance in this case
> than using an explicit if/else...

My point exactly--but perhaps you should spend the effort to make your
compiler recognize these as single branch-free instructions.

> A case could be made for C11 "_Generic", but would need to implement it.


> But, yeah, apart from compiler issues, in theory the min/max part could
> be 3 instructions.

max() and min() are easy to perform in a single cycle in both int and FP

> If done in ASM, might be something like:
> // R4..R7 (x0, y0, x1, y1)
> SUB R4, R6, R16 | SUB R5, R7, R17
> SHAD R16, -31, R18 | SHAD R17, -31, R19
> XOR R16, R18, R16 | XOR R17, R19, R17
> CMPGT R17, R16
> CSELT R16, R17, R18 | CSELT R17, R16, R19
> SHAD R19, -1, R19
> ADD R18, R19, R2

And you are claiming this has an advantage over::

MAX R9,R8,R7
MIN R10,R8,R7
SRL R10,R10,#1
LEA R10,[R9,R10,#-1]


> I don't have actual MIN/MAX ops though.

perhaps you should reconsider...

> Though, it seems RISC-V does have them in the Bitmanip extension.

> Could probably add them in-theory (maybe along with FMIN/FMAX).

I use the same FU to perform Int max() and FP max()--remember once you
special case out the problems, IEEE is a linearly ordered set of numbers.

Robert Finch

unread,
Nov 30, 2023, 7:53:59 PM11/30/23
to
Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.

MitchAlsup

unread,
Nov 30, 2023, 9:47:22 PM11/30/23
to
Robert Finch wrote:
>>
>>> Could probably add them in-theory (maybe along with FMIN/FMAX).
>>
>> I use the same FU to perform Int max() and FP max()--remember once you
>> special case out the problems, IEEE is a linearly ordered set of numbers.

> Q+ has min3 / max3 for minimum or maximum of three values.
> But only fmin / fmax for float.

Interesting, but I did not have enough room in the 3-operand subGroup.

How do you determine the mid( x, y, z ) ?? allowing::

mx = max( x, y, z );
mn = min( x, y, z );
mi = mid( x, y, z );

??

Chris M. Thomasson

unread,
Nov 30, 2023, 9:56:29 PM11/30/23
to
Think along the lines of two n-ary points, 3-ary here:

Sorry for the pseudo-code:
____________________
// two n-ary points... 3-ary here:
p0 = (-1, 0, 0);
p1 = (1, 0, 0);

// difference and mid...
dif = p1 - p0;
mid = p0 + dif * .5;

// plot some points on the canvas...
plot_point(p0);
plot_point(mid);
plot_point(p1);
____________________

?

Actually, we have a min and a max, therefore we have a normalized
difference. .5 can be half way between min and max?

Robert Finch

unread,
Nov 30, 2023, 11:40:10 PM11/30/23
to
I had not thought to add a mid() operation. It would be fairly easy to
do but I am not sure using the opcode space would be worth it. There is
about 20 opcodes available. 3-operands pretty much use up a 32-bit
opcode as register specs are six bits. There is only one bit left over
for sub-grouping. There is MUX, CMOVxx, and PTRDIF all with 3-operands too.

if (x > y && x < z)
mi = x;
else if (y > x && y < z)
mi = y;
else
mi = z;