Alternative Representations of the Concertina II ISA

Quadibloc

unread,

Nov 23, 2023, 11:16:02 PM11/23/23

to

The Concertina II ISA, as it stands now, has a
very limited number of opcodes for load and
store memory-reference operations, but a huge
number of opcodes for register-to-register
operate instructions.

This makes it suitable for an implementation where:

Physical memory is some power-of-two multiple of
96 bits in width, for example, 192 bits or 384 bits
wide;

A simple hardware divide-by-three circuit allows
accessing memory normally with the RAM module width
of 64 bits as the fundamental unit;

Applying no conversion allows 24 bits to be the
fundamental unit;

An even simpler _multiply_ by five circuit allows
accessing memory with a 60-bit word as the
fundamental unit.

So it's possible to take an implementation of the
architecture with three-channel memory, and add the
feature of allocating data memory blocks that look
like they're made up of 48-bit words or 60-bit words.

When a memory-reference instruction refers to an
address which is inside of such a variant block,
its operation woudl be modified to handle a data
type of an appropriate width for that kind of memory.

Thus:

In 60-bit memory,

Load Floating would be an invalid operation;
Load Double would load a 60-bit floating-point number;
Load Medium would load a 45-bit floating-point number, as
such a float would still be wide enough to be useful.

Load Byte would be an invalid operation.
Load Halfword would load a 15-bit integer;
Load would load a 30-bit integer;
Load Long would load a 60-bit integer.

In 36-bit memory,

Load Floating would load a 36-bit floating-point number
Load Double would load a 72-bit floating-point nuber
Load Medium would load a 54-bit floating-point number

Load Byte would load a 9-bit byte
Load Halfword would load an 18-bit halfword integer
Load would load a 36-bit integer
Load Long would load a 72-bit integer into a _pair_
of registers, the first of which would be an even-numbered
register, as the registers are only 64 bits long

In 24-bit memory,

Load Floating would be an invalid operation
Load Double would load a 72-bit floating-point number;
Load Medium would load a 48-bit floating-point number;

Load Byte would load a 6-bit character;
Load Halfword would load a 12-bit integer;
Load would load a 24-bit integer;
Load Long would load a 48-bit integer.

Note that the relationship between floating-point
widths and integer widths in instructions is offset
by a factor of two when 24-bit memory is referenced.

And there would be separate opcodes for doing arithmetic
on numbers of these different lengths in the registers.

This is all very well. But in program code, which, as
it is so far specified, can _only_ reside in 64-bit memory,
what about pseudo-immediate values?

Since the instruction format that references pseudo-immediate
values is that of an operate instruction, pseudo-immediates
_could_ be allowed, with considerable wasted space in most
cases, in operate instructions for the data types associated
with variant memory widths.

However, I have thought of a way to allow the CPU to treat
the different memory widths with greater equality.

Instead of devising a whole new ISA for each of the other
memory widths, the *existing* ISA, designed around standard
memory built up from the 64-bit unit, could be given
_alternative representations_ wherein programs could be
stored in memory of other widths, using the same ISA with
minimum modifications.

The 36-bit Representation

In the standard 64-bit representation, a block may begin
with 1100, followed by a 2-bit prefix field for each of the
remaining fourteen 16-bit parts of the remaining seven
32-bit instruction slots in the block.

So, make that portion of the ISA the only one supported in
the 36-bit representation, but with each 36-bit word, in
a block, now 288 bits long, with eight 36-bit words, consisting
of a prefix field, followed by 16 instruction bits, repeated
sixteen times.

But it must still be possible to have block headers if one is
to have pseudo-immediates.

So the first 36-bit instruction slot of a block, if it is to
be a header, would start with 1011100011, and the pre- field
portion of the second 18 bits of that instruction slot would
contain 11. 1011100011 consists of 10 (32-bit instruction)
and 11100011 (header prefix in 32-bit mode).

So the ISA is modified because in 64-bit memory, only seven
instruction slots, not all eight, can contain 36-bit instructions,
and there are no headers inside 36 bit instructions. But the
modifications are minimal, for the purpose of adapting programs
to the different size of memory, with that size being native
for them.

The 24-bit representation

Since a 24-bit word contains three 8-bit bytes, here, the
form of instructions isn't modified at all. But blocks shrink
from eight instruction slots that are 32-bits in length to
eight 24-bit words, so a block would be 192 bits long.

The amount of wasted space for data in 24-bit types wouldn't
be changed much in many cases by using this representation
instead of the 32-bit representation for program code,
because the clash between instruction widths and data widths
wouldn't be affected. However, _some_ improvement would be
achieved, because now if there are multiple pseudo-immediate
values used in a single block, the 24-bit data items could
at least be *packed* correctly in the block.

The 60-bit representation

Here, a block is 480 bits long.

It consists of two 240-bit sub-blocks, in the following form:

One 15 bit halfword containing 15 prefix bits,

and fifteen 15-bit halfwords to which those bits apply,
turning them into 16-bit halfwords, which are used to form
instructions.

The blocks are aligned on 480-bit boundaries. When code
consists of 32-bit instructions, the first sub-block begins
with an instruction, and the second sub-block begins with the
second half of an instruction.

The first whole 32-bit instruction in a sub-block can be a
header, and so one has a header applying to seven subsequent
instruction slots in the first sub-block, and a header applying
to six subsequent instruction slots in the second sub-block.

John Savard

Quadibloc

unread,

Nov 23, 2023, 11:42:35 PM11/23/23

to

On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

> In 24-bit memory,
>
> Load Floating would be an invalid operation Load Double would load a
> 72-bit floating-point number;
> Load Medium would load a 48-bit floating-point number;

Alternatively, it would perhaps be simpler to do it this way:

Load Floating would load a 48-bit floating-point number;
Load Medium would load a 72-bit floating-point number.

This would prioritize structure over function, but only
the compiler would ever know...

The 24-bit representation of code, however, has a bigger problem.

In the 60-bit representation, a 15-bit halfword that can be
addressed easily in 60-bit memory corresponds to a 16-bit area
in the original ISA;

In the 36-bit representation, an 18-bit halfword, easily addressible
in 36-bit memory, does the same.

In the 24-bit representation, though, a problem exists; the simplest
way to deal with that would be, in code consisting entirely of 32-bit
instructions, to allow only every third instruction to be a branch
target.

John Savard

MitchAlsup

unread,

Nov 24, 2023, 1:51:02 PM11/24/23

to

I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do
you seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......

Quadibloc

unread,

Nov 24, 2023, 5:38:36 PM11/24/23

to

On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

> I have not found a need to have LD/ST floating point instructions
> because I don't have FP register file(s) {or SIMD files}. Why do you
> seem to want/need a file dedicated to floating point values ??
> This saves a huge chunk of OpCode Space.......

Well, for one thing, by giving the integer functional unit one register
file, and the floating-point functional unit its own register file, then
those two functional units are not constrained to be adjacent.

I can stick the Decimal Floating-Point functional unit on the other side
of the floating register file, and the Simple Floating functional unit,
or the Register Packed Decimal functional unit, on the other side of the
integer register file.

But of course my main reason is that this is the way everyone (i.e.
the IBM System/360) does it, so of course I should do it that way!

But having an excess of data types is at least a helpful excuse...

John Savard

Quadibloc

unread,

Nov 24, 2023, 5:41:08 PM11/24/23

to

On Fri, 24 Nov 2023 04:42:31 +0000, Quadibloc wrote:

> On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:
>
>> In 24-bit memory,
>>
>> Load Floating would be an invalid operation Load Double would load a
>> 72-bit floating-point number;
>> Load Medium would load a 48-bit floating-point number;
>
> Alternatively, it would perhaps be simpler to do it this way:
>
> Load Floating would load a 48-bit floating-point number;
> Load Medium would load a 72-bit floating-point number.
>
> This would prioritize structure over function, but only the compiler
> would ever know...

Silly me. How could I have forgotten that 24-bit memory is the
only place where the "ideal" lengths of floating-point numbers
can exist, so I should use them, rather than try to concoct
something that's 24-bit native.

So

Load Floating would load a 36-bit floating-point number,
Load Medium would load a 48-bit floating-point number, and
Load Double would load a 60-bit floating-point number.

John Savard

Quadibloc

unread,

Nov 24, 2023, 5:44:37 PM11/24/23

to

On Fri, 24 Nov 2023 22:38:32 +0000, Quadibloc wrote:

> On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:
>
>> I have not found a need to have LD/ST floating point instructions
>> because I don't have FP register file(s) {or SIMD files}. Why do you
>> seem to want/need a file dedicated to floating point values ??
>> This saves a huge chunk of OpCode Space.......

Also, because IEEE 754 calls for a hidden first bit, and
gradual underflow, these things make floating-point
arithmeitc slightly slower.

Hence, I deal with them at load and store time, by using
the load and store floating point instructions to convert
to and from an *internal representation* of floats. That
in itself precludes using the integer instructions for them.

John Savard

Quadibloc

unread,

Nov 24, 2023, 5:54:01 PM11/24/23

to

On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

> An even simpler _multiply_ by five circuit allows accessing memory with
> a 60-bit word as the fundamental unit.

As well, a multiply by three circuit is used to address the 36-bit memory.

> Instead of devising a whole new ISA for each of the other memory widths,
> the *existing* ISA, designed around standard memory built up from the
> 64-bit unit, could be given _alternative representations_ wherein
> programs could be stored in memory of other widths, using the same ISA
> with minimum modifications.

While this is an "improvement", I do realize that this, along with even
supporting multiple memory widths, even if I have found ways that it can
technically be done, is sheer insanity.

I do have a rationale for going there. It's simple enough. It comes from
the x86.

Since the irresistible force of availability of applications means that
we are doomed to always have just *one* dominant ISA...

then, to minimize the terrible loss that this entails, that one dominant
ISA should be maximally flexible. To approach, as closely as possible, the
"good old days" where the PDP-10 and the CDC 6600 coexisted alongside the
PDP-11 and the IBM System/360.

This is the driving force behind my insane quest to design an ISA for a CPU
which has the ability to do something which no one seems to want.

John Savard

MitchAlsup

unread,

Nov 25, 2023, 3:02:22 PM11/25/23

to

Quadibloc wrote:

> On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

>> I have not found a need to have LD/ST floating point instructions
>> because I don't have FP register file(s) {or SIMD files}. Why do you
>> seem to want/need a file dedicated to floating point values ??
>> This saves a huge chunk of OpCode Space.......

> Well, for one thing, by giving the integer functional unit one register
> file, and the floating-point functional unit its own register file, then
> those two functional units are not constrained to be adjacent.

But Integer Multiply and Divide can share the FPU that does these.
And we have the need to use FP values in deciding to take branches.
So there is a lot of cross pollination of types to FUs. And finally
you have to have 2 function units one to convert from FP to Int and
one to convert INT to FP.

Lets keep them apart and now we need a FU bigger than the FMAC unit
to perform integer multiplies and divides. In addition, you need a Bus
from the FPU side to the Condition unit that cooperates with integer
branch instructions.

Does not look like you can keep them "that far apart" and even when you
do there is lots of cross wiring--some of which are more costly in HW
than keeping FP and Int in the same RF.

BGB

unread,

Nov 25, 2023, 3:46:40 PM11/25/23

to

On 11/24/2023 4:44 PM, Quadibloc wrote:
> On Fri, 24 Nov 2023 22:38:32 +0000, Quadibloc wrote:
>
>> On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:
>>
>>> I have not found a need to have LD/ST floating point instructions
>>> because I don't have FP register file(s) {or SIMD files}. Why do you
>>> seem to want/need a file dedicated to floating point values ??
>>> This saves a huge chunk of OpCode Space.......
>
> Also, because IEEE 754 calls for a hidden first bit, and
> gradual underflow, these things make floating-point
> arithmeitc slightly slower.
>

DAZ/FTZ fixes the gradual underflow issue...
Not the standard solution, but works.

One could maybe also make a case for truncate-only rounding:
Easier to make bit-exact across implementations;
Cheapest option;
...

Though, truncate does occasionally have visible effects on code.
For example, the player's yaw angle in my Quake port was prone to drift,
had to manually add a small bias to stop the drifting.
...

Say, Round-Nearest at least having the advantage that looping relative
computations don't have a tendency to drift towards 0. Though, on the
other side, computations which are stable with truncate rounding will
tend to drift away from zero with round-nearest.

In this case, this applied mostly to 'float' and 'short float' (when set
to be routed through the low-precision unit via a compiler flag), with
the main FPU (used for 'double'; and the others with default options)
still doing proper round-nearest.

I don't imagine a non-hidden bit representation would be much better here.

Renormalization would still be required regularly to avoid other issues,
and manual re-normalization would negatively effect performance.

Operations like FADD/FSUB would also still need to be able to
shift-align operands to be able to do their thing, so non-normalized
values would not save much here either.

Well, with the partial exception if one instead re-interprets the
floating-point to be more like variable-fixed-point, in which case the
compiler would need to deal with these issues (and probably the
programmer as well, say, to specify how many bits are above or below the
decimal point).

But, in this case, may as well save a few bits and just use "normal"
fixed-point (or maybe add a compiler type, like "__fixed(16,16)" or
similar, then have the compiler deal with the shifts and similar...).

> Hence, I deal with them at load and store time, by using
> the load and store floating point instructions to convert
> to and from an *internal representation* of floats. That
> in itself precludes using the integer instructions for them.
>

Possible.

I went with integer operations myself...

> John Savard

Quadibloc

unread,

Nov 26, 2023, 8:30:00 PM11/26/23

to

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

> But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.

I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be. After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.

> And we have the need to use FP values in deciding to take branches.

What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).

John Savard

MitchAlsup

unread,

Nov 26, 2023, 9:02:06 PM11/26/23

to

Quadibloc wrote:

> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

>> But Integer Multiply and Divide can share the FPU that does these.

> But giving each one its own multiplier means more superscalar
> goodness!

> But you _do_ have a good point. This kind of "superscalar goodness"
> is usually wasted, because most programs don't do an equal amount
> of integer and FP arithmetic. So it would be way better to have
> twice as many multipliers and dividers that are designed to be used
> either way than having units of two kinds.

> I'm assuming, though, that one can make the multipliers and dividers
> simpler, hence faster, by making them single-use, and that it's
> possible to know, for a given use case, what the ratio between integer
> and FP operations will be.

In non-scientific code, there are 10 integer mul/divs per 1 FP mul/div
In Scientific code, there are 2 integer mul/divs per 10 FP mul/divs.

> After all, integer multiplication discards
> the MSBs of the result, and floating-point multiplication discards the
> LSBs of the result.

Means almost nothing because a FMAC can end up with those very same bits
as are preferred for Int ×s

>> And we have the need to use FP values in deciding to take branches.

> What gets used in taking branches is the _condition codes_. They're
> conveyed from the integer and floating point functional units, as well
> as all the other functional units, to a central place (the program
> status word).

LD R4,[address of FP value]
BFEQ0 R4,some label

How do you do this on a condition code machine in 2 instructions ??
System 460 had to::

LD F4,[address of FP value]
TST F4
BEQ some label

> John Savard

BGB

unread,

Nov 27, 2023, 1:18:42 AM11/27/23

to

On 11/26/2023 7:29 PM, Quadibloc wrote:
> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>
>> But Integer Multiply and Divide can share the FPU that does these.
>
> But giving each one its own multiplier means more superscalar
> goodness!
>
> But you _do_ have a good point. This kind of "superscalar goodness"
> is usually wasted, because most programs don't do an equal amount
> of integer and FP arithmetic. So it would be way better to have
> twice as many multipliers and dividers that are designed to be used
> either way than having units of two kinds.
>
> I'm assuming, though, that one can make the multipliers and dividers
> simpler, hence faster, by making them single-use, and that it's
> possible to know, for a given use case, what the ratio between integer
> and FP operations will be. After all, integer multiplication discards
> the MSBs of the result, and floating-point multiplication discards the
> LSBs of the result.
>

If one wants to have the most superscalar goodness, for general case
code, this means mostly lots of simple ALU ops (particularly,
ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

Where, say, Load/Store ops and simple ALU ops tend to dominate over
pretty much everything else in terms of usage frequency.

In most code, FPU ops are comparably sparse, as is MUL, and DIV/MOD is
fairly rare in comparison to either. MUL is common enough that you want
it to be fast when it happens, but not so common to where one is dealing
with piles of them all at the same time (except maybe in edge cases,
like the DCT transform in JPEG/MPEG).

If FMUL and MUL can't be co issued, this is unlikely to matter.

DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
most code will not notice (except in rasterizers or similar, which
actually need fast DIV, but can cheat in that they don't usually need
exact DIV and the divisors are small, allowing for "shortcuts", such as
using lookup tables of reciprocals or similar).

Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).

For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.

>> And we have the need to use FP values in deciding to take branches.
>
> What gets used in taking branches is the _condition codes_. They're
> conveyed from the integer and floating point functional units, as well
> as all the other functional units, to a central place (the program
> status word).
>

Yeah, better to not have this style of condition codes...

Like, there are other ways of doing conditional branches...

> John Savard

Anton Ertl

unread,

Nov 27, 2023, 4:44:43 AM11/27/23

to

BGB <cr8...@gmail.com> writes:
>On 11/26/2023 7:29 PM, Quadibloc wrote:
>> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>>
>>> But Integer Multiply and Divide can share the FPU that does these.
>>
>> But giving each one its own multiplier means more superscalar
>> goodness!

Having two multipliers that serve both purposes means even more
superscalar goodness for similar area cost. However, there is the
issue of latency. The Willamette ships the integers over to the FPU
for multiplication, and the result back, crossing several clock
domains (at one clock loss per domain crossing), resulting in a
10-cycle latency for integer multiplication. I think that these days
every high-performance core with real silicon invests into separate
GPR and FP/SIMD (including integer SIMD) multipliers.

>In most code, FPU ops are comparably sparse

In terms of executed ops, that depends very much on the code. GP
cores have acquired SIMD cores primarily for FP ops, as can be seen by
both SSE and supporting only FP at first, and only later adding
integer stuff, because it cost little extra. Plus, we have added GPUs
that are now capable of doing huge amounts of FP ops, with uses in
graphics rendering, HPC and AI.

>Otherwise, one can debate whether or not having DIV/MOD in hardware
>makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
>at least "probably" faster than a generic software-only solution).

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.

However, looking at more recent architectures, the RISC-V M extension
(which is part of RV64G and RV32G, i.e., a standard extension) has not
just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
RISC-V seems significant to me because there the philosophy seems to
be to go for minimalism. So the debate has apparently come to the
conclusion that for general-purpose architectures, you include an
integer divide instruction.

>For cases like divide-by-constant, also, typically it is possible to
>turn it in the compiler into a multiply by reciprocal.

But for that you want a multiply instruction, which in the RISC-V case
means including the M extension, which also includes divide
instructions. Multiplying by the reciprocal may still be faster.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Quadibloc

unread,

Nov 27, 2023, 8:27:05 AM11/27/23

to

On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

> That debate has been held, and MIPS has hardware integer divide, Alpha
> and IA-64 don't have a hardware integer divide; they both have FP divide
> instructions.

I didn't know this about the Itanium. All I remembered hearing was that
the Itanium "didn't have a divide instruction", and so I didn't realize
this applied to fixed-point arithmetic only.

John Savard

Anton Ertl

unread,

Nov 27, 2023, 11:07:49 AM11/27/23

to

You are right, I was wrong.
<http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
says:

|A number of floating-point operations defined by the IEEE Standard are
|deferred to software by the IA-64 architecture in all its
|implementations
|
|* floating-point divide (integer divide, which is based on the
| floating-point divide operation, is also deferred to software)

The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA
instruction, and then multiplying with a. It shows a sequence of 13
instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.

Given that integer division is based on FP division, it probably takes
even more instructions.

Meanwhile, a lowly Cortex-A53 with its divide instruction produces an
integer division result with a small quotient in a few cycles. In
other words, if you want to use that method for division, still
provide a division instruction, and then implement it with that
method. This provides the flexibility to switch to some other method
(e.g., the one used by ARM) later.

BGB

unread,

Nov 27, 2023, 2:41:20 PM11/27/23

to

On 11/27/2023 3:17 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> On 11/26/2023 7:29 PM, Quadibloc wrote:
>>> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>>>
>>>> But Integer Multiply and Divide can share the FPU that does these.
>>>
>>> But giving each one its own multiplier means more superscalar
>>> goodness!
>
> Having two multipliers that serve both purposes means even more
> superscalar goodness for similar area cost. However, there is the
> issue of latency. The Willamette ships the integers over to the FPU
> for multiplication, and the result back, crossing several clock
> domains (at one clock loss per domain crossing), resulting in a
> 10-cycle latency for integer multiplication. I think that these days
> every high-performance core with real silicon invests into separate
> GPR and FP/SIMD (including integer SIMD) multipliers.
>

I ended up with different multipliers mostly because the requirements
are different...

>> In most code, FPU ops are comparably sparse
>
> In terms of executed ops, that depends very much on the code. GP
> cores have acquired SIMD cores primarily for FP ops, as can be seen by
> both SSE and supporting only FP at first, and only later adding
> integer stuff, because it cost little extra. Plus, we have added GPUs
> that are now capable of doing huge amounts of FP ops, with uses in
> graphics rendering, HPC and AI.
>

Yeah, this was not to say there is not FPU dense code, or that FP-SIMD
is not useful, but rather that in most general code, ALU and LD/ST ops
tend to dominate by a fair margin.

Similarly, SIMD ops may still be useful, even if they are a relative
minority of the executed instructions (even in code sequences which are
actively using SIMD ops...).

Like, typically, for every SIMD op used, there are also things like:
The loads and stores to get the value from memory and put the results in
memory;
ALU ops to calculate the index into the array or similar that we are
loading from or storing to;
...

Well, along with other ops, like shuffles and similar to get the SIMD
elements into the desired order, etc.

Like, some of this is why it is difficult to get anywhere near the
theoretical 200 MFLOP of the SIMD unit, apart from very contrived
use-cases (such as running neural net code), and had involved wonk like
operations which combined a SIMD shuffle into the SIMD ADD/SUB/MUL ops.

For a lot of other use cases, I can just be happy enough that the SIMD
ops are "not slow".

>> Otherwise, one can debate whether or not having DIV/MOD in hardware
>> makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
>> at least "probably" faster than a generic software-only solution).
>
> That debate has been held, and MIPS has hardware integer divide, Alpha
> and IA-64 don't have a hardware integer divide; they both have FP
> divide instructions.
>

Technically, also, BJX2 has ended up having both Integer DIV and FDIV
instructions. But, they don't gain all that much, so are still left as
optional features.

The integer divide isn't very fast, but it doesn't matter if it isn't
used all that often.

The FDIV is effectively a boat anchor (around 122 clock cycles).
Though, mostly this was based on the observation that with some minor
tweaks, the integer divide unit could be made to perform floating-point
divide as well.

The main merit though (over a software N-R divider) is that it can
apparently give exact results (my N-R dividers generally can't converge
past the low order 4 bits).

> However, looking at more recent architectures, the RISC-V M extension
> (which is part of RV64G and RV32G, i.e., a standard extension) has not
> just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
> integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
> and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
> RISC-V seems significant to me because there the philosophy seems to
> be to go for minimalism. So the debate has apparently come to the
> conclusion that for general-purpose architectures, you include an
> integer divide instruction.
>

Yeah.

I mostly ended up adding integer divide so that the RISC-V mode could
support the 'M' extension, and if I have it for RISC-V, may as well also
add it in BJX2 as its own optional extension.

Had also added a "FAZ DIV" special case that made the integer divide
faster, where over a limited range of input values, the integer divide
would be turned into a multiply by reciprocal.

So, say, DIV latency:
64-bit: 68 cycle
32-bit: 36 cycle
32-bit FAZ: 3 cycle.

As it so happens, FAZ also covers a similar range to that typically used
for a rasterizer, but is more limited in that it only handles a range of
values it can calculate exactly. For my hardware rasterizer, a similar
strategy was used, but extended to support bigger divisors at the
tradeoff that it becomes inexact with larger divisors.

However, adding a "Divide two integers quickly, but may give an
inaccurate result" instruction would be a bit niche (and normal C code
couldn't use it without an intrinsic or similar).

Partial observation is that mostly, the actual bit patterns in the
reciprocals tends to be fairly repetitive, so it is possible to
synthesize a larger range of reciprocals using lookup tables and
shift-adjustments.

In the hardware rasterizer, I had experimented with a fixed-point 1/Z
divider for "more accurate" perspective-correct rasterization, but this
feature isn't cheap (fixed-point reciprocal is more complicated), and
ended up not using it for now (in favor of merely dividing S and T by Z
and then multiplying by the interpolated Z again during rasterization,
as a sort of less accurate "poor man's" version).

The "poor man's perspective correct" strategy doesn't eliminate the need
to subdivide primitives, but does allow the primitives to be somewhat
larger without having as much distortion (mostly relevant as my GL
renderer is still mostly CPU bound in the transform and subdivision
stages, *1).

In theory, proper perspective correct would entirely eliminate the need
to subdivide primitives, but would require clipping geometry to the
viewing frustum (otherwise, the texture coordinates freak out for
primitives crossing outside the frustum or crossing the near plane).

Approximate FDIV is possible, but I have typically used a different
strategy for the reciprocal.

*1: Though, ironically, this does also mean that, via multi-texturing,
it is semi-viable to also use lightmap lighting again in GLQuake (since
this doesn't add too much CPU penalty over the non-multitexture case).

However, dynamic lightmaps still aren't viable, as the CPU-side cost of
drawing the dynamic light-sources to the lightmaps, and then uploading
them, is still a bit too much.

I suspect, probably in any case, GLQuake wasn't likely written to run on
a 50MHz CPU.

Might have been a little easier if the games had poly-counts and and
draw distances more on par with PlayStation games (say, if the whole
scene is kept less than 500 polys; rather than 1k-2k polys).

Say, Quake being generally around 1k-2k polys for the scene, and roughly
300-500 triangles per alias model, ... Despite the scenes looking crude
with large flat surfaces, most of these surfaces had already been hacked
up a fair bit by the BSP algorithm (though, theoretically, some
annoyances could be reduced if the Quake1 maps were rebuilt using a
modified Q3BSP or similar, as the Quake3 tools natively support vertex
lighting and also don't hack up the geometry quite as much as the
original Quake1 BSP tool did, but alas...).

Wouldn't save much, as I still couldn't legally redistribute the data
files (and my GLQuake port isn't particularly usable absent using a
process to convert all the textures into DDS files and rebuilding the
PAK's and similar, so...).

To have something redistributable, would need to replace all of the
textures, sound effects, alias models, etc. An old (probably long dead)
"OpenQuartz" project had partly done a lot of this, but "got creative"
with the character models in a way I didn't like (would have preferred
something at least "in the same general spirit" as the original models).

Similar sort of annoyances as with FreeDoom, but at least FreeDoom
stayed closer to the original in these areas.

Also, possibly, I may need to rework some things in terms of how TKRA-GL
works to better match up with more recent developments (all this is
still a bit hacky; effectively linking the whole GL implementation to
the binary that uses GL).

Granted, it is also possible I may need to at some point move away from
linking the whole TestKern kernel to any programs that use it as well,
with the tradeoff that programs would no longer be able to launch "bare
metal" in the emulator (but would always need the kernel to also be
present).

>> For cases like divide-by-constant, also, typically it is possible to
>> turn it in the compiler into a multiply by reciprocal.
>
> But for that you want a multiply instruction, which in the RISC-V case
> means including the M extension, which also includes divide
> instructions. Multiplying by the reciprocal may still be faster.
>

Yeah.

For the relevant cases, widening multiply for 32-bit integers generally
has a 3 cycle latency in my case.

Whereas DIV will often have a 36 cycle latency in the general case.
So, the choice is obvious.

Granted, this still generally needs a shift instruction following the
multiply.

Having a "Do a widening 32-bit multiply and then right shift the results
32/33/34/35 bits" would be possible in theory, but a little wonky.

Could possibly make sense if it could also encode a 32-bit immediate:
DMULSH33S Rm, Imm32u, Rn //Int32: Rn=(Rm*Imm32)>>33
DMULSH33U Rm, Imm32u, Rn //UInt32: Rn=(Rm*Imm32)>>33

Probably don't really have the encoding space to pull this off though
(with a 64-bit encoding; Could be doable in theory with an 96-bit
FE-FF-OP encoding though).

Though, as can be noted, this could likely borrow some of the logic from
the FAZ-DIV mechanism or similar.

Would allow handling "y=x/C;" cases in 3L/1T though, but, integer
division still isn't really all that common (at least, not enough to
justify going through all this just to save using a right-shift op).

> - anton

MitchAlsup

unread,

Nov 27, 2023, 8:02:08 PM11/27/23

to

BGB wrote:

> On 11/26/2023 7:29 PM, Quadibloc wrote:
>> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>>
>>> But Integer Multiply and Divide can share the FPU that does these.
>>
>> But giving each one its own multiplier means more superscalar
>> goodness!
>>
>> But you _do_ have a good point. This kind of "superscalar goodness"
>> is usually wasted, because most programs don't do an equal amount
>> of integer and FP arithmetic. So it would be way better to have
>> twice as many multipliers and dividers that are designed to be used
>> either way than having units of two kinds.
>>
>> I'm assuming, though, that one can make the multipliers and dividers
>> simpler, hence faster, by making them single-use, and that it's
>> possible to know, for a given use case, what the ratio between integer
>> and FP operations will be. After all, integer multiplication discards
>> the MSBs of the result, and floating-point multiplication discards the
>> LSBs of the result.
>>

> If one wants to have the most superscalar goodness, for general case
> code, this means mostly lots of simple ALU ops (particularly,
> ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

> Where, say, Load/Store ops and simple ALU ops tend to dominate over
> pretty much everything else in terms of usage frequency.

LD+ST 33%
Int 45%

> In most code, FPU ops are comparably sparse,

FP 9%
> as is MUL,
MUL 2%
> and DIV/MOD is
DIV 0.3%

> fairly rare in comparison to either. MUL is common enough that you want
> it to be fast when it happens, but not so common to where one is dealing
> with piles of them all at the same time (except maybe in edge cases,
> like the DCT transform in JPEG/MPEG).

MUL is much more common in FORTRAN than in C-like languages as it forms
the underlying math for multidimensional arrays, and more ×s exist in
typical number crunching code (F) than in pointer chasing (C) languages.

> If FMUL and MUL can't be co issued, this is unlikely to matter.

Since both are pipelined, this is a small stall {{Unlike MIPS R2000 where
IMUL was 12 cycles blocking while FMUL was 3 (or was it 4) pipelined.

> DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
> most code will not notice (except in rasterizers or similar, which
> actually need fast DIV, but can cheat in that they don't usually need
> exact DIV and the divisors are small, allowing for "shortcuts", such as
> using lookup tables of reciprocals or similar).

We know that FDIV can be performed in 17 cycles in the FMUL unit, and
this has been known since 1989......why are you making it so long ??

0.3% IDIV × 100 cycles and you have added 3 cycles to the average
instruction !!!

> Otherwise, one can debate whether or not having DIV/MOD in hardware
> makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
> at least "probably" faster than a generic software-only solution).

With an 8-bit start, and Newton-Raphson iteration, SW should be able
to do this in mid-30-cycles. {{See ITANIC FDIV SW profile}}

> For cases like divide-by-constant, also, typically it is possible to
> turn it in the compiler into a multiply by reciprocal.

And you should if it helps.

>>> And we have the need to use FP values in deciding to take branches.
>>
>> What gets used in taking branches is the _condition codes_. They're
>> conveyed from the integer and floating point functional units, as well
>> as all the other functional units, to a central place (the program
>> status word).
>>

> Yeah, better to not have this style of condition codes...

RISC-V, MIPS, Mc 88K, My 66000, have no condition codes......

> Like, there are other ways of doing conditional branches...

"Compare to zero and branch" × {Signed, Unsigned, FP16, FP32, FP64}}
for 1 instruction FP branches

Compare and then branch on anything (2 inst::1 int, 1 FP) your heart desires
Branch {NaN, -Infinity, +Infinity, -DeNorm, +Denorm, -Normal, +Normal, Zero,
-zero, +zero, < <= > >= == !=, 0 < x < y, 0 <=x < y, 0 < x <= y, 0 <= x <= y, ... }
Can you even record all these states in a single condition code ??

>> John Savard

MitchAlsup

unread,

Nov 27, 2023, 8:06:07 PM11/27/23

to

Anton Ertl wrote:

> BGB <cr8...@gmail.com> writes:
>>On 11/26/2023 7:29 PM, Quadibloc wrote:
>>> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>>>
>>>> But Integer Multiply and Divide can share the FPU that does these.
>>>
>>> But giving each one its own multiplier means more superscalar
>>> goodness!

> Having two multipliers that serve both purposes means even more
> superscalar goodness for similar area cost. However, there is the
> issue of latency. The Willamette ships the integers over to the FPU
> for multiplication, and the result back, crossing several clock
> domains (at one clock loss per domain crossing), resulting in a
> 10-cycle latency for integer multiplication. I think that these days
> every high-performance core with real silicon invests into separate
> GPR and FP/SIMD (including integer SIMD) multipliers.

>>In most code, FPU ops are comparably sparse

> In terms of executed ops, that depends very much on the code. GP
> cores have acquired SIMD cores primarily for FP ops, as can be seen by
> both SSE and supporting only FP at first, and only later adding
> integer stuff, because it cost little extra. Plus, we have added GPUs
> that are now capable of doing huge amounts of FP ops, with uses in
> graphics rendering, HPC and AI.

Once SIMD gains Integer operations, the Multiplier has to be built to
do both, might as well use it for more things than just SIMD.

>>Otherwise, one can debate whether or not having DIV/MOD in hardware
>>makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
>>at least "probably" faster than a generic software-only solution).

> That debate has been held, and MIPS has hardware integer divide, Alpha
> and IA-64 don't have a hardware integer divide; they both have FP
> divide instructions.

And all of these lead fruitful long productive lives before taking over
-------Oh wait !!

> However, looking at more recent architectures, the RISC-V M extension
> (which is part of RV64G and RV32G, i.e., a standard extension) has not
> just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
> integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
> and REMUW.

All of which are possible in My 66000 using operand sign control, S-bit,
and CARRY when you want 64×64->128 or 128/64->{64 quotient, 64 remainder}

> ARM A64 also has divide instructions (SDIV, UDIV), but
> RISC-V seems significant to me because there the philosophy seems to
> be to go for minimalism. So the debate has apparently come to the
> conclusion that for general-purpose architectures, you include an
> integer divide instruction.

Several forms of integer DIV at least signed and unsigned.....

MitchAlsup

unread,

Nov 27, 2023, 8:06:08 PM11/27/23

to

Anton Ertl wrote:

> Quadibloc <quad...@servername.invalid> writes:
>>On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:
>>
>>> That debate has been held, and MIPS has hardware integer divide, Alpha
>>> and IA-64 don't have a hardware integer divide; they both have FP divide
>>> instructions.
>>
>>I didn't know this about the Itanium. All I remembered hearing was that
>>the Itanium "didn't have a divide instruction", and so I didn't realize
>>this applied to fixed-point arithmetic only.

> You are right, I was wrong.
> <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
> says:

> |A number of floating-point operations defined by the IEEE Standard are
> |deferred to software by the IA-64 architecture in all its
> |implementations
> |
> |* floating-point divide (integer divide, which is based on the
> | floating-point divide operation, is also deferred to software)

> The paper goes on to describe that FP division a/b is based on
> determining 1/b through Newton-Raphson approximation, using the FMA
> instruction, and then multiplying with a. It shows a sequence of 13
> instructions for double precision, 1 frcpa and 12 fma or fnma
> instructions.

> Given that integer division is based on FP division, it probably takes
> even more instructions.

You have to synthesize the bits between 63 and 53 to get 64/64->64.

Quadibloc

unread,

Nov 27, 2023, 11:50:34 PM11/27/23

to

On Mon, 27 Nov 2023 15:54:03 +0000, Anton Ertl wrote:

> Quadibloc <quad...@servername.invalid> writes:
>>On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:
>>
>>> That debate has been held, and MIPS has hardware integer divide, Alpha
>>> and IA-64 don't have a hardware integer divide; they both have FP
>>> divide instructions.
>>
>>I didn't know this about the Itanium. All I remembered hearing was that
>>the Itanium "didn't have a divide instruction", and so I didn't realize
>>this applied to fixed-point arithmetic only.
>
> You are right, I was wrong.
> <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf> says:
>
> |A number of floating-point operations defined by the IEEE Standard are
> |deferred to software by the IA-64 architecture in all its
> |implementations |
> |* floating-point divide (integer divide, which is based on the |
> floating-point divide operation, is also deferred to software)
>
> The paper goes on to describe that FP division a/b is based on
> determining 1/b through Newton-Raphson approximation, using the FMA
> instruction, and then multiplying with a. It shows a sequence of 13
> instructions for double precision, 1 frcpa and 12 fma or fnma
> instructions.
>
> Given that integer division is based on FP division, it probably takes
> even more instructions.

The fastest _hardware_ implementations of FP division are Goldschmidt
division and Newton-Raphson divisiion, and they both make use of
multiplication hardware.

So using Newton-Raphson division to increase the width of an FP division
result to perform 64-bit integer division won't be too hard.

Since the Wikipedia article didn't go into detail, I had to look further
to find out that you _were_ right about the DEC Alpha, however. It did
have floating divide but not integer divide. The 21264 added floating
square root.

John Savard

Anton Ertl

unread,

Nov 28, 2023, 4:18:57 AM11/28/23

to

mitch...@aol.com (MitchAlsup) writes:

>Anton Ertl wrote:
>> <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
>> says:
>
>> |A number of floating-point operations defined by the IEEE Standard are
>> |deferred to software by the IA-64 architecture in all its
>> |implementations
>> |
>> |* floating-point divide (integer divide, which is based on the
>> | floating-point divide operation, is also deferred to software)
>
>> The paper goes on to describe that FP division a/b is based on
>> determining 1/b through Newton-Raphson approximation, using the FMA
>> instruction, and then multiplying with a. It shows a sequence of 13
>> instructions for double precision, 1 frcpa and 12 fma or fnma
>> instructions.
>
>> Given that integer division is based on FP division, it probably takes
>> even more instructions.
>
>You have to synthesize the bits between 63 and 53 to get 64/64->64.

The paper says that they use double-extended division to synthesize
64/64-bit division; I expect that double-extended uses even more
Newton-Raphson steps (and FMAs) than double precision; and then there
are the additional steps for getting an integer.

The paper only shows signed 16/16 bit division, a 15-instruction
sequence including 1 fcrpa and 3 fma/fnma instructions. So you can
expect that 64/64 is at least 13+15-4=24 instructions long, probably
longer (because of double-extended precision).

Anton Ertl

unread,

Nov 28, 2023, 4:33:40 AM11/28/23

to

mitch...@aol.com (MitchAlsup) writes:
>Several forms of integer DIV at least signed and unsigned.....

Gforth has unsigned, floored (signed) and symmetric (signed), in
double-cell/single-cell variants (standardized in Forth) as well as in
single/single variants.

Floored rounds the quotient down, symmetric (often called truncated)
rounds the quotient towards 0. There is also something called
Euclidean that always produces a remainder >=0, but the practical
difference from floored is when the divisor is negative, which is
exceedingly rare.

Given the rareness of negative divisors, one might wonder about
signed/unsigned, but at least implementations of current programming
languages would probably make little use of such instructions even if
they were cheaper than the signed/signed ones.

BGB

unread,

Nov 28, 2023, 5:33:04 AM11/28/23

to

Yeah, pretty much...

>> fairly rare in comparison to either. MUL is common enough that you
>> want it to be fast when it happens, but not so common to where one is
>> dealing with piles of them all at the same time (except maybe in edge
>> cases, like the DCT transform in JPEG/MPEG).
>
> MUL is much more common in FORTRAN than in C-like languages as it forms
> the underlying math for multidimensional arrays, and more ×s exist in
> typical number crunching code (F) than in pointer chasing (C) languages.
>

Yeah.

Multidimensional arrays are comparably rarely used in C land, and when
they are used, they typically have power-of-2 dimensions.

>> If FMUL and MUL can't be co issued, this is unlikely to matter.
>
> Since both are pipelined, this is a small stall {{Unlike MIPS R2000 where
> IMUL was 12 cycles blocking while FMUL was 3 (or was it 4) pipelined.
>

Slow IMUL but fast FMUL: WTF?...
Unless this was only counting single precision, then it could make more
sense.

>> DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
>> most code will not notice (except in rasterizers or similar, which
>> actually need fast DIV, but can cheat in that they don't usually need
>> exact DIV and the divisors are small, allowing for "shortcuts", such
>> as using lookup tables of reciprocals or similar).
>
> We know that FDIV can be performed in 17 cycles in the FMUL unit, and
> this has been known since 1989......why are you making it so long ??
>
> 0.3% IDIV × 100 cycles and you have added 3 cycles to the average
> instruction !!!
>

Can note that using DIV/MOD seems a lot rarer in my case...

But, I can note that my compiler turns constant divide into alternatives:
y=x/C; //~ y=(x*RCP)>>SHR;
y=x%C; //~ y=x-(((x*RCP)>>SHR)*C);

Though, with some extra wonk thrown in so negative values round the
divide towards zero and similar (otherwise, negative values would round
away from zero, which is not the commonly accepted result).

y=((x+((x<0)?(C-1):0))*RCP)>>SHR;

The variable cases are comparably infrequent.

>> Otherwise, one can debate whether or not having DIV/MOD in hardware
>> makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
>> is at least "probably" faster than a generic software-only solution).
>
> With an 8-bit start, and Newton-Raphson iteration, SW should be able
> to do this in mid-30-cycles. {{See ITANIC FDIV SW profile}}
>

I was thinking of integer divide here (being 68 cycle for DIVS.Q/DIVU.Q
in my case; 36 for DIVS.L/DIVU.L).

But, yeah, the FDIV instruction is even slower (120 cycles); and
software Newton-Raphson will win this race...

Partly this is the drawback of using a Shift-Add design.

Say, every clock-cycle, the value is shifted left by 1 bit, and logic
determines whether or not to add a second value to the working value.

For multiply, you can use the bits of one input to mask whether or not
to add the other input. For divide, the carry of the adder can be used
to mask whether to add the inverse of the divisor.

For FDIV, one runs the divider for more clock cycles, and it generates
bits below the decimal point.

Though, at least for the merit of an Integer DIV op, it is faster to use
a 36-cycle integer divide instruction than it is to run a shift-subtract
loop in software.

>> For cases like divide-by-constant, also, typically it is possible to
>> turn it in the compiler into a multiply by reciprocal.
>
> And you should if it helps.
>

Yeah, it does.
Multiply by reciprocal is considerably faster.

>>>> And we have the need to use FP values in deciding to take branches.
>>>
>>> What gets used in taking branches is the _condition codes_. They're
>>> conveyed from the integer and floating point functional units, as well
>>> as all the other functional units, to a central place (the program
>>> status word).
>>>
>
>> Yeah, better to not have this style of condition codes...
>
> RISC-V, MIPS, Mc 88K, My 66000, have no condition codes......
>

Yeah.

In BJX2, there is only SR.T, however, the way it is updated and used is
much narrower in scope.

A case could be made for narrowing its scope further though.

>> Like, there are other ways of doing conditional branches...
>
> "Compare to zero and branch" × {Signed, Unsigned, FP16, FP32, FP64}}
> for 1 instruction FP branches
>

Yeah, I had ended up adding this strategy as well as the prior SR.T
mechanism.

> Compare and then branch on anything (2 inst::1 int, 1 FP) your heart
> desires Branch {NaN, -Infinity, +Infinity, -DeNorm, +Denorm, -Normal,
> +Normal, Zero, -zero, +zero, < <= > >= == !=, 0 < x < y, 0 <=x < y, 0 <
> x <= y, 0 <= x <= y, ... }
> Can you even record all these states in a single condition code ??
>

Possible.

>>> John Savard

Marko Zec

unread,

Nov 28, 2023, 8:16:53 AM11/28/23

to

MitchAlsup <mitch...@aol.com> wrote:
...
> 0.3% IDIV ? 100 cycles and you have added 3 cycles to the average
> instruction !!!

Not really, more likeliky it would be a 0.33 average CPI bump, assuming
a 1.0 baseline CPI for non-DIV instructions. Still way to much to be
ignored...

BGB

unread,

Nov 28, 2023, 2:00:20 PM11/28/23

to

I missed that, but yeah.

Still, it seems the incidence of the integer DIV/MOD/etc instructions
being used is somewhat rarer than this estimate in the code I was measuring.

Checking my compiler output for static use count:
Around 0.05% ...

It might have been more common though, if my compiler didn't
aggressively eliminate the divide by constant cases (and my coding
practices didn't actively avoid using division in general, etc).

Since, between either a slowish DIV instruction, or a software
shift-subtract loop, neither was expected to be all that fast.

On pretty much every system I have used IRL, division has tended to be
fairly slow, so I had tended to avoid it whenever possible.

In cases where fast divide was needed, there were often other
workarounds (such as using a lookup table of reciprocals).

If inexact results are acceptable, it is possible to stretch the table
of reciprocals up to pretty much arbitrary divisors (via normalization
and some shift-related trickery).
In software, this is helped with a CLZ instruction or similar, but had
also used a variant of this effectively in my hardware rasterizer module.

Say:
z=x/y;
y:
0.. 31: Direct multiply by MAX/y (Table A)
32.. 63: Normalized multiply by MAX/y (Table B)
64..127: Use (y>>1) as index to Table B, add 1 to final SHR
128..255: Use (y>>2) as index to Table B, add 2 to final SHR
...

The sizes of Tables A/B can be reduced (via another lookup table) by
making use of repetitions in the bit patterns (though, this is more
practical in Verilog than in software).

...

Though, getting acceptable results from fixed-point division is a little
more of an issue.

Stephen Fuld

unread,

Nov 28, 2023, 2:22:53 PM11/28/23

to

On 11/28/2023 11:00 AM, BGB wrote:

snip

> Still, it seems the incidence of the integer DIV/MOD/etc instructions
> being used is somewhat rarer than this estimate in the code I was
> measuring.

Is this a result of your using mostly a few old games as your
"reference" code? i.e. your code base may be different from what gave
rise to the statistics mentioned by others.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Nov 28, 2023, 2:41:04 PM11/28/23

to

Goldschmidt is about 2× faster than N-R because:: Goldschmidt inner loop
has 2 independent multiplies, while N-R has 2 dependent multiplies. To
meet IEEE 754 accuracy, the last iteration in Goldschmide is ½ a N-R
iteration {{you perform the N-R 1st step and then use the sign bit to
determine +1 or +2 and the +0 can be determined while the multiply is
transpiring.}}

BGB

unread,

Nov 28, 2023, 3:09:27 PM11/28/23

to

On 11/28/2023 1:22 PM, Stephen Fuld wrote:
> On 11/28/2023 11:00 AM, BGB wrote:
>
> snip
>
>> Still, it seems the incidence of the integer DIV/MOD/etc instructions
>> being used is somewhat rarer than this estimate in the code I was
>> measuring.
>
> Is this a result of your using mostly a few old games as your
> "reference" code? i.e. your code base may be different from what gave
> rise to the statistics mentioned by others.
>

Possibly.

There is my C library and kernel/runtime code;
And games like Doom and Quake and similar.

I suspect probably ID also avoided integer division when possible as
well, as it is rarely used (and was probably still slow back in
early/mid 90s PCs).

Looking up, 486 and Pentium 1 still had ~ 40+ cycle 32-bit DIV/IDIV.
...

Looks like MUL/IMUL had similar latency to DIV/IDIV on the 386 and 486,
but then the latency dropped down a fair bit on the Pentium.

However, Doom makes extensive use of integer multiply.

Given the timings, I suspect it is possible they were also using a
Shift-Add design for the multiply/divide unit.

In my case, I am using Shift-Add for DIV and also 64-bit multiply, but a
faster DSP48 based multiplier for 32-bit multiply.

If I were to make a guess based on the latency, I would guess the P1 was
using something like a radix-16 long-multiply.

Comparably, it seems DIV/IDIV didn't actually get much faster until the
Ryzen.

Have noted that these old games did have a nifty trick for calculating
approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);

Can be extended to 3D without too much issue, but for 4D, the sorting
step becomes more of an issue.

Kinda makes sense though, as finding the square-root of things is also
expensive...

But, not sure how a lot of this compares with other (non game) coding
practices.

Have noticed though that a lot of these games really do like using lots
of global variables (a lot more so than I would prefer to use in my
coding practices).

Though, this is one case where it is useful to be able to load/store
global variables in a single instruction (and using something like a GOT
may have a significant penalty on code that uses lots of global variables).

BGB

unread,

Nov 28, 2023, 3:13:32 PM11/28/23

to

Well, and/or just use separate multipliers for each, under the rationale
that, at least on an FPGA, it already has the DSP48's whether or not one
uses them (and a lot more DSP48's than I can really make effective use of).

Well, with the caveat that they only natively do signed 18-bit multiply.

>>> Otherwise, one can debate whether or not having DIV/MOD in hardware
>>> makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
>>> is at least "probably" faster than a generic software-only solution).
>
>> That debate has been held, and MIPS has hardware integer divide, Alpha
>> and IA-64 don't have a hardware integer divide; they both have FP
>> divide instructions.
>
> And all of these lead fruitful long productive lives before taking over
> -------Oh wait !!
>

MIPS at least did moderately well for a time, might have done better if
it were more open.

Both Alpha and IA-64 had a lot of questionable seeming design choices.

>> However, looking at more recent architectures, the RISC-V M extension
>> (which is part of RV64G and RV32G, i.e., a standard extension) has not
>> just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
>> integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
>> and REMUW.
>
> All of which are possible in My 66000 using operand sign control, S-bit,
> and CARRY when you want 64×64->128 or 128/64->{64 quotient, 64 remainder}
>
>> ARM A64 also has divide instructions (SDIV, UDIV), but
>> RISC-V seems significant to me because there the philosophy seems to
>> be to go for minimalism. So the debate has apparently come to the
>> conclusion that for general-purpose architectures, you include an
>> integer divide instruction.
>
> Several forms of integer DIV at least signed and unsigned.....

I had left it as optional, but I guess, optional is not the same as
absent, and my main profiles had ended up including it.

The ops are generally given names like:
DIVS.L, DIVU.L, DIVS.Q, DIVU.Q
MODS.L, MODU.L, MODS.Q, MODU.Q

One may notice a curious level of overlap with the RISC-V ops here...

Though, for 32-bit:
MULS.L, MULU.L: Perform a narrow multiply.
DMULS.L, DMULU.L: Perform a widening multiply.
Along with:
MULS.Q, MULU.Q: 64-bit multiply
MULHS.Q, MULHU.Q: 64-bit multiply, high 64 bits of result

DIV, MOD, and 64-bit multiply, is drastically slower than 32-bit
multiply though.

MitchAlsup

unread,

Nov 30, 2023, 11:27:17 AM11/30/23

to

BGB wrote:

> 64-bit multiply, is drastically slower than 32-bit
> multiply though.

Should only be 1 gate of delay longer......

MitchAlsup

unread,

Nov 30, 2023, 11:27:17 AM11/30/23

to

BGB wrote:

> Have noted that these old games did have a nifty trick for calculating
> approximate distance, say:
> dx=x0-x1;
> dy=y0-y1;
> adx=dx^(dx>>31);
> ady=dy^(dy>>31);
> if(ady>adx)
> { t=adx; adx=ady; ady=t; } //common
> // { adx^=ady; ady^=adx; adx^=ady; } //sometimes
> d=adx+(ady>>1);

Why not::

dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)

d=adx+(ady>>1);
else
d=ady+(adx>>1);

Robert Finch

unread,

Nov 30, 2023, 12:08:38 PM11/30/23

to

FPGA land using DSPs, 24x16 multiply is single cycle, 32x32 could be
single cycle if a lower fmax is acceptable, but 64x64 best pipelined for
several cycles (6). Thor had a fast 24x16 multiply as it is good enough
for array address calcs in a lot of circumstances.

The slower 64x64 multiply means that the basic shift and subtract divide
is more appealing compared to NR or Goldschmidt.

MitchAlsup

unread,

Nov 30, 2023, 12:51:09 PM11/30/23

to

Robert Finch wrote:

> On 2023-11-30 11:25 a.m., MitchAlsup wrote:
>> BGB wrote:
>>
>>> 64-bit multiply, is drastically slower than 32-bit
>>> multiply though.
>>
>> Should only be 1 gate of delay longer......

> FPGA land using DSPs, 24x16 multiply is single cycle, 32x32 could be
> single cycle if a lower fmax is acceptable, but 64x64 best pipelined for
> several cycles (6). Thor had a fast 24x16 multiply as it is good enough
> for array address calcs in a lot of circumstances.

Nothing stops you from building a multiplier in LUTs with Boothe recoding.
Each pair of LUTs being a 3-input XOR and a 3-input Majority gate. This also
gets rid of BGBs argument on building an incomplete multiplier tree {because
IEE needs 53-bits while your DSP only supplies 48.}

BGB

unread,

Nov 30, 2023, 1:56:50 PM11/30/23

to

On 11/30/2023 11:08 AM, Robert Finch wrote:
> On 2023-11-30 11:25 a.m., MitchAlsup wrote:
>> BGB wrote:
>>
>>> 64-bit multiply, is drastically slower than 32-bit
>>> multiply though.
>>
>> Should only be 1 gate of delay longer......
>
> FPGA land using DSPs, 24x16 multiply is single cycle, 32x32 could be
> single cycle if a lower fmax is acceptable, but 64x64 best pipelined for
> several cycles (6). Thor had a fast 24x16 multiply as it is good enough
> for array address calcs in a lot of circumstances.
>

Yeah, 18x18s (by extension, 16s and 16u) can be single cycle, no issue.

In my case, 32x32 is 3 cycle.
For 64x64, "all hell breaks loose" if one tries to express it directly.

The limiting factor isn't the first-stage of multipliers, but all the
adders needed to glue the results together. Likely, wherever is going on
inside the DSP48 is a lot faster than what happens with the generic FPGA
fabric.

For small 3x3->6 bit multipliers or similar, these can be built directly
from LUTs, and for very small values this can be favorable to using a DSP48.

Only reason the multiply in Binary64 FMUL works OK, is because I
effectively ignore all the low-order bits that would have existed.
Though, arguably, this could be part of why N-R can't converge on the
last 4 bits of the result (since this part seemingly involves sub-ULP
shenanigans that would depend on the low-order bits of the internal
multiply, which don't exist in this case).

When I looked at it originally, doing a direct 64-bit multiply wasn't
terribly attractive because it would be both fairly expensive and not
that much faster than a plain software option built on 32-bit widening
multiply.

Say (with ADD.X, 64x64 -> 128):
SHAD.Q R4, -32, R6 | SHAD.Q R5, -32, R7 //1c
MOV 0, R3 | DMULU.L R4, R5, R16 //1c
DMULU.L R6, R4, R18 //1c *1
DMULU.L R4, R7, R19 //1c
DMULS.L R6, R7, R17 //1c
MOVLLD R18, R3, R20 | SHAD.Q R18, -32, R21 //1c
MOVLLD R19, R3, R22 | SHAD.Q R19, -32, R23 //2c
ADD.X R20, R22, R2 //2c
ADD.X R2, R16, R2 //1c
RTS //2c
So, ~ 13 cycles (with 2c interlock penalty).

Or, say 64x64->64:
SHAD.Q R4, -32, R6 | SHAD.Q R5, -32, R7 //1c
MOV 0, R3 | DMULU.L R4, R5, R16 //1c
DMULU.L R6, R4, R18 //2c *1
DMULU.L R4, R7, R19 //3c
ADD R18, R19, R17 //2c
SHAD.Q R17, 32, R17 //2c
ADD R16, R17, R2 //1c
RTS //2c
Here, ~ 14 cycles (5 cycles interlock penalty).

*1: Can't remember for certain ATM whether DMULS.L or DMULU.L is needed
for these parts. Mostly effects the contributions of sign in the
high-order results. But, for 64-bit narrow multiply, we can mostly
ignore the sign, as it doesn't change anything in the low 64 bits of the
result.

Though, there will be some additional overhead for the function call,
which would be avoided with a CPU instruction.

Even with a slow 68-cycle MULS.Q, when function call overheads are
considered, it still isn't too far from break-even.

In my case, this part is optional, and can be controlled with a compiler
command-line option.

> The slower 64x64 multiply means that the basic shift and subtract divide
> is more appealing compared to NR or Goldschmidt.
>

Yes, agreed.

These require a fast full-precision multiplier (or two for
Goldschmidt?), which I don't have either in this case.

But, luckily, had noted that Shift-and-Subtract was trivially turned
into Shift-and-Add, and that the same general algorithm could be used
(with some minor tweaks) to also function as a 64-bit multiplier.

It isn't fast, but:
It works...
Is is moderately affordable.
Can technically give a 64-bit multiply op as well.
Even if, for performance, still better to do 64b MUL in software.

Though, the difference between the ops isn't too unreasonable when
function-call overhead is considered.

And, for general case DIV and MOD, the ops were still faster than the
software versions (even when the software version may have optimizations
like mapping small divisors through a lookup table of reciprocals).

This was mostly because of overheads:
Function calls aren't free (may need to spill stuff, shuffle values
around in registers, etc);
Special-casing the small divisors requires additional logic, and adds a
layer of function-call overhead for the general case.

And, vs a 36 cycle DIVS.L, is isn't hard to burn this just with function
call related overheads.

BGB

unread,

Nov 30, 2023, 2:17:35 PM11/30/23

to

But...

Then one needs to deal with the full latency of a multiplier built from
LUT's and CARRY4's, which isn't going to be small...

One could do this, but would likely need to add a few cycles of latency
to the FMUL.

As is, only around 34 bits of the DSP output are used if one is using
them as unsigned 17 bit multipliers. There is still a short-fall of a
few bits, but this can be patched over using 3-bit LUT multipliers or
similar.

It seems like the ADDer capability of the DSPs isn't getting used, but
this is partly a case of being able to get it to synthesize in a useful way.

Say:
D <= A * B + C;
Which means, one needs the 'C' input before they can do the multiply
(and happens to work in a way that is more useful for small MAC
operations than for building a bigger multiplier).

Luckily at least, my FMUL does produce a full-width mantissa for
Binary64, with the low-end part of the multiplier being cut off mostly
effecting the ULP rounding.

BGB

unread,

Nov 30, 2023, 2:43:44 PM11/30/23

to

That also works, I think this is closer to the form it was used in ROTT
(where I had first noticed it being used).

Swapping the values is easier to extend to 3D though, say:
dx=x0-x1;
dy=y0-y1;
dz=z0-z1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
adz=dz^(dz>>31);

if(ady>adx)
{ t=adx; adx=ady; ady=t; }

if(adz>ady)
{ t=ady; ady=adz; adz=t; }

if(ady>adx)
{ t=adx; adx=ady; ady=t; }

d=adx+(ady>>1)+(adz>>2);

There isn't a strong reason to prefer one option over another here,
except to avoid the XOR swap, as in this case the XOR's are sequentially
dependent on each other.

Not so pretty for 4D though, as swaps quickly come to dominate things.

There is a way of doing it with a "poor man's square root", but it would
be a lot more hairy looking than the above, and for fixed-point requires
knowing where the decimal point is at (whereas the above manages to work
while also remaining agnostic as to the location of the decimal point).

Luckily at least, one doesn't really need to deal with 4D spheres with
fixed-point all that often.

MitchAlsup

unread,

Nov 30, 2023, 3:32:24 PM11/30/23

to

BGB wrote:

> On 11/30/2023 10:23 AM, MitchAlsup wrote:
>> BGB wrote:
>>
>>> Have noted that these old games did have a nifty trick for calculating
>>> approximate distance, say:
>>>    dx=x0-x1;
>>>    dy=y0-y1;
>>>    adx=dx^(dx>>31);
>>>    ady=dy^(dy>>31);
>>>    if(ady>adx)
>>>      { t=adx; adx=ady; ady=t; }         //common
>>> // { adx^=ady; ady^=adx; adx^=ady; } //sometimes
>>>    d=adx+(ady>>1);
>>
>> Why not::
>>
>>     dx=x0-x1;
>>     dy=y0-y1;
>>     adx=dx^(dx>>31);
>>     ady=dy^(dy>>31);
>>     if(ady>adx)
>>         d=adx+(ady>>1);
>>     else
>>         d=ady+(adx>>1);

As long as ISA has max() and min():: Why not::

    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);

d = min( adx, ady ) + (max( adx, ady ) >> 1);

>

BGB

unread,

Nov 30, 2023, 5:02:28 PM11/30/23

to

Yeah, I guess that could work as well for 2D...

Though, makes more sense to use the "__int32_min" intrinsic or similar,
rather than the generic C library "min()"/"max()" macros.
#define min(a, b) (((a)<(b))?(a):(b))
#define max(a, b) (((a)>(b))?(a):(b))

Though, apparently the existence of these macros in stdlib.h is
controversial (they are apparently sort of a POSIX legacy feature, and
not part of the C standard).

Mostly, due to implementation issues in BGBCC, ?: needs an actual
branch, so is ironically actually worse for performance in this case
than using an explicit if/else...

A case could be made for C11 "_Generic", but would need to implement it.

But, yeah, apart from compiler issues, in theory the min/max part could
be 3 instructions.

If done in ASM, might be something like:
// R4..R7 (x0, y0, x1, y1)
SUB R4, R6, R16 | SUB R5, R7, R17
SHAD R16, -31, R18 | SHAD R17, -31, R19
XOR R16, R18, R16 | XOR R17, R19, R17
CMPGT R17, R16
CSELT R16, R17, R18 | CSELT R17, R16, R19
SHAD R19, -1, R19
ADD R18, R19, R2

I don't have actual MIN/MAX ops though.
Though, it seems RISC-V does have them in the Bitmanip extension.

Could probably add them in-theory (maybe along with FMIN/FMAX).

I guess, if I did add FMIN/FMAX, I would technically have enough in
place to "more or less" be able to support the F/D extensions in RISC-V
mode as well at this point, which (if I re-enable the BJX2-LDOP
extension) would in-turn be enough to theoretically be able to push it
potentially op to RV64G in userland (or maybe RV64GC if I got around to
finishing the RV-C decoder).

I guess, RV64GC could be useful as this is generally what the Linux
userland is using (but, RV64IM_ZfinxZdinx, not so much...).

But, F/D could have F0..F31 be mapped to R32..R63 in RISC-V mode, ...

But, annoyingly, without the Privileged spec, nor going off and trying
to clone a bunch of SiFive's hardware level interface or similar," more
correct" CSR handling, ..., there would be little hope of running an
unmodified Linux kernel on it.

But, I guess in theory, if one has RV64GC, the kernel could be ported to
work with the unusual interrupt handling and similar. Theoretically,
there are now mechanisms in place in the design to allow the CPU to
handle interrupts directly in RISC-V mode, avoiding the need for a mixed
ISA kernel (though, very little else would be glossed over at this level).

But, would there be much point in running RV64 Linux on the BJX2
core?... Seems like it would pretty much defeat the whole point in this
case (say, vs using a CPU actually designed to run RV64 as its native ISA).

Well, and/or mostly just run much of the userland in RV64 mode and then
mimic the Linux kernel's syscalls...

MitchAlsup

unread,

Nov 30, 2023, 6:16:55 PM11/30/23

to

max() and min() are single instructions in my ISA. On the FP side, they
even get NaNs put in the proper places, too.

> Though, apparently the existence of these macros in stdlib.h is
> controversial (they are apparently sort of a POSIX legacy feature, and
> not part of the C standard).

It took Brian quite some time to recognize that crap and turn it all into
max() and min() instructions.

> Mostly, due to implementation issues in BGBCC, ?: needs an actual
> branch, so is ironically actually worse for performance in this case
> than using an explicit if/else...

My point exactly--but perhaps you should spend the effort to make your
compiler recognize these as single branch-free instructions.

> A case could be made for C11 "_Generic", but would need to implement it.

> But, yeah, apart from compiler issues, in theory the min/max part could
> be 3 instructions.

max() and min() are easy to perform in a single cycle in both int and FP

> If done in ASM, might be something like:
> // R4..R7 (x0, y0, x1, y1)
> SUB R4, R6, R16 | SUB R5, R7, R17
> SHAD R16, -31, R18 | SHAD R17, -31, R19
> XOR R16, R18, R16 | XOR R17, R19, R17
> CMPGT R17, R16
> CSELT R16, R17, R18 | CSELT R17, R16, R19
> SHAD R19, -1, R19
> ADD R18, R19, R2

And you are claiming this has an advantage over::

MAX R9,R8,R7
MIN R10,R8,R7
SRL R10,R10,#1
LEA R10,[R9,R10,#-1]

> I don't have actual MIN/MAX ops though.

perhaps you should reconsider...

> Though, it seems RISC-V does have them in the Bitmanip extension.

> Could probably add them in-theory (maybe along with FMIN/FMAX).

I use the same FU to perform Int max() and FP max()--remember once you
special case out the problems, IEEE is a linearly ordered set of numbers.

Robert Finch

unread,

Nov 30, 2023, 7:53:59 PM11/30/23

to

Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.

MitchAlsup

unread,

Nov 30, 2023, 9:47:22 PM11/30/23

to

Robert Finch wrote:
>>
>>> Could probably add them in-theory (maybe along with FMIN/FMAX).
>>
>> I use the same FU to perform Int max() and FP max()--remember once you
>> special case out the problems, IEEE is a linearly ordered set of numbers.

> Q+ has min3 / max3 for minimum or maximum of three values.
> But only fmin / fmax for float.

Interesting, but I did not have enough room in the 3-operand subGroup.

How do you determine the mid( x, y, z ) ?? allowing::

mx = max( x, y, z );
mn = min( x, y, z );
mi = mid( x, y, z );

??

Chris M. Thomasson

unread,

Nov 30, 2023, 9:56:29 PM11/30/23

to

Think along the lines of two n-ary points, 3-ary here:

Sorry for the pseudo-code:
____________________
// two n-ary points... 3-ary here:
p0 = (-1, 0, 0);
p1 = (1, 0, 0);

// difference and mid...
dif = p1 - p0;
mid = p0 + dif * .5;

// plot some points on the canvas...
plot_point(p0);
plot_point(mid);
plot_point(p1);
____________________

?

Actually, we have a min and a max, therefore we have a normalized
difference. .5 can be half way between min and max?

Robert Finch

unread,

Nov 30, 2023, 11:40:10 PM11/30/23

to

I had not thought to add a mid() operation. It would be fairly easy to
do but I am not sure using the opcode space would be worth it. There is
about 20 opcodes available. 3-operands pretty much use up a 32-bit
opcode as register specs are six bits. There is only one bit left over
for sub-grouping. There is MUX, CMOVxx, and PTRDIF all with 3-operands too.

if (x > y && x < z)
mi = x;
else if (y > x && y < z)
mi = y;
else
mi = z;

Chris M. Thomasson

unread,

Dec 1, 2023, 12:06:36 AM12/1/23

to

Imvvvho, an interesting question is how to get a equipotential in 3d
space when there are an infinite number of them... One of my thoughts is
to gain three points in a field line, then use the surface normal of the
triangle for an equipotential.

BGB

unread,

Dec 1, 2023, 12:16:14 AM12/1/23

to

OK.

>> Mostly, due to implementation issues in BGBCC, ?: needs an actual
>> branch, so is ironically actually worse for performance in this case
>> than using an explicit if/else...
>
> My point exactly--but perhaps you should spend the effort to make your
> compiler recognize these as single branch-free instructions.
>

I guess, it could be possible in theory to use pattern recognition on
the ASTs to turn this into a call to an intrinsic. Might need to look
into this.

For the most part, hadn't really done much of any of this sort of
pattern recognition.

>> A case could be made for C11 "_Generic", but would need to implement it.
>
>
>> But, yeah, apart from compiler issues, in theory the min/max part
>> could be 3 instructions.
>
> max() and min() are easy to perform in a single cycle in both int and FP
>

Theoretically, it can be done in two instructions, which is still faster
than what ?: generates currently.

Say:
z=(x<y)?x:y;

Compiling to something like, say:
CMPGT R9, R8
BF .L0
MOV R9, R10
BRA .L1
.L0:
MOV R8, R10
BRA .L1
.L1:
MOV R10, R11

Whereas, the normal "if()" is smart enough to turn it into predication,
but ?: needs to use a temporary and doesn't have a predication special case.

Well, and also predication needs to be handled partly in the frontend
(in the generated RIL3 bytecode), but is generally omitted if the output
is going to a RIL object (so, if compiling for a static library, it
always falls back to actual branches).

Partly this is in turn because my 3AC backend isn't really smart enough
to pull this off.

Putting effort into it wasn't really a high priority though, as ?: in C
seems to be relatively infrequently used.

But, as-is, one will get a 2-op sequence if they use the intrinsic.

>> If done in ASM, might be something like:
>>    // R4..R7 (x0, y0, x1, y1)
>>    SUB R4, R6, R16   | SUB R5, R7, R17
>>    SHAD R16, -31, R18 | SHAD R17, -31, R19
>>    XOR R16, R18, R16 | XOR R17, R19, R17
>>    CMPGT R17, R16
>>    CSELT R16, R17, R18 | CSELT R17, R16, R19
>>    SHAD R19, -1, R19
>>    ADD R18, R19, R2
>
> And you are claiming this has an advantage over::
>
>     MAX R9,R8,R7
>     MIN R10,R8,R7
>     SRL R10,R10,#1
>     LEA R10,[R9,R10,#-1]
>

At least in so far as it doesn't depend on features that don't currently
exist.

>
>> I don't have actual MIN/MAX ops though.
>
> perhaps you should reconsider...
>

I didn't have them previously as I could express them in a 2-op
sequence, which seemed "good enough".

Though, this was before some more recent changes which (in the name of
improving FPGA timing) effectively turned CMPxx and similar into 2 cycle
ops.

Though, MIN/MAX ops could possibly allow:
MIN Imm10u, Rn // Rn=(Rn<Imm)?Rn:Imm

Which could potentially allow implementing "__int32_clamp()" in 2 or 3
instructions, which is an improvement over needing 6 instructions (with
pretty much all of these instructions ending up with a 2-cycle latency).

>> Though, it seems RISC-V does have them in the Bitmanip extension.
>
>> Could probably add them in-theory (maybe along with FMIN/FMAX).
>
> I use the same FU to perform Int max() and FP max()--remember once you
> special case out the problems, IEEE is a linearly ordered set of numbers.

Yeah.

I was doing both CMP and FCMP cases via the ALU.
This is why likely both could be added at the same time, since the logic
for adding one would likely apply to the others.

MitchAlsup

unread,

Dec 1, 2023, 1:32:40 PM12/1/23

to

About 93% of ICMP is used in FCMP.

Quadibloc

unread,

Dec 1, 2023, 8:28:18 PM12/1/23

to

On Fri, 24 Nov 2023 22:53:57 +0000, Quadibloc wrote:

> While this is an "improvement", I do realize that this, along with even
> supporting multiple memory widths, even if I have found ways that it can
> technically be done, is sheer insanity.
>
> I do have a rationale for going there. It's simple enough. It comes from
> the x86.
>
> Since the irresistible force of availability of applications means that
> we are doomed to always have just *one* dominant ISA...
>
> then, to minimize the terrible loss that this entails, that one dominant
> ISA should be maximally flexible. To approach, as closely as possible,
> the "good old days" where the PDP-10 and the CDC 6600 coexisted
> alongside the PDP-11 and the IBM System/360.
>
> This is the driving force behind my insane quest to design an ISA for a
> CPU which has the ability to do something which no one seems to want.

And for some people, the KDF 9, with a 48-bit word, was something to
sing about - here, to the tune of "The British Grenadiers":

Some talk of I.B.M.s and some of C.D.C.s,
Of Honeywells and Burroughses, and such great names as these,
But of all the world's computers, there's none that is so fine,
As the English Electric Leo Marconi Kay - Dee - Eff Nine!

Some talk of thirty-two bit words, and some of twenty-four,
Of disks and drums and datacells, and megabytes of core,
But for those who've written usercode there's nothing can outshine,
The subroutine jump nesting store of the Kay - Dee - Eff Nine!

It's a good thing I had saved this in a text file on my computer.

A web search, including a search on Google Groups, was no longer
able to turn this up for me. Also, the KDF 9 was lauded in another
posting for reasons more directly related to this scheme of mine:

(Quoting a brochure for the computer)
The KDF 9 word has 48 bits...

It may be used as...

Eight 6-bit Alphanumeric Characters
One 48-bit Fixed-Point Number
Two 24-bit (Half length) Fixed-Point Numbers
Half of a 96-bit (Double length) Fixed-Point Number
One 48-Bit Floating-Point Number
Two 24-bit (Half length) Floating-Point Numbers
Half of a 96-Bit (Double length) Floating-Point Number
Three 16-Bit (Fixed-Point) Integers
Six 8-Bit Instruction Syllables
(end quote)

An instruction was 1, 2, or 3 syllables; an address was 15 bits.
O, memory! We shall not see its like again.

John Savard

Terje Mathisen

unread,

Dec 2, 2023, 1:24:21 PM12/2/23

to

Possibly due to wanting to not use more than one branch predictor entry?

Maybe because ady rarely was larger than adx?

Besides, your version has the two terms swapped. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Dec 2, 2023, 1:34:34 PM12/2/23

to

Robert Finch wrote:
>
> Q+ has min3 / max3 for minimum or maximum of three values.
> But only fmin / fmax for float.
>

For symmetry I would assume/like a median3 as well, so that min3,
median3, max3 would return a sorted list?

Chris M. Thomasson

unread,

Dec 2, 2023, 2:35:51 PM12/2/23

to

On 12/2/2023 10:24 AM, Terje Mathisen wrote:
> MitchAlsup wrote:
>> BGB wrote:
>>
>>> Have noted that these old games did have a nifty trick for
>>> calculating approximate distance, say:
>>>    dx=x0-x1;
>>>    dy=y0-y1;
>>>    adx=dx^(dx>>31);
>>>    ady=dy^(dy>>31);
>>>    if(ady>adx)
>>>      { t=adx; adx=ady; ady=t; }         //common
>>> // { adx^=ady; ady^=adx; adx^=ady; } //sometimes
>>>    d=adx+(ady>>1);
>>
>> Why not::
>>
>>      dx=x0-x1;
>>      dy=y0-y1;
>>      adx=dx^(dx>>31);
>>      ady=dy^(dy>>31);
>>      if(ady>adx)
>>          d=adx+(ady>>1);
>>      else         d=ady+(adx>>1);
>
> Possibly due to wanting to not use more than one branch predictor entry?
>
> Maybe because ady rarely was larger than adx?
>
> Besides, your version has the two terms swapped. :-)

Check this out, I think it might be relevant, Dog Leg Hypotenuse:

https://forums.parallax.com/discussion/147522/dog-leg-hypotenuse-approximation

MitchAlsup

unread,

Dec 2, 2023, 3:47:21 PM12/2/23

to

Terje Mathisen wrote:

> Robert Finch wrote:
>>
>> Q+ has min3 / max3 for minimum or maximum of three values.
>> But only fmin / fmax for float.
>>
> For symmetry I would assume/like a median3 as well, so that min3,
> median3, max3 would return a sorted list?

That was my point 8-10 posts ago.

> Terje

MitchAlsup

unread,

Dec 2, 2023, 3:47:22 PM12/2/23

to

Terje Mathisen wrote:

> MitchAlsup wrote:
>> BGB wrote:
>>
>>> Have noted that these old games did have a nifty trick for calculating
>>> approximate distance, say:
>>>    dx=x0-x1;
>>>    dy=y0-y1;
>>>    adx=dx^(dx>>31);
>>>    ady=dy^(dy>>31);
>>>    if(ady>adx)
>>>      { t=adx; adx=ady; ady=t; }         //common
>>> // { adx^=ady; ady^=adx; adx^=ady; } //sometimes
>>>    d=adx+(ady>>1);
>>
>> Why not::
>>
>>     dx=x0-x1;
>>     dy=y0-y1;
>>     adx=dx^(dx>>31);
>>     ady=dy^(dy>>31);
>>     if(ady>adx)
>>         d=adx+(ady>>1);
>>     else         d=ady+(adx>>1);

> Possibly due to wanting to not use more than one branch predictor entry?

> Maybe because ady rarely was larger than adx?

> Besides, your version has the two terms swapped. :-)

Just mimicking the original.

> Terje

BGB

unread,

Dec 2, 2023, 6:37:47 PM12/2/23

to

These two versions will give different results, since the "if(ady>adx)"
was the "when to switch values" check, rather than the "when to run the
calculation as-is".

But, yeah, I didn't notice this at first.

In my compiler, both will likely be translated into predicated sequences
(though, the "fastest possible" would be to use conditional select, or
MIN/MAX if it were available, *).

*: I have now added MIN/MAX as an optional feature, along with logic for
RISC-V's 'FSGNJx' ops, mostly for sake of filling in a few holes needed
to add support for RISC-V's F and D extensions in RV64 mode (which can
now be extended, in theory, out to RV64IMAFD, at least supporting all
the userland parts of the ISA).

This is along with also needing to add logic for a single-precision FDIV
and similar as well. Though, single precision FSQRT is, at-the-moment,
just using the crude approximate version.

MIN/MAX and FMIN/FMAX have been added to BJX2 as well, though there is
not currently any plan to add FSGNJx or similar to BJX2 (these ops don't
really seem worth the encoding space).

>> Terje

Terje Mathisen

unread,

Dec 3, 2023, 9:29:19 AM12/3/23

to

Chris M. Thomasson wrote:
> Check this out, I think it might be relevant, Dog Leg Hypotenuse:
>
> https://forums.parallax.com/discussion/147522/dog-leg-hypotenuse-approximation
>
>

Quoting from that link:

> However, the best solution I found with just shifts and adds was this:
>
> hypot ~= hi + (lo + max( 0, lo+lo+lo-hi ) ) / 8

with a max error of just 2.8%!

That is very good, I get a result in 5 cycles on a cpu with min/max
opcodes, after a rewrite to:

hypot = (hi*8 + lo + max(0,lo*3-hi))>>3;

hi = max(a,b)
lo = min(a,b)

t = lo*3 ;; LEA, single cycle
t2 = hi*8 + lo ;; ditto

t = max(0,t)

t += t2

t >>= 3

Terje Mathisen

unread,

Dec 3, 2023, 9:36:45 AM12/3/23

to

I saw that a bit later: I've been on a sailing trip in the Caribbean
with very sporadic network connections, basically just when in reach of
a French island.

MitchAlsup

unread,

Dec 3, 2023, 12:02:27 PM12/3/23

to

Terje Mathisen wrote:

> MitchAlsup wrote:
>> Terje Mathisen wrote:
>>
>>> Robert Finch wrote:
>>>>
>>>> Q+ has min3 / max3 for minimum or maximum of three values.
>>>> But only fmin / fmax for float.
>>>>
>>> For symmetry I would assume/like a median3 as well, so that min3,
>>> median3, max3 would return a sorted list?
>>
>> That was my point 8-10 posts ago.

> I saw that a bit later: I've been on a sailing trip in the Caribbean
> with very sporadic network connections, basically just when in reach of
> a French island.

Must be a tough life............

> Terje

Chris M. Thomasson

unread,

Dec 3, 2023, 3:23:12 PM12/3/23

to

Fwiw, the Sierra people are sailing around as well. Ken and Roberta
Williams. :^) The retired life. ;^) Just kidding. They are hard at work
on some games. One game in particular. Nice to see them working again:

https://www.colossalcave3d.com

Mix colossalcave3d with Half Life! ;^)

https://en.wikipedia.org/wiki/Half-Life_(video_game)

;^)

BGB

unread,

Dec 3, 2023, 5:25:10 PM12/3/23

to

On 12/3/2023 8:29 AM, Terje Mathisen wrote:
> Chris M. Thomasson wrote:
>> Check this out, I think it might be relevant, Dog Leg Hypotenuse:
>>
>> https://forums.parallax.com/discussion/147522/dog-leg-hypotenuse-approximation
>>
> Quoting from that link:
>
>> However, the best solution I found with just shifts and adds was this:
>>
>> hypot ~= hi + (lo + max( 0, lo+lo+lo-hi ) ) / 8
>
> with a max error of just 2.8%!
>
> That is very good, I get a result in 5 cycles on a cpu with min/max
> opcodes, after a rewrite to:
>
> hypot = (hi*8 + lo + max(0,lo*3-hi))>>3;
>
>
> hi = max(a,b)
> lo = min(a,b)
>
> t = lo*3 ;; LEA, single cycle
> t2 = hi*8 + lo ;; ditto
>
> t = max(0,t)
>
> t += t2
>
> t >>= 3
>

Yeah, seems like a good option if one needs a better approximation than
the original version.

Which, I guess exists because:
d=adx+ady;

Is too crude even for a lot of stuff where precision doesn't really
matter (say, excluding rendering objects that are too far away from the
camera, or ranking objects by relative distance, etc).

One use-case for a 4D distance was mostly for trying to rebuild RGB555
-> Indexed lookup tables for a color palette, where I was doing a
theoretical distance of:
dr=cr1-cr0;
dg=cg1-cg0;
db=cb1-cb0;
cy0=(8*cg0+5*cr0+3*cb0)/16;
cy1=(8*cg1+5*cr1+3*cb1)/16;
dy=cy1-cy0;
d=sqrt_apx(dr*dr+dg*dg+db*db+2*dy*dy);

Where, it is better to try to preserve the luminance of a pixel than its
exact color, so luma is handled specially as an additional parameter, in
addition to just the R/G/B distances.

Where, say, in this case sqrt_apx might be, say:
v>>((31-__int_clz(v))/2)

Which, granted, does make the assumption of having a CLZ instruction.
Though, as a downside, for fixed-point, requires more complicated logic
to deal with the location of the decimal point, and is a lot more
"noisy" than the other strategy (since this square-root approximation is
discontinuous).

But, the tradeoff is more that the multiplies and square-root
approximation may be faster than the cost of sorting out the 4 distances.

Though, later ended up instead pre-building the lookup table in the form
of an 8-bit BMP image which was added to the resource section of the
shell binary, which was a lot faster than trying to rebuild it
dynamically during GUI start-up. Couldn't come up with a good algo to
rebuild the RGB555 -> 8-bit lookup tables in a time-frame that wasn't
annoyingly slow at 50MHz. Granted, if one does both a primary and
secondary match for dithering, this results in a 66K BMP image, which is
annoyingly large for something which will only be accessed once, when
starting up the GUI mode.

Half tempting to consider defining an LZ compressed 8-bit BMP variant
for this use-case.

Though, I am left to suspect that part of the appeal of 6x6x6 216 color
may have been that it would have been computationally cheaper to rebuild
such an RGB555 lookup table for this than it is for a more free-form
color palette.

Well, and also things like KD-Tree based palette optimization (followed
by rebuilding the RGB555 lookup table) is too slow in this case to be
considered for real-time use. I suspect 90s era OS's did not use
dynamically optimized palettes though.

...

> Terje
>

BGB

unread,

Dec 3, 2023, 6:34:20 PM12/3/23

to

Half-Life was a game I played a lot back when I was in high-school.

Hadn't really played any of the other Sierra games at the time, and I
guess all this was around the time that Sierra imploded.

Now, several decades later, I am right at the end of my 30s...
Time marches on I guess...

Had recently noted while compiling Doom in both RISC-V and BJX2:
RV64IM: "-ffunction-sections" "-Wl,-gc-sections"
RV64 can beat out BJX2 XG2 in terms of smaller ".text" with "-Os".
With "-O2" or "-O3", RV64 generates a bigger ".text" section.
This is with both using the same C library implementation.
Without "-ffunction-sections", RV64 still loses with "-Os"
The actual size difference of the binaries is a fair bit larger.
It seems that ELF has a lot more space overhead due to metadata.

Both cases seem to be significantly smaller than the x86-64 versions.
WSL: Has smaller ".text" but bigger ELF;
Win64: Bigger ".text" but smaller EXE;
Comparably, PE/COFF has a lot less metadata.

There is around a 2x size difference between WSL and Win64, with the
Win64 case having around 3x the ".text" size of RV64 and BJX2 (~ 340K in
this case).

BJX2 in Baseline mode with "/Os" would still beat RV64IM "-Os", but this
is not a fair comparison (would need to compare against RV64GC to be fair).

I guess would need to evaluate whether it is more ISA related or
compiler related that it is not reliably beating RV64IM in terms of code
size.

...

Terje Mathisen

unread,

Dec 5, 2023, 12:40:14 PM12/5/23

to

Indeed.

Our last such trip was in 2018 (Panama City to Jamaica on the Royal
Clipper), this trip was supposed to take place in 2020 but then Covid
happened.

Back in Oslo now, and I just got my very first positive Covid test, so I
must have been exposed to the virus during the trip, thankfully I didn't
really get sick until we got to Oslo Airport in -17C.

Terje
PS. These trips are for orienteering, we ran 10-11 races over 16 days.
I've uploaded headcam videos to youtube from all of them. On those you
can see both my forward view and a map snippet with moving marker which
shows where I'm running.

Paul A. Clayton

unread,

Dec 5, 2023, 5:53:34 PM12/5/23

to

On 11/27/23 8:03 PM, MitchAlsup wrote:
> Anton Ertl wrote:
[snip]
>> However, looking at more recent architectures, the RISC-V M
>> extension (which is part of RV64G and RV32G, i.e., a
>> standard extension) has not just multiply instructions (MUL,
>> MULH, MULHU, MULHSU, MULW), but also integer divide >> instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
>> REMUW.
>
> All of which are possible in My 66000 using operand sign
> control, S-bit, and and CARRY when you want 64×64->128 or
> 128/64->{64 quotient, 64 remainder}

What about multiply-high-no-carry-in? There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.) A small immediate operand might be shifted into
the most significant position to facilitate division by a
small constant.

A division-specific instruction (divide-using-reciprocal)
could do more sophisticated manipulation of an immediate and
perhaps provide an instruction a compiler could use with
less knowledge of the dividend. Yet there might be uses for
a simple multiply-high-no-carry-in.

Anton Ertl

unread,

Dec 7, 2023, 8:47:41 AM12/7/23

to

"Paul A. Clayton" <paaron...@gmail.com> writes:
>What about multiply-high-no-carry-in? There might be cases
>where such could be used for division with a reused
>(possibly compile-time constant) divisor even without the
>carry-in. (Producing the carry-in by a low multiply seems
>to have significant overhead in energy, throughput, and/or
>latency.)

What's that operation supposed to produce? How does it differ from,
e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
makes you think that it still works for division?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Dec 7, 2023, 1:56:51 PM12/7/23

to

Paul A. Clayton wrote:

> On 11/27/23 8:03 PM, MitchAlsup wrote:
>> Anton Ertl wrote:
> [snip]
>>> However, looking at more recent architectures, the RISC-V M
>>> extension (which is part of RV64G and RV32G, i.e., a
>>> standard extension) has not just multiply instructions (MUL,
>>> MULH, MULHU, MULHSU, MULW), but also integer divide >> instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
>>> REMUW.
>>
>> All of which are possible in My 66000 using operand sign
>> control, S-bit, and and CARRY when you want 64×64->128 or
>> 128/64->{64 quotient, 64 remainder}

> What about multiply-high-no-carry-in?

CARRY R9,{{O}} // carry applies to next inst
// no carry in yes carry out
MUL R10,Rx,Ry // {R9,R10} contain result

BGB

unread,

Dec 7, 2023, 5:44:27 PM12/7/23

to

On 12/7/2023 7:39 AM, Anton Ertl wrote:
> "Paul A. Clayton" <paaron...@gmail.com> writes:
>> What about multiply-high-no-carry-in? There might be cases
>> where such could be used for division with a reused
>> (possibly compile-time constant) divisor even without the
>> carry-in. (Producing the carry-in by a low multiply seems
>> to have significant overhead in energy, throughput, and/or
>> latency.)
>
> What's that operation supposed to produce? How does it differ from,
> e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
> makes you think that it still works for division?
>

For division by reciprocal, one may need to be able to adjust the right
shift (say, 32..34 bits), and add a bias to the low-order bits if the
input is negative (say, adding "Rcp-1").

Without the variable shift part, one will not get reliable results for
some divisors (IIRC, 7, 17, 31, etc).

Without a bias, it will round towards negative infinity rather than
towards zero (standard for division is to always round towards zero).

As-is, signed division logic (say, "y=x/7;") would be something like:
MOV 0x49249249, R1
DMULS.L R4, R1, R2
ADD -1, R1
CMPGE 0, R4
ADD?F R2, R1, R2
SHAD.Q R2, -33, R2

Or, if one used jumbo encodings:
DMULS.L R4, 0x49249249, R2
CMPGE 0, R4 // SR.T=(R4>=0)
ADD?F R2, 0x49249248, R2
SHAD.Q R2, -33, R2

So, sadly, simply taking the high result isn't quite sufficient.

In my case, I could almost add a dedicated instruction for this, but I
don't expect it would be used often enough to justify doing so.

Internally, it would leverage logic that already exists for DMACS.L and
FAZDIV (which basically already does this), but the existence of this
instruction would effectively mandate that FAZDIV exists...

But, if the compiler already knows FAZDIV exists, then it can do:
MOV 7, R1
DIVS.L R4, R1, R2
And still get a roughly 3 cycle integer division... (At least, if R1 is
1 to 63 ...).

(Or, if DIV exists but FAZDIV does not, it still works but takes 36
cycles instead...).

Or, the basic case will always work, and would not depend on the
existence of the FAZDIV mechanism.

> - anton

BGB

unread,

Dec 7, 2023, 5:54:41 PM12/7/23

to

Self-correction, bias is wrong here, should be something more like:
ADD?F R2, 0x1B6DB6DB6, R2

I was writing out the algorithm from memory, so I may be prone to error...

I am still not entirely sure, but in any case, a bias value does need to
be added for negative inputs...

MitchAlsup

unread,

Dec 7, 2023, 9:46:30 PM12/7/23

to

BGB wrote:

> On 12/7/2023 7:39 AM, Anton Ertl wrote:
>> "Paul A. Clayton" <paaron...@gmail.com> writes:
>>> What about multiply-high-no-carry-in? There might be cases
>>> where such could be used for division with a reused
>>> (possibly compile-time constant) divisor even without the
>>> carry-in. (Producing the carry-in by a low multiply seems
>>> to have significant overhead in energy, throughput, and/or
>>> latency.)
>>
>> What's that operation supposed to produce? How does it differ from,
>> e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
>> makes you think that it still works for division?
>>

> For division by reciprocal, one may need to be able to adjust the right
> shift (say, 32..34 bits), and add a bias to the low-order bits if the
> input is negative (say, adding "Rcp-1").

> Without the variable shift part, one will not get reliable results for
> some divisors (IIRC, 7, 17, 31, etc).

> Without a bias, it will round towards negative infinity rather than
> towards zero (standard for division is to always round towards zero).

No:: C99 division is so specified, but there are 3 other (semi) popular
definitions.

> As-is, signed division logic (say, "y=x/7;") would be something like:
> MOV 0x49249249, R1
> DMULS.L R4, R1, R2
> ADD -1, R1
> CMPGE 0, R4
> ADD?F R2, R1, R2
> SHAD.Q R2, -33, R2

> Or, if one used jumbo encodings:
> DMULS.L R4, 0x49249249, R2
> CMPGE 0, R4 // SR.T=(R4>=0)
> ADD?F R2, 0x49249248, R2
> SHAD.Q R2, -33, R2

In 1975 I wrote PDP-11/40 microcode that performed serial DIV algorithm,
and had the opportunity in the loop to short circuit if all the bits of
the denominator had been consumed. {{CMU had an add on board with student
programmable microcode.}}

Thus::
DIV R9,R5,#7

should not take more than 5 cycles. Nor should
Thusly::
DIV R9,R5,#-7

!!

> So, sadly, simply taking the high result isn't quite sufficient.

It is more practical to make HW DIV fast, than to have everyone {and his
brother} invent new ways to avoid dividing.

BGB

unread,

Dec 7, 2023, 11:16:01 PM12/7/23

to

On 12/7/2023 8:44 PM, MitchAlsup wrote:
> BGB wrote:
>
>> On 12/7/2023 7:39 AM, Anton Ertl wrote:
>>> "Paul A. Clayton" <paaron...@gmail.com> writes:
>>>> What about multiply-high-no-carry-in? There might be cases
>>>> where such could be used for division with a reused
>>>> (possibly compile-time constant) divisor even without the
>>>> carry-in. (Producing the carry-in by a low multiply seems
>>>> to have significant overhead in energy, throughput, and/or
>>>> latency.)
>>>
>>> What's that operation supposed to produce? How does it differ from,
>>> e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
>>> makes you think that it still works for division?
>>>
>
>> For division by reciprocal, one may need to be able to adjust the
>> right shift (say, 32..34 bits), and add a bias to the low-order bits
>> if the input is negative (say, adding "Rcp-1").
>
>> Without the variable shift part, one will not get reliable results for
>> some divisors (IIRC, 7, 17, 31, etc).
>
>> Without a bias, it will round towards negative infinity rather than
>> towards zero (standard for division is to always round towards zero).
>
> No:: C99 division is so specified, but there are 3 other (semi) popular
> definitions.
>

Existing software will break if it isn't round-towards-zero.

Though, yeah, if one doesn't care about this, it is possible to
eliminate the biasing step.

>
>> As-is, signed division logic (say, "y=x/7;") would be something like:
>>    MOV      0x49249249, R1
>>    DMULS.L R4, R1, R2
>>    ADD      -1, R1
>>    CMPGE    0, R4
>>    ADD?F    R2, R1, R2
>>    SHAD.Q   R2, -33, R2
>
>> Or, if one used jumbo encodings:
>>    DMULS.L R4, 0x49249249, R2
>>    CMPGE    0, R4               // SR.T=(R4>=0)
>>    ADD?F    R2, 0x49249248, R2
>>    SHAD.Q   R2, -33, R2
>
> In 1975 I wrote PDP-11/40 microcode that performed serial DIV algorithm,
> and had the opportunity in the loop to short circuit if all the bits of
> the denominator had been consumed. {{CMU had an add on board with student
> programmable microcode.}}
>
> Thus::
>       DIV    R9,R5,#7
>
> should not take more than 5 cycles. Nor should
> Thusly::
>       DIV    R9,R5,#-7
>
> !!
>

Timings for the example, fixing bias:
DMULS.L R4, 0x49249249, R2 //2c
CMPGE 0, R4 //2c
ADD?F 0x1B6DB6DB6, R2 //2c
SHAD.Q R2, -33, R2 //1c
Latency, 7 cycles, 3 cycle interlock penalty.

>> So, sadly, simply taking the high result isn't quite sufficient.
>
> It is more practical to make HW DIV fast, than to have everyone {and his
> brother} invent new ways to avoid dividing.

Or, at least make it semi-fast in the compiler.

Then again, I go skimming though my compiler output and encounter this
nugget:
MOV.Q (SP, 608), RQ4
BSR __va64_arg_l
MOV R2, RQ12
MOV RQ12, RQ8
// pdpc201/btshx_supa.c:1597
MOV RQ8, RQ7
EXTU.L RQ7
MOV RQ7, RQ13
MOV RQ10, RQ4
MOV RQ13, RD5
BSR tk_sprint_hex
MOV RQ2, RQ10
// pdpc201/btshx_supa.c:1598

For:
s1=va_arg(lst, char *);
ct=tk_sprint_hex(ct, (u32)s1);

Things check out as per the internal rules, but this is a stupid number
of MOV's...

Say, if a person were hand-optimizing this, they could do, say:
MOV.Q (SP, 608), RQ4
BSR __va64_arg_l
EXTU.L R2, RQ5
// pdpc201/btshx_supa.c:1597
MOV RQ10, RQ4
BSR tk_sprint_hex
MOV RQ2, RQ10
// pdpc201/btshx_supa.c:1598

But, alas, this is still the sort of stuff I am dealing with from my
compiler...

Then notices:
The compiler was using "stale" logic here for type conversion:
Load variable into a scratch register;
Do the operation on scratch register;
Store register back to the variable.
Rather than the newer pattern:
Fetch source and destination variables as registers;
Do conversion between registers.
The code itself was stale, apparently not having gotten the memo that
pointers are 64-bits now....

But, yeah, fixing up the type conversion case should at least eliminate
a few of the MOV's (still not good, but should be "better").

...

Also found and fixed another bug where apparently the logic trying to
auto-cast a value to a _Bool value would result in it being converted to
_Bool twice in a row, say:
CMPQEQ 0, R8
MOVNT R9
CMPEQ 0, R9
MOVNT R10

...

But, I guess, the stuff one finds if they go skimming through an ASM
output dump...

...

Anton Ertl

unread,

Dec 8, 2023, 10:34:59 AM12/8/23

to

BGB <cr8...@gmail.com> writes:
>On 12/7/2023 7:39 AM, Anton Ertl wrote:
>> "Paul A. Clayton" <paaron...@gmail.com> writes:
>>> What about multiply-high-no-carry-in? There might be cases
>>> where such could be used for division with a reused
>>> (possibly compile-time constant) divisor even without the
>>> carry-in. (Producing the carry-in by a low multiply seems
>>> to have significant overhead in energy, throughput, and/or
>>> latency.)
>>
>> What's that operation supposed to produce? How does it differ from,
>> e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
>> makes you think that it still works for division?
>>
>
>For division by reciprocal, one may need to be able to adjust the right
>shift (say, 32..34 bits), and add a bias to the low-order bits if the
>input is negative (say, adding "Rcp-1").
>
>Without the variable shift part, one will not get reliable results for
>some divisors (IIRC, 7, 17, 31, etc).
>
>Without a bias, it will round towards negative infinity rather than
>towards zero (standard for division is to always round towards zero).

Nope. E.g., Forth-83 standardizes floored division; in Forth-94 and
Forth-2012, it's implementation-defined. More generally,
<https://en.wikipedia.org/wiki/Modulo_operation> lists lots of
languages that perform floored or Euclidean modulo, and I would hope
that their division operation agrees with the modulo operation.

As for division by multiplying with the reciprocal, I have written a
paper on it [ertl19kps], but the most important sentence of that paper
is:

|If you read only one paper about the topic, my recommendation is
|Robison’s [Rob05].

My take on this improves the latency of the unsigned case on some CPUs
by avoiding the shift and bias, but it costs an additional
multiplication.

>FAZDIV

What is FAZDIV?

And you failed to answer what multiply-high-no-carry-in does.

@InProceedings{ertl19kps,
author = {M. Anton Ertl},
title = {Integer Division by Multiplying with the
Double-Width Reciprocal},
crossref = {kps19},
pages = {75--84},
url = {http://www.complang.tuwien.ac.at/papers/ertl19kps.pdf},
url-slides = {http://www.complang.tuwien.ac.at/papers/ertl19kps-slides.pdf},
abstract = {Earlier work on integer division by multiplying with
the reciprocal has focused on multiplying with a
single-width reciprocal, combined with a correction
and followed by a shift. The present work explores
using a double-width reciprocal to allow getting rid
of the correction and shift.}
}

@Proceedings{kps19,
title = {20. Kolloquium Programmiersprachen und Grundlagen
der Programmierung (KPS)},
booktitle = {20. Kolloquium Programmiersprachen und Grundlagen
der Programmierung (KPS)},
year = {2019},
key = {kps19},
editor = {Martin Pl\"umicke and Fayez Abu Alia},
url = {https://www.hb.dhbw-stuttgart.de/kps2019/kps2019_Tagungsband.pdf}
}

@InProceedings{robison05,
author = "Arch D. Robison",
title = "{$N$}-Bit Unsigned Division Via {$N$}-Bit
Multiply-Add",
OPTeditor = "Paolo Montuschi and Eric (Eric Mark) Schwarz",
booktitle = "{Proceedings of the 17th IEEE Symposium on Computer
Arithmetic (ARITH-17)}",
publisher = "IEEE Computer Society Press",
ISBN = "0-7695-2366-8",
ISBN-13 = "978-0-7695-2366-8",
year = "2005",
bibdate = "Wed Jun 22 07:02:55 2005",
bibsource = "http://www.math.utah.edu/pub/tex/bib/fparith.bib",
URL = "http://www.acsel-lab.com/arithmetic/arith17/papers/ARITH17_Robison.pdf",
abstract = "Integer division on modern processors is expensive
compared to multiplication. Previous algorithms for
performing unsigned division by an invariant divisor,
via reciprocal approximation, suffer in the worst case
from a common requirement for $ n + 1 $ bit
multiplication, which typically must be synthesized
from $n$-bit multiplication and extra arithmetic
operations. This paper presents, and proves, a hybrid
of previous algorithms that replaces $ n + 1 $ bit
multiplication with a single fused multiply-add
operation on $n$-bit operands, thus reducing any
$n$-bit unsigned division to the upper $n$ bits of a
multiply-add, followed by a single right shift. An
additional benefit is that the prerequisite
calculations are simple and fast. On the Itanium 2
processor, the technique is advantageous for as few as
two quotients that share a common run-time divisor.",
acknowledgement = "Nelson H. F. Beebe, University of Utah, Department
of Mathematics, 110 LCB, 155 S 1400 E RM 233, Salt Lake
City, UT 84112-0090, USA, Tel: +1 801 581 5254, FAX: +1
801 581 4148, e-mail: \path|be...@math.utah.edu|,
\path|be...@acm.org|, \path|be...@computer.org|
(Internet), URL:
\path|http://www.math.utah.edu/~beebe/|",
keywords = "ARITH-17",
pagecount = "9",

BGB

unread,

Dec 8, 2023, 12:54:11 PM12/8/23

to

OK, how about:
Languages like C, Java, C#, etc, expect division to always round towards
zero.

>> FAZDIV
>
> What is FAZDIV?
>

It was the name I came up with for a CPU feature.

But, basically:
There is a "SlowMulDiv" unit that implements Shift-Add /
Shift-and-Subtract, but isn't very fast.

There is a 32-bit integer multiplier (that can do integer
multiply-accumulate), which is faster.

I added a case where the integer multiplier can detect a divide that it
can handle itself, and then asserts a signal to the SlowMulDiv unit that
it should treat this one as a NOP. The multiplier then builds a
reciprocal and similar using several lookup tables, feeds this through
the multiplier, and processes the result.

In the current implementation though, it only deals with numbers between
1 and 63. Everything else is left to the SlowMulDiv unit.

The logic in the main pipeline sees that the multiplier had handled this
case instead, and uses the multiplier's output rather than the dividers.

But, used the term "FAZDIV" which more or less just means "Fast Divide"
(also based on the pattern that 'st' followed quickly/immediately by a
constant sound can get mutated into a 'Z' sound, at least in some US
English dialects).

The actual difference it makes is fairly small though, as while it is at
least fast, it ended up being used relatively infrequently as a lot of
code tends to fairly aggressively avoid integer divide (and the compiler
deals with divide-by-constant cases itself).

> And you failed to answer what multiply-high-no-carry-in does.
>

I am not sure, I didn't come up with that one.

I can imagine though what a "multiply instruction intended specifically
to perform division via multiply by reciprocal" would do.

But, have uncertainty as to whether it would be a good idea to expose
the mechanism directly in an ISA.

If it existed, its main use case would likely be shaving a few clock
cycles off the normal divide by constant case.

Anton Ertl

unread,

Dec 9, 2023, 3:32:45 AM12/9/23

to

BGB <cr8...@gmail.com> writes:
>On 12/8/2023 9:20 AM, Anton Ertl wrote:
>> BGB <cr8...@gmail.com> writes:
>>> Without a bias, it will round towards negative infinity rather than
>>> towards zero (standard for division is to always round towards zero).
>>
>> Nope. E.g., Forth-83 standardizes floored division; in Forth-94 and
>> Forth-2012, it's implementation-defined. More generally,
>> <https://en.wikipedia.org/wiki/Modulo_operation> lists lots of
>> languages that perform floored or Euclidean modulo, and I would hope
>> that their division operation agrees with the modulo operation.

...

>OK, how about:
>Languages like C, Java, C#, etc, expect division to always round towards
>zero.

Yes, there are programming languages that standardize truncated (aka
symmetric) division. BTW, C89 is not one of them, C99 ff. are. There
are also programming languages that standardize other kinds of
division (or at least modulo), or fail to standardize the behaviour
fully when one or both of the operands is negative.

Paul A. Clayton

unread,

Dec 9, 2023, 1:33:45 PM12/9/23

to

On 12/7/23 8:39 AM, Anton Ertl wrote:
> "Paul A. Clayton" <paaron...@gmail.com> writes:
>> What about multiply-high-no-carry-in? There might be cases
>> where such could be used for division with a reused
>> (possibly compile-time constant) divisor even without the
>> carry-in. (Producing the carry-in by a low multiply seems
>> to have significant overhead in energy, throughput, and/or
>> latency.)
>
> What's that operation supposed to produce? How does it differ from,
> e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
> makes you think that it still works for division?

RISC-V's multiply high variants give an exact result and require
carry-in from the lower result. What I was proposing is a multiply
that had the same overhead as a multiply-low (ordinary multiply)
but computed a close approximation of the multiply-high result.
Multiplying by the reciprocal would give a close approximation of
a division result; in some cases a correct result would be
provided. (Hardware might be able to provide an exact division
result more efficiently than software correction of a close
approximation.)

Hardware that performed a full (double-width result) multiply
would probably not benefit from such since alternative use or
power-gating of unused portions of the multiplier could introduce
excessive overhead. Hardware that ran the multiplication through
twice to get a high result might be able to benefit.

The use case seems so limited and the benefits so questionable
that this might well be silly, but it seemed an obvious
possibility.

Paul A. Clayton

unread,

Dec 9, 2023, 1:33:46 PM12/9/23

to

On 12/7/23 1:55 PM, MitchAlsup wrote:
> Paul A. Clayton wrote:
>
>> On 11/27/23 8:03 PM, MitchAlsup wrote:
>>> Anton Ertl wrote:
>> [snip]
>>>> However, looking at more recent architectures, the RISC-V M
>>>> extension (which is part of RV64G and RV32G, i.e., a standard
>>>> extension) has not just multiply instructions (MUL,
>>>> MULH, MULHU, MULHSU, MULW), but also integer divide >>
>>>> instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
>>>> REMUW.
>>>
>>> All of which are possible in My 66000 using operand sign
>>> control, S-bit, and and CARRY when you want 64×64->128 or
>>> 128/64->{64 quotient, 64 remainder}
>
>> What about multiply-high-no-carry-in?
>
>       CARRY   R9,{{O}}    // carry applies to next inst
>                           // no carry in yes carry out
>       MUL     R10,Rx,Ry   // {R9,R10} contain result

I was thinking of a single register result.

BGB

unread,

Dec 9, 2023, 3:59:34 PM12/9/23

to

Yeah. This is closer to what I was doing for floating-point operations.

For 32-bit multiply, the multiplier always internally produces a
widening 64-bit result, but then sign or zero extends it for the 32-bit
cases. Since for C, normal int*int or similar will discard any
high-order results in the case of overflow; and otherwise following
every multiply with a sign or zero extension would be wasteful (but is
ironically closer to how early versions of my ISA would have worked; but
I later split up these cases).

I had considered a possible 64*64->128 widening multiply, since
internally the 64-bit multiplier also produces an intermediate 128-bit
result (then returns the low or high half; and doing a multiply for each
half takes twice as long).

But, thus far, have not done so...

MitchAlsup

unread,

Dec 10, 2023, 1:31:20 PM12/10/23

to

Paul A. Clayton wrote:

> On 12/7/23 1:55 PM, MitchAlsup wrote:
>> Paul A. Clayton wrote:
>>
>>> On 11/27/23 8:03 PM, MitchAlsup wrote:
>>>> Anton Ertl wrote:
>>> [snip]
>>>>> However, looking at more recent architectures, the RISC-V M
>>>>> extension (which is part of RV64G and RV32G, i.e., a standard
>>>>> extension) has not just multiply instructions (MUL,
>>>>> MULH, MULHU, MULHSU, MULW), but also integer divide >>
>>>>> instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
>>>>> REMUW.
>>>>
>>>> All of which are possible in My 66000 using operand sign
>>>> control, S-bit, and and CARRY when you want 64×64->128 or
>>>> 128/64->{64 quotient, 64 remainder}
>>
>>> What about multiply-high-no-carry-in?
>>
>>       CARRY   R9,{{O}}    // carry applies to next inst
>>                           // no carry in yes carry out
>>       MUL     R10,Rx,Ry   // {R9,R10} contain result

> I was thinking of a single register result.

How can a single 64-bit register hold a 128-bit result ??

Thomas Koenig

unread,

Dec 11, 2023, 5:43:46 AM12/11/23

to

MitchAlsup <mitch...@aol.com> schrieb:

> Paul A. Clayton wrote:
>
>> On 12/7/23 1:55 PM, MitchAlsup wrote:

>>>> What about multiply-high-no-carry-in?
>>>
>>>       CARRY   R9,{{O}}    // carry applies to next inst
>>>                           // no carry in yes carry out
>>>       MUL     R10,Rx,Ry   // {R9,R10} contain result
>
>> I was thinking of a single register result.
>
> How can a single 64-bit register hold a 128-bit result ??

Sometimes, you don't need the whole result; the upper bits suffice.

Consider dividing a 32-bit unsigned number n on a 32-bit system
by 25 (equally applicable to similar examples with 64-bit systems,
but I happen to have the numbers at hand).

In plain C, you could do this by

uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;

where any compiler worth its salt will optimize this to a multiply
high followed by a shift by three bits. Compilers have done
this for a long time, for division by a constant. Example, for
32-bit POWER:

#include <stdint.h>

uint32_t div25 (uint32_t n)
{
return n / 25;
}

yields

div25:
lis 9,0x51eb
ori 9,9,0x851f
mulhwu 3,3,9
srwi 3,3,3
blr

where the (lis/ori is POWER's way of synthesizing a 32-bit contant,
and 0x51eb851f = 1374389535).

If you can only access the high part via CARRY, this clobbers a
register that you do not need to do otherwise.

MitchAlsup

unread,

Dec 11, 2023, 7:17:42 PM12/11/23

to

Thomas Koenig wrote:

> MitchAlsup <mitch...@aol.com> schrieb:

>
> #include <stdint.h>

> uint32_t div25 (uint32_t n)
> {
> return n / 25;
> }

> yields

> div25:

> lis 9,0x51eb // cycle 1
> ori 9,9,0x851f // cycle 2
> mulhwu 3,3,9 // cycle 3-6
> srwi 3,3,3 // cycle 7
> blr // cycle 8

div25:
CARRY R8,{{O}} // cycle 1
MUL R1,R1,#0x51eb851f // cycle 1-4
SLL R1,R8,#35 // cycle 5
RET // cycle 6

MitchAlsup

unread,

Dec 11, 2023, 7:17:43 PM12/11/23

to

Thomas Koenig wrote:

> MitchAlsup <mitch...@aol.com> schrieb:
>> Paul A. Clayton wrote:
>>
>>> On 12/7/23 1:55 PM, MitchAlsup wrote:

>>>>> What about multiply-high-no-carry-in?
>>>>
>>>>       CARRY   R9,{{O}}    // carry applies to next inst
>>>>                           // no carry in yes carry out
>>>>       MUL     R10,Rx,Ry   // {R9,R10} contain result
>>
>>> I was thinking of a single register result.
>>
>> How can a single 64-bit register hold a 128-bit result ??

> Sometimes, you don't need the whole result; the upper bits suffice.

> Consider dividing a 32-bit unsigned number n on a 32-bit system
> by 25 (equally applicable to similar examples with 64-bit systems,
> but I happen to have the numbers at hand).

> In plain C, you could do this by

> uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;

This only gives you 29 bits of significance.

What if you also need the remainder ??

Thomas Koenig

unread,

Dec 12, 2023, 12:52:26 AM12/12/23

to

MitchAlsup <mitch...@aol.com> schrieb:

Because that's all it takes, dividing by 25 reduces the number
of significant bits.

> What if you also need the remainder ??

Multiply and subtract.

Thomas Koenig

unread,

Dec 12, 2023, 1:09:44 AM12/12/23

to

MitchAlsup <mitch...@aol.com> schrieb:

Then let's use a 64-bit example:

#include <stdint.h>

uint64_t div (uint64_t n)
{
return n / 37;
}

which becomes

div:
.LFB0:
.cfi_startproc
movabsq $-2492803253203993461, %rax
mulq %rdi
movq %rdx, %rax
shrq $5, %rax
ret

on x86_64 (so, use the high result only and shift it five bits to
the right) and, with Power10,

div:
pli 10,3714566310
pli 9,232160395
rldimi 9,10,32,0
mulhdu 3,3,9
srdi 3,3,5
blr

(The first three instructions are pasting together the constant, which
of course My 66000 can do much better :-)

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

BGB

unread,

Dec 12, 2023, 4:00:43 AM12/12/23

to

And, for comparison, on BJX2 (with the MULQ extension):
MOV -2492803253203993461, R3
MULHU.Q R4, R3, R2
SHLD.Q R2, -5, R2
RTS
...

Though, this will take 24 bytes to encode, and wont be fast.

Also, currently the only instructions that take 64 bit immediate values are:
MOV Imm64, Rn
ADD Imm64, Rn
PLDCXH Imm64, Xn //4x Binary16 to 4x Binary32

...

Robert Finch

unread,

Dec 12, 2023, 11:27:48 AM12/12/23

to

Q+ is almost the same as MY66000

muluh r1,r1,#-2492803253203993461
lsr r1,r1,#5
ret

but ret is only two bytes.

BGB

unread,

Dec 12, 2023, 2:13:41 PM12/12/23

to

In my case, RTS is 16 bits in Baseline mode, 32 bits in XG2 mode (either
way, it effectively performs a branch to the address held in the link
register).

In my case, it is limited both by the instruction encoding rules
currently only allowing for a 96-bit instruction (which, in turn, and
only express a 64-bit immediate for a few ops).

To be able to encode a multiply with a 64-bit immediate would
effectively require a 128-bit instruction encoding (Say: FE-FE-FF-OP,
with the FF being able to glue a 17-bit immediate onto some instructions
that wouldn't normally be able to have an immediate, and 24+24+17=65).

...

MitchAlsup

unread,

Dec 12, 2023, 3:47:54 PM12/12/23

to

When subroutines are aligned to cache line boundaries, that adds nothing.

BGB

unread,

Dec 12, 2023, 4:52:34 PM12/12/23

to

Technically, yes.

Or, even if not, very little, given that RET / RTS is comparably
infrequent...

In other news, did at least go and tweak decoding rules slightly such
that an Op64 prefix can now glue an Imm17s onto most "normal" 3R
instructions, with a few exceptions:
4R and 3R/3RI forms:
These may add either an Imm17s, 4th register, or 3R+Imm9.
This includes the Integer MAC and FMAC instructions.
FPU and FP-SIMD ops, where this encoding encodes a rounding-mode.

This change is N/A for ops not otherwise following the "OP Rm, Ro, Rn"
format.

Or: FFw0-0Vii-F0nm-ZeiZ OP Rm, Imm17s, Rn

This case is mostly intended to allow encoding an immediate form for
many ops which don't otherwise have an immediate form (but does have the
usual drawbacks associated with being a 64-bit encoding).

Thomas Koenig

unread,

Dec 16, 2023, 8:51:20 AM12/16/23

to

Thomas Koenig <tko...@netcologne.de> schrieb:

> A hypothetical My66000 code could be
>
> mulhu r1, r1, #-2492803253203993461
> srl r1, r1, #5
> ret

By "hypothetical" I mean that it isn't in the ISA at the moment.

MitchAlsup

unread,

Dec 16, 2023, 2:21:50 PM12/16/23

to

If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?

But assuming h means high, and the widths are 64-bit operands and
128-bit results::

CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET

Thomas Koenig

unread,

Dec 16, 2023, 5:26:11 PM12/16/23

to

MitchAlsup <mitch...@aol.com> schrieb:

That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.

MitchAlsup

unread,

Dec 16, 2023, 6:12:12 PM12/16/23

to

Heck:: if you are worried about length::

DIV R1,R1,#27
RET

2 words instead of 6.

Thomas Koenig

unread,

Dec 17, 2023, 2:52:16 AM12/17/23

to

This is an optimization problem with three objective functions:
Code size, register use and speed. Any solution in which at which
it is no longer possible to improve one of the objective functions
without making the others worse is a valid solution to that problem
(part of the Pareto front, if you want to use that terminology).

The version with the DIV with constant is the smallest, so it
trumps everything else on size. It is, however, likely to be
slower than the other two. It is what I would a compiler to emit
if optimizing for size (-Os), and when not optimizing.

The version with the Carry is likely faster than the DIV with
constant version, but is bigger and uses two registers instead
of one. It is what I would expect a compiler to generate in
the absence of multiply high unsigned and any -O option, except
for -Os.

The hypothetical version with "multiply high unsigned" is smaller
than the version with Carry and is better on register use, so
it is better in that sense. If it existed, I would expect a
compiler to always generate that for any optimization level
except -Os. (Small caveat: Maybe hardware has a fiendishly
clever algorithm to divide by 37 in four or less cycles; this
would then be known to the compiler, and it could chose accordingly).

MitchAlsup

unread,

Dec 17, 2023, 4:55:58 PM12/17/23

to

Since I do IDIV in the FMAC unit, and integers are not normalized::
FMAC passes both operands through the normalizer's Find First logic
and knows the significance of the numerator and denominator.

Thus, for this case, 27 is detected to have only 5-bits of significance
and the 1-bit per loop* only takes 5 cycles. Giving a fall through time
of 8-cycles total.

(*) assuming a 1-bit at a time loop--which I would not do--but I
still do the normalization steps to make the integers smell like
FP numbers, so, 1 Goldschmidt step and 1 Newton-Raphson step and
you have 8 cycle IDIV for denominators with less than 8-bits of
significance, 10-cycles for denominators of less then 16-bits,...

> The version with the Carry is likely faster than the DIV with
> constant version, but is bigger and uses two registers instead
> of one. It is what I would expect a compiler to generate in
> the absence of multiply high unsigned and any -O option, except
> for -Os.

What if we solve the problem by making IDIV fast instead !!

Given a fast IDIV, is there any use for IMULH ??

> The hypothetical version with "multiply high unsigned" is smaller
> than the version with Carry and is better on register use, so

I should point out that having the constant attached to the
MUL (as is were) you have already saves a register compared to
ISAs without large constants.

> it is better in that sense. If it existed, I would expect a
> compiler to always generate that for any optimization level
> except -Os. (Small caveat: Maybe hardware has a fiendishly
> clever algorithm to divide by 37 in four or less cycles; this
> would then be known to the compiler, and it could chose accordingly).

This is one gigantic reason to use FDIV to perform IDIV--and another
smaller one not to separate register files. FIDVs are in the 17-20
cycle range, why should IDIV not also be in that range ??

Paul A. Clayton

unread,

Dec 17, 2023, 10:14:37 PM12/17/23

to

On 12/17/23 4:55 PM, MitchAlsup wrote:
[snip]

> What if we solve the problem by making IDIV fast instead !!
>
> Given a fast IDIV, is there any use for IMULH ??

I saw using IMULH (or better a "division based on providing a
reciprocal") as a method to perform common subexpression
elimination on the iterative refinement of the reciprocal for
constants and for reused values. Storing and reusing the
reciprocal just seemed to make sense, though sometimes recomputing
may have less overhead.

Latency is also not the only consideration. Energy use is also a
consideration.

This technique might also be applicable when divisors are not
identical, but I doubt that would be very useful (though I vaguely
recall reading that some scientific computation had such divisions
somewhat commonly).

Hardware memoization of reciprocals has been proposed. I merely
thought that allowing a compiler to cache (and reuse) reciprocals
might be useful.

Terje Mathisen

unread,

Dec 18, 2023, 6:01:00 AM12/18/23

to

I disagree:

The R2 overwrite really doesn't matter since the OoO circuitry will
quickly notice that the value isn't used at all, and the extra CARRY
modifier does not make the code any slower: It will still take exactly
the latency of MUL + SHR, so presumably 5-6 total cycles.

Since MULHU is very rarely used in regular code (pretty much only in
constant divisions that got replaced with reciprocal muls), this tiny
overhead is a very good deal.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,

Dec 19, 2023, 8:57:12 AM12/19/23

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

You're correct that the register use will probably be elided in
an OoO machine, but the register is still overwritten. This does
not matter for a function that does just that division, but in
larger functions, this can lead to an additional spill/restore.

> Since MULHU is very rarely used in regular code (pretty much only in
> constant divisions that got replaced with reciprocal muls), this tiny
> overhead is a very good deal.

My 66000 has very low register usage, especially due to the encoding
of constants. Another reason are indexed and scaled loads and
stores, which allows fewer registers because, for code like

do i=1,n
a(i) = b(i) + c(i)
end do

there is no need for having a pointer run through a, b and c,
requiring fewer registers.

This is why putting floating point values and integer values into
the same register set actually works well for My 66000, and would
not work so well for others.

Still, when I see an inefficiency, even minor, I tend to remark
on it :-)

MitchAlsup

unread,

Dec 19, 2023, 1:35:58 PM12/19/23

to

Thomas Koenig wrote:

> Still, when I see an inefficiency, even minor, I tend to remark
> on it :-)

And that is why we love you.