Compact representation for common integer constants

JohnG

unread,

May 5, 2021, 3:25:53 AM5/5/21

to

Hey All,

I've had this for ages and figured I'd toss it out for people's consideration.

Instead of a complete binary representation, represent common integers as 1 times a count of a small number of prime factors and a left shift.

Example, an 8-bit value with 1 bit for the presence of a factor of 3, and 2 bits for factors of 5, 25, and 125, and the remaining 5 bits for left shift. So this would cover integers that were multiples of 1, 3, 5, 15, 25, 75, 125, and 375 then left-shifted between 0 and 31 bits.

It seemed like a nice way to encode a lot of useful constants in a compact opcode format. Variants of the basic idea are hopefully obvious.

-JohnG

Ivan Godard

unread,

May 5, 2021, 11:29:32 AM5/5/21

to

You wouldn't want to actually do the implied multiplies, so this would
become a look-up table indexed by the byte. But if you do that then you
could have any frequent values in the table, not just those implied by
the factors. That's how Mill popCons (Popular Constants) work, with the
nicety that they are encoded only using otherwise wasted entropy in the
binary ISA.

BGB

unread,

May 5, 2021, 4:07:43 PM5/5/21

to

My experiments with shifted-constants were not particularly promising:
Normal integer constants tend to be highly clustered around zero;
There are few obvious common patterns for most larger constants.

And, for this example, not much reason why multiples of 3 or 5 would be
significantly more common than non-multiples of 3 or 5, now combined
with needing hardware to scale things by a multiple of 3 or 5.

This can maybe deal slightly better with "casual" constants, like
"i=10000;" or similar, but these cases are a relative minority IME, and
probably not worth the cost of needing special handling in hardware.

The clustering near zero is also lopsided, with significantly more
positive constants than negative ones. The probability also seems to
fall off with roughly the square of the distance from zero.

The majority of constants which are not clustered near zero tend to fall
into one of several categories:
Hard-coded memory addresses or similar;
Bit-masks or similar;
Floating-point literals.

Memory Addresses: Not a whole lot one can do there, but these do make a
strong case for some sort of relative addressing, such as PC-Relative or
having a Global Pointer or similar.

In which case, the number of bits needed is proportional to
text+data+bss, which for "most binaries" (at least of the ones I am
testing) tends to fall in the range of 20..24 bits.

Though, for relative addressing for structs and arrays, ~ 8..12 bits
seems to be "mostly sufficient" (I am having OK results with a 9-bit
displacement for load/store ops; but as noted, I typically need
"something bigger" to deal with PC-rel or GBR-rel).

In my case, since my C ABI is mostly built around GBR relative
addressing, the number of full-width addresses (48-bits, in theory) is
relatively small (in practice, most of these are references to the MMIO
area at 0000_F0000000, *1).

*1: I am almost half-tempted to eliminate this region if using 48-bit
addressing with the MMU enabled (JQ+MMU). This would force using 48-bit
MMIO addresses if the MMU is enabled, but would effectively give an
entirely flat userland region (vs, "mostly flat but with a random hole
just below the 4GB mark for MMIO and similar").

Bit Masks:
These are the most likely to benefit from some form of shifted encoding,
but need to be able to encode several types of patterns (to be useful).
0000_0010_0000_0000 / 0000_0000_0800_0000 / ...
0000_007F_FFFF_FFFF / 0000_0000_01FF_FFFF / ...
0000_3FE0_0000_0000 / 0000_0000_01FC_0000 / ...
FFFF_FFEF_FFFF_FFFF / FFFF_FFFF_F7FF_FFFF / ...
FFFF_FF80_0000_0000 / FFFF_FFFF_FE00_0000 / ...
FFFF_C01F_FFFF_FFFF / FFFF_FFFF_FE03_FFFF / ...
...

Here, one needs to hit "approximately" the right area, and have rules
for filling in the rest of the bits.

In an ideal case, one would need 16-bits of payload, with a shift, and a
few bits as a fill pattern selector. To save bits, the shift could be a
multiple of 4 or 8 bits.

This would take ~ 22-bits, which is a bit steep.
Cramming it down to 12 to 16 bits is possible, but makes it a lot less
useful (eg: 10-bit literal, 4-bit nybble-shift, 2-bit fill pattern).

Implementation is more difficult, as then one either needs to spend the
resources to handle it in the decoder, or re-purpose another unit (such
as the shift unit). Doing it in a way that leaves it "sufficiently
useful" (vs simpler options), is more difficult.

Shift-style units are kind of expensive, so aren't a good thing to put
in the instruction decoder.

Floating point literals:
If one has a few free spots with 16-bit immediate values, and a
Half-Float converter, then this actually works fairly well (a majority
of floating point constants which appear in code tend to be able to be
expressed exactly as Half-Float).

Main exceptions (can't be encoded exactly / at-all):
Powers of 10, eg: 1000.0, 0.1, 0.001, ...
Series of 9s, eg: 9999.0
Values that fall outside the exponent range
Values derived from irrational numbers
...

However, for: 1.0, 1.5, 0.75, 0.5, ...
They work pretty well.

Even E4.F4 microfloats (FP8) or similar "aren't completely useless" (and
may provide a more compact way to express things like FP-SIMD vectors).

These cases may also be able to express a range of patterns which while
not themselves FP constants, happen to exist as a bit pattern which can
be represented exactly using packed FP16 or FP8. This includes a certain
subset of bit-masks.

The "exception" cases tend not to follow an obvious pattern (in terms of
global statistics) so are "at best" handled with a lookup table, whose
optimal contents will tend to vary from one application to another.

In this latter case, the main viable options then end up being to encode
constants inline, or fetch them from a software managed lookup table,
with the usual sorts of tradeoffs.

Ivan Godard

unread,

May 5, 2021, 5:29:39 PM5/5/21

to

This matches our experience with popCons very well. Let me add that not
only "half-float" works, but also quarter-float, eight-float etc,. Low
precision/low range FP constants are quite common; the only full-sized
erratic (not in tabular data) floats are irrationals.

Mill code uses relative addressing throughout, so absolute addresses are
very rare. Besides 0..10, powers of two and ten seem to be the only ones
worth recognizing n our test corpus.

However, special-casing constants seems to be only a marginal gain in
general: at most 25-30% seem to be able to use the facility. The big win
is to have some way to compose a full size constant on the code-side
only, in one instruction without going through the data-side by loads
etc. Needing to compose a 128-bit constant from 23-bit literals is very
painful, but so is using a data fetch. I like Mitch's solution, although
it requires (or falls out of, if you refer) bundle-ization in the
pre-decoder.

MitchAlsup

unread,

May 5, 2021, 8:07:39 PM5/5/21

to

On Wednesday, May 5, 2021 at 3:07:43 PM UTC-5, BGB wrote:
> On 5/5/2021 2:25 AM, JohnG wrote:
> > Hey All,
> >
> > I've had this for ages and figured I'd toss it out for people's consideration.
> >
> > Instead of a complete binary representation, represent common integers as 1 times a count of a small number of prime factors and a left shift.
> >
> > Example, an 8-bit value with 1 bit for the presence of a factor of 3, and 2 bits for factors of 5, 25, and 125, and the remaining 5 bits for left shift. So this would cover integers that were multiples of 1, 3, 5, 15, 25, 75, 125, and 375 then left-shifted between 0 and 31 bits.
> >
> > It seemed like a nice way to encode a lot of useful constants in a compact opcode format. Variants of the basic idea are hopefully obvious.
> >
> My experiments with shifted-constants were not particularly promising:
> Normal integer constants tend to be highly clustered around zero;
<

With 80%+ of them positive

<
> There are few obvious common patterns for most larger constants.
>
> And, for this example, not much reason why multiples of 3 or 5 would be
> significantly more common than non-multiples of 3 or 5, now combined
> with needing hardware to scale things by a multiple of 3 or 5.
>
> This can maybe deal slightly better with "casual" constants, like
> "i=10000;" or similar, but these cases are a relative minority IME, and
> probably not worth the cost of needing special handling in hardware.
<

Note: 10,000 fits in 14-bits.....

>
>
> The clustering near zero is also lopsided, with significantly more
> positive constants than negative ones. The probability also seems to
> fall off with roughly the square of the distance from zero.
<

4-to-6 positive constants to 1 negative constant for integer stuff
For logical masking stuff there is a bit more on the negative side.

>
>
>
> The majority of constants which are not clustered near zero tend to fall
> into one of several categories:
> Hard-coded memory addresses or similar;
> Bit-masks or similar;
> Floating-point literals.
>
>
> Memory Addresses: Not a whole lot one can do there, but these do make a
> strong case for some sort of relative addressing, such as PC-Relative or
> having a Global Pointer or similar.
<

Which My 66000 has...

>
> In which case, the number of bits needed is proportional to
> text+data+bss, which for "most binaries" (at least of the ones I am
> testing) tends to fall in the range of 20..24 bits.
<

I am seeing 2% of the addresses needing to be 64 bits in size on a
64-bit address space machine.

>
> Though, for relative addressing for structs and arrays, ~ 8..12 bits
> seems to be "mostly sufficient" (I am having OK results with a 9-bit
> displacement for load/store ops; but as noted, I typically need
> "something bigger" to deal with PC-rel or GBR-rel).
<

I remember that the 16-bit immediates on Mc 88K were a lot more useful
in SPEC{89,93} than the 13-bit immediates in SPARC.

>
> In my case, since my C ABI is mostly built around GBR relative
> addressing, the number of full-width addresses (48-bits, in theory) is
> relatively small (in practice, most of these are references to the MMIO
> area at 0000_F0000000, *1).
<

Why does each I/O device not have its own "base register" wich you load
and then talk at it using small displacements ??

>
> *1: I am almost half-tempted to eliminate this region if using 48-bit
> addressing with the MMU enabled (JQ+MMU). This would force using 48-bit
> MMIO addresses if the MMU is enabled, but would effectively give an
> entirely flat userland region (vs, "mostly flat but with a random hole
> just below the 4GB mark for MMIO and similar").
>
>
>
> Bit Masks:
> These are the most likely to benefit from some form of shifted encoding,
> but need to be able to encode several types of patterns (to be useful).
> 0000_0010_0000_0000 / 0000_0000_0800_0000 / ...
> 0000_007F_FFFF_FFFF / 0000_0000_01FF_FFFF / ...
> 0000_3FE0_0000_0000 / 0000_0000_01FC_0000 / ...
> FFFF_FFEF_FFFF_FFFF / FFFF_FFFF_F7FF_FFFF / ...
> FFFF_FF80_0000_0000 / FFFF_FFFF_FE00_0000 / ...
> FFFF_C01F_FFFF_FFFF / FFFF_FFFF_FE03_FFFF / ...
> ...
<

Mc 88K has MAKe instruction which created a 000111000 kind of mask
using two 5-bit values. Offset and size of the 1s field. My 66000 droped
this easy to perform instruction because immediates are more efficient
in execution time.

PDP-10 had a pretty workable scheme for float constants from reasonable
sized constant.

My 66000 looked at this and decided not to "waste" the 16-bit immediate
OpCode space and save it for future use.

MitchAlsup

unread,

May 5, 2021, 8:09:58 PM5/5/21

to

However, a 32-bit constant can be used to hold a 64-bit FP value when the
OpCode decoder an see the constant consumer is double precision.

BGB

unread,

May 5, 2021, 11:57:34 PM5/5/21

to

Quarter-Float is essentially what FP8 is, though in BJX2 only comes in
packed form via conversion ops (4xFP8 to 4xFP16, or 2xFP8 to 2xFP32).

These don't currently have constant-load forms, so loading a vector this
way is a multi-op sequence.

The 8-bit format also comes in both a signed and unsigned variants:
FP8U: E4.F4, bias=7
FP8S: S.E4.F3, bias=7

When used for image data, FP8 has similar image quality to RGB555,
although it can express HDR images and preserves more detail in darker
parts of the image.

> Mill code uses relative addressing throughout, so absolute addresses are
> very rare. Besides 0..10, powers of two and ten seem to be the only ones
> worth recognizing n our test corpus.
>
> However, special-casing constants seems to be only a marginal gain in
> general: at most 25-30% seem to be able to use the facility. The big win
> is to have some way to compose a full size constant on the code-side
> only, in one instruction without going through the data-side by loads
> etc. Needing to compose a 128-bit constant from 23-bit literals is very
> painful, but so is using a data fetch. I like Mitch's solution, although
> it requires (or falls out of, if you refer) bundle-ization in the
> pre-decoder.

The BJX2 ISA can compose 32 or 64-bit constants inline within a single
clock-cycle (via "jumbo prefixes"), though doing so still requires using
64 or 96 bits worth of encoding space.

Only a few instructions can encode a full-width 64-bit immediate though, eg:
MOV Imm64, Rn
ADD Imm64, Rn
...

Most other operations are currently limited to a 33 bit immediate.

Though, 56-bit immediate values technically exist, only certain ops can
actually use them (eg, ADD/SUB/...). Things like memory loads are
in-practice currently limited to a 33-bit displacement.

Branches are even more limited, so a longer branch will require
composing the destination address manually via LEA or ALU ops or similar.

So, while it could be theoretically possible to support a Jumbo-Branch,
at the moment there are no plans to do so (for resource cost and timing
reasons).

Because vectors are composed via register pairs, a 128-bit constant load
can be implemented as two 64-bit constant loads, and can load an
arbitrary 128-bit constant within 2 clock-cycles using 192 bits.
However, usually less space is needed.

However, if one had a table of constants, and the constants were used
repeatedly (with a register kept around for the table, ...); a lookup
table could potentially use less code space than encoding these
constants inline.

Potentially, something like "MOV.Q (PC, Disp9), Rn" could also be used,
but would not save anything space-wise over encoding the constant inline
via jumbo prefixes.

Move so, the jumbo prefixes are a single cycle and the value can be used
immediately, whereas the memory load has a 3 cycle latency, ...

BGB

unread,

May 5, 2021, 11:57:48 PM5/5/21

to

On 5/5/2021 7:07 PM, MitchAlsup wrote:
> On Wednesday, May 5, 2021 at 3:07:43 PM UTC-5, BGB wrote:
>> On 5/5/2021 2:25 AM, JohnG wrote:
>>> Hey All,
>>>
>>> I've had this for ages and figured I'd toss it out for people's consideration.
>>>
>>> Instead of a complete binary representation, represent common integers as 1 times a count of a small number of prime factors and a left shift.
>>>
>>> Example, an 8-bit value with 1 bit for the presence of a factor of 3, and 2 bits for factors of 5, 25, and 125, and the remaining 5 bits for left shift. So this would cover integers that were multiples of 1, 3, 5, 15, 25, 75, 125, and 375 then left-shifted between 0 and 31 bits.
>>>
>>> It seemed like a nice way to encode a lot of useful constants in a compact opcode format. Variants of the basic idea are hopefully obvious.
>>>
>> My experiments with shifted-constants were not particularly promising:
>> Normal integer constants tend to be highly clustered around zero;
> <
> With 80%+ of them positive
> <

Yes, granted.

>> There are few obvious common patterns for most larger constants.
>>
>> And, for this example, not much reason why multiples of 3 or 5 would be
>> significantly more common than non-multiples of 3 or 5, now combined
>> with needing hardware to scale things by a multiple of 3 or 5.
>>
>> This can maybe deal slightly better with "casual" constants, like
>> "i=10000;" or similar, but these cases are a relative minority IME, and
>> probably not worth the cost of needing special handling in hardware.
> <
> Note: 10,000 fits in 14-bits.....

True, but as can be noted, most 3R encodings need a jumbo prefix in my
case to hold a value larger than 9 bits.

Eg:
ADD R12, 291, R29
Can be encoded using a 32-bit form, whereas:
ADD R12, 9999, R29
Needs a 64-bit encoding.

>>
>>
>> The clustering near zero is also lopsided, with significantly more
>> positive constants than negative ones. The probability also seems to
>> fall off with roughly the square of the distance from zero.
> <
> 4-to-6 positive constants to 1 negative constant for integer stuff
> For logical masking stuff there is a bit more on the negative side.

Yep.

>>
>>
>>
>> The majority of constants which are not clustered near zero tend to fall
>> into one of several categories:
>> Hard-coded memory addresses or similar;
>> Bit-masks or similar;
>> Floating-point literals.
>>
>>
>> Memory Addresses: Not a whole lot one can do there, but these do make a
>> strong case for some sort of relative addressing, such as PC-Relative or
>> having a Global Pointer or similar.
> <
> Which My 66000 has...

Likewise, BJX2 has these as well.

Not every ISA has them though...

>>
>> In which case, the number of bits needed is proportional to
>> text+data+bss, which for "most binaries" (at least of the ones I am
>> testing) tends to fall in the range of 20..24 bits.
> <
> I am seeing 2% of the addresses needing to be 64 bits in size on a
> 64-bit address space machine.

Pretty much.

I have a 48-bit space, but thus far pretty much everything sits in the
low 4GB.

I figure I may as well use the full pointer size though, since the
relative cost isn't too drastic, and dealing with 32/64 ABI
compatibility issues is likely to be a bigger issue in the future than
needing to use ~ 3% more RAM.

Code size isn't really effected much either way, though could be
slightly better if 48-bit Disp24 encodings or similar were reintroduced,
but this is unlikely at the moment.

I actually saw a bigger theoretical gain from the 24-bit encodings. As
can be noted, in the case where they were relevant, despite eliminating
a fair chunk of the 32-bit encodings, the overall savings were fairly
modest.

Say, program is:
70% 16-bit ops, 30% 32-bit ops.
And, we eliminate 1/3 of the 32-bit ops:
70% 16-bit ops, 20% 32-bit ops, 10% 24-bit ops.

Binary is only 4% smaller...

For comparison, LZ4 packing tends to save ~ 40-60%, though with the
tradeoff that it is necessary to unpack the code before it can be
executed (so, doesn't make much sense with a small RAM space).

Some sort of hardware-level LZ packing could be possible, in theory, but
would do evil things to the L1 I$; it would effectively need to be
multiple L1 caches glued together in order to be able to support the
in-flight LZ unpacking. The impact on the compiler would likely also be
kinda evil as it would effectively need to perform both code generation
and dictionary compression at the same time (and the constraints imposed
on it are likely to make it somewhat less effective than a traditional
LZ compressor).

>>
>> Though, for relative addressing for structs and arrays, ~ 8..12 bits
>> seems to be "mostly sufficient" (I am having OK results with a 9-bit
>> displacement for load/store ops; but as noted, I typically need
>> "something bigger" to deal with PC-rel or GBR-rel).
> <
> I remember that the 16-bit immediates on Mc 88K were a lot more useful
> in SPEC{89,93} than the 13-bit immediates in SPARC.

Could be, dunno...

As noted, a 9-bit displacement with a struct, for 32 and 64-bit
load/store, allows a struct of 2K or 4K.

Given a majority of structs tend to be less than 512 bytes, it works out
OK for the most part.

Albeit, large structs end up needing 64-bit jumbo encodings to deal with
the field displacements.

Some of this is partly a tradeoff of being able to support 16-bit
instructions, WEX encodings, predicated instructions, ... These things
did cut a few bits off the encoding space.

This was compensated for originally by being able to readily load
constants into registers when needed.

>>
>> In my case, since my C ABI is mostly built around GBR relative
>> addressing, the number of full-width addresses (48-bits, in theory) is
>> relatively small (in practice, most of these are references to the MMIO
>> area at 0000_F0000000, *1).

> <
> Why does each I/O device not have its own "base register" wich you load
> and then talk at it using small displacements ??

Each MMIO device does have a base address and range...

But one needs to load an address to it, somewhere...

so, eg, something like:
volatile uint32_t *somedevice_regs32;
volatile uint64_t *somedevice_regs64;

void somedevice_init()
{
somedevice_regs32 = (uint32_t *)0x0000F00CD000ULL;
somedevice_regs64 = (uint64_t *)somedevice_regs32;
}

uint64_t somedevice_access()
{
uint32_t fl;
uint64_t v;
fl=somedevice_regs32[SOMEDEVICE_STATUS];
... check status or whatever ...
v=somedevice_regs64[SOMEDEVICE_DATAQ];
return(v);
}

If I make the change I mentioned, it would be necessary to instead
write, say:
somedevice_regs32 = (uint32_t *)0xF000000CD000ULL;
somedevice_regs64 = (uint64_t *)somedevice_regs32;

But, they would correspond to the same address as far as the hardware is
concerned.

There is also an MMIO bypass range at 0xC00000000000...

So:
void *mmu_cant_touch_this;
mmu_cant_touch_this = 0xC00001234000;
But, userland code "can't touch this" either...

Though, with the tradeoff that the L1 cache wont necessarily see them as
the same memory. If a store is performed to one address and then read
from another which happens to alias to the same underlying memory, the
results of the first store may not (necessarily) be visible...

However, in simple cases, the direct-mapped L1 is likely to evict one
location before it loads the other. This pattern will fall on its face
if the MMU is used though.

It is possible in the future though, it is possible that the caches
could be made to keep track of this stuff, and possibly the L1 caches
would signal their intentions for a cache-line to the L2, which would
keep track of things, and then notify the L1 caches about cases where
they need to flush certain cache lines.

>>
>> *1: I am almost half-tempted to eliminate this region if using 48-bit
>> addressing with the MMU enabled (JQ+MMU). This would force using 48-bit
>> MMIO addresses if the MMU is enabled, but would effectively give an
>> entirely flat userland region (vs, "mostly flat but with a random hole
>> just below the 4GB mark for MMIO and similar").
>>
>>
>>
>> Bit Masks:
>> These are the most likely to benefit from some form of shifted encoding,
>> but need to be able to encode several types of patterns (to be useful).
>> 0000_0010_0000_0000 / 0000_0000_0800_0000 / ...
>> 0000_007F_FFFF_FFFF / 0000_0000_01FF_FFFF / ...
>> 0000_3FE0_0000_0000 / 0000_0000_01FC_0000 / ...
>> FFFF_FFEF_FFFF_FFFF / FFFF_FFFF_F7FF_FFFF / ...
>> FFFF_FF80_0000_0000 / FFFF_FFFF_FE00_0000 / ...
>> FFFF_C01F_FFFF_FFFF / FFFF_FFFF_FE03_FFFF / ...
>> ...
> <
> Mc 88K has MAKe instruction which created a 000111000 kind of mask
> using two 5-bit values. Offset and size of the 1s field. My 66000 droped
> this easy to perform instruction because immediates are more efficient
> in execution time.

OK.

I have a few spots left for "Imm16, Rn" ops, but I use them sparingly...

Currently:
LDIZ, LDIN, ADD, LDISH, FLDCH, -, -, -

LDIZ/LDIN: Load a Zero or One extended 16-bit constant.
ADD: Add 16-bit constant, sign-extended.
LDISH: Rn=(Rn<<16)|Imm16u
FLDCH: Load a Half-Float Immediate

David Brown

unread,

May 6, 2021, 5:23:58 AM5/6/21

to

That's the way to do it, IMHO. A table also keeps the size of the
constants independent from the number of bits used for the index. And
it lets you include heavily used constants such as 42 even though they
don't fit into a nice pattern.

By the time you've got perhaps -1, 0, 1, 2, 4, 8, 10, 255 (taking 3 bits
to encode), you've covered a lot. I'd guess a 5 bit table would suffice.

John Levine

unread,

May 6, 2021, 12:02:42 PM5/6/21

to

According to JohnG <gomij...@gmail.com>:

>Example, an 8-bit value with 1 bit for the presence of a factor of 3, and 2 bits for factors of 5, 25, and 125, and the remaining 5 bits
>for left shift. So this would cover integers that were multiples of 1, 3, 5, 15, 25, 75, 125, and 375 then left-shifted between 0 and 31
>bits.
>
>It seemed like a nice way to encode a lot of useful constants in a compact opcode format. Variants of the basic idea are hopefully obvious.

It's an old idea which doesn't mean it's a bad one. It is my impression
that you can easily outsmart yourself and once you get beyond small
integers, there's no consistency in what values are popular.

The VAX among its zillion addressing modes had an immediate one with
a six-bit value. Depending on the instruction it was interpreted
as a integer between 0 and 63 or as a floating point number with
a three bit exponent and three bit fraction. That let them represent
integers from 1 to 16, 0.5, and a bunch of less common values.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

BGB

unread,

May 6, 2021, 1:40:11 PM5/6/21

to

On 5/6/2021 11:02 AM, John Levine wrote:
> According to JohnG <gomij...@gmail.com>:
>> Example, an 8-bit value with 1 bit for the presence of a factor of 3, and 2 bits for factors of 5, 25, and 125, and the remaining 5 bits
>> for left shift. So this would cover integers that were multiples of 1, 3, 5, 15, 25, 75, 125, and 375 then left-shifted between 0 and 31
>> bits.
>>
>> It seemed like a nice way to encode a lot of useful constants in a compact opcode format. Variants of the basic idea are hopefully obvious.
>
> It's an old idea which doesn't mean it's a bad one. It is my impression
> that you can easily outsmart yourself and once you get beyond small
> integers, there's no consistency in what values are popular.
> > The VAX among its zillion addressing modes had an immediate one with
> a six-bit value. Depending on the instruction it was interpreted
> as a integer between 0 and 63 or as a floating point number with
> a three bit exponent and three bit fraction. That let them represent
> integers from 1 to 16, 0.5, and a bunch of less common values.
>

Generally true, though a few patterns exist.
If you can efficiently express:
Small integer values;
Small FP values;
Simple bit masks;
...

This goes a long way.
Small integer values are handled in the obvious way.

Small FP values can either be a truncated form of the value (load the
high N bits), or a small FP representation. Of these, a small FP value
typically does better for representing a useful range of constants.

This is how I (eventually) ended up having an instruction to load a
half-float immediate (which is then converted to Double on load).

Note that the conversion used in this case is essentially just repacking
the bits with a few special cases (the zero exponent being handled in a
"slightly non-standard" way).

Bit-masks are a slightly harder problem, not so much to express them
compactly, but rather to keep them cost-effective to decode. The only
real hope they have of being viable is if they can reuse an existing
shift unit or similar ("X SHL N" or "X ROL N" or similar, *1).

Generally, the range of values and patterns needed goes outside the
range of what is particularly viable for a lookup table.

It is also debatable whether they are particularly worthwhile in an ISA
which is able to encode larger constants directly (at the expense of a
slightly larger instruction and loss of ability to execute in parallel
with other instructions).

*1: Say, one has ops which split a 10-bit immediate into X (6 bits) and
N (4 bits), and then does the equivalent of (X ROL (N*4)), and then
comes in both a Zero and One extended variant.

MitchAlsup

unread,

May 6, 2021, 2:05:22 PM5/6/21

to

But this, in general, places them in the execute domain rather than the
constant domain (along with the baggage that entails.)

Constants should never "eat" any execution time, not the packing of bits
using multiple instructions, nor loading constants from memory. No
constants should emanate only from the instruction stream.

BGB

unread,

May 6, 2021, 4:14:20 PM5/6/21

to

In this case, it is the lesser of two evils:
Spend 32 bits to load a shifted-immediate constant;
Spend 64 or 96 bits to load the constant in some other way.

In any case, there will be some execution cost to using a discrete
instruction to load a constant, whether it behaves like a Shift/Rotate
operation, or is effectively just a plain MOV operation.

Either way, one is spending a clock-cycle in this case...

This wouldn't be used in cases where some other instruction is able to
fit the value into a normal immediate field.

Similarly, not going to add ARM style shifted-immediate values to random
other instructions, as this basically opens up a big ugly mess (and
would likely require another dedicated unit for this).

Re-adding something like this as an experiment, doing it for all 3 lanes
costs ~ 2500 LUTs (unreasonably expensive); Limiting it to Lane 1
reduces cost to ~ 800 LUTs (still pretty expensive for what it is).

Most of the cost in this case seemingly goes into the logic for
supporting split-immediate values in the register file (say, the normal
33-bit immed field emitted by the decoder is split into a high-part and
a low-part for sake of the EX stages).

This is handled via special virtual registers, which can be used to
select the high or low part of the immed field.

Also note that Jumbo-Loads had been implemented by using immediate
fields from both Lane 1 and 2 as a combined immediate.

Eg:
IMM: Imm33s (sign-extended to 64 bits; already existed).
JIMM: { Imm32B, Imm32A } (already existed)
RIMMH: Imm33[32: 8] (sign-extended to 64 bits, new).
RIMML: Imm33[ 7: 0] (zero-extended to 64 bits, new).

Though, it is possible such a mechanism could have other uses.

The instruction decoder manages converting the Imm10 field into the
usual Imm33 form, but needs new logic to deal with the (6,4) fields.

...

MitchAlsup

unread,

May 6, 2021, 8:40:20 PM5/6/21

to

from the instruction stream using no instructions.

>
>
> In any case, there will be some execution cost to using a discrete
> instruction to load a constant, whether it behaves like a Shift/Rotate
> operation, or is effectively just a plain MOV operation.
<

The only operations My 66000 performs on constants is
a) sign extension from 16-bits to 64-bits
b) sign extension from 32-bits to 64-bits
c) expansion of 32-bit FP to 64-bit FP (denorms not allowed in 32-bit FP
.....constant.

>
> Either way, one is spending a clock-cycle in this case...
<

Nope.

>
>
> This wouldn't be used in cases where some other instruction is able to
> fit the value into a normal immediate field.
>
> Similarly, not going to add ARM style shifted-immediate values to random
> other instructions, as this basically opens up a big ugly mess (and
> would likely require another dedicated unit for this).
>

Strongly agree.

>
> Re-adding something like this as an experiment, doing it for all 3 lanes
> costs ~ 2500 LUTs (unreasonably expensive); Limiting it to Lane 1
> reduces cost to ~ 800 LUTs (still pretty expensive for what it is).
>

Which is why I limited my constant manipulations to the above.

BGB

unread,

May 7, 2021, 2:33:06 AM5/7/21

to

As noted, at least in my case, not every operation supports immediate
fields in every context, and when they do, they may have size
limitations for other reasons.

A consequence of this is still needing to load values into registers.
The operations which have immediate forms are mostly semi-common ALU ops
and similar.

Eg: ADD/SUB/AND/OR/XOR/SHAD/SHLD/... have immediate forms.
Many other operations do not.

It is possible that it could be generalized, but then the ISA would be
less RISC style then it is already...

Then again, I did see recently that someone had listed my ISA somewhere,
but then classified it as a CISC; I don't entirely agree, but alas...

Then again, many peoples' definitions of "RISC" exclude largish
instruction-sets with variable-length instruction encodings, so alas.

But, taken at face value, then one would also need to exclude Thumb2 and
similar from the RISC category.

>>
>>
>> In any case, there will be some execution cost to using a discrete
>> instruction to load a constant, whether it behaves like a Shift/Rotate
>> operation, or is effectively just a plain MOV operation.
> <
> The only operations My 66000 performs on constants is
> a) sign extension from 16-bits to 64-bits
> b) sign extension from 32-bits to 64-bits
> c) expansion of 32-bit FP to 64-bit FP (denorms not allowed in 32-bit FP
> .....constant.

In this case, it is less of a constant manipulation per-se, and more a
hack to feed an immediate value into both ports of a ROTL or ROTLQ
operation, which can then masquerade as a constant load.

Similarly, the Half-Float constant load isn't so much a new type of
constant-load, so much as invoking the Half-To-Double converter but
giving it an immediate value rather than a register.

As can be noted though, pretty much none of the existing FPU ops can use
immediate values apart from the Half-Float converter.

>>
>> Either way, one is spending a clock-cycle in this case...
> <
> Nope.

In some cases, it might actually be able to save clock cycles vs
jumbo-loads, since jumbo ops can't be executed in parallel with other
instructions...

>>
>>
>> This wouldn't be used in cases where some other instruction is able to
>> fit the value into a normal immediate field.
>>
>> Similarly, not going to add ARM style shifted-immediate values to random
>> other instructions, as this basically opens up a big ugly mess (and
>> would likely require another dedicated unit for this).
>>
> Strongly agree.

Yeah, hence why this is not what I am doing...
Adding a whole bunch of new ALU ops with ARM-style immediate values
would be, a bit much...

>>
>> Re-adding something like this as an experiment, doing it for all 3 lanes
>> costs ~ 2500 LUTs (unreasonably expensive); Limiting it to Lane 1
>> reduces cost to ~ 800 LUTs (still pretty expensive for what it is).
>>
> Which is why I limited my constant manipulations to the above.

From what I can tell, most of the cost seems to be due to adding the
R8IMML / R8IMMH pseudo-registers.

Granted, this is in a part of the code which is filled with fairly tight
timing and "lots of angry bees" (much of the register interlock and
forwarding machinery and similar also passes through here).

Marcus

unread,

May 7, 2021, 3:31:10 AM5/7/21

to

On 2021-05-06, BGB wrote:
> On 5/5/2021 4:29 PM, Ivan Godard wrote:

[snip]

>> This matches our experience with popCons very well. Let me add that
>> not only "half-float" works, but also quarter-float, eight-float etc,.
>> Low precision/low range FP constants are quite common; the only
>> full-sized erratic (not in tabular data) floats are irrationals.
>>
>
> Quarter-Float is essentially what FP8 is, though in BJX2 only comes in
> packed form via conversion ops (4xFP8 to 4xFP16, or 2xFP8 to 2xFP32).
>
> These don't currently have constant-load forms, so loading a vector this
> way is a multi-op sequence.
>
>
> The 8-bit format also comes in both a signed and unsigned variants:
> FP8U: E4.F4, bias=7
> FP8S: S.E4.F3, bias=7
>

In MRISC32 I have "float32", "float16" and "float8" (the shorter formats
usually come in packed form). The 32-bit and 16-bit formats are
identical to binary32 and binary16 in IEEE 754, respectively, while
I had to make a custom format for the 8-bit variant - which happens to
be identical to your "FP8S" format (i.e. S.E4.F3, bias=7).

I wonder: Are there any common/official 8-bit IEEE-style
(sign+exponent+fraction) FP formats that are gaining traction (ignoring
posits)?

/Marcus

Terje Mathisen

unread,

May 7, 2021, 5:45:47 AM5/7/21

to

Marcus wrote:
> In MRISC32 I have "float32", "float16" and "float8" (the shorter formats
> usually come in packed form). The 32-bit and 16-bit formats are
> identical to binary32 and binary16 in IEEE 754, respectively, while
> I had to make a custom format for the 8-bit variant - which happens to
> be identical to your "FP8S" format (i.e. S.E4.F3, bias=7).

With FP16 presumably 1:5:10 right?

>
> I wonder: Are there any common/official 8-bit IEEE-style
> (sign+exponent+fraction) FP formats that are gaining traction (ignoring
> posits)?

FP8 could be either 1:3:4 or 1:4:3, but for full ieee compliance you
really want at least 2n+3 bits in the mantissa when going up in size, so
your choice is the best one imho.

Having 2n+3 means that you never get into double rounding problems by
doing any single operation in the larger size, then rounding back down
to the target precision.

The smallest possible ieee format would need at least sign + two
exponent and two mantissa bits, so FP5 = 1:2:2.

This format would still support all of QNaN/SNaN/Inf/Normal/Sub-normal/Zero.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stefan Monnier

unread,

May 7, 2021, 9:15:39 AM5/7/21

to

> It is possible that it could be generalized, but then the ISA would be less
> RISC style then it is already...
>
> Then again, I did see recently that someone had listed my ISA somewhere, but
> then classified it as a CISC; I don't entirely agree, but alas...
>
> Then again, many peoples' definitions of "RISC" exclude largish
> instruction-sets with variable-length instruction encodings, so alas.
>
> But, taken at face value, then one would also need to exclude Thumb2 and
> similar from the RISC category.

I think it's better not to worry about how other people label your ISA.

The manichean RISC-vs-CISC labeling is stupid anyway: the design space
has many more than 2 spots.

Also I think the "classic RISC" is the result of specific conditions at
that point in time which resulted in a particular sweet spot in the
design space.

But design constraints are quite different now, so applying the same
"quantitative approach" that lead to MIPS/SPARC/... will now result in
something quite different. One could argue that such new ISAs
could be called RISC-2020 ;-)

Stefan

JohnG

unread,

May 7, 2021, 9:18:10 AM5/7/21

to

On Wednesday, May 5, 2021 at 8:29:32 AM UTC-7, Ivan Godard wrote:
> You wouldn't want to actually do the implied multiplies, so this would
> become a look-up table indexed by the byte. But if you do that then you
> could have any frequent values in the table, not just those implied by
> the factors. That's how Mill popCons (Popular Constants) work, with the
> nicety that they are encoded only using otherwise wasted entropy in the
> binary ISA.

Right, but in my example, there are only 8 interesting implied multiplies and would only need an 8x9b rom or equivalent logic:

[5^n 2-bits][3^n 1-bit]
[00][0] = 9'b0_0000_0001 # 5^0 * 3^0 = 1
[00][1] = 9'b0_0000_0011 # 5^0 * 3^1 = 3
[01][0] = 9'b0_0000_0101 # 5^1 * 3^0 = 5
[01][1] = 9'b0_0000_1111 # 5^1 * 3^1 = 15
[10][0] = 9'b0_0001_1001 # 5^2 * 3^0 = 25
[10][1] = 9'b0_0100_1011 # 5^2 * 3^1 = 75
[11][0] = 9'b0_0111_1101 # 5^3 * 3^0 = 125
[11][1] = 9'b1_0111_0111 # 5^3 * 3^1 = 375

Everything else is just as many left shifts as you want to spend bits on and could even wait until the ALU if it has the ability to shift inputs.

If you have the space and time to have a table of perfect constants, then yeah, that seems like the way to go.

A couple other things I think are asked in other replies that I'll put here just to have them all in one place.

This was intended to encode small integers for arithmetic ops - addition, subtraction, multiplication, offsets, etc. for a compressed opcode space (e.g. Thumb, RVC). If the immediate doesn't have a representation in the compressed opcode, use the full-sized instruction.

Good immediate values for logical ops - and, or, xor - are, as noted, more likely to be bit masks of various widths and offsets and would have a different encoding.

I only did positive integers as I was thinking of an arch with separate add and subtract opcodes, but one could of course burn a bit to flip the sign.

The reason I thought it might be interesting to pull out factors of 3 and 5 is that I saw quite a few instances of small data structures that were 24, 48, 96 etc bytes and a lot of human-friendly loops and arrays that were multiples of 10, 100, and 1000.

I just did a kernel build and objdump'd the .o's and looked at the immediates (but not offsets) used. And, for 6-bit immediates at least, it might be a wash. I can reach 0x40, 0x50, 0x60, 0xa0, 0x78, and 0xc0, while 0x0..0x3f can hit 0x38, 0x22, 0x7. And once you get to 7 bits, you might be pretty far down the long-tail and only a dynamic count would make sense.

38722 $0x1 <-- both
17892 $0x8 <-- both
15534 $0x10 <-- both
13068 $0x18 <-- both
6801 $0x20 <-- both
5502 $0x40 <----- mine
4796 $0x30 <-- both
4576 $0x28 <-- both
4209 $0x4 <-- both
2904 $0x38 <-- 6-bit
2595 $0x48 <-- neither
2454 $0x2 <-- both
1936 $0x50 <----- mine
1417 $0x10a0 <-- neither
1328 $0x58 <-- neither
1323 $0x60 <----- mine
1200 $0x3 <-- both
1085 $0xc <-- both
1066 $0x68 <-- neither
1012 $0x22 <-- 6-bit
936 $0x70 <-- neither
780 $0xa0 <----- mine
754 $0x78 <----- mine
724 $0xffffffffffffff80 <-- neither
704 $0x88 <-- neither
686 $0xc0 <----- mine
540 $0x90 <-- neither
526 $0x200 <-- neither
496 $0x7 <-- 6-bit
482 $0x98 <-- neither
449 $0x5 <-- both
439 $0x1000 <-- neither
433 $0x14 <-- both
396 $0xb8 <-- neither
372 $0x6 <-- both
369 $0xa8 <-- neither
319 $0xb0 <-- neither
295 $0xc8 <----- mine
272 $0xfff <-- neither
260 $0xf0 <----- mine
237 $0x9 <-- 6-bit

-JohnG

Marcus

unread,

May 7, 2021, 12:54:53 PM5/7/21

to

On 2021-05-07, Terje Mathisen wrote:
> Marcus wrote:
>> In MRISC32 I have "float32", "float16" and "float8" (the shorter formats
>> usually come in packed form). The 32-bit and 16-bit formats are
>> identical to binary32 and binary16 in IEEE 754, respectively, while
>> I had to make a custom format for the 8-bit variant - which happens to
>> be identical to your "FP8S" format (i.e. S.E4.F3, bias=7).
>
> With FP16 presumably 1:5:10 right?

Yes (1:5:10, bias 15), as in IEEE 754 (and as specified in section 1.2.4
of the MRISC32 Instruction Set Manual
https://mrisc32.bitsnbites.eu/doc/mrisc32-instruction-set-manual.pdf ).

>>
>> I wonder: Are there any common/official 8-bit IEEE-style
>> (sign+exponent+fraction) FP formats that are gaining traction (ignoring
>> posits)?
>
> FP8 could be either 1:3:4 or 1:4:3, but for full ieee compliance you
> really want at least 2n+3 bits in the mantissa when going up in size, so
> your choice is the best one imho.
>
> Having 2n+3 means that you never get into double rounding problems by
> doing any single operation in the larger size, then rounding back down
> to the target precision.

Good to know - I didn't think of that. I mostly went with the
configuration that made the most sense: The main advantage (IMO) of
floating-point numbers compared to fixed point numbers is the increased
dynamic range, and to get an increase of the dynamic range compared to
8-bit signed integers you need at least four exponent bits (and using
more than that just seemed silly).

>
> The smallest possible ieee format would need at least sign + two
> exponent and two mantissa bits, so FP5 = 1:2:2.
>
> This format would still support all of
> QNaN/SNaN/Inf/Normal/Sub-normal/Zero.
>
> Terje
>

/Marcus

Stephen Fuld

unread,

May 7, 2021, 1:08:33 PM5/7/21

to

On 5/7/2021 6:15 AM, Stefan Monnier wrote:
>> It is possible that it could be generalized, but then the ISA would be less
>> RISC style then it is already...
>>
>> Then again, I did see recently that someone had listed my ISA somewhere, but
>> then classified it as a CISC; I don't entirely agree, but alas...
>>
>> Then again, many peoples' definitions of "RISC" exclude largish
>> instruction-sets with variable-length instruction encodings, so alas.
>>
>> But, taken at face value, then one would also need to exclude Thumb2 and
>> similar from the RISC category.
>
> I think it's better not to worry about how other people label your ISA.
>
> The manichean RISC-vs-CISC labeling is stupid anyway: the design space
> has many more than 2 spots.

Amen.

> Also I think the "classic RISC" is the result of specific conditions at
> that point in time which resulted in a particular sweet spot in the
> design space.

As was "CISC", at least in the form of the 8080, which became, through
compatibility concerns, the X86.

> But design constraints are quite different now, so applying the same
> "quantitative approach" that lead to MIPS/SPARC/... will now result in
> something quite different.

Yes. Note that some of the original RISC dogmas were eliminated pretty
quickly i.e. single cycle instructions once chip densities increased
enough that it made sense to include multiply/divide instructions.

But the hardest part is to anticipate where the technology is going so
as to try to make your design reasonably optimal as technology evolves,
while still making something that works well enough in the short term to
be able to get to the long term.

> One could argue that such new ISAs
> could be called RISC-2020 ;-)

Or just not try to name them at all! :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

May 7, 2021, 1:29:21 PM5/7/21

to

On 5/7/2021 4:45 AM, Terje Mathisen wrote:
> Marcus wrote:
>> In MRISC32 I have "float32", "float16" and "float8" (the shorter formats
>> usually come in packed form). The 32-bit and 16-bit formats are
>> identical to binary32 and binary16 in IEEE 754, respectively, while
>> I had to make a custom format for the 8-bit variant - which happens to
>> be identical to your "FP8S" format (i.e. S.E4.F3, bias=7).
>
> With FP16 presumably 1:5:10 right?

At least in my case, yes.

>>
>> I wonder: Are there any common/official 8-bit IEEE-style
>> (sign+exponent+fraction) FP formats that are gaining traction (ignoring
>> posits)?
>
> FP8 could be either 1:3:4 or 1:4:3, but for full ieee compliance you
> really want at least 2n+3 bits in the mantissa when going up in size, so
> your choice is the best one imho.
>
> Having 2n+3 means that you never get into double rounding problems by
> doing any single operation in the larger size, then rounding back down
> to the target precision.
>
> The smallest possible ieee format would need at least sign + two
> exponent and two mantissa bits, so FP5 = 1:2:2.
>
> This format would still support all of
> QNaN/SNaN/Inf/Normal/Sub-normal/Zero.
>

S.E4.F3 has just enough exponent bits to provide a semi-useful dynamic
range.

Whereas S.E3.F4 is only really usable if the bias is explicit /
modifiable. Otherwise, with a fixed bias of 3 or similar, the dynamic
range kinda sucks too much to be useful for all that much.

There are not currently any operations which perform computations on
FP8, currently it is mostly considered as a format for storage and
inline vector constants. In these cases, the actual vector computations
would generally be done at using FP16 / Binary16 vectors.

In this case though, there was some bias in the choice towards what
would be useful for things like RGBA color data and similar.

Stephen Fuld

unread,

May 7, 2021, 1:43:52 PM5/7/21

to

On 5/7/2021 9:54 AM, Marcus wrote:
> On 2021-05-07, Terje Mathisen wrote:

snip

>> FP8 could be either 1:3:4 or 1:4:3, but for full ieee compliance you
>> really want at least 2n+3 bits in the mantissa when going up in size,
>> so your choice is the best one imho.
>>
>> Having 2n+3 means that you never get into double rounding problems by
>> doing any single operation in the larger size, then rounding back down
>> to the target precision.
>
> Good to know - I didn't think of that. I mostly went with the
> configuration that made the most sense: The main advantage (IMO) of
> floating-point numbers compared to fixed point numbers is the increased
> dynamic range,

Isn't that the *only* reason to do floating point?

Anton Ertl

unread,

May 7, 2021, 2:34:44 PM5/7/21

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:
>Also I think the "classic RISC" is the result of specific conditions at
>that point in time which resulted in a particular sweet spot in the
>design space.

There is some truth to this: For CPUs that perform well on
general-purpose code, RISC gave a big performance advantage in the
mid-late 1980s. That advantage eroded as Moore's law gave more
transistors to CISCs, allowing CISCs to implement pipelining (486
1989), superscalar (Pentium 1993), and OoO execution (Pentium Pro
1995).

OTOH, the CISCs did not manage to conquer low-power space, despite
attempts by both Intel (Bonnell, Silvermont ff.) and AMD (Bobcat ff.).
And they also do not compete in the low-area microcontroller market.
AFAIK every Ryzen has several ARM-based microcontrollers inside that
manage various aspects of its operation.

>But design constraints are quite different now, so applying the same
>"quantitative approach" that lead to MIPS/SPARC/... will now result in
>something quite different. One could argue that such new ISAs
>could be called RISC-2020 ;-)

Hmm, RISC-V is pretty close to MIPS (and Patterson argues that it is
pretty close to Berkeley RISC). Maybe the argument is: If the ISA
does not matter much for high-performance, high-power, high-area
cores, because we can fix it in the decoder, better optimize the ISA
for lower-power lower-area applications, where RISC still gives
benefits (of course, if you go for lowest area, you go for something
like the b16-small, but apparently people are not that keen on having
something smaller than a Cortex-M3 or so.

The other entry in the ISA field in recent years is Aarch64, which is
a RISC, and in certain ways closer to MIPS/SPARC than to the original
ARM ISA, but in others it's quite different: condition codes, double
loads and double stores, many addressing modes. And from this ISA
camp we see the A14 Firestorm core which has a significantly higher
IPC than Intel's and AMD's offerings, admittedly at a lower peak clock
rate, but also at a lower power consumption; an 8-wide execution
engine with a very deep reorder buffer (630 entries compared to 352
for Intel's Sunny Cove and 256 for AMD's Zen3) are probably a reason
for this, but the question remains: Was the ISA helpful (or less of a
hindrance than AMD64) for building such a wide and deep core?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,

May 7, 2021, 4:20:56 PM5/7/21

to

Yeah.

Classic RISC:
Fixed size instructions
Typically only a single addressing mode (Reg, Disp)
Typically only supporting aligned memory access
Aims for fixed 1 instruction per cycle
...

BJX2:
Variable length
16/32 base
16/32/64/96 extended
48 possible, but currently unused
Original 48-bit ops can no-longer be encoded.
Multiple addressing modes
Unaligned memory access (8/16/32/64)
64/128 bit alignment for 128-bit ops.
Supports explicitly parallel instructions
...

It has a few features in common with VLIW architectures as well, but
differs from traditional VLIW in that these typically use a fixed-size
bundle encoding, whereas BJX2 bundles are variable-length (composed of
32-bit units), with the 64 and 96 bit instruction encodings being
effectively a special-case of the bundle encoding.

There are several DSP architectures which do something similar to this:
Xtensa, Hexagon, ...

Though, originally, inclination toward VLIW support was based on taking
inspiration from the TMS320C6x and IA64, which, granted, use a
fixed-size bundle encoding.

Some of my earlier ideas involved using a more traditional bundle format
(just sort of awkwardly plonked into the code-stream), but this later
transformed into the current approach.

A recent experiment did also test using 24-bit instructions, but as
noted elsewhere, while in themselves they were effective at reducing the
number of 32-bit ops in size-optimized code, because 32-bit ops are
already the minority in the size-optimized case, the net savings were
fairly small (and ran into issues with the baked-in assumption that
branch-targets are 16-bit aligned, *, ...).

*: One either needs to jostle around with re-encoding the past several
instructions to re-align the instruction stream, or insert a 24-bit NOP
(if the former fails), or use 24-bit byte-aligned branch encodings
(which ended up costing more than they saved vs the "reencode ops to fix
alignment to allow for using the 16-bit branch ops" strategy).

As for resource cost:
My current BJX2 core costs ~ 4x as much as a minimalist 32-bit RISC
style core.

Or, basically, if I go for:
Fixed-length instructions
One instruction at a time
16x 32-bit GPRs
Aligned-only memory access
No Variable-Shift or FPU ops
No MMU
...

It is possible to use ~ 1/4 the LUTs of the current (full-featured) BJX2
core. The difference isn't quite drastic enough to make a strong use
case for the minimalist core (even as a secondary "IO controller" or
similar).

I can also get a lot of this back (eg, fitting a microcontroller subset
of BJX2 on an XC7S25), mostly by disabling WEX and the MMU and similar.

By disabling the FPU and a few other things, it is also possible to
shoe-horn it into an XC7A15 (basically, the smallest Xilinx FPGA I can
really manage to find on FPGA dev-boards "in the wild").

This in-turn creates a disincentive to have a separate 32-bit ISA, vs
using a slightly more restrictive subset of the 64-bit ISA.

I had also looked briefly at trying to do a 16-bit IO controller, but
then ran into the problem that there is basically no real good way to
plug a 16-bit core into my existing bus architecture.

And, it seems, short of targeting something like an iCE40 or similar,
there isn't much point.

Not sure how this compares with the ASIC space, but I suspect with
modern technologies this is probably "grain of sand" territory.

Could matter more if one is having their logic printed onto a plastic
substrate using semiconductor inks, but given these technologies (in
their commercial form) are managing things like Cortex-M cores, it
probably isn't too major of an issue.
Similarly, building such a printer would currently appear to be too
expensive for the hobbyist space.

...

MitchAlsup

unread,

May 7, 2021, 4:33:27 PM5/7/21

to

On Friday, May 7, 2021 at 1:34:44 PM UTC-5, Anton Ertl wrote:
> Stefan Monnier <mon...@iro.umontreal.ca> writes:
> >Also I think the "classic RISC" is the result of specific conditions at
> >that point in time which resulted in a particular sweet spot in the
> >design space.
<

I might note that at the time RISC was coined, it was not possible to
put a pipelined 32- bit data path and control unit in any of the then
current CISC architectures. For example the 68020 needed a 68881
for floating point and a 68851 for MMU whereas MIPS R2000 only
needed the CPU + TLB chip and the FP chip. Also at the time, CISC
were running 1 instruction every 2 clocks best case.

<
> There is some truth to this: For CPUs that perform well on
> general-purpose code, RISC gave a big performance advantage in the
> mid-late 1980s.
<

To a large part this was the memory performance from large SRAM
caches (sometimes 2 caches) and the pipelined data path.

<
> That advantage eroded as Moore's law gave more
> transistors to CISCs, allowing CISCs to implement pipelining (486
> 1989), superscalar (Pentium 1993), and OoO execution (Pentium Pro
> 1995).
<

Once CISC got pipelined and larger caches, they did indeed suddenly
catch up.

>
> OTOH, the CISCs did not manage to conquer low-power space, despite
> attempts by both Intel (Bonnell, Silvermont ff.) and AMD (Bobcat ff.).
<

It is no wonder since Qualcom had an entry that ran in the milliwatt
region whereas the LP x86s were targeted at sub 10 Watt region.

<
> And they also do not compete in the low-area microcontroller market.
<

Where manufactures sell chips at $0.10 each.

<
> AFAIK every Ryzen has several ARM-based microcontrollers inside that
> manage various aspects of its operation.
> >But design constraints are quite different now, so applying the same
> >"quantitative approach" that lead to MIPS/SPARC/... will now result in
> >something quite different. One could argue that such new ISAs
> >could be called RISC-2020 ;-)
<
> Hmm, RISC-V is pretty close to MIPS (and Patterson argues that it is
> pretty close to Berkeley RISC). Maybe the argument is: If the ISA
> does not matter much for high-performance, high-power, high-area
> cores, because we can fix it in the decoder, better optimize the ISA
> for lower-power lower-area applications, where RISC still gives
> benefits (of course, if you go for lowest area, you go for something
> like the b16-small, but apparently people are not that keen on having
> something smaller than a Cortex-M3 or so.
<

An R2000 shrunk down to 7nm is smaller than a single I/O pad on the
actual R2000.

>
> The other entry in the ISA field in recent years is Aarch64, which is
> a RISC, and in certain ways closer to MIPS/SPARC than to the original
> ARM ISA, but in others it's quite different: condition codes, double
> loads and double stores, many addressing modes.
<

I would argue that "having address modes" does not eliminate an ISA
from the RISC camp. As long as the address mode still takes one
cycle in address calculation so [Rbase+Rindex<<scale+Displacement]
can remain RISC, while *p++, *--p, and **p cannot.
<
Secondly, there is a lost of potential code space savings by having
things like LM and STM (or their even more powerful cousins ENTER
and EXIT). And once you have an AGEN sequencer, why not setup the
pipeline and run it at the width of the data path. Thus, My 66000 even
contains MM (memory to memory move).
<
Thirdly, constants, this is where most RISCs fell on their faces. What
with instructions being used to paste 32-bit constants together, and
it gets really messy pasting 64-bit constants together. No, the proper
rule, here, it to expend no (nada zero zilch) execution time pasting
constants together. These constants need to service the needs of
integers, logicals, floating pint, and global/local memory access.
<
Finally, I will argue that having transcendental instructions that are as
fast as an FDIV is GOOD for the ISA and certainly better than having to
call subroutines of the same functionality.

MitchAlsup

unread,

May 7, 2021, 4:40:15 PM5/7/21

to

Mc 88K had
fixed sized instructions
[Rbase+IMM16] and [Rbase+Rindex<<scale] address modes
Aligned memory has been show to be defective
aimed at 1ipc but we did engineer a 6-wide OoO version

My 66000
has fixed sized instruction specifiers with 1 or 2 constants.
[Rbase+IMM16] and [Rbase+Rindex<<scale] and [Rbase+Rindex<<scale+disp[32,64] address modes
Misaligned memory model
Aimed at low burden for LBIO implementations and low burden for GBOoO implementations
Inherently parallel
Never needs a NoOp

BGB

unread,

May 7, 2021, 5:15:02 PM5/7/21

to

On 5/7/2021 8:18 AM, JohnG wrote:
> On Wednesday, May 5, 2021 at 8:29:32 AM UTC-7, Ivan Godard wrote:
>> You wouldn't want to actually do the implied multiplies, so this would
>> become a look-up table indexed by the byte. But if you do that then you
>> could have any frequent values in the table, not just those implied by
>> the factors. That's how Mill popCons (Popular Constants) work, with the
>> nicety that they are encoded only using otherwise wasted entropy in the
>> binary ISA.
>
> Right, but in my example, there are only 8 interesting implied multiplies and would only need an 8x9b rom or equivalent logic:
>
> [5^n 2-bits][3^n 1-bit]
> [00][0] = 9'b0_0000_0001 # 5^0 * 3^0 = 1
> [00][1] = 9'b0_0000_0011 # 5^0 * 3^1 = 3
> [01][0] = 9'b0_0000_0101 # 5^1 * 3^0 = 5
> [01][1] = 9'b0_0000_1111 # 5^1 * 3^1 = 15
> [10][0] = 9'b0_0001_1001 # 5^2 * 3^0 = 25
> [10][1] = 9'b0_0100_1011 # 5^2 * 3^1 = 75
> [11][0] = 9'b0_0111_1101 # 5^3 * 3^0 = 125
> [11][1] = 9'b1_0111_0111 # 5^3 * 3^1 = 375
>

Realize that by FPGA standards, an 8b*9b ROM, presumably looking up
18-bit values or similar, is "pretty damn massive"...

Luckily at least, most FPGAs provide DSP48 units or similar which can
manage a small multiplier, but then roughly one may have to budget for
an extra clock-cycle of latency.

> Everything else is just as many left shifts as you want to spend bits on and could even wait until the ALU if it has the ability to shift inputs.
>
> If you have the space and time to have a table of perfect constants, then yeah, that seems like the way to go.
>
> A couple other things I think are asked in other replies that I'll put here just to have them all in one place.
>

Realize, also, that shift is "kinda expensive" in terms of resources...
One doesn't really want to assume that a shift unit is available during
the decode process. ALU/EX is another matter though (if it is already
known to be available), albeit limits how such a thing may be used.

This is also why many smaller ISAs tend to leave out things like
variable-shift and integer multiply...

Decided to leave it out, but one can use their imagination for how
things work on an ISA which lacks both shift instructions and multiply...

MitchAlsup

unread,

May 7, 2021, 5:28:39 PM5/7/21

to

On Friday, May 7, 2021 at 4:15:02 PM UTC-5, BGB wrote:
> On 5/7/2021 8:18 AM, JohnG wrote:
> > On Wednesday, May 5, 2021 at 8:29:32 AM UTC-7, Ivan Godard wrote:
> >> You wouldn't want to actually do the implied multiplies, so this would
> >> become a look-up table indexed by the byte. But if you do that then you
> >> could have any frequent values in the table, not just those implied by
> >> the factors. That's how Mill popCons (Popular Constants) work, with the
> >> nicety that they are encoded only using otherwise wasted entropy in the
> >> binary ISA.
> >
> > Right, but in my example, there are only 8 interesting implied multiplies and would only need an 8x9b rom or equivalent logic:
> >
> > [5^n 2-bits][3^n 1-bit]
> > [00][0] = 9'b0_0000_0001 # 5^0 * 3^0 = 1
> > [00][1] = 9'b0_0000_0011 # 5^0 * 3^1 = 3
> > [01][0] = 9'b0_0000_0101 # 5^1 * 3^0 = 5
> > [01][1] = 9'b0_0000_1111 # 5^1 * 3^1 = 15
> > [10][0] = 9'b0_0001_1001 # 5^2 * 3^0 = 25
> > [10][1] = 9'b0_0100_1011 # 5^2 * 3^1 = 75
> > [11][0] = 9'b0_0111_1101 # 5^3 * 3^0 = 125
> > [11][1] = 9'b1_0111_0111 # 5^3 * 3^1 = 375
> >
> Realize that by FPGA standards, an 8b*9b ROM, presumably looking up
> 18-bit values or similar, is "pretty damn massive"...
<

But brain dead easy to verify........

On the other hand, those constants may not be used all that often.

>
> Luckily at least, most FPGAs provide DSP48 units or similar which can
> manage a small multiplier, but then roughly one may have to budget for
> an extra clock-cycle of latency.
> > Everything else is just as many left shifts as you want to spend bits on and could even wait until the ALU if it has the ability to shift inputs.
> >
> > If you have the space and time to have a table of perfect constants, then yeah, that seems like the way to go.
> >
> > A couple other things I think are asked in other replies that I'll put here just to have them all in one place.
> >
> Realize, also, that shift is "kinda expensive" in terms of resources...
> One doesn't really want to assume that a shift unit is available during
> the decode process. ALU/EX is another matter though (if it is already
> known to be available), albeit limits how such a thing may be used.
<

I did a design that was 6-wide and only had 3 shifters--which were shared
with (i.e. positioned in) the LD/ST units wherease each of the 6 FUs could
perform an integer operation (IMUL and IDIV were in the FMAC unit).

With base plus scaled index addressing the number of shifts in typical code
was down in the 2%-5% range so while you hade to have somewhat easy
access to a shifter, in a unit that is not expected to be saturated, you don't
have to over resource shift capability.

>
> This is also why many smaller ISAs tend to leave out things like
> variable-shift and integer multiply...
<

In this day and age, that is a mistake.

Anton Ertl

unread,

May 7, 2021, 5:47:41 PM5/7/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, May 7, 2021 at 1:34:44 PM UTC-5, Anton Ertl wrote:
>> The other entry in the ISA field in recent years is Aarch64, which is
>> a RISC, and in certain ways closer to MIPS/SPARC than to the original
>> ARM ISA, but in others it's quite different: condition codes, double
>> loads and double stores, many addressing modes.
><
>I would argue that "having address modes" does not eliminate an ISA
>from the RISC camp.

I agree. It contributes to Aarch64 being relatively far from the
MIPS, though.

>As long as the address mode still takes one
>cycle in address calculation so [Rbase+Rindex<<scale+Displacement]
>can remain RISC, while *p++, *--p, and **p cannot.

ARM, HPPA, Power, and Aarch64 have loads and stores with
autoincrement/decrement and a considered to be RISCs by most.

>Secondly, there is a lost of potential code space savings by having
>things like LM and STM

ARM and Power have load and store multiple intructions. The Aarch64
architects apparently decided that this is not so great, and gave the
users load and store pair instructions, which are quite useful also in
cases where I don't see an ARM compiler generate load/store-multiple.

Thomas Koenig

unread,

May 7, 2021, 6:25:48 PM5/7/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> ARM, HPPA, Power, and Aarch64 have loads and stores with
> autoincrement/decrement and a considered to be RISCs by most.

And at least for POWER, they are a bit different (and more
powerful):

lbzu ra,D(rb)

will, for example, load from ra from rb + D and store rb + d into
rb afterwards.

However, it is better not to use them:

# In some implementations, the Load Algebraic and
# Load with Update instructions may have greater
# latency than other types of Load instructions. More-
# over, Load with Update instructions may take lon-
# ger to execute in some implementations than the
# corresponding pair of a non-update Load instruc-
# tion and an Add instruction.

MitchAlsup

unread,

May 7, 2021, 6:31:12 PM5/7/21

to

On Friday, May 7, 2021 at 5:25:48 PM UTC-5, Thomas Koenig wrote:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> > ARM, HPPA, Power, and Aarch64 have loads and stores with
> > autoincrement/decrement and a considered to be RISCs by most.
> And at least for POWER, they are a bit different (and more
> powerful):
>
> lbzu ra,D(rb)
>
> will, for example, load from ra from rb + D and store rb + d into
> rb afterwards.
<

I think one could argue that this is causing the instruction to have
more than one result.

>
> However, it is better not to use them:
>
> # In some implementations, the Load Algebraic and
> # Load with Update instructions may have greater
> # latency than other types of Load instructions. More-
> # over, Load with Update instructions may take lon-
> # ger to execute in some implementations than the
> # corresponding pair of a non-update Load instruc-
> # tion and an Add instruction.
<

Register write collisions cause pipeline stalling behavior.

MitchAlsup

unread,

May 7, 2021, 6:45:24 PM5/7/21

to

On Friday, May 7, 2021 at 4:47:41 PM UTC-5, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Friday, May 7, 2021 at 1:34:44 PM UTC-5, Anton Ertl wrote:
> >> The other entry in the ISA field in recent years is Aarch64, which is
> >> a RISC, and in certain ways closer to MIPS/SPARC than to the original
> >> ARM ISA, but in others it's quite different: condition codes, double
> >> loads and double stores, many addressing modes.
> ><
> >I would argue that "having address modes" does not eliminate an ISA
> >from the RISC camp.
> I agree. It contributes to Aarch64 being relatively far from the
> MIPS, though.
> >As long as the address mode still takes one
> >cycle in address calculation so [Rbase+Rindex<<scale+Displacement]
> >can remain RISC, while *p++, *--p, and **p cannot.
> ARM, HPPA, Power, and Aarch64 have loads and stores with
> autoincrement/decrement and a considered to be RISCs by most.
> >Secondly, there is a lost of potential code space savings by having
> >things like LM and STM
> ARM and Power have load and store multiple intructions. The Aarch64
> architects apparently decided that this is not so great, and gave the
> users load and store pair instructions, which are quite useful also in
> cases where I don't see an ARM compiler generate load/store-multiple.
<

I added these (ENTER and EXIT in particular) to avoid subroutine prologue
that looked like::
subroutine:
SUB SP,SP, 208
MOV R16,[SP+100]
MOV R17,[SP+108]
MOV R18,[SP+116]
MOV R19,[SP+124]
MOV R20,[SP+132]
MOV R21,[SP+140]
MOV R22,[SP+148]
MOV R23,[SP+156]
MOV R24,[SP+164]
MOV R25,[SP+172]
MOV R26,[SP+180]
MOV R27,[SP+188]
MOV R28[SP+192]
MOV R29,[SP+200]
MOV R0,[SP+208]

And similar for the subroutine epilogue.

Instead One can simply write:

subroutine:
ENTER R16,R0,100

And perform the same amount of work. Now, since HW knows the list of
registers being saves is IN ORDER in memory, it can read the register file
4-8 registers at a time and store them into 1/4 or 1/2 cache line every cycle.
So on a lowly 1-wide machine the above sequence would take 4 or 5 cycles.
On a more aggressive implementation this can be performed lazily and in
the background (so it might appear to take only 1 cycle.) Returning from
the subroutine does similarly, but since the HW knows* the EXIT performs
a control transfer, the return address can be read and delivered to IP without
flowing through R0 and the FETCH can begin before the register reloads
complete.

(*) it is possible to perform an EXIT that does not perform a control transfer
and these are used when exiting dynamic blocks in ALGOL-like languages
and by the stack walker for TRY-CATCH-THROW.

BGB

unread,

May 7, 2021, 8:45:08 PM5/7/21

to

A 'core' in BJX2 currently has:
3x 'SHAD' Units
2x "wide" funnel shift units (Lane 1/2);
1x "narrow" shift-unit (Lane 3);
AGU / AGEN (Lane 1)
32 or 48-bit base address (Operating Mode)
32-bit displacement
(Linear addressing may break with large displacements)
Memory Port (Lane 1)
Partially spans Lane 2 to support 128-bit Load/Store.
Plugs into L1 D$.
IMUL (Lane 1)
32*32 -> 64
3x ALU:
1x ALU-A (Lane 1), Full Capability (ADD / SUB / CMP / ...)
2x ALU-B (Lane 2/3), Partial Capability (ADD / SUB)
1x FPU (SIMD Capable, Spans Lanes 1/2/3)
FADD / FSUB / FMUL (Double)
Packed ADD / SUB / MUL (Single / Half)
Contains:
FADD/FSUB Unit
FMUL Unit
Converters / Misc Glue Logic
3x CONV (Integer Conversion)
Sign/Zero Extension, and a few other misc operations.

Then, other stages:
L1 I$
Decoder Unit
GPR Register-File Unit
PreBranch Unit (Branch Predictor)
...

The "wide" shift units can produce shift and rotate, and can be ganged
together to perform a 128-bit shift.

The "narrow" shift unit can perform a 64-bit arithmetic or logical shift
(but excludes rotate and similar to save cost).

AGU: Essentially, it calculates a 32-bit address in the low bits, and
may do +1/0/-1 on (47:32), with bits (63:48) being copied unchanged (and
generally ignored by the L1 cache). If the scaled displacement the
32-bit limit, then addressing may break down.

Technically, a 33-bit scaled index with a 36-bit partial address would
be "better", it is mostly a trade-off of timing and resource cost
(partly because the AGU and also LEA are single-cycle operations).

The IMUL unit only exists in Lane 1, and performs a 32-bit widening
multiply. Typically, 64-bit multiply is faked using multiple 32-bit
multiplies (with 64-bit multiply being both infrequent and expensive).

ALU: The full-featured ALU only exists in Lane 1, with Lane 2 and 3
getting shaved down ALUs (ADD/SUB/AND/OR/XOR).

The Lane 1 ALU-A unit and Lane 2's ALU-B can be ganged to support
128-bit ADD/SUB/CMP/AND/OR/XOR. Each ALU itself only operates on 64 bits
at a time, but sending a few bits between them allowed the Carry-Select
mechanism to operate as if it were 128 bits wide.

This mostly works mostly because 64 and 128 bit ADD/SUB/CMP has a
2-cycle latency.

Some other misc stuff, eg: CLZ/CTZ, UTX2 (Texture Compression), ... are
also routed through ALU-A.

There is only a single FPU, but effectively it spans all 3 lanes, as
this was needed to be able to support 128-bit FP-SIMD operations. Note
that SIMD operations are implemented via pipelining within the FPU.

As noted, FADD is a 6-cycle operation, whereas PADDF (4x Single) is 10
cycles, so SIMD mostly wins here.

FP conversion ops are routed through different locations:
Double <-> Integer: routed through FADD.
Double <-> Single/Half: FPU, via converter units.
...

FCMP is implemented via the ALU.

FRCP / FDIV / FSQRT: Faked in software via the C runtime.

There was the Long-Double-Extension, which added "Truncated Binary128"
support (~ S.15.80), but this has the drawback of being "rather
expensive". Full Binary128 support currently falls well outside the FPGA
resource budget.

There were plans that the FMULX unit could be made to also be able to
perform 64-bit integer multiply, but this depends on "actually being
able to afford it". This mechanism would involve being able to
reconfigure some of the low-order units of the multiplier via "plumbing
tricks" (to switch between a "square" and "triangular" configuration).

By analogy, it would be equivalent to having a big isosceles triangle,
then being able to cut two of the corners off, and move them to the
bottom of the triangle, forming a square (with a big interior area which
does not move).

Where, the triangle represents an FMUL, and the square represents an
IMUL, and the non-moving part represents the high-order bits.

Units like the TLB aren't directly connected to the execute pipeline,
but is instead connected to the L1 Caches via the ringbus. Operations
like LDTLB are internally actually special-case MMIO operations which
communicate with the TLB more like it were an IO device. Several
registers are forwarded to the TLB though (MMCR, KRR, ...). While being
"theoretically" an MMU register, TTB and similar aren't actually visible
to the MMU (and are more intended to be used by the OS / firmware's
page-table walker).

>>
>> This is also why many smaller ISAs tend to leave out things like
>> variable-shift and integer multiply...
> <
> In this day and age, that is a mistake.

It does suck, and probably isn't ideal for anything meant for performance.

Sorta works in microcontrollers though which aren't usually meant for
performance-sensitive applications.

BGB

unread,

May 7, 2021, 10:07:37 PM5/7/21

to

BJX2 allows misaligned access, and also does not need NOPs.

Misaligned access may degrade performance in some cases though, doesn't
work with certain instructions, ...

If one triggers an interlock on a memory load or similar, there will be
a partial pipeline stall, and it will behave as-if a NOP were present.
Triggering an interlock on various other instructions may also trigger
this behavior.

Interlocks currently are in 2nd place (behind cache misses) for wasting
clock-cycles (improving the efficiency of memory has, in effect, also
increased the proportion of clock-cycles wasted on pipeline interlock
stalls).

I had more recently added logic to my C compiler's "wexifier" to shuffle
instructions around to try to reduce interlock penalties. At the moment,
it is disabled by default as the wexifier seems to be buggy in some
as-of-yet undetermined way (seems to be producing bundles which don't
work correctly on the FPGA implementation, but I can't seem to isolate
the behavior well enough to determine the cause).

BJX2:
Rb + Imm<<Sc
Rb + Ri<<Sc

Rb: R2..R15, R18..R31
Ri: R0, R2..R14, R18..R31

R0/R1: Used to encode special-cases
R15 (SP): Only usable as a base register
R16/R17: Reserved (for the possibility of more special cases)

Sc: Typically hard-wired based on the element type

Base registers encoded via special cases:
PC, GBR, TBR
Index registers encoded via special cases:
R0, Sc=0 (Allows access to misaligned struct members)

However, since pretty much all this is handled in the decoder, I don't
really feel it classifies as distinct address modes (and, as far as the
AGU is concerned (Rb+Ri<<Sc) is the only mode).

Generally, (Rb) and (Rb,0) are treated as equivalent.

Given the lack of autoincrement modes or similar, both PC and PC&(~3)
addressing, ... BJX2 in practice actually has fewer distinct addressing
modes than SuperH.

The SuperH ISA also had a few edge cases where multiple memory accesses
would be performed by a single instruction (and other violations of
Load/Store), but was generally classified as a RISC.

One thing BJX2 does have, that many RISC's lack, is a LEA instruction.
LEA Generally takes the form of assuming the Zero-Extended Store doesn't
make sense, so these cases are interpreted as a LEA.

This was mostly because LEA helps in various edge cases, such as for:
Composing function pointers;
Long-distance branches;
Compound addressing within structures;
Eg: Accessing an element within an array within a structure.
...

BGB

unread,

May 7, 2021, 11:25:17 PM5/7/21

to

Or: Why did I not keep auto-increment in my ISA's...

At best, they don't buy much, or worse, one has to produce multiple
register stores from the same instruction, which adds complexity, and
(in a scalar implementation) is likely to also require a pipeline
interlock to fake the presence of an ADD instruction...

It is "slightly less bad" if using a hard-wired register via a
side-channel (such as PUSH/POP and SP), but as can be noted, PUSH/POP no
longer exists in my ISA either (they didn't "bring home the bacon"
enough to justify the costs of their continued existence).

Related:
"Why are there no branch delay slots?"
...

Marcus

unread,

May 8, 2021, 6:33:40 AM5/8/21

to

I can think of a few more advantages, such as the normalized
representation that simplifies the implementation of many algorithms
(e.g. sqrt, reciprocals, etc). You also get the nice property that
multiplication produces a result that has the same storage size as the
inputs, unlike integer multiplication where you need twice the size
for the output compared to the input (which has led to many different
special solutions in ISA:s over the years, like the HI/LO registers
in MIPS). FP became popular in DSP:s mostly to eliminate the hassles of
avoiding overflow and underflow in fixed point arithmetic. And so on
(many of the advantages are of course side effects of the increased
dynamic range - but I wouldn't go as far as to say that it's the only
reason to use or implement floating-point).

/Marcus

MitchAlsup

unread,

May 8, 2021, 10:17:31 AM5/8/21

to

Auto INCs/DECs often occur several in a row, and a good value
propagator can find these and perform several LDs/STs and
compress the auto INCs/DECs arithmetic into a single add/sub.

>
> It is "slightly less bad" if using a hard-wired register via a
> side-channel (such as PUSH/POP and SP), but as can be noted, PUSH/POP no
> longer exists in my ISA either (they didn't "bring home the bacon"
> enough to justify the costs of their continued existence).
>
> Related:
> "Why are there no branch delay slots?"
<

With a 20% branch density and a single cycle of branch target latency
branch delay slots can save that cycle-------well closer to 50% of that
cycle as only 50% of BDSs are filled doing useful work while up to 70%
are filled with other than NoOps.

In anything other than a 1-wide simple pipeline BDSs are an unuseful
complexity that gets in the way of other things one wants to do in the
pipeline. For example, by FETCHing 4-wide and scanning forward in
an instruction buffer, one can find branches long before they become
"executed" and have the target instructions arrive before the branch
is through DECODE. Thus, unconditional branches may take 0
(zero nada zilch no) cycles even without BDSs. This affords many
opportunities to reduce the penalty for branches. Also, in wider
implementations BDSs are only a complexity that does not save
execution cycles, and makes the rest of the HW considerably more
complex.
<
> ...

MitchAlsup

unread,

May 8, 2021, 10:19:27 AM5/8/21

to

On Saturday, May 8, 2021 at 5:33:40 AM UTC-5, Marcus wrote:
> On 2021-05-07, Stephen Fuld wrote:
> > On 5/7/2021 9:54 AM, Marcus wrote:
> >> On 2021-05-07, Terje Mathisen wrote:
> >
> > snip
> >
> >>> FP8 could be either 1:3:4 or 1:4:3, but for full ieee compliance you
> >>> really want at least 2n+3 bits in the mantissa when going up in size,
> >>> so your choice is the best one imho.
> >>>
> >>> Having 2n+3 means that you never get into double rounding problems by
> >>> doing any single operation in the larger size, then rounding back
> >>> down to the target precision.
> >>
> >> Good to know - I didn't think of that. I mostly went with the
> >> configuration that made the most sense: The main advantage (IMO) of
> >> floating-point numbers compared to fixed point numbers is the increased
> >> dynamic range,
> >
> > Isn't that the *only* reason to do floating point?
> >
> I can think of a few more advantages, such as the normalized
> representation that simplifies the implementation of many algorithms
> (e.g. sqrt, reciprocals, etc). You also get the nice property that
> multiplication produces a result that has the same storage size as the
> inputs, unlike integer multiplication where you need twice the size
<

There are just about as many uses for a FP multiply to produce an
exact result as there are for integer multiplies.

John Levine

unread,

May 8, 2021, 11:31:18 AM5/8/21

to

According to Marcus <m.de...@this.bitsnbites.eu>:

>>> Good to know - I didn't think of that. I mostly went with the
>>> configuration that made the most sense: The main advantage (IMO) of
>>> floating-point numbers compared to fixed point numbers is the increased
>>> dynamic range,
>>
>> Isn't that the *only* reason to do floating point?
>
>I can think of a few more advantages, such as the normalized

>representation that simplifies the implementation of many algorithms ...

The huge advantage is that it takes care of scaling automatically so
the programmer doesn't have to worry about it in every operation.

When people were designing early computers in the 1940s, they knew
about floating point but didn't include it because they thought that
the programer could easily keep track of the scale so the extra
hardware complexity wasn't worth it. This was true if the programmer
was John von Neumann, not so true otherwise, so FP hardware showed up
on the 704 in the early 1950s.

> You also get the nice property that
>multiplication produces a result that has the same storage size as the
>inputs, unlike integer multiplication where you need twice the size

Well, no. The fraction part of a product is twice as long as the
fraction part of the inputs, but FP hardware can helpfully round it if
you're lucky, truncate if you aren't. On S/360, floating multiply
took float inputs and produced a double output. S/370 added double
to quad. I guess you were on your own.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Stephen Fuld

unread,

May 8, 2021, 12:33:51 PM5/8/21

to

On 5/8/2021 8:31 AM, John Levine wrote:
> According to Marcus <m.de...@this.bitsnbites.eu>:
>>>> Good to know - I didn't think of that. I mostly went with the
>>>> configuration that made the most sense: The main advantage (IMO) of
>>>> floating-point numbers compared to fixed point numbers is the increased
>>>> dynamic range,
>>>
>>> Isn't that the *only* reason to do floating point?
>>
>> I can think of a few more advantages, such as the normalized
>> representation that simplifies the implementation of many algorithms ...
>
> The huge advantage is that it takes care of scaling automatically so
> the programmer doesn't have to worry about it in every operation.
>
> When people were designing early computers in the 1940s, they knew
> about floating point but didn't include it because they thought that
> the programer could easily keep track of the scale so the extra
> hardware complexity wasn't worth it. This was true if the programmer
> was John von Neumann, not so true otherwise, so FP hardware showed up
> on the 704 in the early 1950s.

While all of that is true, there were other alternatives. COBOL
supported, and still does, automatic scaling of fixed point numbers. I
don't know if other languages support this.

Of course, the time period you were discussing was before high level
languages, and writing the code in assembler was probably "a bridge too
far" for most programmers at the time.

Anton Ertl

unread,

May 8, 2021, 1:05:37 PM5/8/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>
>> ARM, HPPA, Power, and Aarch64 have loads and stores with
>> autoincrement/decrement and a considered to be RISCs by most.
>
>And at least for POWER, they are a bit different (and more
>powerful):
>
> lbzu ra,D(rb)
>
>will, for example, load from ra from rb + D and store rb + d into
>rb afterwards.

If D is a constant, that's like Aarch64's and ARM's pre-indexed
addressing mode.

>However, it is better not to use them:
>
># In some implementations, the Load Algebraic and
># Load with Update instructions may have greater
># latency than other types of Load instructions. More-
># over, Load with Update instructions may take lon-
># ger to execute in some implementations than the
># corresponding pair of a non-update Load instruc-
># tion and an Add instruction.

Interestingly, the ARMv8 Instruction Set Overview
<http://www.cs.princeton.edu/courses/archive/spr19/cos217/reading/ArmInstructionSetOverview.pdf> says:

|3.3.1 Register Indexed Addressing
|
|The A64 instruction set extends on 32-bit T32 addressing modes,
|allowing a 64-bit index register to be added to the 64-bit base
|register, with optional scaling of the index by the access
|size. Additionally it provides for sign or zero-extension of a 32-bit
|value within an index register, again with optional scaling.
|
|These register index addressing modes provide a useful performance
|gain if they can be performed within a single cycle, and it is
|believed that at least some implementations will be able to do
|this. However, based on implementation experience with AArch32, it is
|expected that other implementations will need an additional cycle to
|execute such addressing modes.
|
|Rationale: The architects intend that implementations should be free
|to fine-tune the performance trade-offs within each implementation,
|and note that providing an instruction which in some implementations
|takes two cycles, is preferable to requiring the dynamic grouping of
|two independent instructions in an implementation that can perform
|this address arithmetic in a single cycle.

Stefan Monnier

unread,

May 8, 2021, 1:34:22 PM5/8/21

to

>>>> Good to know - I didn't think of that. I mostly went with the
>>>> configuration that made the most sense: The main advantage (IMO) of
>>>> floating-point numbers compared to fixed point numbers is the increased
>>>> dynamic range,
>>> Isn't that the *only* reason to do floating point?
>>I can think of a few more advantages, such as the normalized
>>representation that simplifies the implementation of many algorithms ...
> The huge advantage is that it takes care of scaling automatically so
> the programmer doesn't have to worry about it in every operation.

While this is mostly true in general, the comments above where made in
the context of an 8bit floating point format, where the exponent's range
is sufficiently limited that it is still something the programmer very
much has to worry about.

Stefan

Ivan Godard

unread,

May 8, 2021, 2:53:46 PM5/8/21

to

On 5/8/2021 7:19 AM, MitchAlsup wrote:
> On Saturday, May 8, 2021 at 5:33:40 AM UTC-5, Marcus wrote:
>> On 2021-05-07, Stephen Fuld wrote:
>>> On 5/7/2021 9:54 AM, Marcus wrote:
>>>> On 2021-05-07, Terje Mathisen wrote:
>>>
>>> snip
>>>
>>>>> FP8 could be either 1:3:4 or 1:4:3, but for full ieee compliance you
>>>>> really want at least 2n+3 bits in the mantissa when going up in size,
>>>>> so your choice is the best one imho.
>>>>>
>>>>> Having 2n+3 means that you never get into double rounding problems by
>>>>> doing any single operation in the larger size, then rounding back
>>>>> down to the target precision.
>>>>
>>>> Good to know - I didn't think of that. I mostly went with the
>>>> configuration that made the most sense: The main advantage (IMO) of
>>>> floating-point numbers compared to fixed point numbers is the increased
>>>> dynamic range,
>>>
>>> Isn't that the *only* reason to do floating point?
>>>
>> I can think of a few more advantages, such as the normalized
>> representation that simplifies the implementation of many algorithms
>> (e.g. sqrt, reciprocals, etc). You also get the nice property that
>> multiplication produces a result that has the same storage size as the
>> inputs, unlike integer multiplication where you need twice the size
> <
> There are just about as many uses for a FP multiply to produce an
> exact result as there are for integer multiplies.

I always liked the B6500 arithmetic, which had only one instruction for
both integer and FP - if the result fit in the integral value set you
got an integer, and a FP if not, and all ops didn't care which they got.
IIRC, the largest integer had 41 bits, sign-magnitude.

John Levine

unread,

May 8, 2021, 3:19:12 PM5/8/21

to

According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:

>> The huge advantage is that it takes care of scaling automatically so
>> the programmer doesn't have to worry about it in every operation.

>While all of that is true, there were other alternatives. COBOL
>supported, and still does, automatic scaling of fixed point numbers. I
>don't know if other languages support this.

COBOL gives you fixed scaling, e.g. PIC 9999V999 has four digits before
the implicit decimal point and three after. When you do arithmetic
it'll align the decimal point, but you don't get automatic scaling
unless you tell the compiler that the data is COMP-1 or COMP-2 so
it uses the internal floating point representation.

Scaling crisis of the day:

The NASDAQ stock exchange's computers represent prices as 32 bit
unsigned integers, with units being 1/100 of a cent or 1/10000 of a
dollar. Since the largest 32 bit integer is 4294967295 the highest
price it can represnt is $429,496.7295. The price of Berkshire
Hathaway reached $437,131 this week. Oops. They say they'll have a
fix later this month.

The next highest price is about $5000, and Berkshire's CEO Warren
Buffett has said for decades that he'll never split the shares like

BGB

unread,

May 8, 2021, 5:14:17 PM5/8/21

to

Yeah...

Exponent:
15: Inf / NaN
14: 128.0 .. 256.0
13: 64.0 .. 128.0
12: 32.0 .. 64.0
11: 16.0 .. 32.0
10: 8.0 .. 16.0
9: 4.0 .. 8.0
8: 2.0 .. 4.0
7: 1.0 .. 2.0
6: 0.5 .. 1.0
5: 0.25 .. 0.5
4: 0.125 .. 0.25
3: 0.063 .. 0.125
2: 0.031 .. 0.063
1: 0.016 .. 0.031
0: Zero / Denormal

In the FP8S variant, you have precision of, say:
...
0.500, 0.563, 0.625, 0.688, 0.750, 0.813, 0.875, 0.938
1.000, 1.125, 1.250, 1.375, 1.500, 1.625, 1.750, 1.875
2.000, 2.250, 2.500, 2.750, 3.000, 3.250, 3.500, 3.750
...
128, 144, 160, 176, 192, 208, 224, 240

Or, compared with conventional formats, its precision and dynamic range
are kinda garbage.

But, for certain types of tasks, it is still sufficient.

Thomas Koenig

unread,

May 9, 2021, 5:28:07 AM5/9/21

to

BGB <cr8...@gmail.com> schrieb:

> IMUL (Lane 1)
> 32*32 -> 64

Do you have two instructions (one for signed and one for unsigned)
or three (one for the lower half, one for signed high, one for
unsigned high)? The latter version could save you some ALU
complexity and some latency in the (probably common) case where
only a 32*32 multiplication is needed, at the cost of added
instructions for the 32*32-> 64 bit case.

MitchAlsup

unread,

May 9, 2021, 10:51:03 AM5/9/21

to

In My 66000's case, 64×64->64 is the base function and comes in
signed and unsigned varietals. When one wants 64×64-128, one
uses the CARRY instruction-modifier. CARRY provides access to
all forms of wider than std operations {Shifts, ADD, SUB, IMUL, UMUL
IDIV, UDIV, FADD, FSUB, FMUL, and Kahan-Babuška Summation}

So one adds but a single instruction-modifier and gets 2 handfuls
of instructions.

Anton Ertl

unread,

May 9, 2021, 2:20:57 PM5/9/21

to

Not really sure what you mean with the latter. In the early 1950s,
writing programs included steps that most of us don't know about these
days, such as layout of instructions on the rotating memory for
performance, where each instruction included the address of the next
one. What we see as machine code since the 1960s does not have to
deal with these complications, assembly language is even higher level,
Fortran I higher level yet.

As for fixed vs. floating point, I guess that is a cross-cutting
concern. Sure you can argue that, if you spend a lot of time on
low-level steps such as coding layout, spending some time on range
analysis is minor change, so fixed point is acceptable.

OTOH, if you have a nice numerical routine for a certain fixed-point
number range and notice that you need the same computation, but for a
different numeric range, do you want to repeat all the low-level work?
Ok, maybe you can change just a few immediate values and leave the
code as-is otherwise, but if the code includes optimizations based on
the knowledge of the values, that's not possible. In such a scenario,
floating point offers advantages, just as it does now.

What held back floating point for a long time was the slowness and/or
high cost of FP hardware, but at least in general-purpose computers
that's a thing of the past.

Marcus

unread,

May 9, 2021, 2:21:34 PM5/9/21

to

Yes that makes sense. Specifically for FP8, the increased dynamic range
is by far the most important trait. I've even seen examples of using
1:5:2 in DNN:s (deep neural networks) where dynamic range is often more
important than precision.

>
> Stefan
>

/Marcus

Marcus

unread,

May 9, 2021, 2:21:42 PM5/9/21

to

MRISC32:
Rb + Imm

MRISC32 also has LEA (called "LDEA" - LoaD Effective Address). It
essentially uses the output from the AGU as the result (bypassing
the load-from-memory stage), and has a few different use cases.

Apart from preloading memory addresses into registers, it can be used
for simple arithmetic on the form A + B << N (where N is 0, 1, 2 or 3),
which can be used for implementing x * 3 and x * 5, for instance.

Another very useful use case is to load vector registers with a
"stride". In vector mode the AGU will generate addresses on the
form:

(1) addr[k] = Rb + Im * k (Immediate form stride)
(2) addr[k] = Rb + (Ri << Sc) * k (Register form stride)
(3) addr[k] = Rb + Vi[k] << Sc (Gather/scatter)

Forms (1) and (2) construct addresses with a constant offset
between each address (I call these "stride based" load/store).

The LDEA instruction can thus be used for loading a series of
numbers into a vector register, e.g. like this:

LDEA V1, Z, #1 ; V1 = {0, 1, 2, 3, 4, 5, ...}

LDI S1, #7
LDEA V1, S1, #3 ; V1 = {7, 10, 13, 16, 19, ...}

This is a common operation in vectorized code, and on classic SIMD
ISA:s you usually load a predefined constant (e.g. from memory) in this
case.

Since the MRISC32 ISA allows each implementation to define the vector
register size, using predefined vector constants is not a good solution
since the vector register size is not known at compile time.

/Marcus

Stephen Fuld

unread,

May 9, 2021, 2:21:44 PM5/9/21

to

On 5/8/2021 12:19 PM, John Levine wrote:
> According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:
>>> The huge advantage is that it takes care of scaling automatically so
>>> the programmer doesn't have to worry about it in every operation.
>
>> While all of that is true, there were other alternatives. COBOL
>> supported, and still does, automatic scaling of fixed point numbers. I
>> don't know if other languages support this.
>
> COBOL gives you fixed scaling, e.g. PIC 9999V999 has four digits before
> the implicit decimal point and three after. When you do arithmetic
> it'll align the decimal point, but you don't get automatic scaling
> unless you tell the compiler that the data is COMP-1 or COMP-2 so
> it uses the internal floating point representation.

Again, true, of course. I guess I thought the "automatic scaling" is
included in the "dynamic" part of "dynamic range" that Marcus mentioned
earlier. But of course YMMV.

> Scaling crisis of the day:
>
> The NASDAQ stock exchange's computers represent prices as 32 bit
> unsigned integers, with units being 1/100 of a cent or 1/10000 of a
> dollar. Since the largest 32 bit integer is 4294967295 the highest
> price it can represnt is $429,496.7295. The price of Berkshire
> Hathaway reached $437,131 this week. Oops. They say they'll have a
> fix later this month.

Wow! Interesting. I wonder what the fix will be?

> The next highest price is about $5000, and Berkshire's CEO Warren
> Buffett has said for decades that he'll never split the shares like
> everyone else does.

Thanks, John. I didn't know that.

Stefan Monnier

unread,

May 9, 2021, 2:35:30 PM5/9/21

to

> Yes that makes sense. Specifically for FP8, the increased dynamic range
> is by far the most important trait. I've even seen examples of using
> 1:5:2 in DNN:s (deep neural networks) where dynamic range is often more
> important than precision.

Indeed 1:5:2 is arguably more useful than 1:4:3.

Stefan

Stefan Monnier

unread,

May 9, 2021, 2:45:06 PM5/9/21

to

An alternative of course is to use a log-base representation. So your
8bit data is split 1:7 between a sign and an exponent, with no mantissa
at all and then you choose your dynamic range by picking the base.
Multiplication is easy but addition is takes more effort.

But if those FP8 are only used for transport and always converted to
FP16 or FP32 before actual computations, then the ease of
addition/multiplication is irrelevant and the only worry is how to
convert to/from that (and converting from FP8 is easy since it's small
enough to use a lookup table).

Stefan

MitchAlsup

unread,

May 9, 2021, 4:50:08 PM5/9/21

to

On Sunday, May 9, 2021 at 1:21:44 PM UTC-5, Stephen Fuld wrote:
> On 5/8/2021 12:19 PM, John Levine wrote:
> > According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:
> >>> The huge advantage is that it takes care of scaling automatically so
> >>> the programmer doesn't have to worry about it in every operation.
> >
> >> While all of that is true, there were other alternatives. COBOL
> >> supported, and still does, automatic scaling of fixed point numbers. I
> >> don't know if other languages support this.
> >
> > COBOL gives you fixed scaling, e.g. PIC 9999V999 has four digits before
> > the implicit decimal point and three after. When you do arithmetic
> > it'll align the decimal point, but you don't get automatic scaling
> > unless you tell the compiler that the data is COMP-1 or COMP-2 so
> > it uses the internal floating point representation.
> Again, true, of course. I guess I thought the "automatic scaling" is
> included in the "dynamic" part of "dynamic range" that Marcus mentioned
> earlier. But of course YMMV.
> > Scaling crisis of the day:
> >
> > The NASDAQ stock exchange's computers represent prices as 32 bit
> > unsigned integers, with units being 1/100 of a cent or 1/10000 of a
> > dollar. Since the largest 32 bit integer is 4294967295 the highest
> > price it can represnt is $429,496.7295. The price of Berkshire
> > Hathaway reached $437,131 this week. Oops. They say they'll have a
> > fix later this month.
<
> Wow! Interesting. I wonder what the fix will be?
<

The only way to get BH to split their stock would be the thread of delisting.

<
> > The next highest price is about $5000, and Berkshire's CEO Warren
> > Buffett has said for decades that he'll never split the shares like
> > everyone else does.
<

Unless someone forces his hand.

MitchAlsup

unread,

May 9, 2021, 4:53:04 PM5/9/21

to

On Sunday, May 9, 2021 at 1:20:57 PM UTC-5, Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >On 5/8/2021 8:31 AM, John Levine wrote:
> >> When people were designing early computers in the 1940s, they knew
> >> about floating point but didn't include it because they thought that
> >> the programer could easily keep track of the scale so the extra
> >> hardware complexity wasn't worth it. This was true if the programmer
> >> was John von Neumann, not so true otherwise, so FP hardware showed up
> >> on the 704 in the early 1950s.
> >
> >While all of that is true, there were other alternatives. COBOL
> >supported, and still does, automatic scaling of fixed point numbers. I
> >don't know if other languages support this.
> >
> >Of course, the time period you were discussing was before high level
> >languages, and writing the code in assembler was probably "a bridge too
> >far" for most programmers at the time.
<
> Not really sure what you mean with the latter. In the early 1950s,
> writing programs included steps that most of us don't know about these
> days, such as layout of instructions on the rotating memory for
> performance, where each instruction included the address of the next
> one. What we see as machine code since the 1960s does not have to
> deal with these complications, assembly language is even higher level,
> Fortran I higher level yet.
<

Before that we had "patch pannels" where the program was expressed in WIREs

>
> As for fixed vs. floating point, I guess that is a cross-cutting
> concern. Sure you can argue that, if you spend a lot of time on
> low-level steps such as coding layout, spending some time on range
> analysis is minor change, so fixed point is acceptable.
>
> OTOH, if you have a nice numerical routine for a certain fixed-point
> number range and notice that you need the same computation, but for a
> different numeric range, do you want to repeat all the low-level work?
> Ok, maybe you can change just a few immediate values and leave the
> code as-is otherwise, but if the code includes optimizations based on
> the knowledge of the values, that's not possible. In such a scenario,
> floating point offers advantages, just as it does now.
<

My how lazy we have become ! Even with high quality FP, numerical
analysis is still mandatory--except that it is seldom performed !?!

>
> What held back floating point for a long time was the slowness and/or
> high cost of FP hardware, but at least in general-purpose computers
> that's a thing of the past.
<

Since mid 1980s

MitchAlsup

unread,

May 9, 2021, 4:53:47 PM5/9/21

to

Useful precision is often 1-bit (yes or no)
>
> >
> > Stefan
> >
>
> /Marcus

MitchAlsup

unread,

May 9, 2021, 4:59:43 PM5/9/21

to

How many branches are farther than 1/8 GB away with code compiled
from high level languages ??

<
> > Compound addressing within structures;
> > Eg: Accessing an element within an array within a structure.
> > ...
> >
> MRISC32 also has LEA (called "LDEA" - LoaD Effective Address). It
> essentially uses the output from the AGU as the result (bypassing
> the load-from-memory stage), and has a few different use cases.
>

I know one compiler writer that would enunciate "LEA" as 'Lee'-'ah'

>
> Apart from preloading memory addresses into registers, it can be used
> for simple arithmetic on the form A + B << N (where N is 0, 1, 2 or 3),
> which can be used for implementing x * 3 and x * 5, for instance.
>

Better still is to make integer multiply 3-cycles.

>
> Another very useful use case is to load vector registers with a
> "stride". In vector mode the AGU will generate addresses on the
> form:
>
> (1) addr[k] = Rb + Im * k (Immediate form stride)
> (2) addr[k] = Rb + (Ri << Sc) * k (Register form stride)
> (3) addr[k] = Rb + Vi[k] << Sc (Gather/scatter)
>

Both stride and gather forms fall out "for free" in VVM. {So, I am
agreeing with you that LEA is a valuable instruction.}

>
> Forms (1) and (2) construct addresses with a constant offset
> between each address (I call these "stride based" load/store).
<

Don't forget the stride-0 form in GPUs where every thread wants to
bang on the auto-update memory reference location.

MitchAlsup

unread,

May 9, 2021, 5:01:19 PM5/9/21

to

On Sunday, May 9, 2021 at 1:45:06 PM UTC-5, Stefan Monnier wrote:
> Stefan Monnier [2021-05-09 14:35:28] wrote:
> >> Yes that makes sense. Specifically for FP8, the increased dynamic range
> >> is by far the most important trait. I've even seen examples of using
> >> 1:5:2 in DNN:s (deep neural networks) where dynamic range is often more
> >> important than precision.
> > Indeed 1:5:2 is arguably more useful than 1:4:3.
> An alternative of course is to use a log-base representation. So your
> 8bit data is split 1:7 between a sign and an exponent, with no mantissa
> at all and then you choose your dynamic range by picking the base.
> Multiplication is easy but addition is takes more effort.
<

Really ?!?!?
Both are implemented with table look ups with concatenated values as the
address.

Stefan Monnier

unread,

May 9, 2021, 5:10:39 PM5/9/21

to

>> Wow! Interesting. I wonder what the fix will be?

> The only way to get BH to split their stock would be the threat of delisting.

They could also threaten to let it wrap around ;-)

Stefan

Stefan Monnier

unread,

May 9, 2021, 5:16:17 PM5/9/21

to

>> An alternative of course is to use a log-base representation. So your
>> 8bit data is split 1:7 between a sign and an exponent, with no mantissa
>> at all and then you choose your dynamic range by picking the base.
>> Multiplication is easy but addition is takes more effort.
> Really ?!?!?
> Both are implemented with table look ups with concatenated values as the
> address.

Really? A 64k entry table seems rather expensive for such
a functionality. Then again, if the output is FP8, maybe it can be
implemented as just a big mess of auto-generated and+or+not since many
inputs map to the same output.

Stefan

BGB

unread,

May 9, 2021, 5:34:56 PM5/9/21

to

The normal direct branch op in BJX2 has a 20-bit displacement, so can
reach +/- 1MB.

A Jumbo+LEA can do +/- 2GB, so:
LEA.B (PC, Disp33s), R3
JMP R3

But, this is also possible (+/- 32MB):
MOV #Imm25s, R0
BRA R0

In general, these don't really come up at present because, of my test
programs, all of them have < 1MB in the ".text" section.

>>> Compound addressing within structures;
>>> Eg: Accessing an element within an array within a structure.
>>> ...
>>>
>> MRISC32 also has LEA (called "LDEA" - LoaD Effective Address). It
>> essentially uses the output from the AGU as the result (bypassing
>> the load-from-memory stage), and has a few different use cases.
>>
> I know one compiler writer that would enunciate "LEA" as 'Lee'-'ah'

I always thought of it like "Lee".

MitchAlsup

unread,

May 9, 2021, 7:03:11 PM5/9/21

to

IBM 1620 "Can't add, doesn't even try" laid the groundwork where if the
operands are small enough it becomes cheaper to just read out the correct
result than to build the logic to calculate the result.
>
>
> Stefan

MitchAlsup

unread,

May 9, 2021, 7:08:34 PM5/9/21

to

On Sunday, May 9, 2021 at 4:34:56 PM UTC-5, BGB wrote:
> On 5/9/2021 3:59 PM, MitchAlsup wrote:

> > <
> > How many branches are farther than 1/8 GB away with code compiled
> > from high level languages ??
> > <
> The normal direct branch op in BJX2 has a 20-bit displacement, so can
> reach +/- 1MB.
>
> A Jumbo+LEA can do +/- 2GB, so:
> LEA.B (PC, Disp33s), R3
> JMP R3
>
> But, this is also possible (+/- 32MB):
> MOV #Imm25s, R0
> BRA R0
>

My 66000 has 16-bit actual word displacement (18-bit byte address) for
conditional branches and 26-bit unconditional branches (and calls) (28-bit
effective range).

But JMP instructions can have 32-bit or 64-bit immediates so you predicate
over the JMP with an inverted condition for the predicate.

>
> In general, these don't really come up at present because, of my test
> programs, all of them have < 1MB in the ".text" section.
<

In practice, few (<1%) are branches of those dimensions.

<
> >>> Compound addressing within structures;
> >>> Eg: Accessing an element within an array within a structure.
> >>> ...
> >>>
> >> MRISC32 also has LEA (called "LDEA" - LoaD Effective Address). It
> >> essentially uses the output from the AGU as the result (bypassing
> >> the load-from-memory stage), and has a few different use cases.
> >>
> > I know one compiler writer that would enunciate "LEA" as 'Lee'-'ah'
> I always thought of it like "Lee".
<

It was funny to hear him say "lee-ah", but he was from Russia so I gave him
a break.

John Levine

unread,

May 9, 2021, 11:07:44 PM5/9/21

to

According to Stefan Monnier <mon...@iro.umontreal.ca>:

>>> Wow! Interesting. I wonder what the fix will be?
>> The only way to get BH to split their stock would be the threat of delisting.

BH is listed on the NYSE which uses a different internal
representation. This is solely the NASDAQ's problem.

Buffett is 90 years old and has named a successor. They also have
class B shares which are worth 1/1500 of a class A share but only have
1/10000 of a class A vote. The obvious thing to do once Buffett is no
longer on the scene is to convert everything to class B so the price
is in the $300 range and all shares have the same vote.

>They could also threaten to let it wrap around ;-)

That had occurred to me.

Several decades ago I recall a somewhat similar issue when BH was the
first listed stock whose price exceeded $10,000. At that point the
problem was stock price tables were printed in newspapers, and the
typesetting software a lot of papers used wasn't expecting five digit
prices. They fixed that somehow, too.

BGB

unread,

May 9, 2021, 11:22:49 PM5/9/21

to

There are actually more cases:
MULS: 32*32->32, Result is sign-extended from low 32 bits
MULU: 32*32->32, Result is zero-extended from low 32 bits
DMULS: 32*32->64, 64-bit signed result
DMULU: 32*32->64, 64-bit unsigned result

The former give the typical behaviors one expects in C, the latter gives
the widened results.

These exist as 3R forms, so:
DMULU R4, R5, R7 // R7 = R4 * R5

Originally, there were also multiply ops which multiplied two inputs and
then stored a pair of results in R0 and R1, more like the original
SuperH multiply ops, but I dropped these for various reasons.

There are cases where DMACS or DMACU instructions could be useful:
DMACU R4, R5, R7 // R7 = R4 * R5 + R7

But, I don't currently have these.

Eg (64-bit signed multiply):
SHADQ R4, -32, R6 | SHADQ R5, -32, R7 //1c
DMULS R4, R7, R16 //1c
DMACS R5, R6, R16 //3c (2c penalty)
SHADQ R16, 32, R2 //3c (2c penalty)
DMACU R4, R5, R2 //1c
RTS

Though, while fewer instructions than the current form, the above
construction would still be pretty bad in terms of interlock penalties.

SHADQ R4, -32, R6 | SHADQ R5, -32, R7 //1c
DMULS R5, R6, R16 //1c
DMULS R4, R7, R17 //1c
DMULU R4, R5, R18 //1c
ADD R16, R17, R19 //2c (1c penalty, DMULS R17)
SHADQ R19, 32, R2 //2c (1c penalty, ADD R19)
ADD R18, R2 //1c
RTS

Both cases would have approximately the same clock-cycle count (assuming
both cases have a 3-cycle latency).

( Where recently, I have gotten around to modifying things such that the
multiplier is now fully pipelined... )

Otherwise, my time recently has mostly been being consumed by debugging...

Then tries and seeing if I can get stuff to pass timing at 75MHz again
(hopefully without wrecking stuff quite as bad this time). This
sub-effort also revealed a few bugs (*), though there are still some
bugs I have yet to resolve...

*: Eg, after boosting the core to 75MHz while leaving the MMIO bus at
50MHz, stuff was breaking in simulation due to the L2 Ringbus <->
MMIO-Bus bridge not waiting for the MMIO-Bus to return to a READY state
before returning back to an idle state.

It was then possible for the response to travel the rings and get back
to the L1, which then allows execution to continue, with the CPU core
then issuing another MMIO request, which then travels along the rings
back to the MMIO bridge, in less time than it took for the 'OK -> READY'
transition to happen on the MMIO bus...

The way the bridge was designed, it would then try to initiate a
request, see that the MMIO-Bus state was 'OK', and use whatever result
was present (losing the request or returning garbage).

This may have been happening at 50MHz as well, and could have possibly
been leading to some of the bugs I had seen.

Or such...

John Levine

unread,

May 10, 2021, 12:06:15 AM5/10/21

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

>As for fixed vs. floating point, I guess that is a cross-cutting
>concern. Sure you can argue that, if you spend a lot of time on
>low-level steps such as coding layout, spending some time on range
>analysis is minor change, so fixed point is acceptable.

We forget how flaky and unreliable those early computers were. They
used orders of magnitude more components than the most complex
electronic systems ever had before. The Williams tubes used in the
early 1950s just barely worked and needed endless tuning and fiddling.
Even once they switched to core, tubes burned out, components went out
of spec, solder joints cracked, and uptime was measured in hours if
you were lucky, minutes if you weren't.

So a compelling reason to leave out floating point was that all the extra
logic made the computer even less reliable. It's impressive that they
risked it on the 704.

I have read that the practical limit on the length
of a FORTRAN program on the 704 was set by fact that the compiler
had to compile it in one run between hardware failures.

>What held back floating point for a long time was the slowness and/or
>high cost of FP hardware, but at least in general-purpose computers
>that's a thing of the past.

On small systems, I suppose. Floating point was standard or at least a
widely provided option on large computers by the early 1960s.

Ivan Godard

unread,

May 10, 2021, 12:31:11 AM5/10/21

to

On 5/9/2021 9:06 PM, John Levine wrote:
> According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>> As for fixed vs. floating point, I guess that is a cross-cutting
>> concern. Sure you can argue that, if you spend a lot of time on
>> low-level steps such as coding layout, spending some time on range
>> analysis is minor change, so fixed point is acceptable.
>
> We forget how flaky and unreliable those early computers were. They
> used orders of magnitude more components than the most complex
> electronic systems ever had before. The Williams tubes used in the
> early 1950s just barely worked and needed endless tuning and fiddling.
> Even once they switched to core, tubes burned out, components went out
> of spec, solder joints cracked, and uptime was measured in hours if
> you were lucky, minutes if you weren't.
>
> So a compelling reason to leave out floating point was that all the extra
> logic made the computer even less reliable. It's impressive that they
> risked it on the 704.
>
> I have read that the practical limit on the length
> of a FORTRAN program on the 704 was set by fact that the compiler
> had to compile it in one run between hardware failures.
>
>> What held back floating point for a long time was the slowness and/or
>> high cost of FP hardware, but at least in general-purpose computers
>> that's a thing of the past.
>
> On small systems, I suppose. Floating point was standard or at least a
> widely provided option on large computers by the early 1960s.
>

Leaving out functionality is not the only solution to flaky hardware.
B6500 system 101 had a MTTF of 8 minutes after Jake Vigil dumped a cup
of coffee into mem mod zero. So when system 102 came up, with MTTF of
four hours, all the software team (all 23 of us for a new OS, five
compilers, and all the ancillaries) went over there. Which meant that
you couldn't get any time on it.

So I stayed on 101. And for me, and only me, it would stay up for half
an hour, which was enough to boot and get a compile through. Why me? I
remain convinced that it was because after every successful compile I
would pat the console D&D lights and thank it.

BGB

unread,

May 10, 2021, 1:02:13 AM5/10/21

to

On 5/9/2021 6:08 PM, MitchAlsup wrote:
> On Sunday, May 9, 2021 at 4:34:56 PM UTC-5, BGB wrote:
>> On 5/9/2021 3:59 PM, MitchAlsup wrote:
>
>>> <
>>> How many branches are farther than 1/8 GB away with code compiled
>>> from high level languages ??
>>> <
>> The normal direct branch op in BJX2 has a 20-bit displacement, so can
>> reach +/- 1MB.
>>
>> A Jumbo+LEA can do +/- 2GB, so:
>> LEA.B (PC, Disp33s), R3
>> JMP R3
>>
>> But, this is also possible (+/- 32MB):
>> MOV #Imm25s, R0
>> BRA R0
>>
> My 66000 has 16-bit actual word displacement (18-bit byte address) for
> conditional branches and 26-bit unconditional branches (and calls) (28-bit
> effective range).
>
> But JMP instructions can have 32-bit or 64-bit immediates so you predicate
> over the JMP with an inverted condition for the predicate.

All these cases are 20 bits in my case.

I had a few times considered eliminating the fixed BT/BF instructions in
favor of using the BRA?T and BRA?F encodings instead (semantically
equivalent), but haven't done so yet (partly inertia, partly I don't
like breaking binary compatibility unless there is a good reason).

Though, the main reason to do this would be to reclaim ~ 21 bits of
encoding space...

>>
>> In general, these don't really come up at present because, of my test
>> programs, all of them have < 1MB in the ".text" section.
> <
> In practice, few (<1%) are branches of those dimensions.
> <

In my current tests programs, it currently rests at 0%, since any larger
branches tend to be indirect calls through a function pointer or similar
(*2).

*2: Though, one of these programs is has ~ 1.2 MB of ".text" on an
x86-64 build (for both native Win64 and WSL), and 760K as BJX2 (in speed
optimized mode).

Similarly, it wont effect any local branches unless by some absurd
chance a *single function* were to exceed 1MB in the ".text" section.

Excluding the possibility of excessively large procedurally generated
"switch()" blocks or similar (eg: 10k+ case labels), this is unlikely.

But, these cases don't break the ISA, merely the limits of the existing
branch encodings. I could, in-theory add wider encodings, but then would
need to deal with them in the pipeline (and making this part any wider
would be a big source of timing hassles).

Well, it is either this, or allow a "jumbo branch" encoding where, say:
Can branch pretty much anywhere in the address space;
Completely ignored by the branch predictor;
Takes ~ 11 or so clock cycles to perform said branch...

In this case, the branch would likely be routed through the ALU (as a
64-bit integer ADD), and then the ADD result is used as a branch
destination.

Where, say, the branch instruction is actually decoded as a hacked
64-bit ADD instruction, and then in EX2 or EX3, some logic is like "Hey,
we just added something to PC!" and invokes the register-indirect branch
mechanism using the ADD result.

But, a lot of this would be mostly because, as-is, there is basically no
good way for the existing PC-relative branch mechanisms to handle a
displacement this large...

Though, it would technically be a bit simpler/cheaper to add a special
case to allow for a Jumbo-encoded "BRA (Abs48)" or similar in this case,
which wouldn't require any new logic on the EX side of things (can jump
anywhere... Just requires using base relocs...).

>>>>> Compound addressing within structures;
>>>>> Eg: Accessing an element within an array within a structure.
>>>>> ...
>>>>>
>>>> MRISC32 also has LEA (called "LDEA" - LoaD Effective Address). It
>>>> essentially uses the output from the AGU as the result (bypassing
>>>> the load-from-memory stage), and has a few different use cases.
>>>>
>>> I know one compiler writer that would enunciate "LEA" as 'Lee'-'ah'
>> I always thought of it like "Lee".
> <
> It was funny to hear him say "lee-ah", but he was from Russia so I gave him
> a break.

OK.

I vary between phonetic mappings, and letter-based mappings.

If I try to say things IRL, sometimes there is a slight delay if my mind
isn't sure how to map a mess of letters and numbers over to a spoken
form. As can be noted, I mostly think in visual images and text with
spoken language more as a 2nd class citizen.

Admittedly, this is also partly why my naming conventions tend to be
based more on patterns or visual aesthetic rather than whether or not
things can be easily spoken.

Terje Mathisen

unread,

May 10, 2021, 1:58:57 AM5/10/21

to

MitchAlsup wrote:
> On Sunday, May 9, 2021 at 1:45:06 PM UTC-5, Stefan Monnier wrote:
>> Stefan Monnier [2021-05-09 14:35:28] wrote:
>>>> Yes that makes sense. Specifically for FP8, the increased dynamic range
>>>> is by far the most important trait. I've even seen examples of using
>>>> 1:5:2 in DNN:s (deep neural networks) where dynamic range is often more
>>>> important than precision.
>>> Indeed 1:5:2 is arguably more useful than 1:4:3.
>> An alternative of course is to use a log-base representation. So your
>> 8bit data is split 1:7 between a sign and an exponent, with no mantissa
>> at all and then you choose your dynamic range by picking the base.
>> Multiplication is easy but addition is takes more effort.
> <
> Really ?!?!?
> Both are implemented with table look ups with concatenated values as the
> address.

Agreed, in a HW implementation you don't even need a full 64K (8+8
bits), since the sign logic can go on in parallel with the table access.

I.e. 14-bit tables for each operation is sufficient, so that is about 14
KB x 4 = 56 KB of rom space?

If you for some really strange reason want to implement programmable
rounding for those 8x8->8 bit operations, then I would fake it with
8-bit (+ sign) outputs from the lookup tables and use the trailing bit
for rounding up/down/even, but having nearest_or_even built in to the
tables seems more reasonable.

OTOH, for AI training it seems like most implementations are using
higher precision accumulators, i.e. FP16 or FP32?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Quadibloc

unread,

May 10, 2021, 2:02:44 AM5/10/21

to

On Sunday, May 9, 2021 at 2:53:04 PM UTC-6, MitchAlsup wrote:

> My how lazy we have become ! Even with high quality FP, numerical
> analysis is still mandatory--except that it is seldom performed !?!

Numerical analysis is indeed necessary even when using floating-point,
but of course that doesn't mean that floating point is never really
needed!

John Savard

Terje Mathisen

unread,

May 10, 2021, 2:05:23 AM5/10/21

to

MitchAlsup wrote:
> On Sunday, May 9, 2021 at 4:34:56 PM UTC-5, BGB wrote:
>> On 5/9/2021 3:59 PM, MitchAlsup wrote:
>>> I know one compiler writer that would enunciate "LEA" as 'Lee'-'ah'
>> I always thought of it like "Lee".
> <
> It was funny to hear him say "lee-ah", but he was from Russia so I gave him
> a break.

That is funny:

'Lea' is an old female name here in Norway (probably from the
Norwegian-language bible), it is always pronounced as two emphasized
syllables, i.e. close to the 'Lee'-'ah' suggestion above.

I never even considered that it could be pronounced any other way!

Quadibloc

unread,

May 10, 2021, 2:05:44 AM5/10/21

to

On Sunday, May 9, 2021 at 5:03:11 PM UTC-6, MitchAlsup wrote:

> IBM 1620 "Can't add, doesn't even try" laid the groundwork where if the
> operands are small enough it becomes cheaper to just read out the correct
> result than to build the logic to calculate the result.

And that reminds me... of the FOCUS Number System.

Where numbers were represented by logarithms, and addition was done by
looking up log((a+b)/a) as a function of log(b/a).

John Savard

Quadibloc

unread,

May 10, 2021, 2:10:39 AM5/10/21

to

On Sunday, May 9, 2021 at 10:06:15 PM UTC-6, John Levine wrote:

> Even once they switched to core, tubes burned out, components went out
> of spec, solder joints cracked, and uptime was measured in hours if
> you were lucky, minutes if you weren't.

> So a compelling reason to leave out floating point was that all the extra
> logic made the computer even less reliable. It's impressive that they
> risked it on the 704.

> I have read that the practical limit on the length
> of a FORTRAN program on the 704 was set by fact that the compiler
> had to compile it in one run between hardware failures.

That may be. But aside from the 704 using core memory, which didn't
have the pattern sensitivity issues of Williams tubes, a lot of effort
went into making it reliable by IBM. One thing was that the tubes
were run at a lower-than-normal voltage, since they were being used
for digital amplification in a computer with thousands of them, instead
of in a radio with five of them.

John Savard

Thomas Koenig

unread,

May 10, 2021, 2:35:51 AM5/10/21

to

BGB <cr8...@gmail.com> schrieb:

> Similarly, it wont effect any local branches unless by some absurd
> chance a *single function* were to exceed 1MB in the ".text" section.

> Excluding the possibility of excessively large procedurally generated
> "switch()" blocks or similar (eg: 10k+ case labels), this is unlikely.

One way I have encountered this is automatically generated formulas
from computer algebra sytems like Maple.

Look at https://gcc.gnu.org/bugzilla/attachment.cgi?id=41459 for
an example of such code. It isn't easy on compilers (but it does
not have branches).

> But, these cases don't break the ISA, merely the limits of the existing
> branch encodings. I could, in-theory add wider encodings, but then would
> need to deal with them in the pipeline (and making this part any wider
> would be a big source of timing hassles).

Code should be correct as first consideration, fast as a (distant)
second.

I assume you can jump to a register in your ISA, for function
pointers.

If that is the case, you can reverse the test and optionally branch
over an instruction sequence which loads the target address into
a register via loading the PC and adding the offset to it (as
determined by the assembler) and then jumping to that register.

[...]

> Though, it would technically be a bit simpler/cheaper to add a special
> case to allow for a Jumbo-encoded "BRA (Abs48)" or similar in this case,
> which wouldn't require any new logic on the EX side of things (can jump
> anywhere... Just requires using base relocs...).

That sounds even better. You have the long instructions, why not use
them?

Ivan Godard

unread,

May 10, 2021, 3:03:41 AM5/10/21

to

On 5/9/2021 11:05 PM, Terje Mathisen wrote:
> MitchAlsup wrote:
>> On Sunday, May 9, 2021 at 4:34:56 PM UTC-5, BGB wrote:
>>> On 5/9/2021 3:59 PM, MitchAlsup wrote:
>>>> I know one compiler writer that would enunciate "LEA" as 'Lee'-'ah'
>>> I always thought of it like "Lee".
>> <
>> It was funny to hear him say "lee-ah", but he was from Russia so I
>> gave him
>> a break.
>
> That is funny:
>
> 'Lea' is an old female name here in Norway (probably from the
> Norwegian-language bible), it is always pronounced as two emphasized
> syllables, i.e. close to the 'Lee'-'ah' suggestion above.
>
> I never even considered that it could be pronounced any other way!
>
> Terje
>

Exists in English too, though often spelled "Leah".
https://en.wikipedia.org/wiki/Lea_(given_name)
https://en.wikipedia.org/wiki/Leah_(given_name)

Thomas Koenig

unread,

May 10, 2021, 3:21:57 AM5/10/21

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> MitchAlsup wrote:
>> On Sunday, May 9, 2021 at 4:34:56 PM UTC-5, BGB wrote:
>>> On 5/9/2021 3:59 PM, MitchAlsup wrote:
>>>> I know one compiler writer that would enunciate "LEA" as 'Lee'-'ah'
>>> I always thought of it like "Lee".
>> <
>> It was funny to hear him say "lee-ah", but he was from Russia so I gave him
>> a break.
>
> That is funny:
>
> 'Lea' is an old female name here in Norway (probably from the
> Norwegian-language bible),

It is a fairly common name in Germany.

I always thought it came from Latin, "lioness". Hmm...
You're right, it is also a bibilical name.

>it is always pronounced as two emphasized
> syllables, i.e. close to the 'Lee'-'ah' suggestion above.

I would always prounounce the instruction like the name,
le:a .

BGB

unread,

May 10, 2021, 3:40:37 AM5/10/21

to

On 5/10/2021 1:35 AM, Thomas Koenig wrote:
> BGB <cr8...@gmail.com> schrieb:
>
>> Similarly, it wont effect any local branches unless by some absurd
>> chance a *single function* were to exceed 1MB in the ".text" section.
>
>> Excluding the possibility of excessively large procedurally generated
>> "switch()" blocks or similar (eg: 10k+ case labels), this is unlikely.
>
> One way I have encountered this is automatically generated formulas
> from computer algebra sytems like Maple.
>
> Look at https://gcc.gnu.org/bugzilla/attachment.cgi?id=41459 for
> an example of such code. It isn't easy on compilers (but it does
> not have branches).
>

OK.

>> But, these cases don't break the ISA, merely the limits of the existing
>> branch encodings. I could, in-theory add wider encodings, but then would
>> need to deal with them in the pipeline (and making this part any wider
>> would be a big source of timing hassles).
>
> Code should be correct as first consideration, fast as a (distant)
> second.
>
> I assume you can jump to a register in your ISA, for function
> pointers.
>

Yeah, these are possible:
JMP / BRA Rn // Jump to register
JSR / BSR Rn // Call to register
JT / BT Rn // Branch True
JF / BF Rn // Branch False

Naming gets a little funky, and this is a place where my ISA listing and
assembler ended up in disagreement.

So, the assembler uses different mnemonics than the ISA listing (namely
JMP/JSR/JT/JF) mostly to avoid ambiguity over the (PC, Rn) vs (Rn) cases.

> If that is the case, you can reverse the test and optionally branch
> over an instruction sequence which loads the target address into
> a register via loading the PC and adding the offset to it (as
> determined by the assembler) and then jumping to that register.
>
> [...]
>

Well, or use a conditional branch to the register...

The issue is not "what can or can't be done" but rather whether or not
it can be done using a single instruction.

>> Though, it would technically be a bit simpler/cheaper to add a special
>> case to allow for a Jumbo-encoded "BRA (Abs48)" or similar in this case,
>> which wouldn't require any new logic on the EX side of things (can jump
>> anywhere... Just requires using base relocs...).
>
> That sounds even better. You have the long instructions, why not use
> them?
>

Yes, possible...

Did just go and add a "BRA Disp33s" encoding as a test...
But, sadly, directly using the AGU output for the branch (to sidestep
some address hackery) kinda blows out the timing constraints...

Like, it would be nicer to be like:
Well, we can have "48b+(Disp33s*2)" from the AGU, why not just use this
address as the branch destination?... But, alas, timing isn't super
favorable towards this idea...

An Abs48 encoding or similar wouldn't necessarily be subject to this
issue, since from the EX stages' POV, this case is functionally
equivalent to the "branch to register" case.

One possibility:
FAjj-jjjj: LDIZ Imm24u, R0 //existing op
FBjj-jjjj: LDIN Imm24n, R0 //existing op
FEdd-dddd-FAdd-dddd: LDIZ Imm48u, R0
FEdd-dddd-FBdd-dddd: LDIN Imm48n, R0
FFdd-dddd-FAdd-dddd: BRA Abs48
FFdd-dddd-FBdd-dddd: BSR Abs48

Which while not perfect (due to the implications of this encoding; and
inability to be predicated), is at least "not completely awful".

Anton Ertl

unread,

May 10, 2021, 4:39:08 AM5/10/21

to

John Levine <jo...@taugh.com> writes:
>According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>>As for fixed vs. floating point, I guess that is a cross-cutting
>>concern. Sure you can argue that, if you spend a lot of time on
>>low-level steps such as coding layout, spending some time on range
>>analysis is minor change, so fixed point is acceptable.
>
>We forget how flaky and unreliable those early computers were. They
>used orders of magnitude more components than the most complex
>electronic systems ever had before. The Williams tubes used in the
>early 1950s just barely worked and needed endless tuning and fiddling.
>Even once they switched to core, tubes burned out, components went out
>of spec, solder joints cracked, and uptime was measured in hours if
>you were lucky, minutes if you weren't.

Heinz Zemanek told us that the relay computers had long uptimes,
while the tube computers failed every five minutes, and then had hours
of downtime for repair, but the argument was that the tube machine
computed more in the five minutes than the relay machine in
hours.

>>What held back floating point for a long time was the slowness and/or
>>high cost of FP hardware, but at least in general-purpose computers
>>that's a thing of the past.
>
>On small systems, I suppose. Floating point was standard or at least a
>widely provided option on large computers by the early 1960s.

But it still was much slower than integer arithmetic on most machines.
E.g., in "Writing Efficient Programs" (1982) Bentley originally used
FP for his traveling salesman example, but switched to integer
arithmetic as one of the optimization steps. When adapting the
examble for my course in ~2000, I left that step away because the
expected speed gain on the Coppermine CPU was small.

BGB

unread,

May 10, 2021, 4:42:16 AM5/10/21

to

Though, it is usually pronounced as a single syllable with a dipthong,
rather than two syllables.

Depending on accent, the distinction between "Leah" and "Lee" may be
lost (in much the same way as "pin" vs "pen", or "cot" vs "caught", ...).

Or, while most people (myself included) say "dog" with a single vowel,
many of the locals shift it to a diphthong "daug" / "dauwg", or with
other words like "ball" ("bauwl"), and with their endless obsession with
a certain sport, if one is around them, one hears a whole lot about
"fuutbauwl"...

Granted, I did not originate in the area I am currently living...

EricP

unread,

May 10, 2021, 10:28:59 AM5/10/21

to

Anton Ertl wrote:
> John Levine <jo...@taugh.com> writes:
>> According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>>> As for fixed vs. floating point, I guess that is a cross-cutting
>>> concern. Sure you can argue that, if you spend a lot of time on
>>> low-level steps such as coding layout, spending some time on range
>>> analysis is minor change, so fixed point is acceptable.
>> We forget how flaky and unreliable those early computers were. They
>> used orders of magnitude more components than the most complex
>> electronic systems ever had before. The Williams tubes used in the
>> early 1950s just barely worked and needed endless tuning and fiddling.
>> Even once they switched to core, tubes burned out, components went out
>> of spec, solder joints cracked, and uptime was measured in hours if
>> you were lucky, minutes if you weren't.
>
> Heinz Zemanek told us that the relay computers had long uptimes,
> while the tube computers failed every five minutes, and then had hours
> of downtime for repair, but the argument was that the tube machine
> computed more in the five minutes than the relay machine in
> hours.

Modern floating point was first used by Konrad Zuse for his computers,
Z1 (circa 1935), to Z4 (circa 1945).

Z3 circa 1941 was built with 2,600 relays, had a 22-bit word length
and a clock frequency of about 5–10 Hz.
https://en.wikipedia.org/wiki/Z3_(computer)

Z4 was the first commercial unit.
https://en.wikipedia.org/wiki/Z4_(computer)

(in German)
Die Z3 von Konrad Zuse im Deutschen Museum
https://www.youtube.com/watch?v=aUXnhVrT4CI

A Z3 was rebuilt in 2010
[paywalled]
https://ieeexplore.ieee.org/document/1498716

(in German)
Horst Zuse‘s Z3 Part 1: Demonstration
https://www.youtube.com/watch?v=WwBsot-yqgc

(in German)
https://www.youtube.com/watch?v=_YR5HhWlOgg

MitchAlsup

unread,

May 10, 2021, 12:22:47 PM5/10/21

to

On Sunday, May 9, 2021 at 10:22:49 PM UTC-5, BGB wrote:
> On 5/9/2021 4:28 AM, Thomas Koenig wrote:
> > BGB <cr8...@gmail.com> schrieb:
> >
> >> IMUL (Lane 1)
> >> 32*32 -> 64
> >
> > Do you have two instructions (one for signed and one for unsigned)
> > or three (one for the lower half, one for signed high, one for
> > unsigned high)? The latter version could save you some ALU
> > complexity and some latency in the (probably common) case where
> > only a 32*32 multiplication is needed, at the cost of added
> > instructions for the 32*32-> 64 bit case.
> >
> There are actually more cases:
> MULS: 32*32->32, Result is sign-extended from low 32 bits

IMUL

> MULU: 32*32->32, Result is zero-extended from low 32 bits

UMUL

> DMULS: 32*32->64, 64-bit signed result

CARRY; IMUL

> DMULU: 32*32->64, 64-bit unsigned result

CARRY; UMUL

>
> The former give the typical behaviors one expects in C, the latter gives
> the widened results.
>
> These exist as 3R forms, so:
> DMULU R4, R5, R7 // R7 = R4 * R5
<

All mine are 2-operand 1-result

>
>
> Originally, there were also multiply ops which multiplied two inputs and
> then stored a pair of results in R0 and R1, more like the original
> SuperH multiply ops, but I dropped these for various reasons.
<

Consumes way more OpCode space that in useful

>
> There are cases where DMACS or DMACU instructions could be useful:
> DMACU R4, R5, R7 // R7 = R4 * R5 + R7
<

IMAC and UMAC

>
> But, I don't currently have these.
>
>
> Eg (64-bit signed multiply):
> SHADQ R4, -32, R6 | SHADQ R5, -32, R7 //1c
> DMULS R4, R7, R16 //1c
> DMACS R5, R6, R16 //3c (2c penalty)
> SHADQ R16, 32, R2 //3c (2c penalty)
> DMACU R4, R5, R2 //1c
> RTS
>
> Though, while fewer instructions than the current form, the above
> construction would still be pretty bad in terms of interlock penalties.
>
> SHADQ R4, -32, R6 | SHADQ R5, -32, R7 //1c
> DMULS R5, R6, R16 //1c
> DMULS R4, R7, R17 //1c
> DMULU R4, R5, R18 //1c
> ADD R16, R17, R19 //2c (1c penalty, DMULS R17)
> SHADQ R19, 32, R2 //2c (1c penalty, ADD R19)
> ADD R18, R2 //1c
> RTS
>
> Both cases would have approximately the same clock-cycle count (assuming
> both cases have a 3-cycle latency).
<

Which is why I used CARRY; xMUL

>
> ( Where recently, I have gotten around to modifying things such that the
> multiplier is now fully pipelined... )
>
>
>
> Otherwise, my time recently has mostly been being consumed by debugging...
<

Sherlock we know you well...........

MitchAlsup

unread,

May 10, 2021, 12:30:14 PM5/10/21

to

On Monday, May 10, 2021 at 12:02:13 AM UTC-5, BGB wrote:
> On 5/9/2021 6:08 PM, MitchAlsup wrote:
> > On Sunday, May 9, 2021 at 4:34:56 PM UTC-5, BGB wrote:
> >> On 5/9/2021 3:59 PM, MitchAlsup wrote:
> >
> >>> <
> >>> How many branches are farther than 1/8 GB away with code compiled
> >>> from high level languages ??
> >>> <
> >> The normal direct branch op in BJX2 has a 20-bit displacement, so can
> >> reach +/- 1MB.
> >>
> >> A Jumbo+LEA can do +/- 2GB, so:
> >> LEA.B (PC, Disp33s), R3
> >> JMP R3
> >>
> >> But, this is also possible (+/- 32MB):
> >> MOV #Imm25s, R0
> >> BRA R0
> >>
> > My 66000 has 16-bit actual word displacement (18-bit byte address) for
> > conditional branches and 26-bit unconditional branches (and calls) (28-bit
> > effective range).
> >
> > But JMP instructions can have 32-bit or 64-bit immediates so you predicate
> > over the JMP with an inverted condition for the predicate.
> All these cases are 20 bits in my case.
<

But you are not fully supporting a 64-bit address space. Not even attempting.

>
> I had a few times considered eliminating the fixed BT/BF instructions in
> favor of using the BRA?T and BRA?F encodings instead (semantically
> equivalent), but haven't done so yet (partly inertia, partly I don't
> like breaking binary compatibility unless there is a good reason).
>
> Though, the main reason to do this would be to reclaim ~ 21 bits of
> encoding space...
> >>
> >> In general, these don't really come up at present because, of my test
> >> programs, all of them have < 1MB in the ".text" section.
> > <
> > In practice, few (<1%) are branches of those dimensions.
> > <
> In my current tests programs, it currently rests at 0%, since any larger
> branches tend to be indirect calls through a function pointer or similar
> (*2).
<

Which is exactly why I could allow them to be more expensive--so little use.

>
> *2: Though, one of these programs is has ~ 1.2 MB of ".text" on an
> x86-64 build (for both native Win64 and WSL), and 760K as BJX2 (in speed
> optimized mode).
>
>
>
> Similarly, it wont effect any local branches unless by some absurd
> chance a *single function* were to exceed 1MB in the ".text" section.
>
> Excluding the possibility of excessively large procedurally generated
> "switch()" blocks or similar (eg: 10k+ case labels), this is unlikely.
>

The person debugging a 10K case label switch statement gets what [s]he deserves.

Which brings to mind: how is IRL pronounced differently than URL ?

>
>
> Admittedly, this is also partly why my naming conventions tend to be
> based more on patterns or visual aesthetic rather than whether or not
> things can be easily spoken.
<

My OpCode spelling harken back more to IBM 360 and DEC PDP-11 than
to 68K, or x86-64.

MitchAlsup

unread,

May 10, 2021, 12:36:08 PM5/10/21

to

On Monday, May 10, 2021 at 1:35:51 AM UTC-5, Thomas Koenig wrote:
> BGB <cr8...@gmail.com> schrieb:
> > Similarly, it wont effect any local branches unless by some absurd
> > chance a *single function* were to exceed 1MB in the ".text" section.
>
> > Excluding the possibility of excessively large procedurally generated
> > "switch()" blocks or similar (eg: 10k+ case labels), this is unlikely.
> One way I have encountered this is automatically generated formulas
> from computer algebra sytems like Maple.
>
> Look at https://gcc.gnu.org/bugzilla/attachment.cgi?id=41459 for
> an example of such code. It isn't easy on compilers (but it does
> not have branches).
> > But, these cases don't break the ISA, merely the limits of the existing
> > branch encodings. I could, in-theory add wider encodings, but then would
> > need to deal with them in the pipeline (and making this part any wider
> > would be a big source of timing hassles).
> Code should be correct as first consideration, fast as a (distant)
> second.
>
> I assume you can jump to a register in your ISA, for function
> pointers.
<

The proper word is "through" as in: you can jump through a register.
"Indirectly through" is also correct.

>
> If that is the case, you can reverse the test and optionally branch
> over an instruction sequence which loads the target address into
> a register via loading the PC and adding the offset to it (as
> determined by the assembler) and then jumping to that register.
<

Predicate over the not-to-be-taken branch.

Stefan Monnier

unread,

May 10, 2021, 12:43:14 PM5/10/21

to

> Which brings to mind: how is IRL pronounced differently than URL ?

I thought the question would be: how do you spell "we are ell"?

Stefan

Ivan Godard

unread,

May 10, 2021, 12:49:11 PM5/10/21

to

Do you have any saturating multiplies? Why or why not?

John Levine

unread,

May 10, 2021, 1:51:38 PM5/10/21

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

>>On small systems, I suppose. Floating point was standard or at least a
>>widely provided option on large computers by the early 1960s.
>
>But it still was much slower than integer arithmetic on most machines.

I happen to have a 360/67 manual here. It and the similar /65 were
workhorse mainframes in the early 1970s.

Memory to register integer add took 1.4us, floating took 2.43 short, 2.45 double
Integer multiply was 4.8us, 4.4 short (faster than integer), 7.6 double
Integer divide was 8.7us, float 7.3, double 14.10

The short float was faster than integer because the number of bits in
the product or dividend was less.

I have manuals for the 386 and 486, where float arithmetic was also
about half the speed of fixed, even though integer arithmetic was
32x32 and float was all extended with 64 fraction bits. Doesn't seem
much slower to me.

If you can write your code using the same number of fixed instructions
as floats, sure, it'll be faster, but if you need to add extra code to
explicit scaling, I doubt it'll really be faster.

On the other hand, if you had a 286 or 386 with no 287 or 387 and
were simulating floating point in software, *that* was slow.

Marcus

unread,

May 10, 2021, 2:16:25 PM5/10/21

to

On 2021-05-10, MitchAlsup wrote:
> On Sunday, May 9, 2021 at 4:34:56 PM UTC-5, BGB wrote:
>> On 5/9/2021 3:59 PM, MitchAlsup wrote:
>
>>> <
>>> How many branches are farther than 1/8 GB away with code compiled
>>> from high level languages ??
>>> <
>> The normal direct branch op in BJX2 has a 20-bit displacement, so can
>> reach +/- 1MB.
>>
>> A Jumbo+LEA can do +/- 2GB, so:
>> LEA.B (PC, Disp33s), R3
>> JMP R3
>>
>> But, this is also possible (+/- 32MB):
>> MOV #Imm25s, R0
>> BRA R0
>>
> My 66000 has 16-bit actual word displacement (18-bit byte address) for
> conditional branches and 26-bit unconditional branches (and calls) (28-bit
> effective range).
>
> But JMP instructions can have 32-bit or 64-bit immediates so you predicate
> over the JMP with an inverted condition for the predicate.

MRISC32 has 18-bit word displacement (+/-0.5MB) for conditional
branches, and 21-bit word displacement (+/-4MB) for unconditional jumps
and calls.

Since unconditional jump targets are register-relative (where PC is one
possible register - for PC-relative unconditional branches), it is
possible to extend the range with one additional instruction, e.g:

ADDPCHI LR, #foo@pchi
JL LR, #foo+4@pclo

I would expect that a reasonably advanced microarchitecture will fuse
those two instructions into a single instruction with a full-width
displacement (it's a very distinct 64-bit sequence where only the
immediate fields are variable), while a dumb implementation (like my
current CPU) will take two cycles to complete the jump.

In fact the GCC back-end always emits the two-instruction sequence for
calls (and tail calls), and relies on the linker to relax it to a
single instruction when possible (though I have not implemented linker
relaxation yet).

>>
>> In general, these don't really come up at present because, of my test
>> programs, all of them have < 1MB in the ".text" section.
> <
> In practice, few (<1%) are branches of those dimensions.
> <

That's what I was betting on.

/Marcus

Marcus

unread,

May 10, 2021, 2:27:52 PM5/10/21

to

MRISC32 has MULQ: Multiply fixed-point Q numbers with saturation (the
saturation is trivial since the only case that needs to be handled is
-1.0 x -1.0 = 0.999..). There's also MULQR (same but with rounding).

...mostly because MRISC32 tries to support DSP:ish things, and because
the hardware cost is negligible (it's a simple addition to the existing
integer multiplier).

/Marcus

MitchAlsup

unread,

May 10, 2021, 2:35:19 PM5/10/21

to

This is what we did in Mc 88K, the compiler produced the "anywhere"
sequence, and the linker converted it back to "restricted but fits" code
more than 99% of the time.

robf...@gmail.com

unread,

May 10, 2021, 4:53:45 PM5/10/21

to

ANY1 supports a 21 bit branch displacement for branching +/-1MB. It may also store the return address in the target register allowing conditional branch to subroutine.
Operation for BEQ:
Rt = IP + 8
If (Ra = Rb)
If (Rc = 63)
IP = IP + Displacement
Else
IP = Rc + Displacement

There really needs to be only a small number of return address registers. So in a couple of designs I have trimmed the target register down to two bits allowing three return address registers, and then extended the constant field of the JAL instruction by three extra bits. It makes quite a difference being able to branch +-16MB for instance instead of 2MB. 2MB is not enough for some code.

MitchAlsup

unread,

May 10, 2021, 5:11:12 PM5/10/21

to

On Monday, May 10, 2021 at 3:53:45 PM UTC-5, robf...@gmail.com wrote:
> ANY1 supports a 21 bit branch displacement for branching +/-1MB. It may also store the return address in the target register allowing conditional branch to subroutine.
<

How often do you find conditional branching to a subroutine to be effective ?

{I remember using this a lot in 8085 assembly coding, but when writing in C I seldom find
it profitable because all the arguments had to be in the proper registers before the condition
to use the conditionality of the call.

Also, do you have a conditional return, or does the epilogue of the subroutine "get in the way" ??

<
> Operation for BEQ:
> Rt = IP + 8
> If (Ra = Rb)
> If (Rc = 63)
> IP = IP + Displacement
> Else
> IP = Rc + Displacement
>
> There really needs to be only a small number of return address registers. So in a couple of designs I have trimmed the target register down to two bits allowing three return address registers, and then extended the constant field of the JAL instruction by three extra bits. It makes quite a difference being able to branch +-16MB for instance instead of 2MB. 2MB is not enough for some code.
<

I got ±128MB from my ISA.

Ivan Godard

unread,

May 10, 2021, 6:34:33 PM5/10/21

to

On 5/10/2021 2:11 PM, MitchAlsup wrote:
> On Monday, May 10, 2021 at 3:53:45 PM UTC-5, robf...@gmail.com wrote:
>> ANY1 supports a 21 bit branch displacement for branching +/-1MB. It may also store the return address in the target register allowing conditional branch to subroutine.
> <
> How often do you find conditional branching to a subroutine to be effective ?

Very frequent in Mill code, due to predication of calls when EBBs are
folded together to let both then and else run interleaved. Probably rare
otherwise. Consequently asking "how often do you see?" is not very
informative unless also asking whether the compiler takes advantage of
the conditional.

> {I remember using this a lot in 8085 assembly coding, but when writing in C I seldom find
> it profitable because all the arguments had to be in the proper registers before the condition
> to use the conditionality of the call.
>
> Also, do you have a conditional return, or does the epilogue of the subroutine "get in the way" ??

Down with epiloges!

Stefan Monnier

unread,

May 10, 2021, 6:54:46 PM5/10/21

to

Ivan Godard [2021-05-10 15:34:32] wrote:
> On 5/10/2021 2:11 PM, MitchAlsup wrote:
>> On Monday, May 10, 2021 at 3:53:45 PM UTC-5, robf...@gmail.com wrote:
>>> ANY1 supports a 21 bit branch displacement for branching +/-1MB. It may
>>> also store the return address in the target register allowing conditional
>>> branch to subroutine.
>> <
>> How often do you find conditional branching to a subroutine to be effective ?
>
> Very frequent in Mill code, due to predication of calls when EBBs are folded
> together to let both then and else run interleaved. Probably rare
> otherwise. Consequently asking "how often do you see?" is not very
> informative unless also asking whether the compiler takes advantage of
> the conditional.

I guess it's also because Mill's "branch to subroutine" is really a full
function call (including argument passing), which solves:

>> {I remember using this a lot in 8085 assembly coding, but when
>> writing in C I seldom find it profitable because all the arguments
>> had to be in the proper registers before the condition to use the
>> conditionality of the call.

Stefan

Josh Vanderhoof

unread,

May 10, 2021, 7:01:20 PM5/10/21

to

Terje Mathisen <terje.m...@tmsw.no> writes:

> MitchAlsup wrote:
>> On Sunday, May 9, 2021 at 1:45:06 PM UTC-5, Stefan Monnier wrote:
>>> Stefan Monnier [2021-05-09 14:35:28] wrote:
>>>>> Yes that makes sense. Specifically for FP8, the increased dynamic range
>>>>> is by far the most important trait. I've even seen examples of using
>>>>> 1:5:2 in DNN:s (deep neural networks) where dynamic range is often more
>>>>> important than precision.
>>>> Indeed 1:5:2 is arguably more useful than 1:4:3.
>>> An alternative of course is to use a log-base representation. So your
>>> 8bit data is split 1:7 between a sign and an exponent, with no mantissa
>>> at all and then you choose your dynamic range by picking the base.
>>> Multiplication is easy but addition is takes more effort.
>> <
>> Really ?!?!?
>> Both are implemented with table look ups with concatenated values as the
>> address.
>
> Agreed, in a HW implementation you don't even need a full 64K (8+8
> bits), since the sign logic can go on in parallel with the table
> access.
>
> I.e. 14-bit tables for each operation is sufficient, so that is about
> 14 KB x 4 = 56 KB of rom space?

Don't you only need half the table for the commutative ops?

BGB

unread,

May 10, 2021, 8:18:28 PM5/10/21

to

I hacked on it some, and "made it work", mostly by having a new
carry-select adder which implements "(Base48+(Disp33s<<1))".

I have also added logic to the branch predictor so that it uses a carry
bit to detect out-of-range cases (and will disable itself from
predicting the branch if it will go out-of-range).

The EX1 stage then uses a "slightly larger" address calculation for the
branch target.

These effectively allows effectively eliminating most of the
modular-addressed-branch issues.

Now, theoretically, branches can be +/- 8GB.

While I was at it, I effectively widened the AGU by a few bits, so that
it can also now handle 33-bit displacements (as opposed to falling on
its face if it goes outside of +/- 4GB).

The tweaked AGU logic should now be able to handle +/- 32GB for QWORD
operations.

Still seems to pass timing at 50MHz, so probably OK.

>
> An Abs48 encoding or similar wouldn't necessarily be subject to this
> issue, since from the EX stages' POV, this case is functionally
> equivalent to the "branch to register" case.
>
> One possibility:
> FAjj-jjjj: LDIZ Imm24u, R0 //existing op
> FBjj-jjjj: LDIN Imm24n, R0 //existing op
> FEdd-dddd-FAdd-dddd: LDIZ Imm48u, R0
> FEdd-dddd-FBdd-dddd: LDIN Imm48n, R0
> FFdd-dddd-FAdd-dddd: BRA Abs48
> FFdd-dddd-FBdd-dddd: BSR Abs48
>
> Which while not perfect (due to the implications of this encoding; and
> inability to be predicated), is at least "not completely awful".

These operations now exist...

* FEjj_jjjj_F202_C4jj BRA (PC, disp32u)
* FEjj_jjjj_F202_C5jj BRA (PC, disp32n)
* FFjj_jjjj_FAjj_jjjj BRA #abs48

* FEjj_jjjj_F202_CCjj BSR (PC, disp32u)
* FEjj_jjjj_F202_CDjj BSR (PC, disp32n)
* FFjj_jjjj_FBjj_jjjj BSR #abs48

The Disp33s encodings are "kinda a hack", I had an instruction in Imm10
space which only used 8 bits of the operand, so, YOLO...

As noted, the much larger relative branch encodings would likely be:
* FEjj_jjjj_F0jj_Cjjj BRA (PC, disp44s)
* FEjj_jjjj_F0jj_Djjj BSR (PC, disp44s)

Which, while less hacky, these still currently isn't a good way to
utilize such a large displacement.

In any case, while it isn't exactly inconceivable that ".text" can
exceed 1MB, I don't really expect it will exceed 8GB "anytime soon"...

MitchAlsup

unread,

May 10, 2021, 8:26:28 PM5/10/21

to

On Monday, May 10, 2021 at 7:18:28 PM UTC-5, BGB wrote:
> On 5/10/2021 2:40 AM, BGB wrote:
> > On 5/10/2021 1:35 AM, Thomas Koenig wrote:
> > Yes, possible...
> >
> > Did just go and add a "BRA Disp33s" encoding as a test...
> > But, sadly, directly using the AGU output for the branch (to sidestep
> > some address hackery) kinda blows out the timing constraints...
> >
> > Like, it would be nicer to be like:
> > Well, we can have "48b+(Disp33s*2)" from the AGU, why not just use this
> > address as the branch destination?... But, alas, timing isn't super
> > favorable towards this idea...
> >
> I hacked on it some, and "made it work", mostly by having a new
> carry-select adder which implements "(Base48+(Disp33s<<1))".
>
> I have also added logic to the branch predictor so that it uses a carry
> bit to detect out-of-range cases (and will disable itself from
> predicting the branch if it will go out-of-range).
<

In most cases, the prediction shows up at the same moment as the instruction(s)
so if you have to perform the branch target address calculation and look at its
HoB you are already too late to use the prediction in the current fetch cycle.

Adders (8-64-bits) generally have the following properties:: smallest delay
though adder = 5 gates, greatest delay = 11 gates (which can be circuit
designed down to 8-gates, but not in a synthesized form).

>
> The EX1 stage then uses a "slightly larger" address calculation for the
> branch target.
>
> These effectively allows effectively eliminating most of the
> modular-addressed-branch issues.
>
> Now, theoretically, branches can be +/- 8GB.
<

<alt>0177</alt> = ±

robf...@gmail.com

unread,

May 10, 2021, 10:00:37 PM5/10/21

to

> How often do you find conditional branching to a subroutine to be effective ?

Not expecting the conditional branch to subroutine to be very effective. There just happened to be an empty target register field left over in the instruction, so I thought why not make it do some work since the cost is low? But an unconditional branch to subroutine is also available by making the branch condition always true. There is already an unconditional jump and link instruction which can do relative calls, but it forces the address alignment to 16B, whereas the branch to subroutine only needs an 8B alignment.

>Also, do you have a conditional return, or does the epilogue of the subroutine "get in the way" ??

The branch instruction may also be used to perform a conditional return by using the return address register in the target calculation.

> I got ±128MB from my ISA.

For ANY1 the unconditional JAL has a whopping 44-bit target constant field. That’s ±8TB addressing. Not sure that such a range is very valuable, but it falls out of using a 64-bit instruction format.
Now that I think about it, it might be more valuable to trim back some of the constant bits, and use them to add onto the return address for the call. That would allow encoding inline constant parameters to functions and allow variable parameter lists. The compiler could calculate where the return must be and adjust the constant accordingly.

BGB

unread,

May 10, 2021, 11:24:12 PM5/10/21

to

On 5/10/2021 7:26 PM, MitchAlsup wrote:
> On Monday, May 10, 2021 at 7:18:28 PM UTC-5, BGB wrote:
>> On 5/10/2021 2:40 AM, BGB wrote:
>>> On 5/10/2021 1:35 AM, Thomas Koenig wrote:
>>> Yes, possible...
>>>
>>> Did just go and add a "BRA Disp33s" encoding as a test...
>>> But, sadly, directly using the AGU output for the branch (to sidestep
>>> some address hackery) kinda blows out the timing constraints...
>>>
>>> Like, it would be nicer to be like:
>>> Well, we can have "48b+(Disp33s*2)" from the AGU, why not just use this
>>> address as the branch destination?... But, alas, timing isn't super
>>> favorable towards this idea...
>>>
>> I hacked on it some, and "made it work", mostly by having a new
>> carry-select adder which implements "(Base48+(Disp33s<<1))".
>>
>> I have also added logic to the branch predictor so that it uses a carry
>> bit to detect out-of-range cases (and will disable itself from
>> predicting the branch if it will go out-of-range).
> <
> In most cases, the prediction shows up at the same moment as the instruction(s)
> so if you have to perform the branch target address calculation and look at its
> HoB you are already too late to use the prediction in the current fetch cycle.
>
> Adders (8-64-bits) generally have the following properties:: smallest delay
> though adder = 5 gates, greatest delay = 11 gates (which can be circuit
> designed down to 8-gates, but not in a synthesized form).

It is calculated during ID1, in parallel with the main instruction decoder.
It then calculates the fetch address for the next cycle.

This means a correctly predicted branch still has a latency of 2 cycles,
since the new fetch location doesn't start to take effect until the
branch has reached the ID2 stage ( PF IF ID1 ID2 EX1 EX2 EX3 WB ).

Either way, timing here tends to be pretty tight here, which is part of
why Mod-4GB branch-addressing was a thing.

But, looking at the carry bit does have a useful property:
* It means I can now actually make the adder narrower without breaking
stuff.

So, instead of being 32b+24b with the address wrapping at the 4GB mark,
I could do a 24-bit adder, then be like "OK, the high-order bit carried,
ignore this branch!"

Ironically, it is possible it could be made to handle the Abs48 branches
if needed, since no adder would be needed in this case.

Non-predicted branches don't occur until EX1.

Can't go that much narrower though with the use of 20-bit branch
displacements. Likewise, the Disp33 branches are ignored by the branch
predictor (and are handled in EX1 along with register-indirect branches).

Branches generated from EX1 don't actually start taking effect until EX2.

EX1: Submit branch target address to branch-initiator mechanism;
EX2: Set pipeline flush bits, submit new address to L1 I$ (PF).
(This part feeds back through the branch-predictor).
EX3: ... Branch is now underway ...

If I did a "long branch" via the ALU:
EX1: ALU does its thing, 1;
EX2: ALU does its thing, 2;
EX3: Submit ALU result to branch-initiator.

Though, long-branch would likely need to also invoke the initiator in
EX1 in order to flush the pipeline (and again in EX3).

>>
>> The EX1 stage then uses a "slightly larger" address calculation for the
>> branch target.
>>
>> These effectively allows effectively eliminating most of the
>> modular-addressed-branch issues.
>>
>> Now, theoretically, branches can be +/- 8GB.
> <
> <alt>0177</alt> = ±

Tries it, ... ±
Hmm...

Terje Mathisen

unread,

May 11, 2021, 4:14:10 AM5/11/21

to

Ivan Godard wrote:
> On 5/10/2021 2:11 PM, MitchAlsup wrote:
>> On Monday, May 10, 2021 at 3:53:45 PM UTC-5, robf...@gmail.com wrote:
>>> ANY1 supports a 21 bit branch displacement for branching +/-1MB. It
>>> may also store the return address in the target register allowing
>>> conditional branch to subroutine.
>> <
>> How often do you find conditional branching to a subroutine to be
>> effective ?
>
> Very frequent in Mill code, due to predication of calls when EBBs are
> folded together to let both then and else run interleaved. Probably rare
> otherwise. Consequently asking "how often do you see?" is not very
> informative unless also asking whether the compiler takes advantage of
> the conditional.

A conditional function call would be very useful whenever you write any
code that process a stream of data, using a local buffer which you have
to refill whenever you reach the low-water mark.

I.e.

if (amount < low_limit) fill_buffer()

This happens a lot in most stream/file library code, as well as wjen
processing bitstreams for a codec.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"