RISC-V vs. Aarch64

Anton Ertl

unread,

Dec 24, 2021, 12:00:14 PM12/24/21

to

I have recently (despite me) watched
<https://www.youtube.com/watch?v=Ii_pEXKKYUg>, where Chris Celio
compares some aspects RV64G, RV64GC, ARM A32, ARM A64, IA-32 and
AMD64. There is also a tech report <https://arxiv.org/abs/1607.02318>
<https://arxiv.org/pdf/1607.02318.pdf>

In this posting I look at the different approaches taken in A64 (aka
Aarch64) and RV64GC:

* RISC-V has simple instructions, e.g., with only one addressing mode
(reg+offset), and the base instruction set encodes each instruction
in 32 bits, but with the C (compressed) extension adds a 16-bit
encoding for the most frequent instructions.

* A64 has fixed-length 32-bit instructions, but they can be more
complex: A64 has more addressing modes and additional instructions
like load pair and store pair; in particular, the A64 architects
seem to have had few concerns about the register read and write
ports needed per instruction; E.g., a store-pair instruction can
need four read ports, and a load pair instruction can need three
write ports (AFAIK).

Celio argues that the instruction density of RV64GC is competetive,
and that the C extension is cheap to implement for a small
implementation.

It's pretty obvious that a small implementation of RV64G is smaller
than a small implementation of A64, and adding the C extension to a
small implementation of RV64G (to turn it to RV64GC) is reported in
the talk IIRC (it's on the order of 700 transistors, so still cheap),
so you can get a small RV64GC cheaper than a small A64 implementation
and have similar code density.

Celio also argues that instruction fusion can overcome the problem of

But I wonder how things turn out for larger implementations: Now
RV64GC needs twice as many decoders to decode, say, 32 bytes of
instructions, and then some additional hardware that selects which of
these decodes are valid (and of course all this extra hardware costs
energy).

And with instruction fusion the end result of the decoding step
becomes even more complex.

OTOH, a high-performance A64 implementation probably wants some kind
of register port allocator (possibly with port requirements across
several cycles), so it has its own source of complexity. I wonder if
that's the reason why some high-performance ARMs have microcode
caches; I would normally have thought that microcode caches are
unnecesary for a fixed-length instruction format.

What do people more familiar with hardware design think?

A classic question is the classification of the architecture: Celio
claims that ARM is CISC (using an A32 Load-multiple instruction as
example). These questions are not decided by touchy-feely arguments,
but rather by instruction set properties. All RISCs are load/store
general-purpose instruction sets; even the A32 load/store-multiple
instructions can be considered a variant of that; one might argue that
accessing two pages in one instruction is not RISC, but by that
criterion RISC-V is not RISC (it allows page-crossing unaligned
accesses). One other common trait in RISCs is fixed-length 32-bit
instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
the C extension.

An interesting point in the talk is that zero-extension is a common
idiom in RISC-V; the talk does not explain this completely, but
apparently the 32-bit variants of instructions sign-extend (like Alpha
does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
values around in sign-extended form (including 32-bit unsigned
values).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Dec 24, 2021, 1:17:50 PM12/24/21

to

On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
> I have recently (despite me) watched
> <https://www.youtube.com/watch?v=Ii_pEXKKYUg>, where Chris Celio
> compares some aspects RV64G, RV64GC, ARM A32, ARM A64, IA-32 and
> AMD64. There is also a tech report <https://arxiv.org/abs/1607.02318>
> <https://arxiv.org/pdf/1607.02318.pdf>
>
> In this posting I look at the different approaches taken in A64 (aka
> Aarch64) and RV64GC:
>
> * RISC-V has simple instructions, e.g., with only one addressing mode
> (reg+offset), and the base instruction set encodes each instruction
> in 32 bits, but with the C (compressed) extension adds a 16-bit
> encoding for the most frequent instructions.
>
> * A64 has fixed-length 32-bit instructions, but they can be more
> complex: A64 has more addressing modes and additional instructions
> like load pair and store pair; in particular, the A64 architects
> seem to have had few concerns about the register read and write
> ports needed per instruction; E.g., a store-pair instruction can
> need four read ports, and a load pair instruction can need three
> write ports (AFAIK).
>
> Celio argues that the instruction density of RV64GC is competetive,
> and that the C extension is cheap to implement for a small
> implementation.
<

My 66000 ISA has x96-64 instruction set density and is RISC with
a couple of additions--large constants (immediates and displacements
or both).

>
> It's pretty obvious that a small implementation of RV64G is smaller
> than a small implementation of A64, and adding the C extension to a
> small implementation of RV64G (to turn it to RV64GC) is reported in
> the talk IIRC (it's on the order of 700 transistors, so still cheap),
> so you can get a small RV64GC cheaper than a small A64 implementation
> and have similar code density.
<

Once you have constructed the register file(s), the integer, memory, and
floating point units, the size of the fetcher and decoder is immaterial
(noise).

<
>
> Celio also argues that instruction fusion can overcome the problem of
<

Did you incompletely cut a paragraph ?

>
> But I wonder how things turn out for larger implementations: Now
> RV64GC needs twice as many decoders to decode, say, 32 bytes of
> instructions, and then some additional hardware that selects which of
> these decodes are valid (and of course all this extra hardware costs
> energy).
<

As I argue above, the size of decoders is immaterial. But here you are
using the word decoder where you should be using the work instruction
recognizer. A lot of the logic in a superScalar implementation in the
"decoder" area is flip-flops and wires. Once an instruction is recognized
and routed to where it needs to go, then there is a lot of pipeline setup
logic at least as big as the instruction recognizer.

>
> And with instruction fusion the end result of the decoding step
> becomes even more complex.
<

If the instruction fusing is only performed on back-to-back instructions
it is not very hard. If you break this rule, it becomes at least cubic and
at most NP-complete. So, don't break that rule.

<
>
> OTOH, a high-performance A64 implementation probably wants some kind
> of register port allocator (possibly with port requirements across
> several cycles), so it has its own source of complexity. I wonder if
> that's the reason why some high-performance ARMs have microcode
> caches; I would normally have thought that microcode caches are
> unnecesary for a fixed-length instruction format.
>
> What do people more familiar with hardware design think?
<

½ of one, ½ of the other. You do what you think necessary to keep
the expensive HW occupied with work.

>
> A classic question is the classification of the architecture: Celio
> claims that ARM is CISC (using an A32 Load-multiple instruction as
> example). These questions are not decided by touchy-feely arguments,
> but rather by instruction set properties. All RISCs are load/store
> general-purpose instruction sets; even the A32 load/store-multiple
> instructions can be considered a variant of that; one might argue that
> accessing two pages in one instruction is not RISC, but by that
> criterion RISC-V is not RISC (it allows page-crossing unaligned
> accesses). One other common trait in RISCs is fixed-length 32-bit
> instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
> exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
> the C extension.
<

a) misaligned access SHOULD be in every architecture, it is mandated by
some features of some languages. Aligned accesses are always higher
performing; but misaligned should not cost more than 1 extra cycle and
as few as zero extra cycles. Page crossings should be allowed--however
one should be in a situation where crossing of a memory boundary is
NOT ALLOWed {MTRR, different RWE on pages, wrap from positive to
negative,...}
<
b) It is a variant of LM and SM (Enter and EXIT) that make My 66000
ISA size competitive with x86-64; taking what would have been dozens
of instructions and making them one. Since My 66000 has HW context
switch built in, there is already a path from the cache to the RF that
is multiple registers wide (minimum 4 registers). Might as well use
this path even in a 1-instruction-per-cycle implementation. Also note:
This path does not run through the LD/ST aligner.
<
My 66000 also contains switch instructions where the instruction
provides a register value of the switch index, the instruction contains
a limit to the table, and indexes the table feeding the result directly
to IP without data going through a register or the data cache. Should
the index be out-of-bounds control is delivered to default: !
<
c) My 66000 has 32-bit fixed length instruction-specifiers followed
by1-4 doublewords of constants needed by the instruction. One can
<ahem> decode the instruction and derive pointers to these constants,
the number of the constants and a pointer to the next instruction in
30-total gates and 4-gates of delay.
<
There is NO JUSTIFIABLE reason that an instruction is not entirely
self-contained!

<
>
> An interesting point in the talk is that zero-extension is a common
> idiom in RISC-V; the talk does not explain this completely, but
> apparently the 32-bit variants of instructions sign-extend (like Alpha
> does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
> values around in sign-extended form (including 32-bit unsigned
> values).
<

My 66000 ISA has both ZE and SE plus the ability to smash a value
{which may contain more bits than its container} with ZE or SE.
<
My 66000 ABI passes everything around in 64-bit containers. The
filling of these containers perform SE or ZE as desired.

Anton Ertl

unread,

Dec 24, 2021, 2:18:28 PM12/24/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:

>> Celio also argues that instruction fusion can overcome the problem of=20

>
>Did you incompletely cut a paragraph ?

Probably I switched to writing a different paragraph and then forgot
to return:-)

What I wanted to write:

|Celio also argues that instruction fusion can overcome the problem of

|latency chains of the simple instructions. E.g., he gives the
|example of an indexed load consisting of a shift, an add, and a load
|on RISC-V; instruction fusion can combine them into one instruction
|with the latency of one instruction.

But actually I don't think Celio discusses the latency directly, it's
just implicit in his talk.

MitchAlsup

unread,

Dec 24, 2021, 2:36:56 PM12/24/21

to

On Friday, December 24, 2021 at 1:18:28 PM UTC-6, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
> >> Celio also argues that instruction fusion can overcome the problem of=20
> >
> >Did you incompletely cut a paragraph ?
> Probably I switched to writing a different paragraph and then forgot
> to return:-)
> What I wanted to write:
>
> |Celio also argues that instruction fusion can overcome the problem of
> |latency chains of the simple instructions. E.g., he gives the
> |example of an indexed load consisting of a shift, an add, and a load
> |on RISC-V; instruction fusion can combine them into one instruction
> |with the latency of one instruction.
<

You have still "eaten" the size of the 3 instructions in the I$, and I would argue
that it takes less power to directly decode LD Rd,[Rb,Ri<<s+disp] than to
decode {SL Rt,Rindex,#shift; ADD Rt,Rt,Rb; LD Rd,[Rt+disp]}

BGB

unread,

Dec 24, 2021, 4:20:36 PM12/24/21

to

On 12/24/2021 9:38 AM, Anton Ertl wrote:
> I have recently (despite me) watched
> <https://www.youtube.com/watch?v=Ii_pEXKKYUg>, where Chris Celio
> compares some aspects RV64G, RV64GC, ARM A32, ARM A64, IA-32 and
> AMD64. There is also a tech report <https://arxiv.org/abs/1607.02318>
> <https://arxiv.org/pdf/1607.02318.pdf>
>
> In this posting I look at the different approaches taken in A64 (aka
> Aarch64) and RV64GC:
>
> * RISC-V has simple instructions, e.g., with only one addressing mode
> (reg+offset), and the base instruction set encodes each instruction
> in 32 bits, but with the C (compressed) extension adds a 16-bit
> encoding for the most frequent instructions.
>

A 16/32 encoding can save ~ 40-60% IME vs fixed 32-bit instructions.

For example: I was recently experimenting again with a "Fix32" variant
of my ISA, but building the Boot ROM in Fix32 mode basically blows out
the ROM size (had to both omit some stuff and also expand the ROM size
to 48K in this test; vs the 16/32 ISA where it still fits in 32K).

I had experimented in the past, noting that 16/24/32 doesn't save enough
over 16/32 to justify the cost.

For the main ROM, to be able to boot Fix32 a secondary core (and be
within the 32K limit), I had to put some of the initial Boot ASM in
Fix32 mode. It detects that it is on a secondary core and then goes into
a loop where it spins indefinitely and waits for a special interrupt
(with any secondary cores being "woken up" by a special inter-processor
interrupt post-boot).

I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
both modes are "very common" in practice.

More elaborate modes quickly run into diminishing returns territory; so,
things like auto-increment and similar are better off left out IMO.

> * A64 has fixed-length 32-bit instructions, but they can be more
> complex: A64 has more addressing modes and additional instructions
> like load pair and store pair; in particular, the A64 architects
> seem to have had few concerns about the register read and write
> ports needed per instruction; E.g., a store-pair instruction can
> need four read ports, and a load pair instruction can need three
> write ports (AFAIK).
>

These are less of an issue if one assumes a minimum width for the core.
If the core is always at least 3-wide, this isn't an issue.

For a 1-wide core, there may need to be compromise. For example, my
Fix32 core mentioned previously is only 1-wide, so needed to omit the
MOV.X instruction. I did end up retaining Jumbo encodings, but they use
a slightly different mechanism from the 3-wide core.

I have noted I can pair a 3-wide core with a 1-wide core, and fit it
(barely) in the FPGA. Partial issue is that I am paying a fairly steep
cost for the 64B burst transfers in the L2/DDR stage, but I kinda need
these to get sufficient memory bandwidth for a RAM-backed framebuffer to
work (otherwise the screen is broken garbage).

Ironically though, the 1-wide core isn't *that* much cheaper than the
3-wide core, despite being significantly less capable.

I could maybe make the core a bit cheaper by leaving out RV64I support,
which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
Arbitrary register as link-register; ...). Or, conversely, a core which
*only* does RV64I.

Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
though (just these two are around 1/3 of the total LUT cost of the
1-wide core).

Meanwhile, things like the instruction decoder mostly seem to mostly
"disappear into the noise" in this case.

> Celio argues that the instruction density of RV64GC is competetive,
> and that the C extension is cheap to implement for a small
> implementation.
>
> It's pretty obvious that a small implementation of RV64G is smaller
> than a small implementation of A64, and adding the C extension to a
> small implementation of RV64G (to turn it to RV64GC) is reported in
> the talk IIRC (it's on the order of 700 transistors, so still cheap),
> so you can get a small RV64GC cheaper than a small A64 implementation
> and have similar code density.
>

Yeah, by the time you have all the other stuff in 'G', the cost of the
'C' extension is probably negligible.

> Celio also argues that instruction fusion can overcome the problem of
>
> But I wonder how things turn out for larger implementations: Now
> RV64GC needs twice as many decoders to decode, say, 32 bytes of
> instructions, and then some additional hardware that selects which of
> these decodes are valid (and of course all this extra hardware costs
> energy).
>
> And with instruction fusion the end result of the decoding step
> becomes even more complex.
>

Instruction fusion seems like a needless complexity IMO. It would be
better if the ISA can be "dense" enough to make the fusion mostly
unnecessary. Though, "where to draw the line" is probably a subject of
debate.

Granted, it could probably be justified if one has already paid the
complexity cost needed to support superscalar (since it should be able
to use the same mechanism to deal with both).

I have come to the realization though that, because my cores lack many
of the tricks that fancier RISC-V cores use, my cores running in RISC-V
mode will at-best have somewhat worse performance than something like SweRV.

If I were designing a "similar" ISA to RISC-V (with no status flags), I
would probably leave out full Compare-and-Branch instructions, and
instead have a few "simpler" conditional branches, say:
BEQZ reg, label //Branch if reg==0
BNEZ reg, label //Branch if reg!=0
BGEZ reg, label //Branch if reg>=0
BLTZ reg, label //Branch if reg< 0

While conceptually, this doesn't save much, it would be cheaper to
implement in hardware. Relative compares could then use compare
instructions:
CMPx Rs, Rt, Rd
Where:
(Rs==Rt) => 0;
(Rs> Rt) => 1;
(Rs< Rt) => -1.

Though, one issue with a plain SUB is that it would not work correctly
for comparing integer values the same width as the machine registers (if
the difference is too large, the intermediate value will overflow).

> OTOH, a high-performance A64 implementation probably wants some kind
> of register port allocator (possibly with port requirements across
> several cycles), so it has its own source of complexity. I wonder if
> that's the reason why some high-performance ARMs have microcode
> caches; I would normally have thought that microcode caches are
> unnecesary for a fixed-length instruction format.
>

I wouldn't think this necessary, if I were implementing it I could do it
more like how I deal with 128-bit SIMD and similar in my existing core,
and map a single instruction to multiple lanes when needed.

> What do people more familiar with hardware design think?
>
> A classic question is the classification of the architecture: Celio
> claims that ARM is CISC (using an A32 Load-multiple instruction as
> example). These questions are not decided by touchy-feely arguments,
> but rather by instruction set properties. All RISCs are load/store
> general-purpose instruction sets; even the A32 load/store-multiple
> instructions can be considered a variant of that; one might argue that
> accessing two pages in one instruction is not RISC, but by that
> criterion RISC-V is not RISC (it allows page-crossing unaligned
> accesses). One other common trait in RISCs is fixed-length 32-bit
> instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
> exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
> the C extension.
>

IMO, Load/Store is the big issue...

Load/Store allows decent performance from a simplistic pipeline.

As soon as one deviates from Load/Store, one has effectively thrown a
hand grenade into the mix (one is doomed to either needing a much more
complex decoder, OoO, or paying a cost in terms of a significant
increase in clock-cycle counts per instruction).

In comparison, variable-length instructions and misaligned memory access
are not nearly as destructive. Granted, they are not free either. I
suspect the L1 D$ in my case would be somewhat cheaper if it did not
need to deal with misaligned access.

However, besides "cheap core", it is also nice to be able to have "fast
LZ77 decoding" and similar, which is an area where misaligned memory
access pays off.

Page-crossing doesn't seem to be too big of an issue, since it is rare,
and can be handled with two TLB misses in a row if needed (only the
first TLB miss gets served; when the interrupt returns and the
instruction tries again, it causes another TLB miss).

> An interesting point in the talk is that zero-extension is a common
> idiom in RISC-V; the talk does not explain this completely, but
> apparently the 32-bit variants of instructions sign-extend (like Alpha
> does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
> values around in sign-extended form (including 32-bit unsigned
> values).
>

It is a tradeoff.

In BJX2, I went with signed values being kept in sign-extended form, and
unsigned values kept in zero-extended form (and casts requiring explicit
sign or zero extension of the results).

This is a general rule even if there are only a minority of cases where
it should matter.

BGB

unread,

Dec 24, 2021, 4:27:23 PM12/24/21

to

On 12/24/2021 1:36 PM, MitchAlsup wrote:
> On Friday, December 24, 2021 at 1:18:28 PM UTC-6, Anton Ertl wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>>> On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
>>>> Celio also argues that instruction fusion can overcome the problem of=20
>>>
>>> Did you incompletely cut a paragraph ?
>> Probably I switched to writing a different paragraph and then forgot
>> to return:-)
>> What I wanted to write:
>>
>> |Celio also argues that instruction fusion can overcome the problem of
>> |latency chains of the simple instructions. E.g., he gives the
>> |example of an indexed load consisting of a shift, an add, and a load
>> |on RISC-V; instruction fusion can combine them into one instruction
>> |with the latency of one instruction.
> <
> You have still "eaten" the size of the 3 instructions in the I$, and I would argue
> that it takes less power to directly decode LD Rd,[Rb,Ri<<s+disp] than to
> decode {SL Rt,Rindex,#shift; ADD Rt,Rt,Rb; LD Rd,[Rt+disp]}
> <

Agreed.

Fusion seems like a band-aid, and it isn't free.

If it is handled as a special case in a superscalar decoder, even if the
fusion itself works and is "basically free", one may still pay
indirectly in terms of instructions which were not executed in parallel
but would have been executed in parallel had these other instructions
not existed in the way.

MitchAlsup

unread,

Dec 24, 2021, 5:06:25 PM12/24/21

to

On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> On 12/24/2021 9:38 AM, Anton Ertl wrote:

<snip>

>
> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
> both modes are "very common" in practice.
<

I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
it is expressive (more to FORTRAN needs than C needs) and it saves
instruction space. I provided ONLY the above two (2) in M88K and
later when doing x86-64 found out the error of my ways.
<
The least one can add is [Rbase+Rindex<<scale]

>
> More elaborate modes quickly run into diminishing returns territory; so,
> things like auto-increment and similar are better off left out IMO.
<

Agreed except for [Rbase+Rindex<<scale+disp]
<
<snip again>

> Ironically though, the 1-wide core isn't *that* much cheaper than the
> 3-wide core, despite being significantly less capable.
>
> I could maybe make the core a bit cheaper by leaving out RV64I support,
> which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
> Arbitrary register as link-register; ...). Or, conversely, a core which
> *only* does RV64I.
<

When Compare and Branch is correctly specified it can be performed
in 1 pipeline cycle. Then specified without looking "at the gates" is
seldom is !

>
> Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
> though (just these two are around 1/3 of the total LUT cost of the
> 1-wide core).
>

<snip>

> >
> > And with instruction fusion the end result of the decoding step
> > becomes even more complex.
> >
> Instruction fusion seems like a needless complexity IMO. It would be
> better if the ISA can be "dense" enough to make the fusion mostly
> unnecessary. Though, "where to draw the line" is probably a subject of
> debate.
<

Instruction fusing is worthwhile on a 1-wide machine with certain pipeline
organization to execute 2 instructions per cycle and a few other tricks.
<
In my 1-wide My 66130 core, I can CoIssue several instruction pairs:
a) any instruction and an unconditional branch
b) any result delivering instruction and a conditional branch or predicate
consuming the result.
c) Store instruction followed by a calculation instruction such that the
total number of register reads is <= 3.
d) any non-branch instruction followed by an unconditional branch
instruction allow the branch to be taken at execution cost of ZERO
cycles.
<
An eXcel spreadsheet shows up to 30% performance advantage over a
strict 1-wide machine. {This is within spitting distance of the gain of
In Order 2-wide over In Order 1-wide with less than 20% of the cost}

>
> Granted, it could probably be justified if one has already paid the
> complexity cost needed to support superscalar (since it should be able
> to use the same mechanism to deal with both).
<

I justified it not because of SuperScalar infrastructure (which I do not have)
But because My 66000 ISA <practically> requires an instruction fetch buffer;
So, the vast majority of the time the instructions are present and simply
waiting for the pipeline to get around to them.

>
> I have come to the realization though that, because my cores lack many
> of the tricks that fancier RISC-V cores use, my cores running in RISC-V
> mode will at-best have somewhat worse performance than something like SweRV.
<

Yes, not being like the others has certain disadvantages, too.

>
>
>
> If I were designing a "similar" ISA to RISC-V (with no status flags), I
<

Before you head in this direction, I find RISC-V ISA rather dreadful and
only barely serviceable.

<
> would probably leave out full Compare-and-Branch instructions, and
> instead have a few "simpler" conditional branches, say:
> BEQZ reg, label //Branch if reg==0
> BNEZ reg, label //Branch if reg!=0
> BGEZ reg, label //Branch if reg>=0
> BLTZ reg, label //Branch if reg< 0
>
> While conceptually, this doesn't save much, it would be cheaper to
> implement in hardware.
<

Having done both, I can warn you that your assumption is filled with badly
formed misconceptions. From a purity standpoint you do have a point;
from a gate count perspective and a cycle time perspective you do not.

<
> Relative compares could then use compare
> instructions:
> CMPx Rs, Rt, Rd
> Where:
> (Rs==Rt) => 0;
> (Rs> Rt) => 1;
> (Rs< Rt) => -1.
>
> Though, one issue with a plain SUB is that it would not work correctly
> for comparing integer values the same width as the machine registers (if
> the difference is too large, the intermediate value will overflow).
<

Which is why one needs CMP instructions and not to rely on SUB to do 98%
of the work.

<
> > OTOH, a high-performance A64 implementation probably wants some kind
> > of register port allocator (possibly with port requirements across
> > several cycles), so it has its own source of complexity. I wonder if
> > that's the reason why some high-performance ARMs have microcode
> > caches; I would normally have thought that microcode caches are
> > unnecesary for a fixed-length instruction format.
> >
> I wouldn't think this necessary, if I were implementing it I could do it
> more like how I deal with 128-bit SIMD and similar in my existing core,
> and map a single instruction to multiple lanes when needed.
> > What do people more familiar with hardware design think?
> >
> > A classic question is the classification of the architecture: Celio
> > claims that ARM is CISC (using an A32 Load-multiple instruction as
> > example). These questions are not decided by touchy-feely arguments,
> > but rather by instruction set properties. All RISCs are load/store
> > general-purpose instruction sets; even the A32 load/store-multiple
> > instructions can be considered a variant of that; one might argue that
> > accessing two pages in one instruction is not RISC, but by that
> > criterion RISC-V is not RISC (it allows page-crossing unaligned
> > accesses). One other common trait in RISCs is fixed-length 32-bit
> > instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
> > exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
> > the C extension.
> >
> IMO, Load/Store is the big issue...
>
> Load/Store allows decent performance from a simplistic pipeline.
<

Ahem: 4-wide machines do not have simplistic pipelines, so the burden is
only on the lesser implementations and nothing on the higher performance
ones.

>
> As soon as one deviates from Load/Store, one has effectively thrown a
> hand grenade into the mix (one is doomed to either needing a much more
> complex decoder, OoO, or paying a cost in terms of a significant
> increase in clock-cycle counts per instruction).
>

There is the multi-fire reservation station trick used on Athlon and Opteron,
which pretty much makes the problem vanish.

>
> In comparison, variable-length instructions and misaligned memory access
> are not nearly as destructive. Granted, they are not free either. I
> suspect the L1 D$ in my case would be somewhat cheaper if it did not
> need to deal with misaligned access.
<

I would assert that they are better than FREE. They add more performance
than the added gates to do these things cost.

>
> However, besides "cheap core", it is also nice to be able to have "fast
> LZ77 decoding" and similar, which is an area where misaligned memory
> access pays off.
<

¿dynamic bitfields?

>
> Page-crossing doesn't seem to be too big of an issue, since it is rare,
> and can be handled with two TLB misses in a row if needed (only the
> first TLB miss gets served; when the interrupt returns and the
> instruction tries again, it causes another TLB miss).
<

It is only time, and 98%99% of the time these crossing don't happen.
{98% in 4KB pages, 99% in 8KB pages}

<
> > An interesting point in the talk is that zero-extension is a common
> > idiom in RISC-V; the talk does not explain this completely, but
> > apparently the 32-bit variants of instructions sign-extend (like Alpha
> > does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
> > values around in sign-extended form (including 32-bit unsigned
> > values).
> >
> It is a tradeoff.
>
> In BJX2, I went with signed values being kept in sign-extended form, and
> unsigned values kept in zero-extended form (and casts requiring explicit
> sign or zero extension of the results).
<

The tradeoff is even harder where the linker fills in the "upper" bits of a
register. If the underlying premise is sign extension, one COULD need 2
registers to hold the "same value" for 2 different large address constants.
<
If the underlying premise is zero extension, certain bit pasting ways using
+ need to be changed to use |.
<
Another reason not to make SW, of any sort, have to paste bits together to
make <larger> constants.

>
> This is a general rule even if there are only a minority of cases where
> it should matter.
<

More than you alude.

MitchAlsup

unread,

Dec 24, 2021, 8:33:35 PM12/24/21

to

On Friday, December 24, 2021 at 4:06:25 PM UTC-6, MitchAlsup wrote:
> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> > On 12/24/2021 9:38 AM, Anton Ertl wrote:
> <snip>

> > Ironically though, the 1-wide core isn't *that* much cheaper than the
> > 3-wide core, despite being significantly less capable.
> >
> > I could maybe make the core a bit cheaper by leaving out RV64I support,
> > which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
> > Arbitrary register as link-register; ...). Or, conversely, a core which
> > *only* does RV64I.
> <
> When Compare and Branch is correctly specified it can be performed
> in 1 pipeline cycle. Then specified without looking "at the gates" is
> seldom is !
> >
> > Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
> > though (just these two are around 1/3 of the total LUT cost of the
> > 1-wide core).
<

For example: With the combined register file of My 66000 ISA, and in
consideration of a lower end design point, I changed the pipeline by
adding a stage between LD-Align and WriteBack--this stage is appro-
priately known of as WAIT !
<
WAIT allows me to pipeline FP operations into clock stages of
2-clocks each. This gains 5 more gates of delay; so, a 16 gate machine
gets 16+5+16 gates per FP clock cycle. This should make FP take fewer
cycles (seen at the main pipeline rate) for 1-wide implementations.
<
The integer pipeline looks like | EXECUTE | CACHE | ALIGN | WAIT |
The floating point pipeline ......| .......EXECUTE1.........| .......EXECUTE2.......|
>
One can put a comparison operation in WAIT so conditional branches
can the comparison is ready by the time the branch needs it.
<
Most FP calculations have memory references interspersed with FP
calculation instructions, so ½ piping the FP unit causes little performance
loss and there are gains made other places.
> >

Ivan Godard

unread,

Dec 24, 2021, 11:37:14 PM12/24/21

to

On 12/24/2021 10:17 AM, MitchAlsup wrote:

> c) My 66000 has 32-bit fixed length instruction-specifiers followed
> by1-4 doublewords of constants needed by the instruction. One can
> <ahem> decode the instruction and derive pointers to these constants,
> the number of the constants and a pointer to the next instruction in
> 30-total gates and 4-gates of delay.
> <
> There is NO JUSTIFIABLE reason that an instruction is not entirely
> self-contained!

Really? I suppose that might be true for OOO that is emulating
sequential execution, but what about VLIW and other wide multi-issue?
Chopping off the variable and large parts so they can be recognized in
parallel lets the issue width no longer depend on how many constants you
have.

robf...@gmail.com

unread,

Dec 25, 2021, 3:24:58 AM12/25/21

to

The My66000 sounds great. I may borrow some ideas from it.

In Thor2021 if the constant is larger than 23 bits, then prefixes are used to extend it. The primary reason to do this was to keep instructions less than 64-bits wide, so the redundancy in the cache was limited. Cache lines overflow to support 16-bit alignment, with data crossing a cache line stored in two places. A second reason was that prefixes allow the constant to extend beyond 64-bits, to 128 or 256 bits. That should make it possible to do 256-bit wide operations with constants, if needed. 128-bit constant may be needed for the floating-point.
However, there is some room in the opcode space to include a slew of instructions that may use a 64-bit constant directly. So, I am debating with myself whether to support larger constants.

The current core has only a 128-bit wide data bus to the outside world, ROM and DRAM. Cache lines are burst loaded across this bus using a 4-word burst. Writes write one 64-bit word at a time across the bus.
In the path from the cache to the register file, loading a whole cache line’s worth of data would definitely improve performance for context switches. However, if the context is not in cache then performance would be limited by the external read. I may modify things to be able to store 128-bit register pairs at once across the external bus. So, I think an octet register load instruction, and a register pair store instruction may help performance. These could be wrapped into micro-coded instructions to save or restore all registers. There is only a small data cache – 32kB so with frequent access to external memory I think it may get just about as much milage out of a pair load instruction.

It is interesting because by making the register file access path wider in an FPGA the depth remains the same because the file is made out of LUT rams. It cost more transistors and power. If the read/write path is eight register wide then there could be 8 sets of 64 registers because LUT rams are 64-deep anyways.

Thomas Koenig

unread,

Dec 25, 2021, 6:14:03 AM12/25/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
><snip>
>>
>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>> both modes are "very common" in practice.
><
> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
> it is expressive (more to FORTRAN needs than C needs) and it saves
> instruction space. I provided ONLY the above two (2) in M88K and
> later when doing x86-64 found out the error of my ways.
><
> The least one can add is [Rbase+Rindex<<scale]

Depending on how pressed for bits in opcodes one is, the scale
could also be implied in the size of the data, so a 64-bit word
would have either scale = 0 or scale == 3. I think HP-PA did that.

This would cover the most common case of linear array access.

>>
>> More elaborate modes quickly run into diminishing returns territory; so,
>> things like auto-increment and similar are better off left out IMO.
><
> Agreed except for [Rbase+Rindex<<scale+disp]

In your ISA, you specified the disp as either a 32 or a 64-bit quantity.
So, any offset in the range between -2^31 to 2^31-1 needs two 32-bit
words.

Looking at the (probably more common) case where the offset can fit
into (for example) 16 bits, a good alternative might be

LEA Rx,[Rbase+Rindex<<scale]
LD Rx,[Rx+offset]

which also uses to words and does not use an extra register.
The use case would then be pretty small so that I would probably
tend to leave it out.

Ivan Godard

unread,

Dec 25, 2021, 11:08:30 AM12/25/21

to

On 12/25/2021 3:14 AM, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
>> <snip>
>>>
>>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>>> both modes are "very common" in practice.
>> <
>> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
>> it is expressive (more to FORTRAN needs than C needs) and it saves
>> instruction space. I provided ONLY the above two (2) in M88K and
>> later when doing x86-64 found out the error of my ways.
>> <
>> The least one can add is [Rbase+Rindex<<scale]
>
> Depending on how pressed for bits in opcodes one is, the scale
> could also be implied in the size of the data, so a 64-bit word
> would have either scale = 0 or scale == 3. I think HP-PA did that.
>
> This would cover the most common case of linear array access.

Mill does that. Pretty much all the cases that don't work involve
picking a field out of an indexed element of an array of structs, and
nearly all of those require a shift large enough that it wouldn't fit in
the scale encoding of ISAs like x86 anyway.

Anton Ertl

unread,

Dec 25, 2021, 11:46:01 AM12/25/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, December 24, 2021 at 1:18:28 PM UTC-6, Anton Ertl wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>> >On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
>> >> Celio also argues that instruction fusion can overcome the problem of=20
>> >
>> >Did you incompletely cut a paragraph ?
>> Probably I switched to writing a different paragraph and then forgot
>> to return:-)
>> What I wanted to write:
>>
>> |Celio also argues that instruction fusion can overcome the problem of
>> |latency chains of the simple instructions. E.g., he gives the
>> |example of an indexed load consisting of a shift, an add, and a load
>> |on RISC-V; instruction fusion can combine them into one instruction
>> |with the latency of one instruction.
><
>You have still "eaten" the size of the 3 instructions in the I$

Celio argues and supports it with numbers that RV64GC is competetive
with A64 in code size.

>and I would argue
>that it takes less power to directly decode LD Rd,[Rb,Ri<<s+disp] than to
>decode {SL Rt,Rindex,#shift; ADD Rt,Rt,Rb; LD Rd,[Rt+disp]}

Yes, energy is certainly an issue. The question is how do RV64GC
decoders compare to A64 decoders for similar-performance cores on
their usual instruction mix.

It seems to me that the RV64GC might benefit more from a uop cache.
And given that high-performance A64 implementations have one,
competing RV64GC implementations will probably have one, too.

EricP

unread,

Dec 25, 2021, 12:00:58 PM12/25/21

to

Picking the imaginary field out of a complex number vector requires
the full [Rbase+Rindex<<scale+disp] but also requires a separate
scale field that allows *2 of the largest FP size.

I was considering leaving holes for future FP128 data type so that
implies a LEA instruction that allows scale of 1,2,4,8,16,32.
Rounding up the scale field size to 3 bits adds 64,128.

This would probably sit well with that whole quaternion and octonion crowd.

Ivan Godard

unread,

Dec 25, 2021, 12:28:16 PM12/25/21

to

Yep, that' a case where scaling by size doesn't work. However, random
indices into complex arrays are rare; usually you are indexing over a
loop. Sometimes the compiler can build the scaling into the control
variables. When it can't, it just puts the shift into the prior bundle.
On a wide issue machine that's free.

In our code samples, the only place where it makes a difference is in
sporadic references outside of loops: "A[F()].f". Those are rare enough
that we decided to save the entropy.

Anton Ertl

unread,

Dec 25, 2021, 12:52:00 PM12/25/21

to

BGB <cr8...@gmail.com> writes:
>> * A64 has fixed-length 32-bit instructions, but they can be more
>> complex: A64 has more addressing modes and additional instructions
>> like load pair and store pair; in particular, the A64 architects
>> seem to have had few concerns about the register read and write
>> ports needed per instruction; E.g., a store-pair instruction can
>> need four read ports, and a load pair instruction can need three
>> write ports (AFAIK).
>>
>
>These are less of an issue if one assumes a minimum width for the core.
>If the core is always at least 3-wide, this isn't an issue.

Why? Sure, you could have one instruction port that supports up to
four reads and the others only support two, and one that supports
three writes, and the others only support one, but that would still be
8 register reads and 5 writes per cycle, and my impression was that
having many register ports is really expensive.

There is also the option of reducing register port requirements
through the bypasses, but what do you do if there is not enough
bypassed data?

>Instruction fusion seems like a needless complexity IMO. It would be
>better if the ISA can be "dense" enough to make the fusion mostly
>unnecessary.

Well, Celio's position is that RISC-V gets code density through the C
extension and uop density through instruction fusion.

In the past we have seen many cases where it was more beneficial to
solve things in microarchitecture rather than architecture: caches
(rather than explicit fast memory), OoO rather than EPIC, dynamic
branch prediction rather than delay slots and static branch
prediction, bypasses rather than tranport-triggered architecture,
microarchitectural memory disambiguation rather than IA-64's ALAT.

Some of the benefits of putting features in the microarchitecture are
that

1) You can easily remove them (even within one implementation, by
using the appropriate chicken bit), while architectural features are
forever.

2) When you add them, existing code can benefit from them, while
architectural features need at least a recompilation.

RISC seemed to be the converse case of exposing things in the
architecture that, say, a VAX implementation does in the
microarchitecture, but one could also see VAX as another case of
exposing the microarchitecture in the architecture: the
microarchitecture was microcoded (because the microcode store was
faster than main memory at the time; but that's actually another
example of explicit fast memory mentioned above), so architects
designed instruction sets that made heavy use of microcode.

>If I were designing a "similar" ISA to RISC-V (with no status flags), I
>would probably leave out full Compare-and-Branch instructions, and
>instead have a few "simpler" conditional branches, say:
> BEQZ reg, label //Branch if reg==0
> BNEZ reg, label //Branch if reg!=0
> BGEZ reg, label //Branch if reg>=0
> BLTZ reg, label //Branch if reg< 0
>
>While conceptually, this doesn't save much, it would be cheaper to
>implement in hardware. Relative compares could then use compare
>instructions:
> CMPx Rs, Rt, Rd
>Where:
> (Rs==Rt) => 0;
> (Rs> Rt) => 1;
> (Rs< Rt) => -1.
>
>Though, one issue with a plain SUB is that it would not work correctly
>for comparing integer values the same width as the machine registers (if
>the difference is too large, the intermediate value will overflow).

Take a look at how Mitch Alsup solved this in the 88000.

>However, besides "cheap core", it is also nice to be able to have "fast
>LZ77 decoding" and similar, which is an area where misaligned memory
>access pays off.

If you know that you are doing misaligned accesses, composing a
possibly misaligned load from two loads and a combining instruction
(or a few) looks like a simpler approach.

I think the reason why everyone has switched to supporting misaligned
accesses is that there is too much software out there that needs it.

Also, you really want it for SIMD stuff (and your LZ77 example is
actually an example of that), so you need to implement it anyway, and
can just as well extend it to the other memory accesses.

>Page-crossing doesn't seem to be too big of an issue, since it is rare,
>and can be handled with two TLB misses in a row if needed (only the
>first TLB miss gets served; when the interrupt returns and the
>instruction tries again, it causes another TLB miss).

If you deal with TLB misses through interrupts, you have to make sure
that the first one is still there when the second one returns, or some
other way to guarantee eventual progress.

>In BJX2, I went with signed values being kept in sign-extended form, and
>unsigned values kept in zero-extended form

That seems like a natural way to do it. However, I am not sure if
that works well in the presence of type mismatches between caller and
callee in production C code. At least I imagine that's the reason why
ABIs force one kind of extension, and then do the appropriate
extension for the defined type at the usage site if necessary.

MitchAlsup

unread,

Dec 25, 2021, 1:06:46 PM12/25/21

to

IIUC: The packet of instructions on a Mill obeys the concept of being
"self-contained". It may have bits hither and yon, but they all show up
from a single fetch.

MitchAlsup

unread,

Dec 25, 2021, 1:12:37 PM12/25/21

to

On Saturday, December 25, 2021 at 5:14:03 AM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> >> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> ><snip>
> >>
> >> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
> >> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
> >> both modes are "very common" in practice.
> ><
> > I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
> > it is expressive (more to FORTRAN needs than C needs) and it saves
> > instruction space. I provided ONLY the above two (2) in M88K and
> > later when doing x86-64 found out the error of my ways.
> ><
> > The least one can add is [Rbase+Rindex<<scale]
> Depending on how pressed for bits in opcodes one is, the scale
> could also be implied in the size of the data, so a 64-bit word
> would have either scale = 0 or scale == 3. I think HP-PA did that.
<

Without "+disp" I did this in M88K.

>
> This would cover the most common case of linear array access.
> >>
> >> More elaborate modes quickly run into diminishing returns territory; so,
> >> things like auto-increment and similar are better off left out IMO.
> ><
> > Agreed except for [Rbase+Rindex<<scale+disp]
> In your ISA, you specified the disp as either a 32 or a 64-bit quantity.
> So, any offset in the range between -2^31 to 2^31-1 needs two 32-bit
> words.
<

My 66000 has:
a) [Rbase+disp16] that fits in one 32-bit word
b) [Rbase+Rindex<<scale] that fits in one 32-bit word
c) [Rbase+Rindex<<scale+disp32] that fits in two 32-bit words
d) [Rbase+Rindex<<scale+disp64] that fits in three 32-bit words
<
Displacements are sign extended so the problem above is moot.

Ivan Godard

unread,

Dec 25, 2021, 1:52:41 PM12/25/21

to

vs:
[Rbase]
[Rbase+disp]
[Rbase+Rindex(<<scale)]
[Rbase+Rindex(<<scale)+disp]
<all above + Rpredicate>

where disp is 0/1/2/4 bytes, R's are belt width (4 bits on Silver),
scale is one bit, opCode varies by slot - 11 bits typical. Total 16-56 bits.

I don't see the use of disp64, though I suppose it falls out for free
from your encoding of wide literals.

BTW, virtually all cases in our corpus are either disp=0 or disp=8, i.e.
16 or 24 bits total (plus index/scale/predicate if any).

BGB

unread,

Dec 25, 2021, 2:01:39 PM12/25/21

to

On 12/25/2021 5:14 AM, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
>> <snip>
>>>
>>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>>> both modes are "very common" in practice.
>> <
>> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
>> it is expressive (more to FORTRAN needs than C needs) and it saves
>> instruction space. I provided ONLY the above two (2) in M88K and
>> later when doing x86-64 found out the error of my ways.
>> <
>> The least one can add is [Rbase+Rindex<<scale]
>
> Depending on how pressed for bits in opcodes one is, the scale
> could also be implied in the size of the data, so a 64-bit word
> would have either scale = 0 or scale == 3. I think HP-PA did that.
>
> This would cover the most common case of linear array access.
>

This is also how it works on BJX2.
In most cases, the scale is the element size.

The main exceptions are:
PC, GBR, or TBR is encoded as a base register (byte scale);
R1 is used as an index, which is interpreted as an unscaled R0.

It is possible to encode an explicit scale and (Rb+Ri*Sc+Disp) mode via
an Op64 encoding, but I have left this as optional. I may split these
into two different sub-cases, where the ability to encode a scaled
displacement is treated as separate from the ability to do so with a
non-zero displacement field.

Current scales are 1, 2, 4, and 8.
The MOV.X (128-bit) load/store ops uses a scale of 8, which works well
for loading/storing a register pair (primary use case), but is a little
annoying for indexing an array of 128-bit elements (requires a 1-bit
left shift for the index); but this situation is infrequent.

>>>
>>> More elaborate modes quickly run into diminishing returns territory; so,
>>> things like auto-increment and similar are better off left out IMO.
>> <
>> Agreed except for [Rbase+Rindex<<scale+disp]
>
> In your ISA, you specified the disp as either a 32 or a 64-bit quantity.
> So, any offset in the range between -2^31 to 2^31-1 needs two 32-bit
> words.
>
> Looking at the (probably more common) case where the offset can fit
> into (for example) 16 bits, a good alternative might be
>
> LEA Rx,[Rbase+Rindex<<scale]
> LD Rx,[Rx+offset]
>
> which also uses to words and does not use an extra register.
> The use case would then be pretty small so that I would probably
> tend to leave it out.

Such is the issue...

A case which will occur fairly infrequently and is trivially encoded in
a 2 op sequence with only a single cycle of penalty, doesn't make a
particularly strong case.

My usual heuristics are based on:
Is it common?
Is it expensive to encode otherwise?
Does it appear cheap or free as a result of an existing mechanism?
...

So, particularly common cases will end up with more specialized encodings.

Things which are useful or otherwise expensive to implement in software
may get special instructions.

Likewise for things which trivially map to an existing unit or similar
(and have enough of a use-case to justify adding them).

In my case, my decoder works in two sub-stages:
Lookup the internal opcode numbers, form type, ... based on instruction
bits;
Unpack the instruction as given in the instruction form.

The former seems to primarily depend on which and how many bits are
considered in the lookup, and how many bits of output are produced. It
does not seem to care about how many patterns are assigned within this
space.

The latter case mostly cares about how many possible ways exist to
decode the instruction.

Overall decoder cost is moderate. For a 3-wide decoder with both BJX2
and RISC-V support, I am looking at around 4K LUTs. This is less than
either the Lane 1 ALU (6K LUT) or the L1 D$ (8K LUT).

Though, the Lane 2/3 ALUs are a lot smaller, but they deal with less.
They only do integer arithmetic.

Lane 1 ALU does:
Artihmetic ops;
Compare;
Various format conversions;
CLZ / CTZ;
UTX1 decode;
...

BGB

unread,

Dec 25, 2021, 4:07:40 PM12/25/21

to

On 12/24/2021 4:06 PM, MitchAlsup wrote:
> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> <snip>
>>
>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>> both modes are "very common" in practice.
> <
> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
> it is expressive (more to FORTRAN needs than C needs) and it saves
> instruction space. I provided ONLY the above two (2) in M88K and
> later when doing x86-64 found out the error of my ways.
> <
> The least one can add is [Rbase+Rindex<<scale]

There is an implicit scale implied with the Index, but it in my case, it
is element sized in the 32-bit encoding.

In BJX2, there is a [Rbase + Rindex<<Sc + Disp] mode, but:
It is optional;
It requires an Op64 encoding;
My compiler isn't smart enough to utilize it effectively.

It is a little cheaper to support:
[Rbase + Rindex<<Sc]
Encoded as:
[Rbase + Rindex<<Sc + 0]

Which doesn't support any additional machinery, but then I need to
distinguish these cases (both cases use the same encoding, just in the
latter case, the displacement is Zero). It is partly definition, but
also has a minor impact in terms of how I define things in terms of the
CPUID feature flags and similar.

>>
>> More elaborate modes quickly run into diminishing returns territory; so,
>> things like auto-increment and similar are better off left out IMO.
> <
> Agreed except for [Rbase+Rindex<<scale+disp]
> <

Granted, it is in a gray area.

Though, lacking this case is not nearly as adverse as lacking
(Reg,Index) or (Reg,Index*Sc) ...

> <snip again>
>> Ironically though, the 1-wide core isn't *that* much cheaper than the
>> 3-wide core, despite being significantly less capable.
>>
>> I could maybe make the core a bit cheaper by leaving out RV64I support,
>> which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
>> Arbitrary register as link-register; ...). Or, conversely, a core which
>> *only* does RV64I.
> <
> When Compare and Branch is correctly specified it can be performed
> in 1 pipeline cycle. Then specified without looking "at the gates" is
> seldom is !

In the present form, the determination is made in EX1, and the branch is
initiated in EX2 (invalidating whatever is in EX1 at the time). Pushing
it later would likely require using an interlock or similar (so as to
not allow any instructions into the pipeline until the branch direction
can be determined).

The RISC-V style compare-and-branch instructions at present come with a
fairly steep cost in terms of LUTs and timing, as otherwise the results
of the compare would not be used until 1 cycle later.

>>
>> Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
>> though (just these two are around 1/3 of the total LUT cost of the
>> 1-wide core).
>>
> <snip>
>>>
>>> And with instruction fusion the end result of the decoding step
>>> becomes even more complex.
>>>
>> Instruction fusion seems like a needless complexity IMO. It would be
>> better if the ISA can be "dense" enough to make the fusion mostly
>> unnecessary. Though, "where to draw the line" is probably a subject of
>> debate.
> <
> Instruction fusing is worthwhile on a 1-wide machine with certain pipeline
> organization to execute 2 instructions per cycle and a few other tricks.
> <
> In my 1-wide My 66130 core, I can CoIssue several instruction pairs:
> a) any instruction and an unconditional branch
> b) any result delivering instruction and a conditional branch or predicate
> consuming the result.
> c) Store instruction followed by a calculation instruction such that the
> total number of register reads is <= 3.
> d) any non-branch instruction followed by an unconditional branch
> instruction allow the branch to be taken at execution cost of ZERO
> cycles.
> <
> An eXcel spreadsheet shows up to 30% performance advantage over a
> strict 1-wide machine. {This is within spitting distance of the gain of
> In Order 2-wide over In Order 1-wide with less than 20% of the cost}

OK. My existing core does not support (or perform) any sort of
instruction fusion.

Things like instruction fetch / PC-step are determined in the 'IF'
stage, but this is done in terms of instruction bits and mode flags,
rather than recognizing any instructions in particular.

It seems like one could fetch a block of instructions and then include
an "offset" based on how many instructions were executed from the prior
cycle; however this seems more complicated than the current approach,
and for a machine with 96 bit bundles would likely require fetching in
terms of 256 bit blocks rather than the 96 bits corresponding to the
current bundle (since one likely wouldn't know the exact position of the
bundle, or how far to adjust PC, until ID1 or similar).

>>
>> Granted, it could probably be justified if one has already paid the
>> complexity cost needed to support superscalar (since it should be able
>> to use the same mechanism to deal with both).
> <
> I justified it not because of SuperScalar infrastructure (which I do not have)
> But because My 66000 ISA <practically> requires an instruction fetch buffer;
> So, the vast majority of the time the instructions are present and simply
> waiting for the pipeline to get around to them.

OK. I am still using an approach more like:
Send PC to the I$ during PF;
Get a 96 bit bundle and PC_step during IF;
Advance PC and feed back into I$ (PF for next cycle).

This works, but requires things to be known exactly during IF, and does
not deal well with any uncertainty here.

Similarly, trying to stick either fusion or superscalar logic into the
IF stage seems problematic (cost, and if there are any false positives,
the pipeline state is basically hosed).

>>
>> I have come to the realization though that, because my cores lack many
>> of the tricks that fancier RISC-V cores use, my cores running in RISC-V
>> mode will at-best have somewhat worse performance than something like SweRV.
> <
> Yes, not being like the others has certain disadvantages, too.

Yeah. I can make the BJX2 core run RISC-V, but not "well". To do better
would require doing a core specifically for running RISC-V, but then
there is little purpose to do so apart from interop with my existing
cores on a common bus.

My recent "minicore" also aims to support RISC-V, but is at least
"slightly less absurd" in that it is closer to the common featureset of
BJX2 and RISC-V (it is just sort of a crappy subset of both ISAs). Its
performance isn't likely to be all that impressive though in either case.

>>
>>
>>
>> If I were designing a "similar" ISA to RISC-V (with no status flags), I
> <
> Before you head in this direction, I find RISC-V ISA rather dreadful and
> only barely serviceable.
> <

Possibly, there are things I am not a fan of.

Main reasons I am bothering with it at all:
It is much more popular;
It is supported by GCC;
It (sorta) maps onto my existing pipeline...
Or at least RV64I and RV64IC.
Pretty much everything beyond this is not really 1:1.

However, there are issues:
Pure RISC-V isn't really able to operate the BJX2 core;
BGBCC doesn't yet have a RISC-V target (or ELF object files);
...

This basically limits cross-ISA interaction at present mostly to ASM blobs.

I have yet to work out specifics for how I will do cross ISA interfacing
to a sufficient degree to actually load up RISC-V binaries on the BJX2
core (it can already be done in theory, but they can't do a whole lot as
of yet).

Harder case would be bare-metal, which is basically what I would need to
have any real hope of a Linux port or similar; but in this case would
require ability to link object files between compilers and across ISA
boundaries.

The simpler route would just sorta being to load RISC-V ELF binaries
into TestKern.

Trying to port the Linux kernel to build on BGBCC is likely no-go.

Though, even if I did so, most existing Linux on RISC-V ports assume
RV64G or RV64GC, which is a bit out-of-reach at present.

>> would probably leave out full Compare-and-Branch instructions, and
>> instead have a few "simpler" conditional branches, say:
>> BEQZ reg, label //Branch if reg==0
>> BNEZ reg, label //Branch if reg!=0
>> BGEZ reg, label //Branch if reg>=0
>> BLTZ reg, label //Branch if reg< 0
>>
>> While conceptually, this doesn't save much, it would be cheaper to
>> implement in hardware.
> <
> Having done both, I can warn you that your assumption is filled with badly
> formed misconceptions. From a purity standpoint you do have a point;
> from a gate count perspective and a cycle time perspective you do not.
> <

It would on average cost more clock cycles, but it seems like:
Detect 0 on input (already have this to detect Inf);
Look at sign bit;
Initiate branch signal.

Should be cheaper (in terms of latency) than:
Subtract values;
Use carry, zero, and sign bits of result to determine branch;
Initiate branch signal.

Granted, the actual branch mechanism doesn't initiate until EX2 in this
case, which is probably the only reason it works at all. This basically
overrides the next cycle's PC with the branch target generated in EX1,
and then initiating a flush of the rest of the pipeline stages (unless
the branch predictor also handled this).

Granted, doing this would break compatibility with RISC-V, defeating the
whole point. And, otherwise, I tend to leave these out in favor of a
2-op sequence using the SR.T bit.

>> Relative compares could then use compare
>> instructions:
>> CMPx Rs, Rt, Rd
>> Where:
>> (Rs==Rt) => 0;
>> (Rs> Rt) => 1;
>> (Rs< Rt) => -1.
>>
>> Though, one issue with a plain SUB is that it would not work correctly
>> for comparing integer values the same width as the machine registers (if
>> the difference is too large, the intermediate value will overflow).
> <
> Which is why one needs CMP instructions and not to rely on SUB to do 98%
> of the work.
> <

Granted. Traditionally many would use SUB here, but as noted SUB only
works if:
The values are smaller than the register width;
The values are properly sign or zero extended.

I am thinking mostly of lower-end implementations.

In theory, one could split the complex addressing modes into multiple
uOps or similar, and then execute them end-to-end. But, otherwise, one
has a problem.

>>
>> As soon as one deviates from Load/Store, one has effectively thrown a
>> hand grenade into the mix (one is doomed to either needing a much more
>> complex decoder, OoO, or paying a cost in terms of a significant
>> increase in clock-cycle counts per instruction).
>>
> There is the multi-fire reservation station trick used on Athlon and Opteron,
> which pretty much makes the problem vanish.

OK.

>>
>> In comparison, variable-length instructions and misaligned memory access
>> are not nearly as destructive. Granted, they are not free either. I
>> suspect the L1 D$ in my case would be somewhat cheaper if it did not
>> need to deal with misaligned access.
> <
> I would assert that they are better than FREE. They add more performance
> than the added gates to do these things cost.

Granted.

I will note that BJX2 has both variable-length instructions and
misaligned load/store. They are useful at least, but have some LUT cost.

Having binaries be 40-60% smaller, and ability to get a significant
speedup from things like LZ77 and Huffman/Rice decoders, is a worthwhile
tradeoff.

>>
>> However, besides "cheap core", it is also nice to be able to have "fast
>> LZ77 decoding" and similar, which is an area where misaligned memory
>> access pays off.
> <
> ¿dynamic bitfields?

Yes, it is also useful for things like efficient Huffman and Rice
decoding, but I didn't mention this.

>>
>> Page-crossing doesn't seem to be too big of an issue, since it is rare,
>> and can be handled with two TLB misses in a row if needed (only the
>> first TLB miss gets served; when the interrupt returns and the
>> instruction tries again, it causes another TLB miss).
> <
> It is only time, and 98%99% of the time these crossing don't happen.
> {98% in 4KB pages, 99% in 8KB pages}

I am mostly using 16K pages as my testing implied this to be the local
optimum.

Smaller, and TLB miss rate increases significantly.
Larger, and memory overhead increases without much reduction in miss rate.

> <
>>> An interesting point in the talk is that zero-extension is a common
>>> idiom in RISC-V; the talk does not explain this completely, but
>>> apparently the 32-bit variants of instructions sign-extend (like Alpha
>>> does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
>>> values around in sign-extended form (including 32-bit unsigned
>>> values).
>>>
>> It is a tradeoff.
>>
>> In BJX2, I went with signed values being kept in sign-extended form, and
>> unsigned values kept in zero-extended form (and casts requiring explicit
>> sign or zero extension of the results).
> <
> The tradeoff is even harder where the linker fills in the "upper" bits of a
> register. If the underlying premise is sign extension, one COULD need 2
> registers to hold the "same value" for 2 different large address constants.
> <
> If the underlying premise is zero extension, certain bit pasting ways using
> + need to be changed to use |.
> <
> Another reason not to make SW, of any sort, have to paste bits together to
> make <larger> constants.

OK.

As noted though, bit pasting isn't needed with Jumbo encodings.
The encoding is kinda confetti, but hardware mostly deals with this.

Does sorta lead to an annoyance though for a compiler in that one needs
a bunch of different reloc types to deal with all this (or, when the
same reloc fixup code is dealing with effectively 4 different ISA's
worth of reloc types; and all 4 of them used bit-confetti encodings).

>>
>> This is a general rule even if there are only a minority of cases where
>> it should matter.
> <
> More than you alude.

Possibly.

In general, code needs to keep values in the correct form because there
are cases where it matters:
Load/Store ops ended up using 33 bit displacements;
Some operations always operate on full 64-bit inputs;
...

Going the x86-64 / A64 route effectively requires doubling up nearly
every operation with both 32-bit and 64-bit forms; rather than leaving
most of the ISA as 64-bit except in cases where 64b is too costly and/or
32-bit semantics are required (such as to make C code work as expected).

One thing I have observed is that one can get wonky results from C code
if 'int' values are allowed to go outside of the expected range, more so
when load/store gets involved (the compiler can save and reload a value
and then have the value differ).

So, I ended up with operations like "ADDS.L" and "ADDU.L" whose basic
sole purpose is to do ADD and similar in ways which produce results
which wrap in the expected ways in the case of integer overflow.

...

MitchAlsup

unread,

Dec 25, 2021, 4:45:56 PM12/25/21

to

There are a couple of other things:
Rbase == 0 means use IP as the base register {Dispψ, Disp32 }
Rindex == 0 means index = 0.
Rbase == 0 AND Disp64 means the 64-bit constant is an absolute address
....{and can be indexed if desired}.
<
So, DISP64 is used for absolute addressing anywhere in the address space.

>
> BTW, virtually all cases in our corpus are either disp=0 or disp=8, i.e.
> 16 or 24 bits total (plus index/scale/predicate if any).
<

I am seeing 2% DISP64
Dispψ means Disp field does not exist in instruction.
<
A surprising amount of STs contain constants to be deposited in memory
<
ST #5,[SP+32]

MitchAlsup

unread,

Dec 25, 2021, 4:59:08 PM12/25/21

to

On Saturday, December 25, 2021 at 1:01:39 PM UTC-6, BGB wrote:
> On 12/25/2021 5:14 AM, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> >>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
<snip>

> Such is the issue...
>
> A case which will occur fairly infrequently and is trivially encoded in
> a 2 op sequence with only a single cycle of penalty, doesn't make a
> particularly strong case.
>
>
>
> My usual heuristics are based on:
> Is it common?
> Is it expensive to encode otherwise?
> Does it appear cheap or free as a result of an existing mechanism?
> ...

A good list.

>
> So, particularly common cases will end up with more specialized encodings.
>
> Things which are useful or otherwise expensive to implement in software
> may get special instructions.
<

A switch() instruction ended up in My 66000 ISA. This instruction transfers
control to a value calculated via a table in memory without consuming a register
and without passing the data through the data cache (switch tables are accessed
though the I$) and the index is range checked (transferring control to default:
in the out of range condition.)
<
This made the cut because there are enough switches in typical code, AND the
table access should not consume a register, AND the table data should not
pollute the D$, and we can use the branch adder in the PARSE stage of the
pipeline to perform the arithmetic and fetch instructions.
<
ENTER and EXIT made the cut because these can do the work of multiple instructions
(up to 19 under MY 66000 ABI) while avoiding pollution of registers, AND I can put
the preserved registers in a place where SW cannot access the data--eliminating
ROP attacks. ENTER and EXIT are faster than actually executing instructions to
perform the same amount of work because registers can be accessed 4-8 registers
per cycle due to HW context switch.

>
> Likewise for things which trivially map to an existing unit or similar
> (and have enough of a use-case to justify adding them).
>
>
> In my case, my decoder works in two sub-stages:
> Lookup the internal opcode numbers, form type, ... based on instruction
> bits;
> Unpack the instruction as given in the instruction form.
<

My decoding strategy is to look at 6-bits (Major OpCode) and use this to
determine instruction format {1-operand, 2-opernd, 3-operand, memory ref,
shifts, and predicate}. Registers can be routed directly to RF decode ports
in smaller implementations,
<
DECODE (after PARSE) then reads the RF, performs forwarding, binds
CoIssued instructions, calculates branch target addresses, and controls
the FETCH address multiplexer. Operands and OpCodes are routed to
appropriate calculation units.

MitchAlsup

unread,

Dec 25, 2021, 5:23:10 PM12/25/21

to

On Saturday, December 25, 2021 at 3:07:40 PM UTC-6, BGB wrote:
> On 12/24/2021 4:06 PM, MitchAlsup wrote:
> > On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> >> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> > <snip>

> > Instruction fusing is worthwhile on a 1-wide machine with certain pipeline
> > organization to execute 2 instructions per cycle and a few other tricks.
> > <
> > In my 1-wide My 66130 core, I can CoIssue several instruction pairs:
> > a) any instruction and an unconditional branch
> > b) any result delivering instruction and a conditional branch or predicate
> > consuming the result.
> > c) Store instruction followed by a calculation instruction such that the
> > total number of register reads is <= 3.
> > d) any non-branch instruction followed by an unconditional branch
> > instruction allow the branch to be taken at execution cost of ZERO
> > cycles.
> > <
> > An eXcel spreadsheet shows up to 30% performance advantage over a
> > strict 1-wide machine. {This is within spitting distance of the gain of
> > In Order 2-wide over In Order 1-wide with less than 20% of the cost}
> OK. My existing core does not support (or perform) any sort of
> instruction fusion.
>
>
> Things like instruction fetch / PC-step are determined in the 'IF'
> stage, but this is done in terms of instruction bits and mode flags,
> rather than recognizing any instructions in particular.
>
> It seems like one could fetch a block of instructions and then include
> an "offset" based on how many instructions were executed from the prior
> cycle; however this seems more complicated than the current approach,
<

Once your instruction is no longer fixed at 32-bits, you find yourself asking
well, how may bits should I fetch every cycle. And you find that 4 words
are appropriate for a low-end machine, 8-16 for a larger machine. In other
words you have committed yourself to having an instruction buffer.
<
Once you have an instruction buffer, you can decode every instruction in
the buffer and develop a unary pointer to the next instruction (30 total gates
4 gates of delay) So, now if you pick a given instruction you can pick a
second instruction 6 gates later, and pick 3 and 4 four gates later, pick
5,6,7,8 4-gates later,... You simply skip over the constants during PARSE
you access the constants from something that looks surprizingly like
a RF in DECODE, performing forwarding,...

<
> and for a machine with 96 bit bundles would likely require fetching in
> terms of 256 bit blocks rather than the 96 bits corresponding to the
> current bundle (since one likely wouldn't know the exact position of the
> bundle, or how far to adjust PC, until ID1 or similar).
<

Once you have an instruction buffer, you can scan IB forward looking tor
branches; so, you can fetch the target prior to executing the branch and
perform 1-cycle branches (sometimes zero cycle branches/calls/returns)

>
>
>
> Trying to port the Linux kernel to build on BGBCC is likely no-go.
>
> Though, even if I did so, most existing Linux on RISC-V ports assume
> RV64G or RV64GC, which is a bit out-of-reach at present.
> >> would probably leave out full Compare-and-Branch instructions, and
> >> instead have a few "simpler" conditional branches, say:
> >> BEQZ reg, label //Branch if reg==0
> >> BNEZ reg, label //Branch if reg!=0
> >> BGEZ reg, label //Branch if reg>=0
> >> BLTZ reg, label //Branch if reg< 0
> >>
> >> While conceptually, this doesn't save much, it would be cheaper to
> >> implement in hardware.
> > <
> > Having done both, I can warn you that your assumption is filled with badly
> > formed misconceptions. From a purity standpoint you do have a point;
> > from a gate count perspective and a cycle time perspective you do not.
> > <
> It would on average cost more clock cycles, but it seems like:
> Detect 0 on input (already have this to detect Inf);
> Look at sign bit;
> Initiate branch signal.
>
> Should be cheaper (in terms of latency) than:
> Subtract values;
> Use carry, zero, and sign bits of result to determine branch;
> Initiate branch signal.
<

Yes, that is the trick, here. The compare-branch does not need a full carry
chain--it only needs a degenerate carry chain--and this degenerate chain
is at most ½ the gate delay of a full carry chain. So, the logic ends up as
XOR-great-bit-AND gate.
<
Now, if (when) you CoIssue CMP and BB, you setup the CMP logic so you
can determine if the branch will be taken prior to the point you can generate
the output of the CMP instruction itself. You use the intermediate state of the
CMP instruction to drive the BB instruction. And not only do you execute
2 instructions in 1 cycle, you have already fetched the target instruction
so, you are ready to insert either sequential or target instruction into Decode
in that same clock.
>
<snip>

> >>
> >> Load/Store allows decent performance from a simplistic pipeline.
> > <
> > Ahem: 4-wide machines do not have simplistic pipelines, so the burden is
> > only on the lesser implementations and nothing on the higher performance
> > ones.
<
> I am thinking mostly of lower-end implementations.
>
> In theory, one could split the complex addressing modes into multiple
> uOps or similar, and then execute them end-to-end. But, otherwise, one
> has a problem.
<

For a variety of reasons, you don't want to do that. FORTRAN code will be
accessing the stack interspersed with accesses to common block arrays.
So, you will want:
<
LD Ri,[SP+offset]
LD Rx,[IP+Rj<<3+offset]
<
To be able to "run" down the memory pipeline back-to-back continuously.
<
<snip>

> > More than you alude.
> Possibly.
>
>
> In general, code needs to keep values in the correct form because there
> are cases where it matters:
> Load/Store ops ended up using 33 bit displacements;
> Some operations always operate on full 64-bit inputs;
> ...
>
> Going the x86-64 / A64 route effectively requires doubling up nearly
> every operation with both 32-bit and 64-bit forms; rather than leaving
> most of the ISA as 64-bit except in cases where 64b is too costly and/or
> 32-bit semantics are required (such as to make C code work as expected).
>
>
> One thing I have observed is that one can get wonky results from C code
> if 'int' values are allowed to go outside of the expected range, more so
> when load/store gets involved (the compiler can save and reload a value
> and then have the value differ).
<

My 66000 has instructions to smash values in registers back into their prescribed
ranges:
SL Rchar,Rlong,<8:0>
SL Rhalf,Rlong,<16:0>
SL Rword,Rlong<32:0>
<
But these are degenerate subsets of real bit manipulation instructions.

Ivan Godard

unread,

Dec 25, 2021, 8:34:56 PM12/25/21

to

Which removes an address register. vs: Rbase can be a ptr off the belt
or any of 8 specRegs

> Rindex == 0 means index = 0.

vs no Rindex at all (different main opCode)

> Rbase == 0 AND Disp64 means the 64-bit constant is an absolute address
> ....{and can be indexed if desired}.
> <
> So, DISP64 is used for absolute addressing anywhere in the address space.

Absolute addressing is useless in PIC code

>>
>> BTW, virtually all cases in our corpus are either disp=0 or disp=8, i.e.
>> 16 or 24 bits total (plus index/scale/predicate if any).
> <
> I am seeing 2% DISP64

For what, other than absolute address? Which you shouldn't :-)

> Dispψ means Disp field does not exist in instruction.

Shows as "disp<Greek phi>" for my reader - ???

> <
> A surprising amount of STs contain constants to be deposited in memory
> <
> ST #5,[SP+32]

That's cute :-)

Thomas Koenig

unread,

Dec 26, 2021, 6:22:11 AM12/26/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> FORTRAN code will be
> accessing the stack interspersed with accesses to common block arrays.

There are several types of Fortran applications that still use
COMMON blocks these days. You can assume that a new program
no longer uses COMMON, it is too error-prone and clumsy.

There programs with a more-or-less fixed complexity, which ran
minutes or hours on the mainframes and minis of the past and which
now run in seconds on a low-powered laptop, with most of the time
spent on program startup. I maintain a couple of these at work,
and nobody cares about efficiency there.

Then, there are larger packages from the past which are too
expensive (money or hour-wise) to rewrite which depend on COMMON
blocks for what passed for dynamic memory management in pre-Fortran
90 days. The typical style there is to pass around workspaces as
extra arguments. Look at Lapack for examples of the calling sequence,
so there is little direct access to the COMMON block.

New code is written with more concern for OpenMP, OpenACC, Coarrays,
and the object-oriented features of Fortran, and uses allocatable
variables extensively.

What you wrote was certainly true 30 years ago, maybe 20 years ago.
It should no longer be a consideration for a new architecture.

a...@littlepinkcloud.invalid

unread,

Dec 26, 2021, 6:24:05 AM12/26/21

to

BGB <cr8...@gmail.com> wrote:
>
> If I were designing a "similar" ISA to RISC-V (with no status flags), I
> would probably leave out full Compare-and-Branch instructions, and
> instead have a few "simpler" conditional branches, say:
> BEQZ reg, label //Branch if reg==0
> BNEZ reg, label //Branch if reg!=0
> BGEZ reg, label //Branch if reg>=0
> BLTZ reg, label //Branch if reg< 0
>
> While conceptually, this doesn't save much, it would be cheaper to
> implement in hardware.

Just an aside: A64 has all of these, in addition to conditional
branches depending on the flags. I didn't see any mention of that in
Celio et al.

Andrew.

MitchAlsup

unread,

Dec 26, 2021, 3:20:52 PM12/26/21

to

The other character I could have used was: ϕ

<
> > <
> > A surprising amount of STs contain constants to be deposited in memory
> > <
> > ST #5,[SP+32]
<
> That's cute :-)
<

Brian gets credit for adding this.

BGB

unread,

Dec 26, 2021, 3:23:17 PM12/26/21

to

On 12/25/2021 11:10 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>>> * A64 has fixed-length 32-bit instructions, but they can be more
>>> complex: A64 has more addressing modes and additional instructions
>>> like load pair and store pair; in particular, the A64 architects
>>> seem to have had few concerns about the register read and write
>>> ports needed per instruction; E.g., a store-pair instruction can
>>> need four read ports, and a load pair instruction can need three
>>> write ports (AFAIK).
>>>
>>
>> These are less of an issue if one assumes a minimum width for the core.
>> If the core is always at least 3-wide, this isn't an issue.
>
> Why? Sure, you could have one instruction port that supports up to
> four reads and the others only support two, and one that supports
> three writes, and the others only support one, but that would still be
> 8 register reads and 5 writes per cycle, and my impression was that
> having many register ports is really expensive.
>

If you have a multi-lane core, it already has the needed register ports.
You then spread a single instruction across 2 or 3 lanes.

Say, for example, in my core:
Each instruction natively has 2 read ports, 1 write port;
There are also instructions which read from more registers.

If you use an instruction which reads from the destination register,
this automatically eats Lane 3. If you use an instruction which operates
on 128-bit values, it eats all 3 lanes.

This was also partly why I ended up going primarily with a 3 lane
design. It allowed bundling a lot more cases in primarily 2-lane
operation (the 3rd lane mostly a source of spare register ports and the
occasional ALU op); and it had enough register ports to allow for
128-bit SIMD.

Meanwhile, if I do a 1-wide subset of my ISA, a number of instructions
are no longer usable because this subset is limited to a 3R+1W register
file (going to 2R+1W would no longer allow for register-indexed load/store).

Seemingly, despite the differences, for the CPU core itself, with a
similar baseline feature-set (MMU and FPU, ...), the 3-wide core is only
~ 150% the size of the 1-wide core; but somewhat more capable (and also
faster).

A bigger issue ATM is seemingly, that with 64B L2 transfers, there is ~
15K LUT going to the DDR controller and L2 cache (L2 also eats most of
the Block RAM).

However, it increases memory bandwidth, and allows for putting the video
memory in external DRAM (this only works effectively if the bandwidth is
high enough to be able to keep up with the raster; otherwise the screen
contents turn into garbage).

The VRAM module operates as its own cache, which operates in terms of
256-bit tiles (8x8|4x4 pixels), and in-turn takes another 4K LUT.
Though, it also contains logic to mimic the original MMIO interface.

(So, DDR+L2+VRAM is already ~ 1/3 of the total LUT budget).
Throw in another 2K for the PCM and FM Synth audio modules.

...

So, it is sorta like at present:
XC7A100: Can fit a 3-wide core, and a second 1-wide core;
Or, can used narrower DDR/L2, and fit 2x 3-wide cores,
but, need to omit a few features to fit 2x 3-wide.
XC7S50: Can fit a 3-wide core, need to reduce some settings.
XC7S25: Can shoe-horn in a 1-wide (RISC style) core.
Though, an ISA more like RV32I makes more sense on an XC7S25.

Don't have a board ATM with an XC7A35 (such as an Arty A7-35T); should
be able to fit a 1-wide core though as it is a little bigger than the
XC7S25.

The Arty A7-100T is basically the same FPGA as the Nexys A7 which I am
still using (just with a bigger RAM module, but fewer built-in
connectors; I really like having built-in VGA/PS2/SDcard slot though).

> There is also the option of reducing register port requirements
> through the bypasses, but what do you do if there is not enough
> bypassed data?
>

My ISA has side-channels, but mostly used for control registers and a
few special registers effected by the ISR mechanism. The CRs are mostly
tied up with internal state within the core.

A few could have arguably been better as GPRs; for example, both ARM and
RISC-V had put the Global and Link registers in GPR space. Personally, I
would assume sticking with one or two dedicated link registers.

Tried to come up with something here, but it ended up fairly similar to
the RISC-V layout (well, except that I would assume making the ABI
register assignments more contiguous).

>> Instruction fusion seems like a needless complexity IMO. It would be
>> better if the ISA can be "dense" enough to make the fusion mostly
>> unnecessary.
>
> Well, Celio's position is that RISC-V gets code density through the C
> extension and uop density through instruction fusion.
>

It can do this, but could be better if fusion were unnecessary.

It likely punishes low-end implementations (with poor performance) more
than it helps (by saving LUTs/gates).

Like, it is a design which helps "very low end" implementations, at the
expense of other "moderately low-end" implementations, which need to
implement "higher end" functionality to compensate for deficiencies.

I generally agree with the 'C' extension though.

> In the past we have seen many cases where it was more beneficial to
> solve things in microarchitecture rather than architecture: caches
> (rather than explicit fast memory), OoO rather than EPIC, dynamic
> branch prediction rather than delay slots and static branch
> prediction, bypasses rather than tranport-triggered architecture,
> microarchitectural memory disambiguation rather than IA-64's ALAT.
>
> Some of the benefits of putting features in the microarchitecture are
> that
>
> 1) You can easily remove them (even within one implementation, by
> using the appropriate chicken bit), while architectural features are
> forever.
>
> 2) When you add them, existing code can benefit from them, while
> architectural features need at least a recompilation.
>
> RISC seemed to be the converse case of exposing things in the
> architecture that, say, a VAX implementation does in the
> microarchitecture, but one could also see VAX as another case of
> exposing the microarchitecture in the architecture: the
> microarchitecture was microcoded (because the microcode store was
> faster than main memory at the time; but that's actually another
> example of explicit fast memory mentioned above), so architects
> designed instruction sets that made heavy use of microcode.
>

As can be noted, I am coming at it from the perspective of having ended
up with a DSP flavored hybrid VLIW/EPIC style ISA (albeit with
variable-sized bundles; more like Xtensa or Hexagon).

In my case, if you write:
OP2 | OP1
Or:
OP3 | OP2 | OP1

It is basically telling the assembler to encode these instructions to
run in parallel...

When compiling C code, the compiler is more speculative, and tries to
shuffle and bundle instructions, but "kinda sucks at doing so".

>> If I were designing a "similar" ISA to RISC-V (with no status flags), I
>> would probably leave out full Compare-and-Branch instructions, and
>> instead have a few "simpler" conditional branches, say:
>> BEQZ reg, label //Branch if reg==0
>> BNEZ reg, label //Branch if reg!=0
>> BGEZ reg, label //Branch if reg>=0
>> BLTZ reg, label //Branch if reg< 0
>>
>> While conceptually, this doesn't save much, it would be cheaper to
>> implement in hardware. Relative compares could then use compare
>> instructions:
>> CMPx Rs, Rt, Rd
>> Where:
>> (Rs==Rt) => 0;
>> (Rs> Rt) => 1;
>> (Rs< Rt) => -1.
>>
>> Though, one issue with a plain SUB is that it would not work correctly
>> for comparing integer values the same width as the machine registers (if
>> the difference is too large, the intermediate value will overflow).
>
> Take a look at how Mitch Alsup solved this in the 88000.
>

It does work as-is, just would maybe be nicer if it was cheaper and had
less impact on timing.

But, I think the idea was that one only needs to figure out the
resultant sign and carry bits, rather than necessarily waiting for the
full result.

May or may not look into this.

>> However, besides "cheap core", it is also nice to be able to have "fast
>> LZ77 decoding" and similar, which is an area where misaligned memory
>> access pays off.
>
> If you know that you are doing misaligned accesses, composing a
> possibly misaligned load from two loads and a combining instruction
> (or a few) looks like a simpler approach.
>

It can be faked easily enough, but as can be noted, the cycle cost of
faking it is higher than that of native hardware support.

In cases where one wants unaligned access on hardware which only does
aligned access, this is where a keyword like "__unaligned" comes into
play. This allows using fast (direct) access for aligned loads/stores,
and a slower (faked) approach for misaligned load/store.

In this case, both RISC-V and BJX2 assume unaligned load/store
operations for basic types, so no issue here (and there is no direct
equivalent of MOV.X in RISC-V).

> I think the reason why everyone has switched to supporting misaligned
> accesses is that there is too much software out there that needs it.
>
> Also, you really want it for SIMD stuff (and your LZ77 example is
> actually an example of that), so you need to implement it anyway, and
> can just as well extend it to the other memory accesses.
>

Yeah, pretty much.

My case:
B/W/L/Q (8/16/32/64): Unaligned
X (128 bit): 8B Aligned.

The mapping for RISC-V is basically:
MOV.B <-> LB / SB
MOV.W <-> LH / SH
MOV.L <-> LW / SW
MOV.Q <-> LD / SD
MOVU.B <-> LBU
MOVU.W <-> LHU
MOVU.L <-> LWU

>> Page-crossing doesn't seem to be too big of an issue, since it is rare,
>> and can be handled with two TLB misses in a row if needed (only the
>> first TLB miss gets served; when the interrupt returns and the
>> instruction tries again, it causes another TLB miss).
>
> If you deal with TLB misses through interrupts, you have to make sure
> that the first one is still there when the second one returns, or some
> other way to guarantee eventual progress.
>

It generally works so long as one has at least a 2-way TLB.
I am using a 4-way TLB, so it is very unlikely to be an issue.

If one has a 1-way TLB, a situation can arise though where the CPU gets
stuck in an endless cycle of TLB miss handling. Though, this is slightly
less likely if one ensures that no two contiguous pages can map to the
same index; but can still occur if the target PC also happens to fall in
the same page.

I sidestepped TLB issues for the ISRs by making the ISRs always operate
with physically mapped addresses. In terms of the L1 caches, the virtual
and physical spaces are treated as different locations within a larger
address space (reducing the issue of possible aliasing). A similar trick
was used for KRR, so the L1 will avoid matching a requests with a cache
line fetched with a different KRR (which may potentially have different
access rights to the target page, *).

But, making all this stuff work reliably is a bit of a pain.

*: Though, the L1 does "cheat" slightly by effectively smashing ~ 192
bits of address and metadata down to ~ 80 bits via XOR hashing. However,
the probability of hash collisions seems "sufficiently unlikely".

>> In BJX2, I went with signed values being kept in sign-extended form, and
>> unsigned values kept in zero-extended form
>
> That seems like a natural way to do it. However, I am not sure if
> that works well in the presence of type mismatches between caller and
> callee in production C code. At least I imagine that's the reason why
> ABIs force one kind of extension, and then do the appropriate
> extension for the defined type at the usage site if necessary.
>

As long as the function has a prototype, the call/return will coerce the
types into the form expected by the called function (as usual).

If no prototype is present, it follows traditional rules:
Small integer types are promoted to their largest machine form;
Float is promoted to Double;
...

For missing prototypes, kinda ended up (mostly) going with 'long':
C90 says 'int', but this truncates pointers, 'long' does not;
Also, 'long' doesn't seem to break anything here.
C99 makes missing prototypes invalid anyways, so...

Other cases mostly still follow the "implicit int" rule:
unsigned x;
Is equivalent to:
unsigned int x;
...

Anton Ertl

unread,

Dec 27, 2021, 4:40:38 AM12/27/21

to

BGB <cr8...@gmail.com> writes:
>On 12/25/2021 11:10 AM, Anton Ertl wrote:
>> BGB <cr8...@gmail.com> writes:
>>>> * A64 has fixed-length 32-bit instructions, but they can be more
>>>> complex: A64 has more addressing modes and additional instructions
>>>> like load pair and store pair; in particular, the A64 architects
>>>> seem to have had few concerns about the register read and write
>>>> ports needed per instruction; E.g., a store-pair instruction can
>>>> need four read ports, and a load pair instruction can need three
>>>> write ports (AFAIK).
>>>>
>>>
>>> These are less of an issue if one assumes a minimum width for the core.
>>> If the core is always at least 3-wide, this isn't an issue.
>>
>> Why? Sure, you could have one instruction port that supports up to
>> four reads and the others only support two, and one that supports
>> three writes, and the others only support one, but that would still be
>> 8 register reads and 5 writes per cycle, and my impression was that
>> having many register ports is really expensive.
>>
>
>If you have a multi-lane core, it already has the needed register ports.
>You then spread a single instruction across 2 or 3 lanes.

Ok, but then these instructions reduce the utilization of the lanes.
A pre/post-increment load does not provide increased throughput over a
load and an add on such an implementation.

>> Well, Celio's position is that RISC-V gets code density through the C
>> extension and uop density through instruction fusion.

...

>Like, it is a design which helps "very low end" implementations, at the
>expense of other "moderately low-end" implementations, which need to
>implement "higher end" functionality to compensate for deficiencies.

Yes. The question is how much extra complexity this functionality
costs for moderately low-end implementations.

One can consider this the cost of having a wide-range architecture
rather than an architecture designed for a narrower range. In the
case of A64 the idea probably was also that they have A32/T32 for the
very low end, so they were able to design A64 for a narrower range.

>> If you know that you are doing misaligned accesses, composing a
>> possibly misaligned load from two loads and a combining instruction
>> (or a few) looks like a simpler approach.
>>
>
>It can be faked easily enough, but as can be noted, the cycle cost of
>faking it is higher than that of native hardware support.

Why should that be so? The work necessary for doing it explicitly is
the same. At least for the load case. For unaligned stores I expect
that a write-combining store buffer can do some things that are pretty
far from what any ISA that allows only aligned stores has done
explicitly yet; and current (possibly unaligned) stores are probably a
good interface to that hardware structure.

>In cases where one wants unaligned access on hardware which only does
>aligned access, this is where a keyword like "__unaligned" comes into
>play. This allows using fast (direct) access for aligned loads/stores,
>and a slower (faked) approach for misaligned load/store.

Yes, you would need to specify the unaligned accesses in programs.
However, given that most programs have been and are developed on
hardware where that has not been necessary, programmers are unlikely
to get that right, so allowing all memory accesses (exception: for
comminicating between threads) to be unaligned is the way to go and
has won.

>In this case, both RISC-V and BJX2 assume unaligned load/store
>operations for basic types, so no issue here (and there is no direct
>equivalent of MOV.X in RISC-V).

What is MOV.X?

>>> In BJX2, I went with signed values being kept in sign-extended form, and
>>> unsigned values kept in zero-extended form
>>
>> That seems like a natural way to do it. However, I am not sure if
>> that works well in the presence of type mismatches between caller and
>> callee in production C code. At least I imagine that's the reason why
>> ABIs force one kind of extension, and then do the appropriate
>> extension for the defined type at the usage site if necessary.
>>
>
>As long as the function has a prototype, the call/return will coerce the
>types into the form expected by the called function (as usual).
>
>If no prototype is present, it follows traditional rules:
> Small integer types are promoted to their largest machine form;
> Float is promoted to Double;
> ...
>
>For missing prototypes, kinda ended up (mostly) going with 'long':
> C90 says 'int', but this truncates pointers, 'long' does not;
> Also, 'long' doesn't seem to break anything here.

Pointers would be a problem for the RISC-V and AMD64 approaches, too,
because the sign/zero-extension would trample over the high 32-bits.
The compiler also notices the problems with pointers, because
operations like dereferencing don't work on integer types.

I am more thinking of problems like having an int variable in one
function, and passing that to a separately compiled function that
expects an unsigned (and similar cases). On RISC-V the callee will
zero-extend the passed value, on AMD64 the caller will zero-extend the
passed value, in your approach the caller will sign-extend the value
and pass it, and the callee will assume that the value has been
zero-extended.

> C99 makes missing prototypes invalid anyways, so...

This does not make programs with missing (or mismatched) prototypes go
away.

BGB

unread,

Dec 27, 2021, 1:07:06 PM12/27/21

to

On 12/27/2021 2:15 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> On 12/25/2021 11:10 AM, Anton Ertl wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>>> * A64 has fixed-length 32-bit instructions, but they can be more
>>>>> complex: A64 has more addressing modes and additional instructions
>>>>> like load pair and store pair; in particular, the A64 architects
>>>>> seem to have had few concerns about the register read and write
>>>>> ports needed per instruction; E.g., a store-pair instruction can
>>>>> need four read ports, and a load pair instruction can need three
>>>>> write ports (AFAIK).
>>>>>
>>>>
>>>> These are less of an issue if one assumes a minimum width for the core.
>>>> If the core is always at least 3-wide, this isn't an issue.
>>>
>>> Why? Sure, you could have one instruction port that supports up to
>>> four reads and the others only support two, and one that supports
>>> three writes, and the others only support one, but that would still be
>>> 8 register reads and 5 writes per cycle, and my impression was that
>>> having many register ports is really expensive.
>>>
>>
>> If you have a multi-lane core, it already has the needed register ports.
>> You then spread a single instruction across 2 or 3 lanes.
>
> Ok, but then these instructions reduce the utilization of the lanes.
> A pre/post-increment load does not provide increased throughput over a
> load and an add on such an implementation.
>

Yes, generally...

One could decode a post-increment op as two ops internally.
MOV.B (R4)+, R5
Decoded as:
ADD 1, R4 | MOV.B (R4), R5

Though, encoding this directly in BJX2 (at the ISA level) would not be
allowed, as it violates the sequential/parallel rule (bundles are not
allowed when the result of executing the instructions sequentially would
differ from them being executed in parallel).

There is no auto-increment addressing in my case though, for the main
reason that it isn't used frequently enough to make much visible impact.
It can be done in 2 ops, and frequently the ADD op can be executed in
parallel with something else.

Eg:
while(cs<ce)
*ct++=*cs++;
As:
.L0:
MOV.B (R5), R7
ADD 1, R5 | MOV.B R7, (R4)
ADD 1, R4 | CMPQGE R5, R6
BF .L0

>>> Well, Celio's position is that RISC-V gets code density through the C
>>> extension and uop density through instruction fusion.
> ...
>> Like, it is a design which helps "very low end" implementations, at the
>> expense of other "moderately low-end" implementations, which need to
>> implement "higher end" functionality to compensate for deficiencies.
>
> Yes. The question is how much extra complexity this functionality
> costs for moderately low-end implementations.
>
> One can consider this the cost of having a wide-range architecture
> rather than an architecture designed for a narrower range. In the
> case of A64 the idea probably was also that they have A32/T32 for the
> very low end, so they were able to design A64 for a narrower range.
>

Yeah. I suspect the lower end for A64 is probably in-order superscalar
machines.

If you need a microcontroller, there is Thumb.

RV32I could likely be pretty competitive with Thumb, but for a core
which is higher end than a microcontroller, but lower end than a typical
superscalar core, it is likely to hurt.

>>> If you know that you are doing misaligned accesses, composing a
>>> possibly misaligned load from two loads and a combining instruction
>>> (or a few) looks like a simpler approach.
>>>
>>
>> It can be faked easily enough, but as can be noted, the cycle cost of
>> faking it is higher than that of native hardware support.
>
> Why should that be so? The work necessary for doing it explicitly is
> the same. At least for the load case. For unaligned stores I expect
> that a write-combining store buffer can do some things that are pretty
> far from what any ISA that allows only aligned stores has done
> explicitly yet; and current (possibly unaligned) stores are probably a
> good interface to that hardware structure.
>

Faking a misaligned load is likely to require a multi-instruction
sequence, and also likely require stalling on an interlock. Doing it in
hardware can allow the load to be done in a single clock cycle.

It is a similar reason for why I eventually ended up re-adding an FMOV.S
instruction:
MOV.L + FLDCF, typically ends up taking 4 cycles.
FMOV.S, avoids an interlock, often 1 cycle.

One could argue for a case where FMOV.S is load-only, but I re-added it
on both the Load and Store paths.

In effect, it is analogous to a 32-bit Load/Store with a
Binary32<->Binary64 converter glued on. The store path is a little more
expensive here, because it also involves potentially zeroing or rounding
the mantissa (whereas the widening conversion is effectively glorified
bit-shuffling).

Though, arguably, one could make the store path cheaper by using a
truncating conversion (as is done with SIMD ops), but OTOH one can
potentially leverage the same converter used for 'FSTCF'
(Double->Single), which does need a rounding conversion. This more
effects timing than it does LUT cost.

>> In cases where one wants unaligned access on hardware which only does
>> aligned access, this is where a keyword like "__unaligned" comes into
>> play. This allows using fast (direct) access for aligned loads/stores,
>> and a slower (faked) approach for misaligned load/store.
>
> Yes, you would need to specify the unaligned accesses in programs.
> However, given that most programs have been and are developed on
> hardware where that has not been necessary, programmers are unlikely
> to get that right, so allowing all memory accesses (exception: for
> comminicating between threads) to be unaligned is the way to go and
> has won.
>

Not quite so much, IME.

Even on hardware where unaligned access is typically the default, some
compilers have managed to make relying on it brittle enough that most C
code still tends to preserve an aligned/unaligned distinction.

>> In this case, both RISC-V and BJX2 assume unaligned load/store
>> operations for basic types, so no issue here (and there is no direct
>> equivalent of MOV.X in RISC-V).
>
> What is MOV.X?
>

A 128-bit Load/Store Pair.

It is also used for 128-bit SIMD Load/Store, since the ISA uses a single
register space for GPRs and SIMD.

From the way it is implemented, it requires an 8-byte alignment (unlike
a pair of MOV.Q instructions), and is often slightly faster if accesses
are aligned on a 16-byte boundary. It also only works on even-numbered
registers.

It can be faster than a pair of MOV.Q instructions, and is most commonly
found in function prolog/epilog sequences and similar.

Yeah, pretty much.
I depends a lot on what the callee does with the value which effects
what the results are here.

>> C99 makes missing prototypes invalid anyways, so...
>
> This does not make programs with missing (or mismatched) prototypes go
> away.
>

Possibly true. Pretty much all of the compilers I am aware of went over
to something partway between C90 and C99 semantics, typically behaving
like C90 and then generating a warning about the missing prototype.

But they "at least stand a fighting chance" in my case.
Most basic types will (at lest) end up in the correct registers, even if
the values don't match up exactly.

Main exception is when 128-bit types get involved, which as-is:
Pad up to the next even-numbered register (if needed);
Pass the value as a register pair.

Struct passing rules in my case are:
sizeof(Foo) <= 8: Pass as a single register;
sizeof(Foo) <= 16: Pass as a register pair;
Else : Pass pointer to struct.

This seemed to me like a reasonable compromise to me. Avoids both
excessively complex/convoluted rules, as well as being reasonably efficient.

If one runs out of registers, any remaining arguments are passed on the
stack, following the same pattern. Like AMD64 (and unlike Win64), there
is no dedicated space for argument spills. The callee can adjust the
stack pointer by an extra 64B if they want such a spill space though, so
in practice it likely isn't a huge issue.

Return is basically similar, either the called function returns the
struct in registers, or the caller passes a pointer to the structure
which will receive the result. Trying to call a function which returns a
struct with no destination will instead pass a pointer to a dummy struct.

MitchAlsup

unread,

Dec 27, 2021, 2:04:03 PM12/27/21

to

VEC Ri,<>
LDB R5,[Rcs+Ri]
STB R5,[Rct+Ri]
LOOP LT,#1,Ri
>
6 BJX2 instructions versus 4 My 66000.

My 66000 ABI; effectively, lay out the argument list (and return value list)
in memory (on the stack) then put the first 8 double words in registers
(and do not create locations on the stack for argument[0..7].)

BGB

unread,

Dec 27, 2021, 7:12:17 PM12/27/21

to

Bigger issue here is probably that the loop, as written originally, will
bottleneck at ~ 6MB/s at 50MHz:
.L0:
MOV.B (R5), R7 //3c (2c penalty)
ADD 1, R5 | MOV.B R7, (R4) //2c (1c penalty)
ADD 1, R4 | CMPQGE R5, R6 //1c
BF .L0 //2c
Or, ~ 8c per byte.

Or, no bundling:
.L0:
MOV.B (R5), R7 //2c (1c penalty)
ADD 1, R5 //1c
MOV.B R7, (R4) //1c
ADD 1, R4 //1c
CMPQGE R5, R6 //1c
BF .L0 //2c
Still 8c per byte.

50/8 => 6.25

But, probably surprising no one, the CPU is devoid of any sort of
"cleverness" here. The compiler wont do anything either, leaving it
mostly up to the programmer.

One can mostly count themselves lucky if the compiler manages to avoid
dropping a few register spills in the middle of the loop or similar, ...

Technically, one could also write:
MOV 0, R7
.L0:
MOV.B (R5, R7), R3 //3c (2c penalty)
ADD 1, R7 | MOV.B R3, (R4, R7) //1c
JCMPQLT R6, R7, .L0 //2c (*)
ADD R6, R4 | ADD R6, R5

Which is 6c, 8.3MB/s.

But, the compiler wont infer this, and it isn't really all that much better.

*: These instructions share the same underlying mechanism as the
branches in RISC-V mode, and will exist if the RVI extension is enabled.
Note that compare-and-branch instructions may not be predicated.

Slightly faster:
MOV R6, R7
CMPQGE 8, R7
BF .L1
.L0:
ADD -8, R7 | MOV.Q (R5), R3 //3c (2c penalty)
ADD 8, R5 | MOV.Q R3, (R4) //1c
ADD 8, R4 | CMPQGE 8, R7 //1c
BT .L0 //2c
.L1:
CMPQGT 0, R7
BF .L3
.L2:
ADD -1, R7 | MOV.B (R5), R7 //3c (2c penalty)
ADD 1, R5 | MOV.B R7, (R4) //1c
ADD 1, R4 | CMPQGT 0, R7 //1c
BF .L2 //2c
.L3:

Which should average ~ 67 MB/s...

Throw in another loop stage copying 32B at a time, and it could be
potentially pushed up to around 200MB/s (at least, if staying within the
L1 cache), ...

Conceptually, it is not much different in this case.

If one has 8 or fewer arguments, and they are all 64 bits or less, then
it is basically plain registers. Since a significant majority of
functions fit this pattern, they don't need to use any stack arguments.

Thomas Koenig

unread,

Dec 29, 2021, 5:32:44 AM12/29/21

to

BGB <cr8...@gmail.com> schrieb:

> Eg:
> while(cs<ce)
> *ct++=*cs++;
> As:
> .L0:
> MOV.B (R5), R7
> ADD 1, R5 | MOV.B R7, (R4)
> ADD 1, R4 | CMPQGE R5, R6
> BF .L0

Nit: The two are only identical if there is at least one
byte to be copied.

BGB

unread,

Dec 29, 2021, 11:32:50 AM12/29/21

to

Yeah, true...

One of my other responses contained a pretty obvious screw up with
register usage, but I didn't notice until after I had posted it (and
Usenet doesn't have any edit or undo).

I guess I could note though that my compiler tends to compile while
loops like:
if(!cond)goto .LEND;
.L0:
body
if(cond)goto .L0;
.LEND:
Rather than:
.L0:
if(!cond)goto .LEND;
body
goto .L0;
.LEND:

As the former tends to give better performance, albeit with slightly
worse code density.

...

Ivan Godard

unread,

Dec 29, 2021, 12:23:42 PM12/29/21

to

I always used:
goto .LTEST;
.L0:
body
.LTEST
if(cond)goto .L0;
.LEND:

BGB

unread,

Dec 29, 2021, 1:09:01 PM12/29/21

to

Hmm, yeah. Didn't think of this. I guess this has the advantage of
avoiding a need to duplicate the conditional.

There tends to also still be a label there, because this is where
"continue" lands.

The change is "fairly trivial". Testing this with a program, this
results in an ~ 0.15% reduction in overall binary size.

Program also passes the "still works" test.

Thomas Koenig

unread,

Dec 29, 2021, 1:52:08 PM12/29/21

to

BGB <cr8...@gmail.com> schrieb:

> On 12/29/2021 4:32 AM, Thomas Koenig wrote:
>> BGB <cr8...@gmail.com> schrieb:
>>
>>> Eg:
>>> while(cs<ce)
>>> *ct++=*cs++;
>>> As:
>>> .L0:
>>> MOV.B (R5), R7
>>> ADD 1, R5 | MOV.B R7, (R4)
>>> ADD 1, R4 | CMPQGE R5, R6
>>> BF .L0
>>
>> Nit: The two are only identical if there is at least one
>> byte to be copied.
>
> Yeah, true...
>
> One of my other responses contained a pretty obvious screw up with
> register usage, but I didn't notice until after I had posted it (and
> Usenet doesn't have any edit or undo).

Usenet has Cancel and Supersede. I see from your headers that
you use Thunderbird for news, it appears to have a function to
generate a Cancel.

Marcus

unread,

Dec 30, 2021, 7:12:31 AM12/30/21

to

On 2021-12-24, MitchAlsup wrote:
> On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:

[snip]

>>
>> It's pretty obvious that a small implementation of RV64G is smaller
>> than a small implementation of A64, and adding the C extension to a
>> small implementation of RV64G (to turn it to RV64GC) is reported in
>> the talk IIRC (it's on the order of 700 transistors, so still cheap),
>> so you can get a small RV64GC cheaper than a small A64 implementation
>> and have similar code density.
> <
> Once you have constructed the register file(s), the integer, memory, and
> floating point units, the size of the fetcher and decoder is immaterial
> (noise).
> <

How about branch misprediction penalty? My assumption is that a more
complex decoder is likely to require more pipeline stages in the front
end, which is likely to hurt when you get branch mispredictions, right?

/Marcus

Marcus

unread,

Dec 30, 2021, 7:23:45 AM12/30/21

to

In MRISC32 I only have support for storing zero to memory:

STW Z, [R4, #offset]

...or PC-relative address (PC +/- 4 MB):

STWPC Z, #address@pc

I added support for this in the GCC MRISC32 machine description when I
discovered that storing zero to a variable is a very common operation
(and previously it would do it as a LOAD-IMMEDIATE + STORE-REGISTER
pair).

/Marcus

Marcus

unread,

Dec 30, 2021, 8:42:15 AM12/30/21

to

On 2021-12-24, MitchAlsup wrote:
> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> <snip>

<snip>

>> would probably leave out full Compare-and-Branch instructions, and
>> instead have a few "simpler" conditional branches, say:
>> BEQZ reg, label //Branch if reg==0
>> BNEZ reg, label //Branch if reg!=0
>> BGEZ reg, label //Branch if reg>=0
>> BLTZ reg, label //Branch if reg< 0
>>
>> While conceptually, this doesn't save much, it would be cheaper to
>> implement in hardware.
> <

> Having done both, I can warn you that your assumption is filled with badly
> formed misconceptions. From a purity standpoint you do have a point;
> from a gate count perspective and a cycle time perspective you do not.
> <

>> Relative compares could then use compare
>> instructions:
>> CMPx Rs, Rt, Rd
>> Where:
>> (Rs==Rt) => 0;
>> (Rs> Rt) => 1;
>> (Rs< Rt) => -1.
>>
>> Though, one issue with a plain SUB is that it would not work correctly
>> for comparing integer values the same width as the machine registers (if
>> the difference is too large, the intermediate value will overflow).
> <

> Which is why one needs CMP instructions and not to rely on SUB to do 98%
> of the work.
> <

MRISC32 has the "simple compare-to-fixed-value-and-branch", i.e:

BZ reg, label // Branch if reg == zero
BNZ reg, label // Branch if reg != zero
BS reg, label // Branch if reg == -1 (all bits Set)
BNS reg, label // Branch if reg != -1
BLT reg, label // Branch if reg < zero (signed)
BGE reg, label // Branch if reg >= zero (signed)
BLE reg, label // Branch if reg <= zero (signed)
BGT reg, label // Branch if reg > zero (signed)

...and then it has a bunch of compare instructions (actually "Set if",
which means that the target register is set to -1 if the condition is
true, or cleared to zero if it is false):

SEQ Rd, Rs, Rt/imm // Set Rd if Rs == Rt/imm
SNE Rd, Rs, Rt/imm // Set Rd if Rs != Rt/imm
SLT Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (signed)
SLTU Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (unsigned)
SLE Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (signed)
SLEU Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (unsigned)

...plus floating-point compare instructions (FSEQ, FSLT, ...).

The condition (EQ, LT, ...) is explicitly encoded in the comparison
instruction in order to produce a mask (all bits set or all bits
cleared), which comes in handy for vector operations. It also works well
together with the "bitwise select" instruction (SEL) for implementing
conditional select (conditional move).

Granted, there is some overlap between S[cc] and B[cc], and it's not
perfectly symmetric (e.g. there's only SLT/SLE, but no SGT/SGE), but
the compiler can usually sort these things out since there's ample
opportunity to transform conditions (e.g. replace SGE+BS with SLT+BNS).

/Marcus

EricP

unread,

Dec 30, 2021, 11:51:01 AM12/30/21

to

Marcus wrote:
>
> ....and then it has a bunch of compare instructions (actually "Set if",

> which means that the target register is set to -1 if the condition is
> true, or cleared to zero if it is false):
>
> SEQ Rd, Rs, Rt/imm // Set Rd if Rs == Rt/imm
> SNE Rd, Rs, Rt/imm // Set Rd if Rs != Rt/imm
> SLT Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (signed)
> SLTU Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (unsigned)
> SLE Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (signed)
> SLEU Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (unsigned)
>

> ....plus floating-point compare instructions (FSEQ, FSLT, ...).

>
> The condition (EQ, LT, ...) is explicitly encoded in the comparison
> instruction in order to produce a mask (all bits set or all bits
> cleared), which comes in handy for vector operations. It also works well
> together with the "bitwise select" instruction (SEL) for implementing
> conditional select (conditional move).
>
> Granted, there is some overlap between S[cc] and B[cc], and it's not
> perfectly symmetric (e.g. there's only SLT/SLE, but no SGT/SGE), but
> the compiler can usually sort these things out since there's ample
> opportunity to transform conditions (e.g. replace SGE+BS with SLT+BNS).
>
> /Marcus
>

C,C++ and a bunch of languages explicitly define booleans as 0 or 1
so this definition won't be optimal for those languages.
VAX Fortran used 0,-1 for LOGICAL but I don't know if that
was defined by the language or implementation dependant.

BGB

unread,

Dec 30, 2021, 1:05:54 PM12/30/21

to

I suspect this isn't likely to be too much of an issue for most "sane"
encodings.

For something like x86 or 65C816 or similar, this is likely to get a
little messier (fully variable length byte-oriented ISAs).

Something M68K like would be intermediate (instructions were generally
16/32/48-bit with a length which can be determined from the instruction
encoded in first word). Bigger hassle would be that M68K is a Reg/Mem
ISA (like x86) with some fairly complex addressing modes.

In my case, it is possible to look at a few bits in the instruction word
and figure out the length (15:12):
0xE, 0xF: 32 bit
0x7, 0x9: 32 bit (XGPR)
Else: 16-bit
Bits (11:8) may also be needed to determine bundling and similar.
F0..F3: Scalar (F0..F3 blocks, Base Encoding)
F4..F7: WEX (Repeat F0..F3)
F8..FB: Scalar
FC..FD: WEX (Repeat F8..F9)
FE..FF: Jumbo Prefix
The Ez blocks are predicated forms:
E0..E3: (F0..F3)?T
E4..E7: (F0..F3)?F
E8..E9: (F8..F9)?T
EA..EB: WEX (F0/F2)?T
EC..ED: (F8..F9)?F
EE..EF: WEX (F0/F2)?F

The 7 and 9 blocks repeat F0, and F2/F1, albeit with register fields
expanded to 6 bits. Bit[11] is interpreted as a WEX flag, and these
encodings lack support for predication (meanwhile, the Op64 encodings
allow predication but not WEX).

This could be "cleaner", but some parts of the ISA were later additions.

While RISC-V uses the low-order bits, in premise it isn't that much
different. If the CPU is set to RISC-V mode, the flags are modified to
reflect RISC-V's encodings.

Decoding the various major blocks is handled with "selection flags"
(flags are set to indicate the major blocks to handle the instruction).

Within the IF stage, the combination of Fx (32-bit) and Wx (WEX) flags
are used to determine the overall length of the bundle:
FxA=0: 16-bit
FxA=1:
WxA=0: 32-bit
WxA=1 && (SR.WXE || Jumbo):
FxB=0: 48-bit (Reserved)
FxB=1:
WxB=0: 64-bit
WxB=1:
FxC=0: 80-bit (Reserved)
FxC=1: 96-bit

Jumbo prefixes are mostly special here in that they may force bundle
decoding when the "WEX Enable" flag is clear.

At present, Fetch works like:
Gives 96-bits representing whatever was at PC;
Gives an overall length for the instruction/bundle;
Used to calculate the next cycles' PC.

The 96-bundle / etc are then used as inputs to the ID1 stage.

For example, decoding in my case looks something like:
Unpack the register fields, various immediate types, ...
May then be modified by the presence of a jumbo prefix.
Use big nested table lookup to find a few pieces of information:
nmid: Major opcode (6b)
ucix: Minor opcode / Data (6b)
ucty: Pipeline Type (3b)
ccty: Default Condition Code (3b)
fmid: Major instruction form (4b)
ity: Instruction form subtytpe (4b)
bty: Data / Element Type (Ld/St) (3b)
Do a case based on fmid and ity:
These map instruction registers/... to the decoder ports.
This seems to be the more expensive part.
Addition Outputs:
Rm, Ro, Rp, Rn (4x 7-bit)
Imm: 33-bits
...

An outer level decoder will run several instances of the former decoder,
in order to deal with bundled instructions, ...

So, in effect one might have, say:
3x 32-bit decoder (BJX2 32b) (FzA/FzB/FzC)
1x 16-bit decoder (BJX2 16b) (Bz)
1x 32-bit decoder (RISC-V 32b) (RvA)
Or, 3x 32-bit decoder (RISC-V 32b) (RvA/RvB/RvC, *)
1x 16-bit decoder (RISC-V 16b) (Rz)

Based on the bundle encoding and CPU mode, it maps the instructions from
the decoders onto the CPU pipeline (Lanes 1/2/3).

*: If using a crazy hack to glue bundles and jumbo prefixes onto RISC-V
at the expense of not being able to have both RVC and bundles able to be
encoded at the same time (otherwise only a single RV decoder is used).
Hackery would be used to encode mode-changes into function pointers or
via the GOT. This breaks the RISC-V ISA spec in a few ways that
(probably) shouldn't effect most code in the wild (the actual rules are
a little weird; I suspect if code relies on the ability to have
misaligned PC with it being selectively ignored, this is likely to break).

Eg:
if(FxA && !RVI)
if(WxA)
if(WxB)
Lane3=FzA
Lane2=FzB
Lane1=FzC
else
Lane3=NOP
Lane2=FzA
Lane1=FzB
else
Lane3=NOP
Lane2=NOP
Lane1=FzA
else if(!RVI)
Lane3=NOP
Lane2=NOP
Lane1=Bz
else
Similar, but for RISC-V ops...

This is then followed by some special-case logic to spread SIMD
encodings and similar across multiple lanes, the Immediate handling for
Jumbo96 ops, ...

In BJX2, the Jumbo encodings are special, but are routed through the
32-bit decoders. They may receive an extra 26-bit field signaling a
jumbo prefix in the adjacent lane. Some additional hackery is used for
Jumbo96 encodings (uses multi-lane handling for the immediate).

The Jumbo prefix primarily modifies the decoding of the immediate
fields, or the register fields (Op64 encodings).

This whole process fits within a single clock cycle (ID1).

The next stage (ID2) is mostly related to fetching register values from
the register file and/or forwarding them from the EX stages. For
branches/PC-rel/... PC points to the end of the current bundle.

In the remaining stages (ID2 onwards), the CPU is basically 3 parallel
lanes with a 6R+3W register file.

Some units span multiple lanes, and some instructions may span multiple
lanes in the decoder.

While up-front, this seems more complicated, overall it seems cheaper
than what would be required for a superscalar decoder (which would in
effect need to use pattern-matching across the instructions and register
fields to figure out the bundle width).

One possibility could be to handle superscalar as a hardware level
"autobundler" in the IF stage. Cost and latency seem like potential
issues though. Simplest case would be to check for possible register
collisions, and (if none exist) if both operations are "clean"
operations (such as ALU ops or similar). This hack would at least avoid
needing any significant change to the pipeline though.

Initial thinking was to handle bundling more as a "load time" hack in
the ELF loader, or as an AOT process. Otherwise, the code would only be
able to run one instruction at a time.

It is debatable though if the whole "dual ISA" thing even really makes
sense. Can also note that direct linking across ISAs wont really work as
the C ABI's are very different (so, in effect, BJX2 code callable from
RISC-V mode would also need to use RISC-V's C ABI; with the extra hair
of some internal register remapping, ...).

> /Marcus

MitchAlsup

unread,

Dec 30, 2021, 1:29:41 PM12/30/21

to

The more pipeline stages one has, the less cost a decoder is.
<
The wider the pipeline, the higher the relative misprediction costs.
<
On the other hand, My team in 1990 figured out how to execute a predicted
branch in cycle k and be inserting instructions into the execution window
form the mispredicted direction in cycle k+1. You CAN make the repair
cost zero cycles, you still eat the delay to calculation of mispredict cycles.
>
> /Marcus

Marcus

unread,

Dec 30, 2021, 2:47:36 PM12/30/21

to

As a software developer I'm painfully aware of this. I decided not to
care too much about it, though. Really, most software that relies on
this property of C should be frowned upon. E.g. expressions like:

a = b + (c == d);

...aren't really good programming practice.

I have seen a few places where the compiler does conversions from -1 to
+1 (but those have mostly been due to missing/bad pattern matching in
the machine description).

/Marcus

BGB

unread,

Dec 30, 2021, 3:05:30 PM12/30/21

to

On 12/30/2021 12:29 PM, MitchAlsup wrote:
> On Thursday, December 30, 2021 at 6:12:31 AM UTC-6, Marcus wrote:
>> On 2021-12-24, MitchAlsup wrote:
>>> On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
>>
>> [snip]
>>
>>>>
>>>> It's pretty obvious that a small implementation of RV64G is smaller
>>>> than a small implementation of A64, and adding the C extension to a
>>>> small implementation of RV64G (to turn it to RV64GC) is reported in
>>>> the talk IIRC (it's on the order of 700 transistors, so still cheap),
>>>> so you can get a small RV64GC cheaper than a small A64 implementation
>>>> and have similar code density.
>>> <
>>> Once you have constructed the register file(s), the integer, memory, and
>>> floating point units, the size of the fetcher and decoder is immaterial
>>> (noise).
>>> <
>>
>> How about branch misprediction penalty? My assumption is that a more
>> complex decoder is likely to require more pipeline stages in the front
>> end, which is likely to hurt when you get branch mispredictions, right?
> <
> The more pipeline stages one has, the less cost a decoder is.

Probably true:
With two "decode" stages (one for unpacking, another for register
fetch), I can have a relatively complicated unpacking process.

If one tries to unpack and fetch registers in the same stage, it is a
lot harder to make it pass timing.

Things like LUT cost are less predictable:
Breaking complex logic into multiple stages will often reduce LUT cost;
But, adding extra "forwarding" stages for timing reasons will often
increase LUT cost.

> <
> The wider the pipeline, the higher the relative misprediction costs.
> <
> On the other hand, My team in 1990 figured out how to execute a predicted
> branch in cycle k and be inserting instructions into the execution window
> form the mispredicted direction in cycle k+1. You CAN make the repair
> cost zero cycles, you still eat the delay to calculation of mispredict cycles.

This is where predication is nice:
It can absorb some of the cost of "small" branches (by avoiding the
branch in the fist place).

In the case of a branch, in my case, it takes effect ~ 2 cycles later.

EX1: Signal Branch
EX2: Initiate Branch (new Addr goes into PF);
EX3: First "post-branch" fetch reaches IF.

It is necessary though to invalidate everything previously in the
pipeline, so most of the cost is waiting for the old/invalidated
contents to cycle through.

Except with the branch predictor, which redirects the fetch earlier in
the pipeline. However, since it takes effect in ID1 (rather than IF),
there is still a slight delay.

And, there is a separate PF/IF stage mostly because the Block RAM
operates on a clock edge, ...

>>
>> /Marcus

MitchAlsup

unread,

Dec 30, 2021, 3:34:01 PM12/30/21

to

IIRC Pascal uses 0 and -1
<
So, when I tackled this, I use signed (0 and -1) or unsigned (0 and +1)
bit field extracts from my CMP instructions to obtain TRUE or FALSE
depending on what the language wants. {The semi-equivalent SETGT
kind of instructions would have to double their footprint in the ISA
to cover both cases.} Having both cases available means that using
TRUE as a bit-field mask works well.

> >
> As a software developer I'm painfully aware of this. I decided not to
> care too much about it, though. Really, most software that relies on
> this property of C should be frowned upon. E.g. expressions like:
>
> a = b + (c == d);
>
> ...aren't really good programming practice.
<

And ESPECIALY bad in floating point where:
a) you should not be doing equality comparisons
b) you should be checking for NaNs
c) you need to write comparison code with 3 blocks (yes, no, not-comparable)

Thomas Koenig

unread,

Dec 30, 2021, 4:07:59 PM12/30/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> On 2021-12-30, EricP wrote:

>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
>> so this definition won't be optimal for those languages.
>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
>> was defined by the language or implementation dependant.

It is implementation defined.

With GNU Fortran, the choice was made to only allow 0 and 1.
Anything else, and you're likely to end up with random results
with LOGICAL variables.

One reason for that was that gcc, still a C-centric compiler, is
geared toward _Bool. Fortran has followed, because the C interop
sort of prescribes that.

> As a software developer I'm painfully aware of this. I decided not to
> care too much about it, though. Really, most software that relies on
> this property of C should be frowned upon. E.g. expressions like:
>
> a = b + (c == d);
>
> ...aren't really good programming practice.

Agreed.

> I have seen a few places where the compiler does conversions from -1 to
> +1 (but those have mostly been due to missing/bad pattern matching in
> the machine description).

What is the assembly for

_Bool foo (int a, int b)
{
return a > b;
}

for your architecture at a reasonable optimization level?

Niklas Holsti

unread,

Dec 30, 2021, 4:14:50 PM12/30/21

to

On 2021-12-30 22:34, MitchAlsup wrote:
> On Thursday, December 30, 2021 at 1:47:36 PM UTC-6, Marcus wrote:
>> On 2021-12-30, EricP wrote:

[snip]

>>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
>>> so this definition won't be optimal for those languages.
>>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
>>> was defined by the language or implementation dependant.
> <
> IIRC Pascal uses 0 and -1

Pascal-the-language has a "boolean" type, with values False and True. I
believe the machine representation of those values is
implementation-defined.

Ada-the-language also has a "Boolean" type, with values False and True,
but it is defined as an enumeration type with the default representation
of such types, which means that False is represented as 0 and True as 1.

Thomas Koenig

unread,

Dec 30, 2021, 4:23:10 PM12/30/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Thursday, December 30, 2021 at 1:47:36 PM UTC-6, Marcus wrote:

>> As a software developer I'm painfully aware of this. I decided not to
>> care too much about it, though. Really, most software that relies on
>> this property of C should be frowned upon. E.g. expressions like:
>>
>> a = b + (c == d);
>>
>> ...aren't really good programming practice.
><
> And ESPECIALY bad in floating point where:
> a) you should not be doing equality comparisons

You can do equality comparisons if you know that the
results will be exact. For example,

a = 1.0 + 1.0
if (a /= 2.0) stop "Your adder is broken"

should never execute the STOP.

> b) you should be checking for NaNs

Depends. Sometimes NaNs are not an issue, and sometimes
it's fine to let them bubble up.

> c) you need to write comparison code with 3 blocks (yes, no, not-comparable)

Or you can use Fortran and check afterwards :-)

Here's an example taken from the Fortran 2018 standard, which
checks for different sizes first, and then for overflow during
the calculation.

MODULE DOT
! Module for dot product of two real arrays of rank 1.
! The caller needs to ensure that exceptions do not cause halting.
USE, INTRINSIC :: IEEE_EXCEPTIONS
LOGICAL :: MATRIX_ERROR = .FALSE.
INTERFACE OPERATOR(.dot.)
MODULE PROCEDURE MULT
END INTERFACE OPERATOR(.dot.)
CONTAINS
REAL FUNCTION MULT (A, B)
REAL, INTENT (IN) :: A(:), B(:)
INTEGER I
LOGICAL OVERFLOW
IF (SIZE(A) /= SIZE(B)) THEN
MATRIX_ERROR = .TRUE.
RETURN
END IF
! The processor ensures that IEEE_OVERFLOW is quiet.
MULT = 0.0
DO I = 1, SIZE (A)
MULT = MULT + A(I)*B(I)
END DO
CALL IEEE_GET_FLAG (IEEE_OVERFLOW, OVERFLOW)
IF (OVERFLOW) MATRIX_ERROR = .TRUE.
END FUNCTION MULT
END MODULE DOT

You can also check for IEEE_INVALID to see if any NaNs were
involved.

Quadibloc

unread,

Dec 30, 2021, 4:24:37 PM12/30/21

to

On Friday, December 24, 2021 at 9:37:14 PM UTC-7, Ivan Godard wrote:
> On 12/24/2021 10:17 AM, MitchAlsup wrote:

> > There is NO JUSTIFIABLE reason that an instruction is not entirely
> > self-contained!

> Really? I suppose that might be true for OOO that is emulating
> sequential execution, but what about VLIW and other wide multi-issue?

> Chopping off the variable and large parts so they can be recognized in
> parallel lets the issue width no longer depend on how many constants you
> have.

This is the sort of thing I've been fooling around with finding a way to
achieve. To avoid too much complexity, my recent designs all involved
a self-contained 256-bit block which combined as many fixed-length
instructions as possible with the variable or large parts they needed.

John Savard

MitchAlsup

unread,

Dec 30, 2021, 5:37:47 PM12/30/21

to

< O=0
ENTRY foo
foo:
CMP R1,R1,R2
EXT R1,R1,#<GT>
RET

BGB

unread,

Dec 30, 2021, 5:51:11 PM12/30/21

to

My case:
foo:
CMPGT R5, R4
MOVT R2
RTS

CMPGT updates SR.T based on the compare.
MOVT copies SR.T to a register.
MOVNT copies !SR.T to a register.

Bill Findlay

unread,

Dec 30, 2021, 6:45:28 PM12/30/21

to

On 30 Dec 2021, Niklas Holsti wrote
(in article <j36lq7...@mid.individual.net>):

> On 2021-12-30 22:34, MitchAlsup wrote:
> > On Thursday, December 30, 2021 at 1:47:36 PM UTC-6, Marcus wrote:
> > > On 2021-12-30, EricP wrote:
>
> [snip]
>
> > > > C,C++ and a bunch of languages explicitly define booleans as 0 or 1
> > > > so this definition won't be optimal for those languages.
> > > > VAX Fortran used 0,-1 for LOGICAL but I don't know if that
> > > > was defined by the language or implementation dependant.
> > <
> > IIRC Pascal uses 0 and -1
>
> Pascal-the-language has a "boolean" type, with values False and True. I
> believe the machine representation of those values is
> implementation-defined.

Pascal requires that false< true, that true = succ(false), that false =
pred(true),
that ord(false) = 0, and that ord(true) = 1, where succ(), pred() and ord()
are
equivalent to Boolean'Succ(),Boolean'Pred() and Boolean'Pos() in Ada.
This does not enforce representations of 0 and 1, but it's a pretty strong
hint!

--
Bill Findlay

MitchAlsup

unread,

Dec 30, 2021, 7:08:41 PM12/30/21

to

There are sorts of ring-sum arithmetics which can emulate these requirements;
just not very many that can be efficiently calculated in 2-s complement.
>
> --
> Bill Findlay

Ivan Godard

unread,

Dec 30, 2021, 11:24:26 PM12/30/21

to

Well, sort of. You are expecting that both-paths predecode will catch
most of the misses, and that's true when the misses are from wrong
loop-counts. But it's not true for data-dependent open code, nor for a
lot of switches, or any poly code in general when the target fanout
exceeds the predecodable fanout. Short pipes help with everything.

Ivan Godard

unread,

Dec 30, 2021, 11:28:21 PM12/30/21

to

Yeah. IIRC, that was one I had to beat Jean-Claude up about.

Ivan Godard

unread,

Dec 30, 2021, 11:39:38 PM12/30/21

to

On 12/30/2021 1:07 PM, Thomas Koenig wrote:

> What is the assembly for
>
> _Bool foo (int a, int b)
> {
> return a > b;
> }
>
> for your architecture at a reasonable optimization level?

F("foo") %0 %1;
retn(b0 %2) ^9, // L3C2F0
gtrsb(b0 %0, b1 %1) %2 ^8;
// L3C11F0=/home/ivan/mill/specializer/src/test1.c
//0: |i v%1 v%0 |o ^%0 ^%1 |d v%2 |w ^%2

Optlevel -O0, two instructions, one bundle, one cycle.

Marcus

unread,

Dec 31, 2021, 4:40:06 AM12/31/21

to

As a matter of fact, you can now try out MRISC32 on Compiler Explorer
(support was added about a week ago): https://godbolt.org/z/z9sK3MYeM

MRISC32:
slt r1,r2,r1
sub r1,z,r1
ret

Two instructions: compare + convert from -1 to +1.

But I'm not too concerned. MIPS and RISC-V can do that in a single
instruction, but most popular architectures require at least two
instructions.

x86:
cmp edi, esi
setg al
ret

ARMv8:
cmp w0, w1
cset w0, gt
ret

ARMv7:
mov r2, #0
cmp r0, r1
movgt r2, #1
mov r0, r2
bx lr

POWER:
subf 3,3,4
srdi 3,3,63
blr

...etc.

/Marcus

Thomas Koenig

unread,

Dec 31, 2021, 5:28:35 AM12/31/21

to

Ivan Godard <iv...@millcomputing.com> schrieb:

> On 12/30/2021 1:07 PM, Thomas Koenig wrote:
>
>> What is the assembly for
>>
>> _Bool foo (int a, int b)
>> {
>> return a > b;
>> }
>>
>> for your architecture at a reasonable optimization level?
>
> F("foo") %0 %1;
> retn(b0 %2) ^9, // L3C2F0
> gtrsb(b0 %0, b1 %1) %2 ^8;
> // L3C11F0=/home/ivan/mill/specializer/src/test1.c
> //0: |i v%1 v%0 |o ^%0 ^%1 |d v%2 |w ^%2

Could you decipher this for the non-initialized? :-)

Thomas Koenig

unread,

Dec 31, 2021, 5:43:39 AM12/31/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> On 2021-12-30 22:07, Thomas Koenig wrote:

>> What is the assembly for
>>
>> _Bool foo (int a, int b)
>> {
>> return a > b;
>> }
>>
>> for your architecture at a reasonable optimization level?
>>
>
> As a matter of fact, you can now try out MRISC32 on Compiler Explorer
> (support was added about a week ago): https://godbolt.org/z/z9sK3MYeM

That is so cool, I think I will throw some code at it and
see how this works out :-)

>
> MRISC32:
> slt r1,r2,r1
> sub r1,z,r1
> ret
>
> Two instructions: compare + convert from -1 to +1.

Looks good.

>
> But I'm not too concerned. MIPS and RISC-V can do that in a single
> instruction, but most popular architectures require at least two
> instructions.

It is actually quite interesting to see the different approaches
the different architectures do

> x86:
> cmp edi, esi
> setg al
> ret

Setting a register from a condition code...

> ARMv8:
> cmp w0, w1
> cset w0, gt
> ret

Same thing here.

> ARMv7:
> mov r2, #0
> cmp r0, r1
> movgt r2, #1
> mov r0, r2
> bx lr

Set the register, compare, set another register, and copy
the register to the return register.

gcc does

cmp r0, r1
ite le
movle r0, #0
movgt r0, #1
bx lr

which looks a little bit better (especially since ite seems
to be a pseudo-op in 32-bit mode, if I read that correctly).

> POWER:
> subf 3,3,4
> srdi 3,3,63
> blr

Subtract and right shift of the sign bit. That's another
way of doing this, I guess. When I saw this, I thought
"Somebody must have been feeling hackish"...

Terje Mathisen

unread,

Dec 31, 2021, 7:25:01 AM12/31/21

to

Marcus wrote:
> On 2021-12-30, EricP wrote:

>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
>> so this definition won't be optimal for those languages.
>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
>> was defined by the language or implementation dependant.

-1 is better than 1, it can be used as a mask.

>
> As a software developer I'm painfully aware of this. I decided not to
> care too much about it, though. Really, most software that relies on
> this property of C should be frowned upon. E.g. expressions like:
>
> a = b + (c == d);
>
> ...aren't really good programming practice.

No! Please tell me it ain't so!

I use that type of constructs all over the place when writing branchless
code/doing table lookups etc.

The Dec 24th Advent of Code puzzle had a miniature ALU with no MOV, so
it used "x MUL 0" to zero out x, then x ADD N to actually set it to a
value. X EQL W returned 1/0, which was subsequently used in
multiplications to stand in for branches.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Marcus

unread,

Dec 31, 2021, 7:37:07 AM12/31/21

to

On 2021-12-31, Terje Mathisen wrote:
> Marcus wrote:
>> On 2021-12-30, EricP wrote:
>>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
>>> so this definition won't be optimal for those languages.
>>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
>>> was defined by the language or implementation dependant.
>
> -1 is better than 1, it can be used as a mask.
>
>>
>> As a software developer I'm painfully aware of this. I decided not to
>> care too much about it, though. Really, most software that relies on
>> this property of C should be frowned upon. E.g. expressions like:
>>
>> a = b + (c == d);
>>
>> ...aren't really good programming practice.
>
> No! Please tell me it ain't so!

:-)

>
> I use that type of constructs all over the place when writing branchless
> code/doing table lookups etc.

I think that you'll find that the following code produces the exact same
result:

int a;
if (c == d) {
a = b + 1;
} else {
a = b;
}

It too is completely branchless.

My main gripe with the former version is the implicit type conversion
(boolean to integer), and that I don't like to see logical operands and
arithmetic operands mixed in the same expression. Another slightly more
readable version, IMO, would be:

a = b + (c == d ? 1 : 0);

See: https://godbolt.org/z/K7ndn57T1

Thomas Koenig

unread,

Dec 31, 2021, 7:37:43 AM12/31/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> As a matter of fact, you can now try out MRISC32 on Compiler Explorer
> (support was added about a week ago): https://godbolt.org/z/z9sK3MYeM

I could not help but start experimenting with that :-)

I seen no vectorization for

void foo(int * const restrict a, int * const restrict b, int * restrict c, int n)
{
for (int i=0; i<n; i++)
c[i] = a[i]+b[i];
}

and -fopt-info-vec-missed tells me, at https://godbolt.org/z/vPM3KTrnz ,
<source>:3:20: missed: couldn't vectorize loop
<source>:3:20: missed: not vectorized: unsupported data-type

so it seems some more delving into gcc's secret internals
will be required...

Marcus

unread,

Dec 31, 2021, 7:43:25 AM12/31/21

to

Yup! Vectorization is not at all implemented (the compiler doesn't even
know how to use the vector registers), so it's one of the more
interesting improvements to be done to the MRISC32 GCC machine
description.

Since I'm not a compiler guy (at all), I'm pretty happy that I've gotten
this far. Auto-vectorization is not on the near term roadmap,
unfortunately (things like better C++ support is higher up on the prio
list). Also, the vector part of the ISA is not really finalized yet.

Inline assembler is the solution for MRISC32 vector code. E.g:

https://github.com/mbitsnbites/mc1-doom/blob/master/src/r_draw.c#L89

/Marcus

a...@littlepinkcloud.invalid

unread,

Dec 31, 2021, 7:43:28 AM12/31/21

to

Thomas Koenig <tko...@netcologne.de> wrote:
> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>> On 2021-12-30 22:07, Thomas Koenig wrote:
>
>>> What is the assembly for
>>>
>>> _Bool foo (int a, int b)
>>> {
>>> return a > b;
>>> }
>>>
>>> for your architecture at a reasonable optimization level?
>>>
>>
>> As a matter of fact, you can now try out MRISC32 on Compiler Explorer
>> (support was added about a week ago): https://godbolt.org/z/z9sK3MYeM
>

>> POWER:
>> subf 3,3,4
>> srdi 3,3,63
>> blr
>
> Subtract and right shift of the sign bit. That's another
> way of doing this, I guess. When I saw this, I thought
> "Somebody must have been feeling hackish"...

Is it even correct? Surely that's (b - a < 0), not (a > b). I think
it'd return true for foo(-0x7fffffff, 0x7fffffff). But I don't know
the POWER ISA at all well, so pls ignore if this is cleverer than I
think. :-).

Andrew.

Ivan Godard

unread,

Dec 31, 2021, 7:48:30 AM12/31/21

to

Sorry, it has all the annotations turned on, so the reader can see the
belt flow. A key:
L3C1F0 etc: source position (line # etc)
bN: belt reference to the Nth mot recent drop
%N: SSA # for drop
^N: instruction number in ebb
N: cycle number in ebb
|X: phase X (i: immPhase, o: opPhase d: dropPhase
w: writerPhase)
v%N: a drop of %N
^%N: an eval of %N

Removing all the annotations, you get:
F("foo");
retn(b0),gtrsb(b0, b1);

Michael S

unread,

Dec 31, 2021, 8:17:58 AM12/31/21

to

It's about ABI rather than ISA.
If the ABI specifies that 32-bit signed integers parameters that are passed in registers are sign-extened by caller then it is correct.

Terje Mathisen

unread,

Dec 31, 2021, 8:27:09 AM12/31/21

to

In my use of these constructs, the code has typically been written in
asm, with no helpful peephole optimizer to return it to the form I
intended. :-)

I use another form in branchless conversion of julian day number to
y/m/d, i.e. subtraction and arithmetic shift right to convert the result
into a 0/-1 mask.

>
> My main gripe with the former version is the implicit type conversion
> (boolean to integer), and that I don't like to see logical operands and

bool is a newfangled construct, it all used to be just zero/non-zero ints...

> arithmetic operands mixed in the same expression. Another slightly more
> readable version, IMO, would be:
>
> a = b + (c == d ? 1 : 0);
>
> See: https://godbolt.org/z/K7ndn57T1

I do like the way trigrams keep all the logic in a single assignment.

David Brown

unread,

Dec 31, 2021, 10:04:24 AM12/31/21

to

Given a good enough compiler, it is branchless. I am a big fan of using
good optimising compilers, so that you can write your source code in the
way you think is clearest and best, and let the compiler worry about the
details of the low-level code optimisations. But it is still useful to
remember that you can't always assume a good compiler.

Also note that on some processors, branched code can be faster than
branchless equivalents.

>
> My main gripe with the former version is the implicit type conversion
> (boolean to integer), and that I don't like to see logical operands and
> arithmetic operands mixed in the same expression.

Some languages have booleans as an integral part of the language and
distinct from arithmetic types. C is not such a language. The
relational operators are defined to give "int" results, and guarantee
that "c == d" will evaluate to 0 or 1. There is no type conversion
here, implicit or otherwise.

C does have a "bool" type (or, more precisely, "_Bool"), but it was
added late (C99) and changing the operators would be a breaking change.

So "a = b + (c == d);" is not, in fact, mixing logical operands and
arithmetic operands in C - it's all arithmetic as far as the language is
concerned.

(You can well argue that the programmer's intention, or the appearance
of the code, is mixing logical and arithmetic operands. And I'd agree -
it's not a style I feel comfortably and clearly expresses the
programmer's intent. It might be appropriate in some circumstances.)

C++ has "bool" built-in, and relational operators (other than the new
spaceship operator) return bool. But for compatibility with C, bool
expressions are implicitly converted to int as needed, and the value is
guaranteed.

(Similarly, the result of "a && b" is an "int" in C, and a "bool" in C++.)

robf...@gmail.com

unread,

Dec 31, 2021, 10:41:23 AM12/31/21

to

I figured I’d throw Thor cc64 compiler output to the mix:

int foo (register int a, register int b)
{
return a > b;
}

Gives:
_foo:
enter 0
slt a0,a1,a0
leave 0

As can been seen I have not got the compiler removing the ENTER and
LEAVE yet, which could be done when only argument registers are used
in the function.

EricP

unread,

Dec 31, 2021, 11:05:12 AM12/31/21

to

Terje Mathisen wrote:
> Marcus wrote:
>> On 2021-12-30, EricP wrote:
>>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
>>> so this definition won't be optimal for those languages.
>>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
>>> was defined by the language or implementation dependant.
>
> -1 is better than 1, it can be used as a mask.

One or the other usage has to do a negate.
The question is which usage, boolean or mask, is more common?
In general it is boolean so it is mask user that should do the negate.

There is a lot of C code that expects that a = b < c; produces 0 or 1.
So if CMP produces a -1 then a C compiler would always generate a NEG too.

For boolean expressions in IF statements the NEG is unnecessary
if the branch tests zero/non-zero and so it might be optimized away.
But that requires the compiler code gen knowing that boolean expressions
in IF statements are not quite the same as other boolean expressions
(and we give a special dispensation to & and | operators on booleans).
Similar rational optimizes any for CMOV conditionals.

It seems easier to have the mask user pay for the NEG.

MitchAlsup

unread,

Dec 31, 2021, 12:17:34 PM12/31/21

to

On Friday, December 31, 2021 at 10:05:12 AM UTC-6, EricP wrote:
> Terje Mathisen wrote:
> > Marcus wrote:
> >> On 2021-12-30, EricP wrote:
> >>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
> >>> so this definition won't be optimal for those languages.
> >>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
> >>> was defined by the language or implementation dependant.
> >
> > -1 is better than 1, it can be used as a mask.
> One or the other usage has to do a negate.
<

Which is why it is best for the ISA to have the ability to produce both
0:-1 and 0:+1 and let each language decide for itself which sequence
is best.

<
> The question is which usage, boolean or mask, is more common?
> In general it is boolean so it is mask user that should do the negate.
<

My 66000 even has the instruction to use the mask as a conditional operator
giving an effective bit-by-bit multiplexer ((a&c)|(b&~c)) along with the form
that uses the condition as a selector ((a&!!c)|(a&!c)).

>
> There is a lot of C code that expects that a = b < c; produces 0 or 1.
> So if CMP produces a -1 then a C compiler would always generate a NEG too.
<

s/always/mostly/

<
>
> For boolean expressions in IF statements the NEG is unnecessary
> if the branch tests zero/non-zero and so it might be optimized away.
<

For floating point, one has to give careful consideration to NaNs so that
even after condition inversion that the NaNs go into the else-clause.

Terje Mathisen

unread,

Dec 31, 2021, 12:28:10 PM12/31/21

to

It is of course _far_ too late to change it in C (or most existing
languages/compilers), but please notice that all the SIMD vector
instruction sets I have seen tend to use 0/-1, specifically because that
is usable as a mask.

x86 also has a movemask* opcode which extracts the top bit from each
element and pack them together into an integer register, this makes it
suitable to verify "Any matching element?" after first doing a SIMD compare.

BGB

unread,

Dec 31, 2021, 12:48:37 PM12/31/21

to

On 12/31/2021 7:27 AM, Terje Mathisen wrote:
> Marcus wrote:
>> On 2021-12-31, Terje Mathisen wrote:
>>> Marcus wrote:
>>>> On 2021-12-30, EricP wrote:
>>>>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
>>>>> so this definition won't be optimal for those languages.
>>>>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
>>>>> was defined by the language or implementation dependant.
>>>
>>> -1 is better than 1, it can be used as a mask.
>>>

(!x-1)

>>>>
>>>> As a software developer I'm painfully aware of this. I decided not to
>>>> care too much about it, though. Really, most software that relies on
>>>> this property of C should be frowned upon. E.g. expressions like:
>>>>
>>>>    a = b + (c == d);
>>>>
>>>> ...aren't really good programming practice.
>>>
>>> No! Please tell me it ain't so!
>>
>> :-)
>>
>>>
>>> I use that type of constructs all over the place when writing
>>> branchless code/doing table lookups etc.
>>
>> I think that you'll find that the following code produces the exact same
>> result:
>>
>>      int a;
>>      if (c == d) {
>>          a = b + 1;
>>      } else {
>>          a = b;
>>      }
>>
>> It too is completely branchless.
>
> In my use of these constructs, the code has typically been written in
> asm, with no helpful peephole optimizer to return it to the form I
> intended. :-)
>

The use of an "if()" branch this way, while (likely) branchless in my
case, could be less efficient in many cases due to implementing both
branches via predicated ops rather than by putting the value into a
register.

Say:

a = b + (c == d);

Could be handled as:
CMPEQ R5, R6
MOVT R7
ADD R4, R7, R2

( Using ADC here is possible, but BGBCC will not do so. Something like
"ADC R4, 0, R2" would allow doing it in 2 instructions, but no encoding
exists for this. )

If Branch:
CMPEQ R5, R6
ADD?T R4, 1, R2
MOV?F R4, R2

But, not much difference in this case in terms of clock cycles (it would
be a little worse if a branch were used, but this is part of why I ended
up adding predicated ops in the first place...).

One can't nest them or use complex conditionals, but this isn't usually
a big issue.

Granted, predicated ops are a relative minority (say, only ~ 6% of
constant loads end up predicated, fewer in other categories, ...), but
they can save more cycles than implied by their fairly limited range of
use-cases.

> I use another form in branchless conversion of julian day number to
> y/m/d, i.e. subtraction and arithmetic shift right to convert the result
> into a 0/-1 mask.
>

We need to convince the world to going over to define units of time in
terms of megaseconds and gigaseconds.

Like, one could describe themselves as an ancient vampire who has been
on this Earth for a little over 1 gigasecond...

>>
>> My main gripe with the former version is the implicit type conversion
>> (boolean to integer), and that I don't like to see logical operands and
>
> bool is a newfangled construct, it all used to be just zero/non-zero
> ints...
>

Yeah, _Bool can be defined as:
Integer type whose only defined values are 0 or 1.
Storage size is whatever is both small and convenient.

In BGBCC, the _Bool type basically reduces to an alias for 'unsigned
char', differing in a few contexts, but generally equivalent.

Using 1 bit storage could be more compact, but in many cases trying to
bit-pack these values would be slower than using bytes (and for
bit-packed flags one is probably much better off using
"if(flags&FOO_FLAG)" or similar).

IIRC, there is special case logic in the compiler for casting to _Bool.
So, (_Bool)(expr) => ((expr)!=0)

Thomas Koenig

unread,

Dec 31, 2021, 12:57:54 PM12/31/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> On 2021-12-31, Thomas Koenig wrote:
>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>>
>>> As a matter of fact, you can now try out MRISC32 on Compiler Explorer
>>> (support was added about a week ago): https://godbolt.org/z/z9sK3MYeM
>>
>> I could not help but start experimenting with that :-)
>>
>> I seen no vectorization for
>>
>> void foo(int * const restrict a, int * const restrict b, int * restrict c, int n)
>> {
>> for (int i=0; i<n; i++)
>> c[i] = a[i]+b[i];
>> }
>>
>> and -fopt-info-vec-missed tells me, at https://godbolt.org/z/vPM3KTrnz ,
>> <source>:3:20: missed: couldn't vectorize loop
>> <source>:3:20: missed: not vectorized: unsupported data-type
>>
>> so it seems some more delving into gcc's secret internals
>> will be required...
>>
>
>
> Yup! Vectorization is not at all implemented (the compiler doesn't even
> know how to use the vector registers), so it's one of the more
> interesting improvements to be done to the MRISC32 GCC machine
> description.
>
> Since I'm not a compiler guy (at all), I'm pretty happy that I've gotten
> this far.

And pretty impressive it is, too. I've never even gotten close to
the machine description part of gcc.

> Auto-vectorization is not on the near term roadmap,
> unfortunately (things like better C++ support is higher up on the prio
> list). Also, the vector part of the ISA is not really finalized yet.
>
> Inline assembler is the solution for MRISC32 vector code. E.g:
>
> https://github.com/mbitsnbites/mc1-doom/blob/master/src/r_draw.c#L89

It might be possible to upgrade this to __builtin, at least, that way
it would be possible to at least the register allocation part to gcc.

Hmm... the only thing that comes to mind when selecting a vector
with size-agnostic vectors is the SVE feature of aarch64, so it might
be possible to pick up something from that. However, this is a _big_
ISA, and I am not sure that poking around there will lead to anything
useful in a reasonable amount of time.

Marcus

unread,

Dec 31, 2021, 1:45:07 PM12/31/21

to

MRISC32 is a vector-first ISA. Producing masks is the natural choice
when working with vectors.

I was mostly happy to discover that it worked pretty much equally well
for scalars & branches.

So far it seems that using -1 (mask) instead of +1 gives me roughly the
same instruction count in most situations as for architectures that use
a flags register:

* Cond. branch: S[cc] + BS ~= CMP + B[cc]
* Cond. select: S[cc] + SEL ~= CMP + CSEL[cc]/MOV[cc]
* Cond. to 0/1: S[cc] + NEG ~= CMP + CSET[cc]/MOV[cc]

OTOH it's a very good fit when dealing with vectors and even packed
data (SIMD). E.g. the following sequence does bytewise selection for
each vector element:

sltu.b v1, v2, v3
sel v1, v4, v5

SLTU.B does "set if less than (unsigned)" for each byte of a 32-bit
word, and since the instruction operands are vector registers, it will
perform the operation for each vector element (each element is 32 bits
wide).

SEL does "bitwise select", i.e. a = (b & a) | (c & ~a)

One of the key design principles of the MRISC32 ISA is that the same
instructions are used for scalar and for vector operands, with the same
semantics.

/Marcus

Anton Ertl

unread,

Dec 31, 2021, 3:04:34 PM12/31/21

to

EricP <ThatWould...@thevillage.com> writes:
>Terje Mathisen wrote:
>> Marcus wrote:
>>> On 2021-12-30, EricP wrote:
>>>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
>>>> so this definition won't be optimal for those languages.
>>>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
>>>> was defined by the language or implementation dependant.
>>
>> -1 is better than 1, it can be used as a mask.
>
>One or the other usage has to do a negate.

Not necessarily, there are architectures that can do both.

E.g., on Aarch64, the Forth word < (which produces 0 or -1) looks as
follows:

55555843A8: ldr x0, [x25,#0x8]! #load second stack item
55555843AC: add x26, x26, #0x8 #update instruction pointer
55555843B0: subs xzr, x0, x27 #compare
55555843B4: csinv x27, xzr, xzr, ge #and reify the ge condition as 0/-1
55555843B8: ldur x1, [x26,#-0x8] #dispatch next primitive
55555843BC: br x1 #dispatch next primitive

>The question is which usage, boolean or mask, is more common?

You can represent a boolean true as -1.

The issue is not "boolean", but C.

>There is a lot of C code that expects that a = b < c; produces 0 or 1.

Not just that, the language guarantees it.

>So if CMP produces a -1 then a C compiler would always generate a NEG too.

Only if the flag needs to be reified (not that common in C, with
branching operators like && and ||).

>For boolean expressions in IF statements the NEG is unnecessary
>if the branch tests zero/non-zero and so it might be optimized away.
>But that requires the compiler code gen knowing that boolean expressions
>in IF statements are not quite the same as other boolean expressions

This is easy to implement in a compiler. E.g., in tree-parsing
instruction selection you would have stuff like:

grammar rule # generated code
root: Condbranch(cond) # branch on the comparison result
cond: Lt(reg,reg) # generate a comparison
reg: cond # reify the cond as 0/1

and as long as the conds are consumed only by condbranches, you get no
reification and not negation.

>(and we give a special dispensation to & and | operators on booleans).

Not sure what you mean by that. You can also optimize & and | with
flags as parameters; assuming a cond is 0/-1 in a register (and not
just something in the condition codes; in that case I would introduce
additional nonterminals):

znz: cond # no code
cond: And(cond,cond) # and
znz: And(cond,reg) # and
znz: And(reg,cond) # and
cond: Or(cond,cond) # or
znz: Or(znz,reg) # or
znz: Or(reg,znz) # or

znz represents a flag as zero/nonzero value, which is useful in
combination with branch on zero instructions (i.e., if the result is
used by such a branch).

>It seems easier to have the mask user pay for the NEG.

Certainly MIPS, Alpha, and RISC-V designers thought so. But of course
it is possible to use instruction combining to eliminate the latency
cost of the negation.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Dec 31, 2021, 3:30:04 PM12/31/21

to

On Friday, December 31, 2021 at 2:04:34 PM UTC-6, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >Terje Mathisen wrote:
> >> Marcus wrote:
> >>> On 2021-12-30, EricP wrote:
> >>>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
> >>>> so this definition won't be optimal for those languages.
> >>>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
> >>>> was defined by the language or implementation dependant.
> >>
> >> -1 is better than 1, it can be used as a mask.
> >
> >One or the other usage has to do a negate.
> Not necessarily, there are architectures that can do both.
<

And architectures which provide both--this is my recommendation--
provide both.

>
> E.g., on Aarch64, the Forth word < (which produces 0 or -1) looks as
> follows:
>
> 55555843A8: ldr x0, [x25,#0x8]! #load second stack item
> 55555843AC: add x26, x26, #0x8 #update instruction pointer
> 55555843B0: subs xzr, x0, x27 #compare
> 55555843B4: csinv x27, xzr, xzr, ge #and reify the ge condition as 0/-1
> 55555843B8: ldur x1, [x26,#-0x8] #dispatch next primitive
> 55555843BC: br x1 #dispatch next primitive
> >The question is which usage, boolean or mask, is more common?
> You can represent a boolean true as -1.
>
> The issue is not "boolean", but C.
> >There is a lot of C code that expects that a = b < c; produces 0 or 1.
> Not just that, the language guarantees it.
> >So if CMP produces a -1 then a C compiler would always generate a NEG too.
> Only if the flag needs to be reified (not that common in C, with
> branching operators like && and ||).
<

And then there is My 66000 which has sign control bits in integer calculation
instructions::
<
Rd = -Rs1 - Rs2
<
to avoid 90%± of negation instructions; these pair with the constant attachment
encoding to decrease entropy of the encoding.

<
> >For boolean expressions in IF statements the NEG is unnecessary
> >if the branch tests zero/non-zero and so it might be optimized away.
> >But that requires the compiler code gen knowing that boolean expressions
> >in IF statements are not quite the same as other boolean expressions
> This is easy to implement in a compiler. E.g., in tree-parsing
> instruction selection you would have stuff like:
>
> grammar rule # generated code
> root: Condbranch(cond) # branch on the comparison result
> cond: Lt(reg,reg) # generate a comparison
> reg: cond # reify the cond as 0/1
>
> and as long as the conds are consumed only by condbranches, you get no
> reification and not negation.
<

Almost every reasonable(tm) ISA there is a way to CMP and BC without
having to jump through hoops of NEGs.

<
> >(and we give a special dispensation to & and | operators on booleans).
> Not sure what you mean by that. You can also optimize & and | with
> flags as parameters; assuming a cond is 0/-1 in a register (and not
> just something in the condition codes; in that case I would introduce
> additional nonterminals):
>
> znz: cond # no code
> cond: And(cond,cond) # and
> znz: And(cond,reg) # and
> znz: And(reg,cond) # and
> cond: Or(cond,cond) # or
> znz: Or(znz,reg) # or
> znz: Or(reg,znz) # or
>
> znz represents a flag as zero/nonzero value, which is useful in
> combination with branch on zero instructions (i.e., if the result is
> used by such a branch).
> >It seems easier to have the mask user pay for the NEG.
<

Mask consumers of conditions are less then 10% of consumers of conditions;
so, it makes sense to make them pay.

<
> Certainly MIPS, Alpha, and RISC-V designers thought so. But of course
> it is possible to use instruction combining to eliminate the latency
> cost of the negation.
<

It is even possible to make 90%+ of NEGs disappear by embedding negation/inversion
into the semantics of ISA encoding.

Quadibloc

unread,

Dec 31, 2021, 3:51:00 PM12/31/21

to

On Thursday, December 30, 2021 at 9:51:01 AM UTC-7, EricP wrote:

> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
> was defined by the language or implementation dependant.

I believe it would have been implementation dependent, since
IBM mainframe FORTRAN used 0 and 1 for .FALSE. and .TRUE.
respectively.

John Savard

Quadibloc

unread,

Dec 31, 2021, 4:28:04 PM12/31/21

to

On Thursday, December 30, 2021 at 2:24:37 PM UTC-7, Quadibloc wrote:
> On Friday, December 24, 2021 at 9:37:14 PM UTC-7, Ivan Godard wrote:
> > On 12/24/2021 10:17 AM, MitchAlsup wrote:
>
> > > There is NO JUSTIFIABLE reason that an instruction is not entirely
> > > self-contained!
>
> > Really? I suppose that might be true for OOO that is emulating
> > sequential execution, but what about VLIW and other wide multi-issue?
>
> > Chopping off the variable and large parts so they can be recognized in
> > parallel lets the issue width no longer depend on how many constants you
> > have.
>
> This is the sort of thing I've been fooling around with finding a way to
> achieve. To avoid too much complexity, my recent designs all involved
> a self-contained 256-bit block which combined as many fixed-length
> instructions as possible with the variable or large parts they needed.

I've turned my attention to Concertina II again, having thought of a
way to simplify one aspect of that design considerably.

There would just be one instruction format, and either blocks would
have no header, or just one kind of header.

Unfortunately, that turns out to have constrained available opcode space
severely enough to require some serious compromises, so it will take
some time to make those livable. Essentially, I have only 30 bits for the
32-bit instructions, since I include a break bit in the instruction itself,
and use half the opcode space for pairs of 15-bit instructions.

John Savard

Thomas Koenig

unread,

Jan 1, 2022, 6:13:25 AM1/1/22

to

a...@littlepinkcloud.invalid <a...@littlepinkcloud.invalid> schrieb:

POWER uses 64-bit data only (but it generates a 32-bit carry,
just in case), so this is indeed correct.

The code emitted for

_Bool foo (long a, long b)
{
return a > b;
}

is far less pretty and depends on selected CPU and comipiler.

gcc 4.8.5:

cmpd cr7,r3,r4
mfocrf r3,1
rlwinm r3,r3,30,31,31
blr

This moves a condition code into a register and extracts
the bit field.

gcc 12 with -mcpu=power8 (the current default):

sradi r10,r4,63
rldicl r9,r3,1,63
subfc r3,r3,r4
adde r3,r9,r10
xori r3,r3,1
clrlwi r3,r3,24
blr

That is weird. Two shifts and masks, one subtraction, one
addition, one xor, and one instruction that clears bits.

gcc 12 with -mcpu=power9:

li r9,0
li r10,1
cmpd r3,r4
iselgt r3,r10,r9
blr

Load constants into two registers and select based on a compare
instruction.

xlc:

cmpd r3,r4
li r3,1
bgt 14 <foo+0x14>
li r3,0
ori r2,r2,0
blr

Load 1, and conditionally branch around the load of 0. The or
with immediate is a not and some sort of hint to the linker, IIRC.

Clang:

sradi r5,r4,63
rldicl r6,r3,1,63
subfc r3,r3,r4
adde r3,r6,r5
xori r3,r3,1
blr

Same as gcc 12 for POWER8 above, but without the final clear.

At leat from an instruction count, the oldest solution looks
to be the best.

robf...@gmail.com

unread,

Jan 1, 2022, 7:05:59 AM1/1/22

to

The PowerPC code demonstrates one of the issues with using condition
registers instead of GPRs for comparison results. They end up being
transferred to GPRs anyway. One would think that the compiler should
return a value in a condition register for a function returning a bool. Which
would be much more efficient. There are eight condition register available.
That way the condition register could be specified in a subsequent
conditional statement without needing to convert the GPR value to a
condition code. The equals flag of the condition register could be used to
store the Boolean value. The compiler would have to track values in
condition registers. There should be a ‘set’ instruction for the condition
registers allowing the bits to be transferred to the equals bit.

Compiled code should look something like:
SLT cr0,r3,r4
RET

Quadibloc

unread,

Jan 1, 2022, 8:28:41 AM1/1/22

to

On Friday, December 31, 2021 at 2:28:04 PM UTC-7, Quadibloc wrote:

> I've turned my attention to Concertina II again, having thought of a
> way to simplify one aspect of that design considerably.
>
> There would just be one instruction format, and either blocks would
> have no header, or just one kind of header.
>
> Unfortunately, that turns out to have constrained available opcode space
> severely enough to require some serious compromises, so it will take
> some time to make those livable. Essentially, I have only 30 bits for the
> 32-bit instructions, since I include a break bit in the instruction itself,
> and use half the opcode space for pairs of 15-bit instructions.

I've come up with something that should work. The sacrifice is that
16-bit instructions will only be available within code blocks with headers -
and they will not be able to be predicated.

John Savard

Ivan Godard

unread,

Jan 1, 2022, 9:09:09 AM1/1/22

to

This is very similar in effect to having different regfiles for integer
and FP, only in your proposal it's integer (and FP?) and bool. Splitting
regs by type does offer some code (and encoding) advantages, but also
some drawbacks. Think about how you would do VARARGS when bools are
passed in the flags :-)

Terje Mathisen

unread,

Jan 1, 2022, 9:27:50 AM1/1/22

to

Just like x86 condition codes, POWER compilers will probably do a better
job if they can inline bool functions, so that the condition code can be
used directly instead of first having to be reified as an int, then
tested again in the calling function.

Thomas Koenig

unread,

Jan 1, 2022, 10:01:12 AM1/1/22

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Just like x86 condition codes, POWER compilers will probably do a better
> job if they can inline bool functions, so that the condition code can be
> used directly instead of first having to be reified as an int, then
> tested again in the calling function.

Very much so.

_Bool gt (long int a, long int b)
{
return a > b;
}

long int mymax(long int a,long int b)
{
return gt(a,b) ? a : b;
}

will give you, with -O3 on a current trunk,

cmpd r3,r4
isellt r3,r4,r3
blr

for mymax.

EricP

unread,

Jan 1, 2022, 10:41:33 AM1/1/22

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>

>> So if CMP produces a -1 then a C compiler would always generate a NEG too.
>
> Only if the flag needs to be reified (not that common in C, with
> branching operators like && and ||).

What I was getting at was that the code gen point would likely be
so far down in the compiler that it would have lost the context necessary
to know whether this expression was a non-reified bool or integer,
it would have to assume it was integer and be forced to spit out a NEG
(as required by the C standard).

That left it with the much harder job of trying to optimize NEG away later.

But I see below that you show how to retain the necessary context.

>> For boolean expressions in IF statements the NEG is unnecessary
>> if the branch tests zero/non-zero and so it might be optimized away.
>> But that requires the compiler code gen knowing that boolean expressions
>> in IF statements are not quite the same as other boolean expressions
>
> This is easy to implement in a compiler. E.g., in tree-parsing
> instruction selection you would have stuff like:
>
> grammar rule # generated code
> root: Condbranch(cond) # branch on the comparison result
> cond: Lt(reg,reg) # generate a comparison
> reg: cond # reify the cond as 0/1
>
> and as long as the conds are consumed only by condbranches, you get no
> reification and not negation.
>
>> (and we give a special dispensation to & and | operators on booleans).
>
> Not sure what you mean by that. You can also optimize & and | with
> flags as parameters; assuming a cond is 0/-1 in a register (and not
> just something in the condition codes; in that case I would introduce
> additional nonterminals):

The & and | operators normally act on integral data types and return
the same types. It would need special handling so they operate on your
cond type and don't trigger reification.

As you show below.

> znz: cond # no code
> cond: And(cond,cond) # and
> znz: And(cond,reg) # and
> znz: And(reg,cond) # and
> cond: Or(cond,cond) # or
> znz: Or(znz,reg) # or
> znz: Or(reg,znz) # or
>
> znz represents a flag as zero/nonzero value, which is useful in
> combination with branch on zero instructions (i.e., if the result is
> used by such a branch).

Yes, this overloading for & and | would do.

The compiler would have to be built to propagate cond as a distinct type.
I was assuming that none of this would be in a C compiler already,
why would it since C defines these all as integer expressions,
which is why it for a port to this ISA that it would always generate
the NEG and then have to try to optimize it away.

Anton Ertl

unread,

Jan 1, 2022, 11:44:49 AM1/1/22

to

EricP <ThatWould...@thevillage.com> writes:

>Anton Ertl wrote:
>What I was getting at was that the code gen point would likely be
>so far down in the compiler that it would have lost the context necessary
>to know whether this expression was a non-reified bool or integer,

For that you would have to hide the data flow from the comparison to
the branch. Even relatively simple compilers will see this for most
code, and more sophisticated compilers keep track of data flow quite
far; admittedly, it gets a little tricky if there is a phi function in
between (but that's pretty rare for the stuff we are discussing here).

>> This is easy to implement in a compiler. E.g., in tree-parsing
>> instruction selection you would have stuff like:
>>
>> grammar rule # generated code
>> root: Condbranch(cond) # branch on the comparison result
>> cond: Lt(reg,reg) # generate a comparison
>> reg: cond # reify the cond as 0/1
>>
>> and as long as the conds are consumed only by condbranches, you get no
>> reification and not negation.
>>
>>> (and we give a special dispensation to & and | operators on booleans).
>>
>> Not sure what you mean by that. You can also optimize & and | with
>> flags as parameters; assuming a cond is 0/-1 in a register (and not
>> just something in the condition codes; in that case I would introduce
>> additional nonterminals):
>
>The & and | operators normally act on integral data types and return
>the same types. It would need special handling so they operate on your
>cond type and don't trigger reification.

cond is a nonterminal of the tree grammar, not a type. The trick with
tree grammars is that they typically are ambiguous, and the costs (not
shown here) of the code generation rules are used to select the
lowest-cost tree parse. In the present context, reifying will usually
be more expensive and will not be selected if an alternative exists.

>As you show below.
>
>> znz: cond # no code
>> cond: And(cond,cond) # and
>> znz: And(cond,reg) # and
>> znz: And(reg,cond) # and
>> cond: Or(cond,cond) # or
>> znz: Or(znz,reg) # or
>> znz: Or(reg,znz) # or
>>
>> znz represents a flag as zero/nonzero value, which is useful in
>> combination with branch on zero instructions (i.e., if the result is
>> used by such a branch).
>
>Yes, this overloading for & and | would do.
>
>The compiler would have to be built to propagate cond as a distinct type.

As mentioned, cond is a nonterminal, not a type (and root, reg, and
znz are also nonterminals). The rest of the compiler knows nothing
about these nonterminals, and does not need to concern itself with
them. It delivers a fully instantiated tree (i.e., only operators and
terminals) to the tree parser, and the tree parser selects the optimal
(wrt the tree grammar) parse (and thus code) for the tree. E.g.,

if (a<b) ...

could become the tree

Condbranch(Lt(Reg,Reg))

(note that the terminal "Reg" is different from the nonterminal "reg".)

Tree parsing is just one instruction selection technology, but others
(e.g., Davidson-Fraser) are just as capable of avoiding reification of
flags in the common cases. In Davidson-Fraser the conceptual process
might be along the lines of first generating the reification, and then
eliminating it through peephole optimization.

I guess that this is a major reason why we have not yet seen in the
condition handling area the kind of standardization on one particular
architectural style that we have seen in other areas (e.g., register
machines have won quite a while agod, 2s-complement has won a long
time ago, little-endian has won a short time ago, etc.).

In particular, (general-purpose) register machines have won because
compilers are so bad at making good use of special-purpose registers,
yet the MIPS or 88k approaches (both of which use general-purpose
registers for conditions) have not won out over the condition-code
register approach (AMD64, ARM A32, ARM A64).

Marcus

unread,

Jan 1, 2022, 12:35:13 PM1/1/22

to

The MRISC32 version, https://godbolt.org/z/r6rTj5aWv

mymax:
max r1,r1,r2
ret

;-)

Guillaume

unread,

Jan 2, 2022, 1:55:45 PM1/2/22

to

For RISCV, it's:
mymax:
bge a0,a1,.L4
mv a0,a1
.L4:
ret

So, requires a conditional branch...
But, for floating point (if supported), there is the 'fmax' instruction.

EricP

unread,

Jan 2, 2022, 3:29:05 PM1/2/22

to

I'm looking at some papers on tree parsing peephole optimizers,
built using an LR(0) parser with extra bits to allow shift-reduce
and reduce-reduce conflicts to allow multiple matches.
Where the parse can follow multiple paths, it forks and follows all.

If I was doing this I would add an option to allow a semantic check to be
performed at each fork so you can guide it to choose particular forks
and prune unnecessary forks as soon as possible.

This breadth-wise cost based forking search is similar to how I did
syntax error repair. I also once built a lexical scanner that could
search for many, as in 10's of thousands, of search strings at once,
forking to follow multiple paths (I was thinking it might be useful
as a virus scanner).

MitchAlsup

unread,

Jan 2, 2022, 5:38:00 PM1/2/22

to

Why is there not an IMAX instruction in every modern ISA ??

Marcus

unread,

Jan 3, 2022, 2:44:30 AM1/3/22

to

Agree.

Actually, RISC-V has MIN, MAX, MINU, MAXU (just like MRISC32) in the
bitmanip extension (a.k.a. the "add all instructions that should have
been part of the base ISA"-extension ;-) )

It also has VMIN, VMAX, VMINU, VMAXU in the V extension, acting on
vector registers.

...and VFMIN, VFMAX in the V extesions, acting on vector registers.

So in summary, RISC-V has two sets of integer MIN/MAX instructions and
two sets of floating-point MIN/MAX instructions. Different instructions
for different register files, and none of them part of the base
instruction set.

I also note that RISC-V goes against the trend and has *three* register
files (scalar integer + scalar floating-point + vector), with different
instruction sets for each register file. x86 history repeating?

/Marcus

Anton Ertl

unread,

Jan 3, 2022, 4:07:15 AM1/3/22

to

EricP <ThatWould...@thevillage.com> writes:
>I'm looking at some papers on tree parsing peephole optimizers,
>built using an LR(0) parser with extra bits to allow shift-reduce
>and reduce-reduce conflicts to allow multiple matches.

LR parsers are string parsers. The Graham-Glanville approach to code
generation used an LR parser on a prefix representation of the tree as
a poor-man's tree parser; of course they did not think about it that
way at the time, but once tree parser generators were developed, the
Graham-Glanville approach vanished.

An early paper on tree parsing is:

@InProceedings{emmelmann+89,
author = {Helmut Emmelmann and Friedrich-Wilhelm Schr\"oer and
Rudolf Landwehr},
title = {{BEG} -- a Generator for Efficient Back Ends},
crossref = "sigplan89",
pages = "227--237"
}

@Proceedings{sigplan89,
key = "SIGPLAN~'89",
booktitle = "SIGPLAN~'89 Conference on
Programming Language Design and Implementation",
title = "SIGPLAN~'89 Conference on
Programming Language Design and Implementation",
year = "1989",
}

Or you can read Section 2 of
<http://www.complang.tuwien.ac.at/papers/thier+18.pdf>, which provides
some background on this technology.

>If I was doing this I would add an option to allow a semantic check to be
>performed at each fork so you can guide it to choose particular forks
>and prune unnecessary forks as soon as possible.

For best code generation speed, the tree grammar is transformed into a
tree-parsing automaton (e.g., with the tool "burg"), where the code
generation speed is independent of the number of rules (and number of
alternative trees) involved.

Concerning semantics checks, in BEG and lburg costs can depend on
semantic things (i.e., things beyond the tree grammar), such as the
values of constants; Thier's constraints disable rules based on
semantic things (again, like the values of constants). But both
features are used for improved modeling of instructions (e.g., the
MIPS addiu instruction allows only a certain range of immediate
values), not pruning forks.

Ivan Godard

unread,

Jan 3, 2022, 5:01:36 AM1/3/22

to

because you don't need it when you have "?:".

gtr(b0, b1), pick(b0, b1, b2), retn(b0);

one bundle, one cycle.

MitchAlsup

unread,

Jan 3, 2022, 12:28:30 PM1/3/22

to

and at least 4 times as much transport energy.

Guillaume

unread,

Jan 3, 2022, 1:21:44 PM1/3/22

to

Le 02/01/2022 à 23:37, MitchAlsup a écrit :
> Why is there not an IMAX instruction in every modern ISA ??

Well, I can't say for every modern ISA, but for RISC-V, as Marcus said,
MIN/MAX instructions (and much more) are available in the Bitmanip
extension that has now been ratified recently if I'm not mistaken.

The reason is that the RISC-V ISA is very modular, so they have choosen
to keep the base ISA minimal, and then extend it.

It has benefits of course - you can design cores that are as minimal or
as featureful as needed, while still being compliant - but it also has
drawbacks: it will tend to lead to fragmentation. RISC-V has mitigated
that by defining "supersets" such as RV64GC that contain a number of
extensions, and that are meant for more general-purpose computing, but
fragmentation can still be an issue.

As to why they chose to put the MIN/MAX instructions in the "Bitmanip"
extension, I'm not sure. It looks like this extension contains A LOT of
common instructions that were non-essential, and missing in the base
ISA. Like a big bag of instructions, not all of them being really bit
manipulation-related IMHO. Oh, and the Bitmanip extension is itself
sliced into 4 sub-extensions =). Good luck if you were to implement the
whole extension in its entirety: it's very large.

Don't get me wrong, I do like RISC-V and have designed a RV32/64 core.
The more I've worked with this ISA and the more I've gotten to
understand the choices that were made. But, there's still this point
about fragmentation that I'm concerned with.

John Dallman

unread,

Jan 3, 2022, 2:28:54 PM1/3/22

to

In article <sqverm$adp$1...@gioia.aioe.org>, mes...@bottle.org (Guillaume)
wrote:

> Don't get me wrong, I do like RISC-V and have designed a RV32/64
> core. The more I've worked with this ISA and the more I've gotten
> to understand the choices that were made. But, there's still this
> point about fragmentation that I'm concerned with.

It's a matter of the usage cases. Small embedded systems, which don't get
applications installed on them by end-users, often want the smallest
practical processor core. General-purpose systems (smartphones to
supercomputers) can usually afford some generality in their ISA and the
associated costs.

The proliferation of RISC-V extensions is somewhat analogous to the
many-fold ISA variations of ARMv5 through v7, although much better
structured. ARMv8 is far more prescriptive, and the ARMv8.0 ISA is
adequate for many purposes.

I expect that if general-purpose OSes become popular on RISC-V, each OS
will specify a minimum set of extensions that must be available.

John

Marcus

unread,

Jan 3, 2022, 2:33:33 PM1/3/22

to

On 2022-01-03, Guillaume wrote:
> Le 02/01/2022 à 23:37, MitchAlsup a écrit :
>> Why is there not an IMAX instruction in every modern ISA ??
>
> Well, I can't say for every modern ISA, but for RISC-V, as Marcus said,
> MIN/MAX instructions (and much more) are available in the Bitmanip
> extension that has now been ratified recently if I'm not mistaken.
>
> The reason is that the RISC-V ISA is very modular, so they have choosen
> to keep the base ISA minimal, and then extend it.
>
> It has benefits of course - you can design cores that are as minimal or
> as featureful as needed, while still being compliant - but it also has
> drawbacks: it will tend to lead to fragmentation. RISC-V has mitigated
> that by defining "supersets" such as RV64GC that contain a number of
> extensions, and that are meant for more general-purpose computing, but
> fragmentation can still be an issue.
>
> As to why they chose to put the MIN/MAX instructions in the "Bitmanip"
> extension, I'm not sure. It looks like this extension contains A LOT of
> common instructions that were non-essential, and missing in the base
> ISA. Like a big bag of instructions, not all of them being really bit
> manipulation-related IMHO. Oh, and the Bitmanip extension is itself
> sliced into 4 sub-extensions =). Good luck if you were to implement the
> whole extension in its entirety: it's very large.

Funny thing: I recently discovered that they seem to have dropped the
bit-field instructions from bitmanip (!). I thought that it was one of
the key features of that extension. So I guess we're looking at more
bit-field extensions in the future, adding even more to the fragmentation.

https://github.com/riscv/riscv-bitmanip/issues/169

>
> Don't get me wrong, I do like RISC-V and have designed a RV32/64 core.
> The more I've worked with this ISA and the more I've gotten to
> understand the choices that were made. But, there's still this point
> about fragmentation that I'm concerned with.

I too like RISC-V (it's open, it works and it kind of scales), and at
the same time I dislike it (mostly the over-catering for ultra-small,
leading to some sub-optimal design decisions, and the fragmentation
part).

/Marcus