Compare-and-branch vs PIC

Russell Wallace

unread,

Dec 7, 2022, 10:26:11 AM12/7/22

to

I've been trying to better understand the reasoning behind some of the design decisions that went into historical computer architectures. For example, I note that ARM-1 is not a purist RISC, but has a number of features designed to get more work done in each instruction. But one thing it does not have is compare-and-branch in one instruction; you have to compare separately, then branch on condition code. Yet compare-and-branch was not entirely unprecedented at the time. There doesn't seem to be any clear evidence that it would have added an extra cycle of delay:
https://retrocomputing.stackexchange.com/questions/25785/would-compare-and-branch-have-added-an-extra-cycle-on-arm-1

A possibility that occurs to me, though: maybe you can't use the ALU to compare operands and also add a displacement to the instruction pointer, in the same cycle. Maybe if you want compare-and-branch, you have to choose between an extra cycle of delay, vs the simpler model of branch instruction that provides bits that simply replace the low bits of the instruction pointer. Which is potentially awkward if you were hoping for position independent code.

Is that actually the case, assuming you are tightly constrained in die area (either historically on a micron-scale process, or today on a small embedded core)? Or am I missing the mark?

Theo

unread,

Dec 7, 2022, 11:42:18 AM12/7/22

to

Russell Wallace <russell...@gmail.com> wrote:
> I've been trying to better understand the reasoning behind some of the design decisions that went into historical computer architectures. For example, I note that ARM-1 is not a purist RISC, but has a number of features designed to get more work done in each instruction. But one thing it does not have is compare-and-branch in one instruction; you have to compare separately, then branch on condition code. Yet compare-and-branch was not entirely unprecedented at the time. There doesn't seem to be any clear evidence that it would have added an extra cycle of delay:
> https://retrocomputing.stackexchange.com/questions/25785/would-compare-and-branch-have-added-an-extra-cycle-on-arm-1
>
> A possibility that occurs to me, though: maybe you can't use the ALU to
> compare operands and also add a displacement to the instruction pointer,
> in the same cycle. Maybe if you want compare-and-branch, you have to
> choose between an extra cycle of delay, vs the simpler model of branch
> instruction that provides bits that simply replace the low bits of the
> instruction pointer. Which is potentially awkward if you were hoping for
> position independent code.

You can do PIC on 32-bit ARM, you just have to do:

ADDEQ pc,pc,#offset
(or whatever condition code)

instead of a branch (where #offset is calculated by your assembler). If the
range of an immediate constant is too short, you have to use a register:

MOV rN,#(offset & 0x3FC00) << 10
ORR rN,rN,#(offset & 0x3FC)
ADDEQ pc,pc,rN

which can also be generated by your assembler. (you know the bottom two
bits of the offset are going to be zero so don't need to set them)

If you want to do compare and relative-branch you need to use the ALU twice,
once for the compare and once for adding to the PC. It would have been
possible to do a compare with absolute branch
(PC <= cond ? immediate : PC+4)
without using the ALU, but that's not very useful.

> Is that actually the case, assuming you are tightly constrained in die
> area (either historically on a micron-scale process, or today on a small
> embedded core)? Or am I missing the mark?

These days addition is cheap, so pc <= pc + N isn't very costly. Back then,
it was.

Theo

Theo

unread,

Dec 7, 2022, 11:44:32 AM12/7/22

to

Theo <theom...@chiark.greenend.org.uk> wrote:
> MOV rN,#(offset & 0x3FC00) << 10
> ORR rN,rN,#(offset & 0x3FC)
> ADDEQ pc,pc,rN

should be:

MOV rN,#(offset & 0x3FC00)

ORR rN,rN,#(offset & 0x3FC)
ADDEQ pc,pc,rN

(the <<10 is inferred by the assembler)

John Dallman

unread,

Dec 7, 2022, 11:53:15 AM12/7/22

to

In article <b6260f7b-fbeb-4549...@googlegroups.com>,

russell...@gmail.com (Russell Wallace) wrote:

> For example, I note that ARM-1 is not a purist RISC, but has a
> number of features designed to get more work done in each
> instruction. But one thing it does not have is compare-and-branch
> in one instruction; you have to compare separately, then branch on
> condition code.

I think the design intent was that you didn't branch unless absolutely
necessary, but used the conditional (predicated) instructions as much as
possible. Those have proved hard for compilers to use efficiently, but
are quite easy for humans writing assembler code. Acorn Computers
primarily designed ARM for use in their own machines, and wrote their
early operating systems in assembly language.

John

MitchAlsup

unread,

Dec 7, 2022, 1:06:55 PM12/7/22

to

On Wednesday, December 7, 2022 at 9:26:11 AM UTC-6, Russell Wallace wrote:
> I've been trying to better understand the reasoning behind some of the design decisions that went into historical computer architectures. For example, I note that ARM-1 is not a purist RISC, but has a number of features designed to get more work done in each instruction. But one thing it does not have is compare-and-branch in one instruction; you have to compare separately, then branch on condition code. Yet compare-and-branch was not entirely unprecedented at the time. There doesn't seem to be any clear evidence that it would have added an extra cycle of delay:
> https://retrocomputing.stackexchange.com/questions/25785/would-compare-and-branch-have-added-an-extra-cycle-on-arm-1
>
> A possibility that occurs to me, though: maybe you can't use the ALU to compare operands and also add a displacement to the instruction pointer, in the same cycle. Maybe if you want compare-and-branch, you have to choose between an extra cycle of delay, vs the simpler model of branch instruction that provides bits that simply replace the low bits of the instruction pointer. Which is potentially awkward if you were hoping for position independent code.
<

Even early RISC machines had a separate adder for branching and for integer calculations.
These are in different stages of the pipeline and have to be that way.

>
> Is that actually the case, assuming you are tightly constrained in die area (either historically on a micron-scale process, or today on a small embedded core)? Or am I missing the mark?
<

At about the 1.5 micron level, you get to choose to put the TLB on die or the FPU on die.
At 1 micron, you can put both on die.
At 0.7 micron a significant portion of cache comes on die.
At 0.5 micron, you can put reservation stations and make it GBOoO.

Stephen Fuld

unread,

Dec 7, 2022, 1:26:02 PM12/7/22

to

On 12/7/2022 7:26 AM, Russell Wallace wrote:
> I've been trying to better understand the reasoning behind some of the design decisions that went into historical computer architectures. For example, I note that ARM-1 is not a purist RISC, but has a number of features designed to get more work done in each instruction. But one thing it does not have is compare-and-branch in one instruction; you have to compare separately, then branch on condition code. Yet compare-and-branch was not entirely unprecedented at the time. There doesn't seem to be any clear evidence that it would have added an extra cycle of delay:
> https://retrocomputing.stackexchange.com/questions/25785/would-compare-and-branch-have-added-an-extra-cycle-on-arm-1

Unless you are talking about a compare to some fixed value, expressed in
the op code (e.g. a compare to zero and branch instruction) it would
have required an extra field in the instruction, i.e. two values to
compare and a third to specify the branch target address. Given the
fixed 32 bit instructions, this would have been "a challenge". :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

Dec 7, 2022, 2:00:13 PM12/7/22

to

Doing a general compare + branch has a risk of additional latency cost,
because now one has the "initiate a branch" logic depending directly on
the ALU compare logic.

In my case, I went with a single status bit (a True/False bit) that is
used (mostly) to predicate stuff.

So, say:
OP ... //Always Execute
OP?T ... //Execute Op if T is Set
OP?F ... //Execute Op if F is Clear

The BT/BF instructions (Branch if True or False), are currently encoded
as if they were:
BRA?T
BRA?F

There were some older dedicated BT/BF encodings, but they are
deprecated/dropped (in the 32-bit forms), and the space may later be
reclaimed for more 3R ops or similar.

The use of a single status bit also allows fitting:
OP ... //Always, Normal
OP ... || //Always, Bundle
OP?T ... //If True
OP?F ... //If False

Into roughly 2 bits of encoding entropy.

There were a few special ops:
LDIx Imm24, R0
That only exist in scalar form, so:
LDIx Imm24, R0 //Scalar Form
Jumbo Ext24 //Jumbo Prefixes
OP?T ... || //If True + Bundle (ISA subset)
OP?F ... || //If False + Bundle (ISA subset)

This stays within the two bits of encoding entropy, though internally
3-bits are used in the pipeline.

If I were not using any status bits, such as for a more RISC-V like ISA
design, I would likely go instead with hard-wired compare-with-zero ops.

In this case, it is nearly as capable, but the checks are significantly
cheaper to perform (and one can use the extra bits from not having a
second register, to maybe have a slightly bigger displacement).

Say:
BZ Rn, Disp17 //Branch if Rn==0
BNZ Rn, Disp17 //Branch if Rn!=0
BLT Rn, Disp17 //Branch if Rn< 0 (Bit 63 is Set)
BGE Rn, Disp17 //Branch if Rn>=0 (Bit 63 is Clear)

In this case, one can have tristate outputs on the compare ops, say:
-1: A B

I would also get rid of the user-specified Link Register, as:
This is rarely useful;
It adds a non-trivial cost to the "branch-with-link" mechanism.

While a second link register "can" be useful for prolog compression
and/or "millicode" functions, the relative savings (vs, say, manually
copying the link register to a different register and/or saving it to
the stack) are small enough to be mostly ignored.

In intermediate option would be to instead allow encodings for two
hard-wired link registers.

Say, for a 3b Branch space:
000 B Disp22
001 -
010 BL Disp22 //Branch with Link, Save Primary LR
011 BL2 Disp22 //Branch with Link, Save Alternate LR
100 BZ Rn, Disp17
101 BNZ Rn, Disp17
110 BLT Rn, Disp17
111 BGE Rn, Disp17

This could be fit into the same amount of encoding space as the original
RISC-V JAL instruction (if using a similar encoding scheme to RISC-V).

I would not consider this design a win over my current ISA though, as
while it could fit more easily into a slightly smaller core, to get
similar performance would require more advanced hardware-level support
(such as superscalar and special logic to detect and handle short
forward branches as predication).

Similarly, there isn't a whole lot of point in trying to face off
against RISC-V with an ISA that is "nearly the same, just with slightly
cheaper branches and similar".

Though, beyond the base ISA (into the "extensions") space, would likely
do some more significant redesign (mostly to try to keep things more
consistent and for more aggressive cost minimization).

...

BGB

unread,

Dec 7, 2022, 2:10:33 PM12/7/22

to

RISC-V compares an arbitrary pair of registers.

IMHO, this is not worth the cost (given compare-with-zero would have
been "nearly as effective", allow larger branch displacements, and would
have also been cheaper...).

For something more like ARM, one would need a fairly small branch
displacement, and it would likely do little to help over the use of
condition-codes.

Though, burning 4-bits on the condition code is a little steep, but they
didn't need to deal with variable-length instructions (so, they gain
back a few bits by not needing to be able to encode 16-bit operations or
similar).

MitchAlsup

unread,

Dec 7, 2022, 3:03:48 PM12/7/22

to

On Wednesday, December 7, 2022 at 1:10:33 PM UTC-6, BGB wrote:
> On 12/7/2022 12:25 PM, Stephen Fuld wrote:

> > Unless you are talking about a compare to some fixed value, expressed in
> > the op code (e.g. a compare to zero and branch instruction) it would
> > have required an extra field in the instruction, i.e. two values to
> > compare and a third to specify the branch target address. Given the
> > fixed 32 bit instructions, this would have been "a challenge". :-)
> >
> RISC-V compares an arbitrary pair of registers.
<

Yes, It does.

>
> IMHO, this is not worth the cost (given compare-with-zero would have
> been "nearly as effective", allow larger branch displacements, and would
> have also been cheaper...).
>

My 66000 does not have a Compare-Branch, using a compare-to-zero-branch
and a Comparison instruction followed by a branch-on-bit. I also have Pred-
ication on bit and on condition. There are some codes where RISC-V wins;
there are cases where we tie, and there are cases where My 66000 wins.
Overall in integer codes; it looks to be break even--no winner either way--it
is just different.
<
Since RISC-V does not allow constants in its compare-branch instruction,
all comparisons against a constant that is not-zero costs an instruction.
In FP comparisons, RISC-V has to use a SET-T/F followed by a branch
and in many cases that SET is a comparison with a constant which also
adds to the instruction stream. Most of the FP compare codes are better
in My 66000 than in RISC-V--but their are outliers on both sides.
<
Many times My 66000 can "keep" a comparison as a persistent value
(in a register) and use that value again:: Say we compared x and y and
put the result of this comparison in R7. As long as R7 is not damaged,
we can extract any of the relations between x and y from the bit pattern
stored in R7 {==, !=, s>, s>=, s<, s<=, u>, u>=, u<, u<=, 0<x<y, 0<=x<y, 0<x<=y,
0<=x<=y} and a few more esoteric relations. Thus, My 66000 has as many
"condition codes" as the compiler has available registers. Any of the relations
can be used to start a predicated string, qualify a branch, or be extracted
into {0,+1} <unsigned> or {-1,0} <signed> -- without out having SET instructions
in the instruction set, nor be limited with SET instructions in the instruction
set.

>
>
> For something more like ARM, one would need a fairly small branch
> displacement, and it would likely do little to help over the use of
> condition-codes.
>
> Though, burning 4-bits on the condition code is a little steep, but they
> didn't need to deal with variable-length instructions (so, they gain
> back a few bits by not needing to be able to encode 16-bit operations or
> similar).
<

The only proper number of bit in the condition codes are 0.

MitchAlsup

unread,

Dec 7, 2022, 3:10:41 PM12/7/22

to

I put this into 1-bit of entropy, and did not charge the consuming instruction
for that bit. That is normal instructions do not waste a bit encoding a possibility
that occurs less than 20% of the time, but instead, I move a group of these bits
over into a PRED instruction, which casts a execute/don't shadow across a
number of instructions.

>
>
> There were a few special ops:
> LDIx Imm24, R0
> That only exist in scalar form, so:
> LDIx Imm24, R0 //Scalar Form
> Jumbo Ext24 //Jumbo Prefixes
> OP?T ... || //If True + Bundle (ISA subset)
> OP?F ... || //If False + Bundle (ISA subset)
>
> This stays within the two bits of encoding entropy, though internally
> 3-bits are used in the pipeline.
>

1-bit per instruction in the pipeline.

>
>
> If I were not using any status bits, such as for a more RISC-V like ISA
> design, I would likely go instead with hard-wired compare-with-zero ops.
>
> In this case, it is nearly as capable, but the checks are significantly
> cheaper to perform (and one can use the extra bits from not having a
> second register, to maybe have a slightly bigger displacement).
>
> Say:
> BZ Rn, Disp17 //Branch if Rn==0
> BNZ Rn, Disp17 //Branch if Rn!=0
> BLT Rn, Disp17 //Branch if Rn< 0 (Bit 63 is Set)
> BGE Rn, Disp17 //Branch if Rn>=0 (Bit 63 is Clear)
>
> In this case, one can have tristate outputs on the compare ops, say:
> -1: A 0: A==B
> 1: A> B
>
> I would also get rid of the user-specified Link Register, as:
> This is rarely useful;
> It adds a non-trivial cost to the "branch-with-link" mechanism.
>
> While a second link register "can" be useful for prolog compression
> and/or "millicode" functions, the relative savings (vs, say, manually
> copying the link register to a different register and/or saving it to
> the stack) are small enough to be mostly ignored.
<

You "could" add ENTER and EXIT instructions to the ISA and dispense
with the fake code density, along with the alterations of the control
flow (power and cycles).
>

robf...@gmail.com

unread,

Dec 7, 2022, 3:26:20 PM12/7/22

to

I think deciding between compare-and-branch as one instruction may have to
do with the size of the branch displacement possible in an instruction format.
It takes more bits to represent the instruction if there are two register fields in
it and that may limit the displacement size. If instructions are 16-bits specifying
two registers in the instruction probably is not practical due to the small
displacement.

It is often not necessary to do the compare, in some processors the condition
codes are set by many instructions and a branch can take place without needing
an explicit compare. In that case the instructions are just as compact as doing a
compare-and-branch in one instruction.

RISCV has only a 12-bit displacement field which is adequate most of the time,
but no doubt on rare occasions extra code needs to be inserted to branch further
away.

Another consideration is that compare instructions are needed in the instruction
set anyway, it may be less hardware then to use separate compare and branch
instructions. Compares are needed for expressions like: X = a < b. which are
sometimes recorded instead of branched on.

My latest core stores the results in condition code registers like the PowerPC
rather than gprs. It is just a bit different. I manage then to get a 21-bit branch
displacement. For branch-to-subroutine 24-bit displacements are possible.

BGB

unread,

Dec 7, 2022, 5:18:41 PM12/7/22

to

If I were to do stuff differently here, it would probably be to spend
less of the encoding space on 16-bit ops.

I might choose to stick with 32 GPRs, since the number of cases that
benefit from 64 GPRs is fairly modest.

If 64 GPRs is supported, it likely makes sense to treat these as a
special case encoding (ISA subset or 64-bit ops), since otherwise 6-bit
register fields would still be too much of an issue for encoding space
(and only sometimes bring much benefit).

Though kinda ugly, BJX2 sort of manages 64 GPRs "semi-effectively" with
5-bit register fields. But, not sure I would do it again...

Say, possible re-imagined encoding space:
zzzz-zzzz-zzzz-zzz0 //16 bit
zzzz-zzzz-zzzz-zzzz zzzz-zzzz-zzzz-zpp1 //32+ bit

Then, say, pp:
00: Always
01: WEX
10: ?T
11: ?F

At this point, assuming I stick with 5b register fields and modest
Imm/Disp fields, I would be ahead of where I started out with for BJX2
(which burns 3-bits on the 16/32 split).

Possible Schemes, 16b:
zzzz-mmmm-nnnn-zzz0 //16b 2R (16 regs)
zzzz-iiii-nnnn-zzz0 //16b 2RI (Imm4/Disp4)
zzzm-mmmm-nnnn-nzz0 //16b 2R (32 regs)
zzzi-iiii-nnnn-nzz0 //16b 2RI (Imm5/Disp5, 32 regs)
zzzi-iiii-iiii-izz0 //16b Imm10

Would likely need to provide:
SP-relative Load/Store (R0..R31)
Constant Load (R0..R31)
MOV 2R (R0..R31)
CMPxx-2R and 2RI (R0..R31)
A few basic ALU ops (ADD/SUB);
Short-form Branch Instructions;
...

Could be more minimal than in BJX2, since there is little need for
16-bit ops to actually be able to run the system (but, exist primarily
so that one can save space).

3R or 3RI cases can be skipped as these don't really have enough of a
hit-rate to justify the level of hair they add.

For 32-bit ops, might choose to remain with primarily 9-bit and
immediate and displacement fields, since with scaling, these are
reasonably effective.

Possible 32-bit layout:
zzzt-tttt-zzzz-ssss szzn-nnnn-zzzz-zpp1 //32b, 3R
iiii-iiii-izzz-ssss szzn-nnnn-zzzz-zpp1 //32b, 3RI Imm9
iiii-iiii-iiii-iiii izzn-nnnn-zzzz-zpp1 //32b, 2RI Imm17
iiii-iiii-iiii-iiii iiii-zzzz-zzzz-zpp1 //32b, Imm20 / Branch
iiii-iiii-iiii-iiii iiii-iiii-zzzz-zpp1 //32b, Imm24 (Jumbo)

In this case, The Branch-Op would not be allowed as WEX, and this would
encode the Jumbo block.

Possible:
zzzt-tttt-zzzz-ssss szzn-nnnn-zzz0-0pp1 //32b, 3R
iiii-iiii-izzz-ssss szzn-nnnn-zzz0-1pp1 //32b, 3RI Imm9

iiii-iiii-iiii-iiii izzn-nnnn-z011-1pp1 //32b, 2RI Imm17

Imm16/Imm17 Block:
iiii-iiii-iiii-iiii 000n-nnnn-0011-1pp1 MOV Imm16u, Rn
iiii-iiii-iiii-iiii 100n-nnnn-0011-1pp1 MOV Imm16n, Rn
iiii-iiii-iiii-iiii 001n-nnnn-0011-1pp1 ADD Imm16u, Rn
iiii-iiii-iiii-iiii 101n-nnnn-0011-1pp1 ADD Imm16n, Rn
iiii-iiii-iiii-iiii 010n-nnnn-0011-1pp1 LDSH Imm16u, Rn
iiii-iiii-iiii-iiii 110n-nnnn-0011-1pp1 FLDCH Imm16u, Rn
iiii-iiii-iiii-iiii 011n-nnnn-0011-1pp1 -
iiii-iiii-iiii-iiii 111n-nnnn-0011-1pp1 -

iiii-iiii-iiii-iiii 0ttn-nnnn-1011-1pp1 ? MOV.x Rn, (GBR, Disp16u)
iiii-iiii-iiii-iiii 1ttn-nnnn-1011-1pp1 ? MOV.x (GBR, Disp16u), Rn
00=L, 01=Q, 10=UL, 11=X
Scaled by 4 or 8 (First 256K or 512K of data/bss).

Branch Block:
iiii-iiii-iiii-iiii iiii-ii00-0111-1001 BRA Disp22
iiii-iiii-iiii-iiii iiii-ii01-0111-1001 MOV Imm22u, Rtmp
iiii-iiii-iiii-iiii iiii-ii10-0111-1001 BSR Disp22
iiii-iiii-iiii-iiii iiii-ii11-0111-1001 MOV Imm22n, Rtmp
iiii-iiii-iiii-iiii izzn-nnnn-1111-1001 Bcc Disp17

iiii-iiii-iiii-iiii iiii-iiii-0111-1011 Jumbo-Imm
iiii-iiii-iiii-iiii iiii-iiii-1111-1011 Jumbo-Op
iiii-iiii-iiii-iiii iiii-ii00-0111-1101 BT Disp22
iiii-iiii-iiii-iiii iiii-ii00-0111-1111 BF Disp22

Where, Rtmp is a fixed temporary register (likely the return-value
register).

64-bit:
iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii iiii-iiii-0111-1001 BRA Abs48

96-bit:
iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii 000n-nnnn-0011-1pp1 MOV Imm64, Rn

iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii 001n-nnnn-0011-1pp1 ADD Imm64, Rn

...

Might still need some more thinking here...

>>
>>
>> There were a few special ops:
>> LDIx Imm24, R0
>> That only exist in scalar form, so:
>> LDIx Imm24, R0 //Scalar Form
>> Jumbo Ext24 //Jumbo Prefixes
>> OP?T ... || //If True + Bundle (ISA subset)
>> OP?F ... || //If False + Bundle (ISA subset)
>>
>> This stays within the two bits of encoding entropy, though internally
>> 3-bits are used in the pipeline.
>>
> 1-bit per instruction in the pipeline.

Possibly, but my scheme allows executing both the Then and Else branches
in parallel in many cases, and does not need an explicit "PRED" op, ...

Also doesn't lead to extra state that would still need to be saved
somehow if an interrupt occurs.

>>
>>
>> If I were not using any status bits, such as for a more RISC-V like ISA
>> design, I would likely go instead with hard-wired compare-with-zero ops.
>>
>> In this case, it is nearly as capable, but the checks are significantly
>> cheaper to perform (and one can use the extra bits from not having a
>> second register, to maybe have a slightly bigger displacement).
>>
>> Say:
>> BZ Rn, Disp17 //Branch if Rn==0
>> BNZ Rn, Disp17 //Branch if Rn!=0
>> BLT Rn, Disp17 //Branch if Rn< 0 (Bit 63 is Set)
>> BGE Rn, Disp17 //Branch if Rn>=0 (Bit 63 is Clear)
>>
>> In this case, one can have tristate outputs on the compare ops, say:
>> -1: A> 0: A==B
>> 1: A> B
>>
>> I would also get rid of the user-specified Link Register, as:
>> This is rarely useful;
>> It adds a non-trivial cost to the "branch-with-link" mechanism.
>>
>> While a second link register "can" be useful for prolog compression
>> and/or "millicode" functions, the relative savings (vs, say, manually
>> copying the link register to a different register and/or saving it to
>> the stack) are small enough to be mostly ignored.
> <
> You "could" add ENTER and EXIT instructions to the ISA and dispense
> with the fake code density, along with the alterations of the control
> flow (power and cycles).

These would require some sort of state machine, and/or a mechanism to
break apart ops into multiple u-ops in the pipeline. Similar issues go
for ARM style LDM/STM.

Something like millicode allows the (cheap) mechanism of just using an
unrolled sequence of normal Load/Store instructions (may be kept inline
in the function in cases where N is small).

BGB

unread,

Dec 7, 2022, 6:00:15 PM12/7/22

to

On 12/7/2022 2:26 PM, robf...@gmail.com wrote:
> I think deciding between compare-and-branch as one instruction may have to
> do with the size of the branch displacement possible in an instruction format.
> It takes more bits to represent the instruction if there are two register fields in
> it and that may limit the displacement size. If instructions are 16-bits specifying
> two registers in the instruction probably is not practical due to the small
> displacement.
>
> It is often not necessary to do the compare, in some processors the condition
> codes are set by many instructions and a branch can take place without needing
> an explicit compare. In that case the instructions are just as compact as doing a
> compare-and-branch in one instruction.
>

As noted, I still like the 1-bit True/False scheme. It has some of the
advantages of CCs, mostly without the same sorts of drawbacks.

Failing this, fixed compare-with-zero is the second-place option (nearly
as powerful as the full-compare case, but cheaper and with fewer issues).

> RISCV has only a 12-bit displacement field which is adequate most of the time,
> but no doubt on rare occasions extra code needs to be inserted to branch further
> away.
>

Such is the issue.

The 12-bits falls into the category of "usually but not always sufficient".

For larger functions, the compiler would need to do something like:
Bcc Rs, Rt, .L0
JAL X0, lbl
.L0:

With 17 bits, this would be far less likely.

With 20 or 22 bits, it would drop essentially to zero (no functions are
this large).

OTOH, with 8 or 9 bits, it more a case of "well, it is often sufficient"
(but much of the time, it is not).

> Another consideration is that compare instructions are needed in the instruction
> set anyway, it may be less hardware then to use separate compare and branch
> instructions. Compares are needed for expressions like: X = a < b. which are
> sometimes recorded instead of branched on.
>

Yeah.

RISC-V also has a "SLT" instruction for these cases:
SLT Rn, Rs, Rt
Does effectively:
Rn = Rs < Rt;

In my case, there are "MOVT Rn" / "MOVNT Rn", which copy the T bit into
a register (as-is, or inverted).

> My latest core stores the results in condition code registers like the PowerPC
> rather than gprs. It is just a bit different. I manage then to get a 21-bit branch
> displacement. For branch-to-subroutine 24-bit displacements are possible.
>

In a few cases, I considered options for using a small set of P-bit
registers, say:
000= P0 (P0=1)
001=!P0 (P0=1)
010= P1 (T)
011=!P1 (T)
100= P2 (S)
101=!P2 (S)
110= P3 (U)
111=!P3 (U)

Where P0 is hard-wired as 1, and !P0 is a special case (such as WEX).

In this case, things like compare ops would be directed into one of the
P-bit registers.

But, the need for more than a single predicate bit is obscure enough to
make it not really worthwhile to spend an extra encoding bit on it
(needing to deal with multiple predicated branches or similar at the
same time being "fairly uncommon").

Had also experimented with a mechanism to treat P-bits as a small stack
machine, but there wasn't any good way to make use of this in my
compiler (enough to offset the cost of the mechanism to do so).

This could have allowed effectively handling modest-size if/else trees
using predication, but is "much less straightforward" for a compiler
(and both less versatile and harder to work with in the backend than
having a few free-form predicate-bit registers).

...

Russell Wallace

unread,

Dec 8, 2022, 2:02:06 AM12/8/22

to

On Wednesday, December 7, 2022 at 6:26:02 PM UTC, Stephen Fuld wrote:
> Unless you are talking about a compare to some fixed value, expressed in
> the op code (e.g. a compare to zero and branch instruction) it would
> have required an extra field in the instruction, i.e. two values to
> compare and a third to specify the branch target address. Given the
> fixed 32 bit instructions, this would have been "a challenge". :-)

10 bits operands, 1 bit direction, 1 bit yes/no, 2 bits for kind of comparison (==, signed/uns <, and a <= b ⇔ !(b < a)), 2 bits for opcode, leaves 16 bits of displacement, which should be more than plenty. I think 2 bits for opcode is justified given the importance of this instruction, but could bump it all the way to something like 6 bits, and still get a reasonably ample 12 bits of displacement.

MitchAlsup

unread,

Dec 8, 2022, 2:52:44 PM12/8/22

to

On Wednesday, December 7, 2022 at 2:26:20 PM UTC-6, robf...@gmail.com wrote:
> I think deciding between compare-and-branch as one instruction may have to
> do with the size of the branch displacement possible in an instruction format.
> It takes more bits to represent the instruction if there are two register fields in
> it and that may limit the displacement size. If instructions are 16-bits specifying
> two registers in the instruction probably is not practical due to the small
> displacement.
<

With the Compare instruction and then branch model, there is 1 branch instruction
with a myriad of sub instructions {Br ==, Br !=, Br s>, BR s>=, Br s<, BR s<=, Br u , BR u>=, and more} all as 1 single Major OpCode (front 6 bits). I
use the bits where the destination register would be to encode the condition.
<
In contrast, the Compare-Branch model only has a few compares-and-branches
due to not wanting to waste too much OpCode space of the compare-branch.
<
My compare then branch model uses a bit vector condition, so branch on bit
falls out for free. Over on the compare-to-zero-and-branch side, I have a complete
set of compare-to-zero-branch instructions, some with the ability to test for
situations outside of normal programmatics--for example: test if an ATOMIC
sequence has suffered from interference.

>
> It is often not necessary to do the compare, in some processors the condition
> codes are set by many instructions and a branch can take place without needing
> an explicit compare. In that case the instructions are just as compact as doing a
> compare-and-branch in one instruction.
<

The SPARC experience indicates that 80% of this free work is non-useful.

>
> RISCV has only a 12-bit displacement field which is adequate most of the time,
> but no doubt on rare occasions extra code needs to be inserted to branch further
> away.
<

My 66000 has 16-bits of word displacement in the OpCode which is inadequate
a whole lot less than RISC-V. You would need a subroutine with more than 32767
instructions.

>
> Another consideration is that compare instructions are needed in the instruction
> set anyway, it may be less hardware then to use separate compare and branch
> instructions. Compares are needed for expressions like: X = a < b. which are
> sometimes recorded instead of branched on.
<

My 66000 compare instructions build a bit-vector of relations. Then my shift
instructions can extract the signed set {-1,0} or the unsigned set {0,+1}, so
different languages can access what is their definition of the right result.

>
> My latest core stores the results in condition code registers like the PowerPC
> rather than gprs. It is just a bit different. I manage then to get a 21-bit branch
> displacement. For branch-to-subroutine 24-bit displacements are possible.
<

How often is 21-bits sufficient (where 16-bits would have been insufficient) ??
Do you know of a subroutine that is bigger than 1,000,000 instructions ?
<
As for direct BR and CALL one has 26-bits

Scott Lurndal

unread,

Dec 8, 2022, 3:07:31 PM12/8/22

to

For assembly code, as opposed to more constrained compiler
generated subroutines; particularly OS/VM level code, I can
see cases where the virtual address space is fragmented such that
21 bit offsets (or even 32-bit offsets) wouldn't be sufficient;
particularly in a 64-bit address space.

MitchAlsup

unread,

Dec 8, 2022, 3:46:15 PM12/8/22

to

Fair enough::
<
But when the Branch/call space is insufficient, I have access to both 32-bit
and 64-bit (using my constant extension). So, if the ASM programmer
writes::
<
BR/CALL Some_GuestOS_Common_Entry_Point
<
The assembler leaves it in 64-bit form. Later, when the linker resolves the
linkages and find this is 32-bit or 26-bit capable, it (linker) can shrink the
module.
<
When ASLR randomizes the VASs, different images may find different entry
points using different sized "constants" than the previous image or later image.
<
I should also notice that should the linker leave the linkage uniformly 64-bits
that this takes no more cycles to execute (98% confidence) than the smaller
forms--all you are saving is code footprint not cycles. The 1-word BR address
takes the same cycle count as the 2-word BR and as the 3-word BR. So, the only
gain in compression of potentially large displacements is code footprint not
cycles.

BGB

unread,

Dec 8, 2022, 4:58:22 PM12/8/22

to

On 12/8/2022 1:52 PM, MitchAlsup wrote:
> On Wednesday, December 7, 2022 at 2:26:20 PM UTC-6, robf...@gmail.com wrote:
>> I think deciding between compare-and-branch as one instruction may have to
>> do with the size of the branch displacement possible in an instruction format.
>> It takes more bits to represent the instruction if there are two register fields in
>> it and that may limit the displacement size. If instructions are 16-bits specifying
>> two registers in the instruction probably is not practical due to the small
>> displacement.
> <
> With the Compare instruction and then branch model, there is 1 branch instruction
> with a myriad of sub instructions {Br ==, Br !=, Br s>, BR s>=, Br s<, BR s<=, Br u<
> Br u<=, Br u>, BR u>=, and more} all as 1 single Major OpCode (front 6 bits). I
> use the bits where the destination register would be to encode the condition.
> <
> In contrast, the Compare-Branch model only has a few compares-and-branches
> due to not wanting to waste too much OpCode space of the compare-branch.
> <
> My compare then branch model uses a bit vector condition, so branch on bit
> falls out for free. Over on the compare-to-zero-and-branch side, I have a complete
> set of compare-to-zero-branch instructions, some with the ability to test for
> situations outside of normal programmatics--for example: test if an ATOMIC
> sequence has suffered from interference.

I ended up defining some RISC-V style compare-and-branch instructions.
Only real reason they are enabled in because the RISC-V mode needs them.

They end up talking up ~ 1/4 of the F1 block (mostly used for Load/Store
Disp9 ops).

IIRC, I did add compiler support for them, but they are still "pretty
meh": "if(a<b)" as a single op, but only if the target is within +/- 256
bytes being not that big of a win.

Ironically, "if(a<0)" tends to end up being more useful (these ended up
existing as separate ops in BJX2 mostly because of the lack of an
architectural zero register)

In the case of integer compare ops, the majority of cases tend to be
comparisons with a constant, which is on average better served, say, by
"CMPxx Imm10, Rn; BT/BF lbl"

>>
>> It is often not necessary to do the compare, in some processors the condition
>> codes are set by many instructions and a branch can take place without needing
>> an explicit compare. In that case the instructions are just as compact as doing a
>> compare-and-branch in one instruction.
> <
> The SPARC experience indicates that 80% of this free work is non-useful.

Most status flag updating is noise.

If one does status-flags like on x86, then far more often the status
flags will end up being stomped before one can do anything with the results.

Status flags are more useful, ironically, when only a very limited
selection of instructions modify them.

Similar for True/False bit predication:
To be useful for Then/Else blocks, also implies that no instruction
inside the Then/Else block may update this bit.

>>
>> RISCV has only a 12-bit displacement field which is adequate most of the time,
>> but no doubt on rare occasions extra code needs to be inserted to branch further
>> away.
> <
> My 66000 has 16-bits of word displacement in the OpCode which is inadequate
> a whole lot less than RISC-V. You would need a subroutine with more than 32767
> instructions.

In my case, BT/BF are 20-bit, which is almost always sufficient.

But, yeah, a branch within +/- 32K words will hit a lot more often than
+/- 2K words (which will hit more often than +/- 256 words).

>>
>> Another consideration is that compare instructions are needed in the instruction
>> set anyway, it may be less hardware then to use separate compare and branch
>> instructions. Compares are needed for expressions like: X = a < b. which are
>> sometimes recorded instead of branched on.
> <
> My 66000 compare instructions build a bit-vector of relations. Then my shift
> instructions can extract the signed set {-1,0} or the unsigned set {0,+1}, so
> different languages can access what is their definition of the right result.

In my case, they come in "flavors":
EQ
GT, GE (Signed)
HI, HS (Unsigned)

BT/BF choice may be used to built the other possibilities.
Strictly speaking, one only needs EQ, GT, and HI; but this would leave
immediate-form instructions unable to directly express some cases (LT
and GE; and if these cases need a separate constant-load, this is lame).

So, one needs, say:
CMPEQ Reg, Reg
CMPEQ Imm, Reg
CMPGT Reg, Reg
CMPGT Imm, Reg
CMPGE Imm, Reg

CMPHI Reg, Reg
CMPHI Imm, Reg
CMPHS Imm, Reg

>>
>> My latest core stores the results in condition code registers like the PowerPC
>> rather than gprs. It is just a bit different. I manage then to get a 21-bit branch
>> displacement. For branch-to-subroutine 24-bit displacements are possible.
> <
> How often is 21-bits sufficient (where 16-bits would have been insufficient) ??
> Do you know of a subroutine that is bigger than 1,000,000 instructions ?
> <
> As for direct BR and CALL one has 26-bits

Thus far, none of my test programs have exceeded the existing 20-bit
limit (1MB), for the entire ".text" section.

When this happens, things like function calls will start needing to use
a bigger encoding (thus far mostly untested; and will not currently be
predicted by the branch predictor). As-is, there are both Disp33s and
Abs48 encodings as possible fallback cases.

For most sane functions, for internal branches, this is likely in
"probably not going to happen" territory.

Closest to this limit thus far is ROTT, which weighs in at a (relatively
massive) 754K of ".text" (for reference, the x86-64 builds are nearly 2MB).

Though, part of this though seems to be that the people at Apogee seem
to have used excessive amounts of macros and copy/paste.

Something like a 22-bit branch might be better in the sense of "yeah,
4MB .text section is not implausible".

Though, 26-bits seems a little overkill though.

Non-local calls and branches need not use relative branch instructions
though, so this case does not matter.

In other news, my experiment for Huffman coding + LZ with 4 interleaved
bitstreams was a moderate success (on my Ryzen 2700X, MSVC):
Decode speeds ~ 400 to 700 MB/sec with Huffman coding;
Decode speeds ~ 600 to 1200 MB/sec with fixed-length (byte) symbols;
Compression ratios also slightly better than Deflate.

GCC still seems to be doing its "approx 40% faster" thing, so with GCC
in a few cases was getting Huffman-coded decode speeds of around 1.0GB/sec.

General approach should also map over to BJX2.

The LZ format is a little wonky though on account of the use of parallel
buffers for encoding/decoding stuff. Other than the parallel buffer
wonkiness, the design is vaguely similar to that of LZ4 (just with tags,
literal, distance, and extra bytes, being spread into 4 parallel buffers).

This is also with a 12-bit symbol length limit (can do 13 bit, which
helps compression at the cost of decoding speed).

Still technically only a moderate improvement over a codec based on a
normal (symbol at a time) decoding with a 12-bit symbol length limit
though (vs, say, the speed improvement of a 12-bit limit vs 15 bits).

But, at least, it is not being particularly bottle-necked by the entropy
coder (in the profiler), which is something (also spends a lot of time
in handling the LZ format, and in the match copying function).

Had also designed the format to be able to deal with deltas between
buffers (partly inspired by Sierra's video format). Though this part was
not being tested in this case.

So, would at least appear to be "kinda fast-ish", at least as far as
entropy-encoded designs go.

...

robf...@gmail.com

unread,

Dec 10, 2022, 3:51:40 AM12/10/22

to

Found I can use the condition registers as complicated predicate registers
with the use of the PRED instruction modifier. I am liking the My66000 PRED
modifier idea. The original Thor core had 16 predicate registers which were
basically, condition code registers that were coupled with a four-bit condition
specifier. Thus, eight bits were tacked onto the start of every instruction to
specify a predicate. I found most of the time the predicate was not used so a
byte of memory was wasted for many instructions. Eventually decided to
shelve the core. But now with My66000 style PREDs there is one extra
instruction for the PRED, 32-bitts, but it can be applied for up to the following
seven instructions (in my core). And the bytes are used only when a predicate
is needed. So, now there are eight condition (predicate) registers instead of
16. But the number of predicate conditions has expanded to allow a variety of
floating-point conditions. So, a six-bit condition field is used. This is nine bits
total, but it only must be present in the PRED instruction which has lots of bits
available.
One complication is dealing with external interrupts that may happen. My first
thought is to simply not allow them in the shadow of a PRED instruction. It
would be delaying interrupt processing by seven or eight instructions max.
Otherwise, the PRED state would need to be saved somehow and restored after
the interrupt. Possibly by saving a copy of the interrupted instruction plus
predicate.

My core uses condition registers to store a bit vector of comparison flags in a
manner analogous to the My66000 use of a GPR.
I am hoping to have more GPRs available for other use by offloading the compare
results to dedicated registers. The branch instruction then also need only specify
three bits for the condition register.

EricP

unread,

Dec 10, 2022, 10:41:08 AM12/10/22

to

robf...@gmail.com wrote:
> One complication is dealing with external interrupts that may happen. My first
> thought is to simply not allow them in the shadow of a PRED instruction. It
> would be delaying interrupt processing by seven or eight instructions max.
> Otherwise, the PRED state would need to be saved somehow and restored after
> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> predicate.

An alternative is a PRED prefix that only applies to a single instruction.

Of course it won't win awards on code size but decode sees the pairing
as a single instruction and the shadow complications do go away.
Internally it still has many of the complications of predicated uOps,
but the dependency chain is simplified because it doesn't interact
with the shadow mask and its invisible global state -
it is just an optional extra source operand on each uOp.

BGB

unread,

Dec 10, 2022, 1:55:20 PM12/10/22

to

And, burning 2 or 3 bits in the encoding means no space (or cycles) are
wasted on a prefix.

Granted, one could argue, the predicate field and existence of 16-bit
ops, is the dividing line between the practicality of 6-bit register fields.

Say, 6b:
zztt-tttt-zzzz-ssss ssnn-nnnn-zzzz-pp11 //10b for opcode
zztt-tttt-zzzz-ssss ssnn-nnnn-zzzz-zpp1 //11b for opcode
zztt-tttt-zzzz-ssss ssnn-nnnn-zzzz-zzzz //14b for opcode

Vs, 5b:
zzzt-tttt-zzzz-ssss szzn-nnnn-zzzz-pp11 //13b for opcode
zzzt-tttt-zzzz-ssss szzn-nnnn-zzzz-zpp1 //14b for opcode
zzzt-tttt-zzzz-ssss szzn-nnnn-zzzz-zzzz //17b for opcode

With my existing ISA (in the above scheme):
zzzz-znst-tttt-zzzz 111p-zzpz-nnnn-ssss //12b for opcode

Have noted, one doesn't really want to have much less opcode than this.

Had hacked on some 32b encodings with 6b fields, say:
zzzz-znst-tttt-zzzz 0111-wnst-nnnn-ssss //9b for opcode
zzzz-znsi-iiii-iiii 1001-wnsz-nnnn-ssss //6b for opcode

Which lack predication and only cover a subset of the ISA, and then
stuff gets more hacky...

There is the Op64 encoding, which supports both 6b register fields and
predication, but not bundling, and needs 64b to encode.

There is also 2x Op40, which supports both (in a bundle), as a 96-bit
encoding, but is a bit of a hack (in the decoder, it fakes it as if
there were a pair of op64 encodings; just with a lot of the bits
hard-wired to 0).

With 6b fields, a "break even" would be to give up entirely on 16-bit
ops but keep a 2b pp field, say:
zztt-tttt-zzzz-ssss ssnn-nnnn-zzzz-zzpp //12b for opcode
Or, the RISC-V style notation of breaking up by field:
zz-tttttt-zzzz-ssssss-nnnnnn-zzzzzz-pp //12b for opcode

Code density could suffer slightly, but it could allow both uniform
6-bit register access and per-instruction predication (but not multiple
predicate bits).

Another option being to just start with 11b and try a little harder to
optimize the use of the encoding space (will still need to keep Jumbo
encodings or similar as an "escape hatch", or for less-common instructions).

Though, 16b encodings wont go very far if trying for 6b fields:
zznn-nnnn-ssss-ssz0 //3b opcode... yeah, wont go far.

Almost makes more sense to not bother if going directly for 64 GPRs, and
instead skip out on 16b ops (gaining an extra opcode bit).

Well, and/or treat the low 3 bits as a combined field:
000: 32b
001: 32b
010: 16b (Opt, 1)
011: 32b
10z: 32b
11z: 32b

1: So, there would be half the encoding space which would effectively be
disallowed from being used outside of Lane 1. This encoding space could
be left to 16b ops, which will necessarily need to operate on a subset
of the register space.

...

MitchAlsup

unread,

Dec 10, 2022, 2:57:48 PM12/10/22

to

signed {=, !=, >, >=, <, <=} unsigned {>, >=, <, <=} float {=, !=, >, >=, <, <=, OR, UN} and
double {=, !=, >, >=, <, <=, OR, UN}
<
I count 26 which fits in 5-bits.

<
> One complication is dealing with external interrupts that may happen. My first
> thought is to simply not allow them in the shadow of a PRED instruction. It
> would be delaying interrupt processing by seven or eight instructions max.
> Otherwise, the PRED state would need to be saved somehow and restored after
> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> predicate.
>
> My core uses condition registers to store a bit vector of comparison flags in a
> manner analogous to the My66000 use of a GPR.
<

Another convert........

John Levine

unread,

Dec 10, 2022, 4:37:58 PM12/10/22

to

According to EricP <ThatWould...@thevillage.com>:

>robf...@gmail.com wrote:
>> One complication is dealing with external interrupts that may happen. My first
>> thought is to simply not allow them in the shadow of a PRED instruction. It
>> would be delaying interrupt processing by seven or eight instructions max.
>> Otherwise, the PRED state would need to be saved somehow and restored after
>> the interrupt. Possibly by saving a copy of the interrupted instruction plus
>> predicate.
>
>An alternative is a PRED prefix that only applies to a single instruction.

Fifty years ago we called those skip instructions.

On a PDP-6, to put the larger of A and B into accumulator AC:

MOVE AC,A ; put A into AC
CAMGE AC,B ; skip if AC >= B
MOVE AC,B ; put B into AC

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

robf...@gmail.com

unread,

Dec 10, 2022, 5:07:25 PM12/10/22

to

>signed {=, !=, >, >=, <, <=} unsigned {>, >=, <, <=} float {=, !=, >, >=, <, <=, OR, UN} and
>double {=, !=, >, >=, <, <=, OR, UN}
><
>I count 26 which fits in 5-bits.

I followed the 68k FPU's float tests of which there are 32 just for one precision. They
went a bit crazy testing for Nan's in combination with other relations. Then there are
the 16 integer conditions based on the flags nf,vf,cf,zf. That leave 16 conditions to be
defined.
I was thinking of including magnitude greater than and less than too, which would take
the abs() of the numbers first. I have a low bar for including ops in the instruction set,
as a lot can be supported. Somewhere around 0.1% used.

Just added triple precision decimal-float to my 68k core. Primarily using the 68k core
to test the decimal-float instructions. They use separate compare and branch
instructions. Found an error in recent docs for the 68881/68882 FPU.

MitchAlsup

unread,

Dec 10, 2022, 7:09:40 PM12/10/22

to

On Saturday, December 10, 2022 at 3:37:58 PM UTC-6, John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
> >robf...@gmail.com wrote:
> >> One complication is dealing with external interrupts that may happen. My first
> >> thought is to simply not allow them in the shadow of a PRED instruction. It
> >> would be delaying interrupt processing by seven or eight instructions max.
> >> Otherwise, the PRED state would need to be saved somehow and restored after
> >> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> >> predicate.
> >
> >An alternative is a PRED prefix that only applies to a single instruction.
> Fifty years ago we called those skip instructions.
<

Which comes back to the important point about predication ::
<
You do not want to predicate any farther than is already covered by your FETCH
momentum. Predication is entirely present to avoid disrupting the current FETCH
stream--and no farther. If you have to (or want to) disrupt the FETCH point, then
branch (and friends) is the proper tool.

>
> On a PDP-6, to put the larger of A and B into accumulator AC:
>
> MOVE AC,A ; put A into AC
> CAMGE AC,B ; skip if AC >= B
> MOVE AC,B ; put B into AC
>

There were a lot of cool things about the PDP-6 and the PDP-10 follow ons.

EricP

unread,

Dec 11, 2022, 11:25:47 AM12/11/22

to

John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
>> robf...@gmail.com wrote:
>>> One complication is dealing with external interrupts that may happen. My first
>>> thought is to simply not allow them in the shadow of a PRED instruction. It
>>> would be delaying interrupt processing by seven or eight instructions max.
>>> Otherwise, the PRED state would need to be saved somehow and restored after
>>> the interrupt. Possibly by saving a copy of the interrupted instruction plus
>>> predicate.
>> An alternative is a PRED prefix that only applies to a single instruction.
>
> Fifty years ago we called those skip instructions.
>
> On a PDP-6, to put the larger of A and B into accumulator AC:
>
> MOVE AC,A ; put A into AC
> CAMGE AC,B ; skip if AC >= B
> MOVE AC,B ; put B into AC
>

Not really - skip is a very short branch, forward only.

Skip/Branch removes intervening instructions from the pipeline
and no resources are assigned for them.
(Consider how skip would behave in an in-order pipeline without branch
prediction - it would stall at fetch until its conditional resolved.)

Predication keeps the pipeline running, feeding the predicated
instructions into the pipeline and tentatively assigns resources
to them in case they do execute.
Later if they do not execute it must patch the result environment
so it looks like the disabled instructions never existed,
which can possibly include propagating values to different locations
and dynamically editing dependency chains accordingly,
then clean up and recover the assigned resources.
This must be done in such a way that it can take an interrupt/exception
at any point and resolve to a predictable and restartable state.

BTW on the simple cpu designs in days of yore, the advantage of
skip conditional over an unconditional branch vs a conditional branch
is for skip the HW instruction execute state machine always has the same
state sequence and just changes the value added to IP, whereas BRcc has to
change its whole state sequence depending on CC to perform/not-perform
the offset fetch and add. So BRcc is more complex to implement than SKIPcc,
and when you are paying per-gate for the cpu the cost difference matters.

MitchAlsup

unread,

Dec 11, 2022, 1:10:54 PM12/11/22

to

On Sunday, December 11, 2022 at 10:25:47 AM UTC-6, EricP wrote:
> John Levine wrote:
> > According to EricP <ThatWould...@thevillage.com>:
> >> robf...@gmail.com wrote:
> >>> One complication is dealing with external interrupts that may happen. My first
> >>> thought is to simply not allow them in the shadow of a PRED instruction. It
> >>> would be delaying interrupt processing by seven or eight instructions max.
> >>> Otherwise, the PRED state would need to be saved somehow and restored after
> >>> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> >>> predicate.
> >> An alternative is a PRED prefix that only applies to a single instruction.
> >
> > Fifty years ago we called those skip instructions.
> >
> > On a PDP-6, to put the larger of A and B into accumulator AC:
> >
> > MOVE AC,A ; put A into AC
> > CAMGE AC,B ; skip if AC >= B
> > MOVE AC,B ; put B into AC
> >
> Not really - skip is a very short branch, forward only.
>
> Skip/Branch removes intervening instructions from the pipeline
> and no resources are assigned for them.
> (Consider how skip would behave in an in-order pipeline without branch
> prediction - it would stall at fetch until its conditional resolved.)
>
> Predication keeps the pipeline running, feeding the predicated
> instructions into the pipeline and tentatively assigns resources
> to them in case they do execute.
<

There may be a short range (like 1 cycle) where the front end has
to guess whether the then-clause or the else-clause is to be executed,
by the following cycle that has resolved. And with the mask available
in the PRED instruction, it is easy for the front end to skip forward
over the then-clause to arrive at the else-clause without executing
any (but 1 might get nullified) instruction in the then-clause. So,
you don't put all the instructions in the pipeline.
<
Plus fixing up the rename table is harder than skipping over
the un-executed clause.

<
> Later if they do not execute it must patch the result environment
> so it looks like the disabled instructions never existed,
> which can possibly include propagating values to different locations
> and dynamically editing dependency chains accordingly,
> then clean up and recover the assigned resources.
> This must be done in such a way that it can take an interrupt/exception
> at any point and resolve to a predictable and restartable state.
>
> BTW on the simple cpu designs in days of yore, the advantage of
> skip conditional over an unconditional branch vs a conditional branch
> is for skip the HW instruction execute state machine always has the same
> state sequence and just changes the value added to IP, whereas BRcc has to
> change its whole state sequence depending on CC to perform/not-perform
> the offset fetch and add. So BRcc is more complex to implement than SKIPcc,
> and when you are paying per-gate for the cpu the cost difference matters.
<

The Front-End has to treat branches as resets--take this address and
fetch a bunch of instructions, whereas Predication tells the Front-End
Just continue using your current fetch strategy.

John Levine

unread,

Dec 11, 2022, 2:07:35 PM12/11/22

to

According to EricP <ThatWould...@thevillage.com>:

>> Fifty years ago we called those skip instructions.

>Not really - skip is a very short branch, forward only.
>
>Skip/Branch removes intervening instructions from the pipeline
>and no resources are assigned for them.
>(Consider how skip would behave in an in-order pipeline without branch
>prediction - it would stall at fetch until its conditional resolved.)
>
>Predication keeps the pipeline running, feeding the predicated
>instructions into the pipeline and tentatively assigns resources
>to them in case they do execute.

Seems to me that's more of an implementation decision. When it sees
the skip why couldn't it treat that as a predicate on the next instruction?
I realize that when we designed machines with skip instructions, they
didn't have pipelines.

>BTW on the simple cpu designs in days of yore, the advantage of
>skip conditional over an unconditional branch vs a conditional branch
>is for skip the HW instruction execute state machine always has the same
>state sequence and just changes the value added to IP, whereas BRcc has to
>change its whole state sequence depending on CC to perform/not-perform
>the offset fetch and add. So BRcc is more complex to implement than SKIPcc,
>and when you are paying per-gate for the cpu the cost difference matters.

I don't ever recall seeing a machine that had condition codes and skip
instructions. The skip instructions all did some kind of comparison or
bit test. If you wanted the effect of a conditional jump you'd reverse
the test and skip over the jump. The point of the skip was that the
memory address in the skip instruction was the location to test or
(on PDP-6/10) sometimes an immediate operand or bit mask. There
wasn't room for a jump address.

MitchAlsup

unread,

Dec 11, 2022, 2:14:37 PM12/11/22

to

On Sunday, December 11, 2022 at 10:25:47 AM UTC-6, EricP wrote:

> John Levine wrote:
> > According to EricP <ThatWould...@thevillage.com>:
> >> robf...@gmail.com wrote:
> >>> One complication is dealing with external interrupts that may happen. My first
> >>> thought is to simply not allow them in the shadow of a PRED instruction. It
> >>> would be delaying interrupt processing by seven or eight instructions max.
> >>> Otherwise, the PRED state would need to be saved somehow and restored after
> >>> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> >>> predicate.
> >> An alternative is a PRED prefix that only applies to a single instruction.
> >
> > Fifty years ago we called those skip instructions.
> >
> > On a PDP-6, to put the larger of A and B into accumulator AC:
> >
> > MOVE AC,A ; put A into AC
> > CAMGE AC,B ; skip if AC >= B
> > MOVE AC,B ; put B into AC
> >
> Not really - skip is a very short branch, forward only.
>
> Skip/Branch removes intervening instructions from the pipeline
> and no resources are assigned for them.
> (Consider how skip would behave in an in-order pipeline without branch
> prediction - it would stall at fetch until its conditional resolved.)
>
> Predication keeps the pipeline running, feeding the predicated
> instructions into the pipeline and tentatively assigns resources
> to them in case they do execute.
<

This might be a reasonable model if every instruction has its own
condition and maybe a unique condition code. It is not a reasonable
model where one condition casts a then-shadow and an else-shadow
over a number of instructions.

<
> Later if they do not execute it must patch the result environment
> so it looks like the disabled instructions never existed,
<

Then there are the cases where the same registers are used differently
in the then and else clauses. Patching this stuff up is "non trivial".

Stephen Fuld

unread,

Dec 12, 2022, 11:34:23 AM12/12/22

to

I didn't know much about the PDP-6/10 except that they were 36 bit
systems, so I spent a little time researching them. I was struck by the
similarity of the instruction formats, etc. to the Univac 1100 series
(as opposed to other 36 bit systems such as the IBM 704 or the Honeywell
DPS6).

That is a preface to saying that the 1108 had skip instructions (they
were called "Test" instructions in 1100 parlance), and conditional jump
instructions. It was pretty much as you say, conditional jump for things
like testing a register for zero, etc., but as operations like comparing
two operands required both operands, there were Test instructions as
there was not room in the instruction format for the two operands and a
jump target address.

As for condition codes, there were none for "operand compares", but
there were conditional jumps for "conditions" like overflow, FP
underflow, etc., which were kept as bits in the CPU's Processor Status
Register, which was not generally accessible by user programs.

The relative timings were interesting. Most "typical" instructions took
750ns. Note that this was before caches and basic memory access time was
also 750 ns. (Fetch next instruction was normally overlapped with
executing the current instruction.) A conditional jump instruction took
750 ns if not taken and 1,500 ns if taken (Required extra time to fetch
the instruction at the target address) A test instruction took 875 ns
if the next instruction was executed, but 1,625 ns if it was skipped. I
have forgotten why the "extra" time for test instructions.

EricP

unread,

Dec 12, 2022, 12:26:18 PM12/12/22

to

John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
>>> Fifty years ago we called those skip instructions.
>
>> Not really - skip is a very short branch, forward only.
>>
>> Skip/Branch removes intervening instructions from the pipeline
>> and no resources are assigned for them.
>> (Consider how skip would behave in an in-order pipeline without branch
>> prediction - it would stall at fetch until its conditional resolved.)
>>
>> Predication keeps the pipeline running, feeding the predicated
>> instructions into the pipeline and tentatively assigns resources
>> to them in case they do execute.
>
> Seems to me that's more of an implementation decision. When it sees
> the skip why couldn't it treat that as a predicate on the next instruction?
> I realize that when we designed machines with skip instructions, they
> didn't have pipelines.

Its cheapest to do a conditional increment on the IP.

>> BTW on the simple cpu designs in days of yore, the advantage of
>> skip conditional over an unconditional branch vs a conditional branch
>> is for skip the HW instruction execute state machine always has the same
>> state sequence and just changes the value added to IP, whereas BRcc has to
>> change its whole state sequence depending on CC to perform/not-perform
>> the offset fetch and add. So BRcc is more complex to implement than SKIPcc,
>> and when you are paying per-gate for the cpu the cost difference matters.
>
> I don't ever recall seeing a machine that had condition codes and skip
> instructions. The skip instructions all did some kind of comparison or
> bit test. If you wanted the effect of a conditional jump you'd reverse
> the test and skip over the jump. The point of the skip was that the
> memory address in the skip instruction was the location to test or
> (on PDP-6/10) sometimes an immediate operand or bit mask. There
> wasn't room for a jump address.

Skip on condition code is what I've seen.

PDP-8 has a skip instruction for various combination of zero & carry.
It also has increment memory and skip EQ/NE zero.

DG Nova has a 3-bit skip field on ALU instructions for various
carry and zero tests on the result.

RCA 1802 8-bit microprocessor has skip over 1 or 2 bytes instructions
on various carry and zero test. It also has branch conditional
instructions which is strange as skip should make them unnecessary.
But 1802 does lots of strange things so that is par for the course.
Like the short and long branches are 2 or 3 bytes but the skip is
over 1 or 2 bytes, so it really could only skip over a short branch.
The could have made it conditional skip over 1,2 or 3 bytes, but instead
they added two whole sets of short and long conditional branches.
And the branches are really jumps to short or long absolute addresses.
Ah, the 1970's.

EricP

unread,

Dec 12, 2022, 12:26:18 PM12/12/22

to

MitchAlsup wrote:
> On Sunday, December 11, 2022 at 10:25:47 AM UTC-6, EricP wrote:
>>
>> Predication keeps the pipeline running, feeding the predicated
>> instructions into the pipeline and tentatively assigns resources
>> to them in case they do execute.
> <
> This might be a reasonable model if every instruction has its own
> condition and maybe a unique condition code. It is not a reasonable
> model where one condition casts a then-shadow and an else-shadow
> over a number of instructions.

An architecture with predication must have an internal execution mechanism
for enabling/disabling the various uOps based on matching a predicate
condition to a predicate value, and selecting what enable/disable actions
each uOp takes. That execution mechanism should be independent to how the
predicate conditions are distributed to the uOps.

One simple but size-costly distribution method would be a PRED prefix
for each macro instruction.

From that point of view the PRED shadow mask is a code space saving
method for mass distribution of the predicate condition to multiple uOps.
But the back end execution mechanism should remain the same.

> <
>> Later if they do not execute it must patch the result environment
>> so it looks like the disabled instructions never existed,
> <
> Then there are the cases where the same registers are used differently
> in the then and else clauses. Patching this stuff up is "non trivial".

As I see it, the only really viable predicate execute mechanism
(because the alternatives have exponential complexity expansion)
is to treat most uOps similar to a CMOV, where each has two actions:

- if the uOp is enabled it operates on its source registers
and writes its result as usual
- if the uOp is disabled it either (for in-order) patches the scoreboard
dependencies or (for OoO) copies the original dest register value.

Once the predicate value has resolved and is forwarded to all its
dependent uOps then those uOps can prune out source dependencies register
that are no longer relevant so they don't block execution.

The net result of the above is that all predicated instructions
always execute an action, usually either operate or copy.

John Levine

unread,

Dec 12, 2022, 12:42:33 PM12/12/22

to

According to EricP <ThatWould...@thevillage.com>:

>Skip on condition code is what I've seen.
>
>PDP-8 has a skip instruction for various combination of zero & carry.
>It also has increment memory and skip EQ/NE zero.

I also spent a lot of time programming a PDP-8 and I can promise you it
did not have condition codes. It had a single accumulator, and a "link"
bit which was more or less a carry flag. The skip instructions looked
at the AC and link, e.g., skip on negative AC, skip on zero link.

ISZ increment and skip if zero was how you did loop counters. There
was no ISNZ.

>DG Nova has a 3-bit skip field on ALU instructions for various
>carry and zero tests on the result.

The Nova's skip instructions were similar to the PDP-8, not surprising
since the same guy designed them. Skip on carry, skip on zero register.
No condition codes either.

>RCA 1802 8-bit microprocessor has skip over 1 or 2 bytes instructions
>on various carry and zero test. It also has branch conditional
>instructions which is strange as skip should make them unnecessary.

Didn't use that one.

I did use the Varian 620i which had one- and two-word instructions,
but the skips only skipped one word. That led to some amusing bugs.

John Levine

unread,

Dec 12, 2022, 1:06:24 PM12/12/22

to

According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:

>I didn't know much about the PDP-6/10 except that they were 36 bit
>systems, so I spent a little time researching them. I was struck by the
>similarity of the instruction formats, etc. to the Univac 1100 series
>(as opposed to other 36 bit systems such as the IBM 704 or the Honeywell
>DPS6).

The 704 was the ur-36 bit machine in 1954 that inspired everything
else. (The 701 was sort of the beta version and the 704 fixed many of
its mistakes like half-word addressing and Williams tube memory.) The
704 architecture was carried forward to the 709, 7090, and 7094, then
killed in favor of S/360 but not without a fight within IBM.

The DPB6 was a 16/32 bit machine so I'm guessing you mean the 6000
series which was the 36-bit GE 600 series, on which DTSS and Multics
ran. It had extremely complicated addressing but only a single AC and
MQ which in retrospect wasn't a great choice. It did have condition
codes, zero, negative, carry, overflow, and conditional branches on
them.

All of the 36 bit machines ran out of address bits. Some had extended
addressing hacks which didn't help enough.

>As for condition codes, there were none for "operand compares", but
>there were conditional jumps for "conditions" like overflow, FP
>underflow, etc., which were kept as bits in the CPU's Processor Status
>Register, which was not generally accessible by user programs.

We called them flags and didn't use them much. They weren't much like
condition codes, they were set when something like overflow happened
and stayed on until tested and reset by a JFCL or overwritten by a
JRST instruction. In all the time I programmed a PDP-10 I don't recall
writing a JFCL other than as a no-op and a JRST other than as an
unconditional jump. JSP AC,.+1 saved the flags in a register if you
wanted for some reason to test them without resetting them. Never did
that either.

>The relative timings were interesting. Most "typical" instructions took
>750ns. Note that this was before caches and basic memory access time was
>also 750 ns. (Fetch next instruction was normally overlapped with
>executing the current instruction.) A conditional jump instruction took
>750 ns if not taken and 1,500 ns if taken (Required extra time to fetch
>the instruction at the target address) A test instruction took 875 ns
>if the next instruction was executed, but 1,625 ns if it was skipped. I
>have forgotten why the "extra" time for test instructions.

Probably the same as for jump, it had to refill the prefetch buffer when
it didn't execute the skipped instruction.

Stephen Fuld

unread,

Dec 12, 2022, 1:44:05 PM12/12/22

to

On 12/12/2022 10:06 AM, John Levine wrote:
> According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:
>> I didn't know much about the PDP-6/10 except that they were 36 bit
>> systems, so I spent a little time researching them. I was struck by the
>> similarity of the instruction formats, etc. to the Univac 1100 series
>> (as opposed to other 36 bit systems such as the IBM 704 or the Honeywell
>> DPS6).
>
> The 704 was the ur-36 bit machine in 1954 that inspired everything
> else.

The Univac 1103 was a 36 bit machine introduced in 1953, though it was
not compatible with later 1100 series machines.

https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series

(The 701 was sort of the beta version and the 704 fixed many of
> its mistakes like half-word addressing and Williams tube memory.) The
> 704 architecture was carried forward to the 709, 7090, and 7094, then
> killed in favor of S/360 but not without a fight within IBM.
>
> The DPB6 was a 16/32 bit machine so I'm guessing you mean the 6000
> series which was the 36-bit GE 600 series, on which DTSS and Multics
> ran.

Yes. My sloppy research. :-(

> It had extremely complicated addressing but only a single AC and
> MQ which in retrospect wasn't a great choice. It did have condition
> codes, zero, negative, carry, overflow, and conditional branches on
> them.
>
> All of the 36 bit machines ran out of address bits. Some had extended
> addressing hacks which didn't help enough.

I think it is fair to say that *all* computers introduced before say the
1980s, that survived, ran out of address bits. The 1100 series
introduced their Extended Mode, that survives (albeit in emulation) to
this day. I don't think that lack of address bits was the primary
reason for its decline.

I have often said that a reason for its decline was their marketing
department's utter failure to convince the world that 36 was an integral
power of 2. :-)

BGB

unread,

Dec 13, 2022, 4:57:10 AM12/13/22

to

On 12/12/2022 11:25 AM, EricP wrote:
> MitchAlsup wrote:
>> On Sunday, December 11, 2022 at 10:25:47 AM UTC-6, EricP wrote:
>>>
>>> Predication keeps the pipeline running, feeding the predicated
>>> instructions into the pipeline and tentatively assigns resources to
>>> them in case they do execute.
>> <
>> This might be a reasonable model if every instruction has its own
>> condition and maybe a unique condition code. It is not a reasonable
>> model where one condition casts a then-shadow and an else-shadow
>> over a number of instructions.
>
> An architecture with predication must have an internal execution mechanism
> for enabling/disabling the various uOps based on matching a predicate
> condition to a predicate value, and selecting what enable/disable actions
> each uOp takes. That execution mechanism should be independent to how the
> predicate conditions are distributed to the uOps.
>

In my case, the predication mode travels along with the opcode, and once
it moves into EX1, it gets remapped if needed.

Most normal operations turn into a NOP operation. Branches split into
BRA and BRA_NB (Branch Non-Branch). The main purpose of BRA_NB is that
if the branch-predictor had predicted a branch as taken, then BRA_NB
will initiate a branch to the following instruction (essentially a "BRA 0").

> One simple but size-costly distribution method would be a PRED prefix
> for each macro instruction.
>
> From that point of view the PRED shadow mask is a code space saving
> method for mass distribution of the predicate condition to multiple uOps.
> But the back end execution mechanism should remain the same.
>

As noted, I went with essentially dedicating 2 bits to this in every
instruction.

This is a little less than 32-bit ARM, albeit there are 3 bits spent on
encoding the 16/32 split.

Well, apart from 7xxx and 9xxx, which are more a case of these spaces
having been "mostly" unused in the 16-bit opcode map (whereas most other
options would have broken binary compatibility).

I had originally almost considered making XGPR being a special operating
mode that would have essentially reclaimed all of the 16-bit space, say:
wZnm-ZeoZ
wZnm-Zeii

Where the high E/F is has the '111' bits replaced by the (inverted) high
bits of the 3 register fields.

This encoding would have mostly avoided the orthogonality issues, but I
ended up choosing the "more conservative" option.

But, faced with some other issues, I am almost left to wonder if going
that route might have been better in the long run.

It is possible that I could still consider this option.

Pros: Regains some amount of orthogonality.
Cons: Operating mode hassle, complete loss of 16-bit ops in this mode.

So, say:
ADD?T R17, 0x123, R37 | SUB?F R17, 0x145, R37

Could then be encoded as two 32-bit instructions.

Handling would be otherwise similar to that of RISC-V mode; may need a
different PE/COFF machine-ID so that the PE loader knows to start the
program in this mode. Calling between modes would be slightly less of an
issue than BJX2 <-> RISC-V, since at least both sides will (more or
less) agree on the same ABI.

Basically, it is similar to the ARM32<->Thumb situation.

A mode switch being needed because (for fairly obvious reasons), this
would otherwise break binary compatibility in a pretty major way.

>> <
>>> Later if they do not execute it must patch the result environment so
>>> it looks like the disabled instructions never existed,
>> <
>> Then there are the cases where the same registers are used differently
>> in the then and else clauses. Patching this stuff up is "non trivial".
>
> As I see it, the only really viable predicate execute mechanism
> (because the alternatives have exponential complexity expansion)
> is to treat most uOps similar to a CMOV, where each has two actions:
>
> - if the uOp is enabled it operates on its source registers
> and writes its result as usual
> - if the uOp is disabled it either (for in-order) patches the scoreboard
> dependencies or (for OoO) copies the original dest register value.
>
> Once the predicate value has resolved and is forwarded to all its
> dependent uOps then those uOps can prune out source dependencies register
> that are no longer relevant so they don't block execution.
>
> The net result of the above is that all predicated instructions
> always execute an action, usually either operate or copy.
>

Probably depends on the design of the core.

In my design, simply turning them into NOPs is sufficient.

Granted, my core is strictly in-order, and there is no scoreboard. So,
the NOPs are just "empty spots" in the pipeline.

As with NOPs, these instructions may still eat clock-cycles.
Because interlocking happens before they enter the EX stage, they may
also still trigger interlock stalls on other instructions.

EricP

unread,

Dec 13, 2022, 2:08:39 PM12/13/22

to

WRT interlocking, yes that is exactly my point - they are not really
NOPs because in-order still has to track RAW and WAW register
dependencies and diddle the tracking flip-flops.

Say I have a predicate prefix that tests if the lsb of a register
matches a predicate condition. Eg
ET rN: Execute True prefix enables if rN[0] == 1
EF rN: Execute False prefix enables if rN[0] == 0

#1 CMP r0 = r9 > 1234 // set/clear r0 lsb
#2 ET r0: ADD r3 = r2 + r1 // enable if r0[0] == 1
#3 ADD r5 = r4 + r3

In the above instruction #2 the prefix tests lsb of r0 and
enables if it matches the T==1 or F==0 predicate condition.
#2 is RAW dependent on r0 for the predicate value
and may additionally be RAW on r2 and r1 if enabled.
#3 may be RAW dependent on r3 until r0 is resolved.

Either (a) the pipeline stalls #2 at Reg Read stage until r0 resolves,
or (b) it reads r2 and r1 anyway and marks r3 as Write Pending.
If it chooses (b) then #3 will RAW stall on r3, and if later r0 resolves
and disables #2 then #2 has to reset the Write Pending flag on r3
allowing #3 to continue execution.

In both cases instruction #2 either stalls for the r0 value or
proceeds to perform work like register read and then does housekeeping
on tracking flags to patch up after.
In both cases a disabled instruction still has visible side effects
so is not a NOP.

BGB

unread,

Dec 13, 2022, 2:54:45 PM12/13/22

to

Sorta, I guess...

It is a NOP in terms of "high level behavior" or "How EX stages see it".

Or, not necessarily a NOP if one demands that "Every NOP eats exactly 1
clock cycle". These may eat multiple clock cycles for any register
dependencies that may have otherwise occurred (since, this part happens
before it enters EX1, and thus before it can be known whether or not the
instruction will execute).

In theory, it could be possible to shortcut the predication during
ID1->ID2 by looking forward and verifying that nothing in the ID2 or EX1
stages can touch SR.T, and if so using the prior value of SR.T, however,
this isn't likely to save enough clock cycles to make it worth the extra
cost.

...

MitchAlsup

unread,

Dec 13, 2022, 3:53:18 PM12/13/22

to

On a 10-wide machine, does a NoOp eat exactly 1 cycle ?? and what does
1 cycle mean on a 10-wide machine ??

<
> These may eat multiple clock cycles for any register
> dependencies that may have otherwise occurred (since, this part happens
> before it enters EX1, and thus before it can be known whether or not the
> instruction will execute).
<

On GBOoO machines, many to most instructions to not appear to add any
latency to the rest of the executing instructions !! This makes the above
definition intractable.

>
>
> In theory, it could be possible to shortcut the predication during
> ID1->ID2 by looking forward and verifying that nothing in the ID2 or EX1
> stages can touch SR.T, and if so using the prior value of SR.T, however,
> this isn't likely to save enough clock cycles to make it worth the extra
> cost.
<

The was predication is defined my My 66000 is that the front end is
free to skip all instructions that are not supposed to be executed--but
this comes with a restriction that all of the instructions can be categorized
into exactly 1 of 2 groups (then-clause and else-clause). I started out by
allowing instructions to be common, but the compiler could not figure
out how to use and we simplified the encoding--I may further simplify
the encoding from here........
>
> ...

BGB

unread,

Dec 13, 2022, 8:52:10 PM12/13/22

to

This was a response to the assertion that a NOP from a predicated-away
instruction was not a true NOP because it still has to deal with
register dependencies and similar (so, more like a phantom version of
the instruction, rather than a "true NOP").

The only real alternative then, is to assume that a NOP has a known
latency (say, 1 cycle).

But, yeah, 0 cycles is another possible interpretation, but 0 cycle NOPs
aren't really a workable definition for an in-order machine.

>> These may eat multiple clock cycles for any register
>> dependencies that may have otherwise occurred (since, this part happens
>> before it enters EX1, and thus before it can be known whether or not the
>> instruction will execute).
> <
> On GBOoO machines, many to most instructions to not appear to add any
> latency to the rest of the executing instructions !! This makes the above
> definition intractable.

Possibly, I was thinking for a simple in-order machine.

Instruction latency on GBOoO is more a "throw hands in the air and say
'who knows'?" thing.

Then again, I guess it was interesting to observe that some types of
optimizations which help on BJX2 also help on x86-64, despite x86-64
having not enough registers to actually make use of the optimization as
expressed (x86-64 seems to be able to pretend as-if the stack
spills/fills were registers).

Meanwhile, I am left needing to use lots of registers, and occasional
optimizations which try to side-step a few (store then load) cases, then
recently needing to fix a bug where one of these optimizations broke due
to a type-handling issue in my compiler (was using the type of the value
stored to an array, rather than the type of the value as loaded from the
array).

Well, along with other issues, like local variables having their storage
being culled, then missing some of these variables were stored to but
not loaded from, resulting in them having a wild-card offset relative to
the stack-pointer (thus leading to some annoying stack-corruption bugs).

...

>>
>>
>> In theory, it could be possible to shortcut the predication during
>> ID1->ID2 by looking forward and verifying that nothing in the ID2 or EX1
>> stages can touch SR.T, and if so using the prior value of SR.T, however,
>> this isn't likely to save enough clock cycles to make it worth the extra
>> cost.
> <
> The was predication is defined my My 66000 is that the front end is
> free to skip all instructions that are not supposed to be executed--but
> this comes with a restriction that all of the instructions can be categorized
> into exactly 1 of 2 groups (then-clause and else-clause). I started out by
> allowing instructions to be common, but the compiler could not figure
> out how to use and we simplified the encoding--I may further simplify
> the encoding from here........

Nothing requires that they are not skipped in my case either, just that
this is beyond what my current implementation can manage.

Rules are mostly similar to normal instructions, just with the
difference that ?T and ?F instructions are allowed to have register
conflicts with each other in bundles on account of them being mutually
exclusive.

BTW: I put up a vote on Twitter wanting to see what the general
sentiment was on possible ways to "resolve" the encoding orthogonality
issues with R32..R63:
https://twitter.com/cr88192/status/1602801136590782466

Partly, because "add a new CPU mode and sort of half-way address the
issue" is kind of, not exactly a clear win...

The new encoding would at least make R32..R63 able to be able to be used
in all the same contexts as R0..R31, but still can't address some of the
other orthogonality issues, mostly because there is basically no way
possible to fit everything I would want into a 32 bit instruction word.

I have already experimented with adding some of the features needed for
the new mode to the Verilog, and everything seems to survive (mostly
involves some minor tweaks to the handling of the Link Register and
similar, increasing the number of CPU modes from 4 to 16).

Say:
0: BJX2 + Scalar (16/32)
1: BJX2 + WEX (16/32/64/96)
2: RISC-V (Generic 16/32)
3: RISC-V + WEX (TBD)
4: BJX2_XGP2 + Scalar (32)
5: BJX2_XGP2 + WEX (32/64/96)
6..15: Reserved

This will come at the cost that the status-flag bits for SIMD comparison
(PQRO) will no longer be preserved across function calls (replaced by
additional CPU-mode bits).

Was also tempted to replace '3', noting that this mode is probably DOA
(in favor of handling RISC-V as superscalar). Mode-3 would have required
a load-time bundle conversion, which in turn requires additional
metadata to do safely.

They will also have new Machine IDs in PEL4, say:
B232: Start in Mode 0|1 (32-bit pointers)
B264: Start in Mode 0|1 (64-bit pointers)
B296: Start in Mode 0|1 (but uses 128-bit pointers)
B265: Start in Mode 4|5 (64-bit pointers)
B297: Start in Mode 4|5 (128-bit pointers)

5032: RISC-V (32-bit), N/A on BJX2
5064: RISC-V (64-bit), Starts in Mode 2

These modes would also apply to COFF objects (if COFF were used).

Mode 0/1 differ mostly in that 0 executes instructions like a scalar
RISC machine, whereas 1 will execute instructions as bundles (code may
jump between these modes based on the result of the WEXMD instruction).

Similar would apply to 4/5.

Jumps between other modes would be accomplished via setting up bits in
function pointers (or, implicitly in the case of the Link Register;
which will implicitly return the mode to that used by the caller).

Could in theory direct-call between BJX2 and RISC-V, but doing so in
practice would require some mechanism to auto-generate glue thunks to
gloss over the (relatively significant) differences in the C ABI.

Tough, thus far, running RISC-V code on BJX2 still hasn't really been
tested much beyond trivial test cases.

...

MitchAlsup

unread,

Dec 13, 2022, 11:19:48 PM12/13/22

to

On Tuesday, December 13, 2022 at 7:52:10 PM UTC-6, BGB wrote:
> On 12/13/2022 2:53 PM, MitchAlsup wrote:

> Then again, I guess it was interesting to observe that some types of
> optimizations which help on BJX2 also help on x86-64, despite x86-64
> having not enough registers to actually make use of the optimization as
> expressed (x86-64 seems to be able to pretend as-if the stack
> spills/fills were registers).
>

The stack on x86 is defined to always be aligned to the size of the GPRs.

>
> Meanwhile, I am left needing to use lots of registers, and occasional
> optimizations which try to side-step a few (store then load) cases, then
> recently needing to fix a bug where one of these optimizations broke due
> to a type-handling issue in my compiler (was using the type of the value
> stored to an array, rather than the type of the value as loaded from the
> array).
<

If there is only 1 set of registers, this problem vanishes.
> <snip>

> Nothing requires that they are not skipped in my case either, just that
> this is beyond what my current implementation can manage.
>

This was a discussion about NoOps and the necessity of having them in ISA.
If they are present they have to have some defined meaning, EricP and I
are trying to ascertain exactly what the prescription needs to be. Being able
to be skipped fails several of the assertions in this thread.

<
> Rules are mostly similar to normal instructions, just with the
> difference that ?T and ?F instructions are allowed to have register
> conflicts with each other in bundles on account of them being mutually
> exclusive.
>
>
>
> BTW: I put up a vote on Twitter wanting to see what the general
> sentiment was on possible ways to "resolve" the encoding orthogonality
> issues with R32..R63:
> https://twitter.com/cr88192/status/1602801136590782466
>

There was a recent poll on some site and 56% of Americans do not think
that Arabic numerals should be taught in schools, too; and 15% don't have
an opinion.
https://www.snopes.com/fact-check/teaching-arabic-numerals/
We are well on the way to Idiocracy.
>

BGB

unread,

Dec 14, 2022, 2:54:24 AM12/14/22

to

On 12/13/2022 10:19 PM, MitchAlsup wrote:
> On Tuesday, December 13, 2022 at 7:52:10 PM UTC-6, BGB wrote:
>> On 12/13/2022 2:53 PM, MitchAlsup wrote:
>
>> Then again, I guess it was interesting to observe that some types of
>> optimizations which help on BJX2 also help on x86-64, despite x86-64
>> having not enough registers to actually make use of the optimization as
>> expressed (x86-64 seems to be able to pretend as-if the stack
>> spills/fills were registers).
>>
> The stack on x86 is defined to always be aligned to the size of the GPRs.

Yeah. As noted, current ABI rules generally enforce a 16B alignment for
the stack on BJX2 as well (with stack items generally following their
"natural alignment"). CPU isn't smart enough to treat memory as
registers though.

>>
>> Meanwhile, I am left needing to use lots of registers, and occasional
>> optimizations which try to side-step a few (store then load) cases, then
>> recently needing to fix a bug where one of these optimizations broke due
>> to a type-handling issue in my compiler (was using the type of the value
>> stored to an array, rather than the type of the value as loaded from the
>> array).
> <
> If there is only 1 set of registers, this problem vanishes.

There is only one type of register in my case, but this particular issue
was closer to the C level rather than the ISA level.

When storing something to an array, and then loading it back again, it
was reported as the type of the value that had been stored rather than
the type of the array element.

So, say, if one tries storing "int" to a "short*" array, or "void*" to a
"Foo*" array, everything is fine. But, if a load returns 'int' or
'void*' rather than the intended type, this may lead to other problems.

Options are one of:
Types match exactly, allow it through as-is;
Types are "compatible", quietly coerce the type;
Quietly convert it into a type-cast operation;
Actually do the load.

>> <snip>
>> Nothing requires that they are not skipped in my case either, just that
>> this is beyond what my current implementation can manage.
>>
> This was a discussion about NoOps and the necessity of having them in ISA.
> If they are present they have to have some defined meaning, EricP and I
> are trying to ascertain exactly what the prescription needs to be. Being able
> to be skipped fails several of the assertions in this thread.

Fair enough.

Probably depends some on the role and context:
As passing (in the ISA);
For burning clock-cycles;
Or, as an empty-spot within the pipeline (resulting either from an
interlock or not otherwise having an instruction to run in that location
at that time).

Within my pipeline, major-opcode 6'h00 is regarded as NOP.
There is another class of NOP expressed as essentially "CONV.MOV ZR, ZR".

> <
>> Rules are mostly similar to normal instructions, just with the
>> difference that ?T and ?F instructions are allowed to have register
>> conflicts with each other in bundles on account of them being mutually
>> exclusive.
>>
>>
>>
>> BTW: I put up a vote on Twitter wanting to see what the general
>> sentiment was on possible ways to "resolve" the encoding orthogonality
>> issues with R32..R63:
>> https://twitter.com/cr88192/status/1602801136590782466
>>
> There was a recent poll on some site and 56% of Americans do not think
> that Arabic numerals should be taught in schools, too; and 15% don't have
> an opinion.
> https://www.snopes.com/fact-check/teaching-arabic-numerals/
> We are well on the way to Idiocracy.

"Oh those squiggles, what do they mean. Radix-10 positional arithmetic,
what sorcery is this! We all know true numbers look like MCMXCIX!", then
proceeds to lose their crap if someone tries to bring up zero or
negative numbers...

Thus far, the dominant response seems to be that people are against
having 64 GPRs. Would have preferred better prompts, but each was
limited to 25 characters, which isn't really enough for this.

I was more just wondering what the general sentiment was, rather than
committing to follow with whatever option wins.

But, yeah, the "least effort" option is "leave everything as-is", where:
BGBCC does not use R32..R63 by default unless enabled via a command-line
option (except for the 128-bit ABI).

If not enabled, as far as BGBCC and the assembler are concerned, these
registers do not exist:
-fxgpr: Allows ASM to use these registers.
-fxgpr_ena: Allow BGBCC to use them for register allocation.
-fxgpr_abi: Allow ABI to use them for argument passing
Increases the number of register arguments to 16 (*).
But, also increases ABI's register spill space to 128 bytes.

*: Increasing the number of register arguments from 8 to 16 seems to
increase the number of function calls which fit entirely in registers
from around 80% to around 98%.

Using R32..R63 can help slightly for things like TKRA-GL and JPEG
decoding, but is slightly detrimental in many other cases (globally
enabling them is slightly detrimental to performance and code density
with the existing encoding scheme).

Part of the issue is likely due to the orthogonality issue:
Prevents cases where instructions could have been bundled;
Forces using 64-bit Op64 encodings in many cases (as a fallback);
...

This seems to be enough to offset the (arguably small) reduction in the
number of stack spill-and-fill (spill-and-fill is bad; but 64-bit
instruction encodings being a roadblock to the WEXifier is also bad...).

By design, this would eliminate cases where Op64 encodings are needed to
deal with XGPR (but where the op could otherwise fit into a 32-bit
encoding).

But, it does allow the "majority of all functions" to switch almost or
entirely to static-assigning all of the local variables to registers.
This mostly applies to non-leaf functions in this case; as the majority
of leaf functions are already able to go full-static with 32 GPRs, but
non-leaf functions can't use scratch registers for static assignment.

Say, 32 GPRs:
Leaf function: Has ~ 26 registers it can use in this case.
Non-Leaf: Has ~ 14 registers it can use in this case.

With 64 GPRs:
Leaf function: Has ~ 58 registers it can use.
Non-Leaf: Has ~ 30 registers it can use.

Partial reason being, in a leaf function, nothing is going to stomp the
scratch registers, but a non-leaf functions need to deal with these
registers getting stomped during function calls.

On average, functions seems to have roughly 17 variables in total (in a
roughly Gaussian distribution). With the number of function arguments
seemingly following a geometric distribution.

Though, using fixed-length 32-bit ops would be bad for size-optimized
code. So, for example, the new encoding mode would be a bad idea for the
Boot ROM.

Would need to finish implementing and then test it though.

Compared to other possibilities, the current scheme was chosen to be
"fairly conservative" (trying to minimize changes needed both to my
Verilog code and to BGBCC needed to add support for it).

It will basically reuse nearly all of the existing instruction decoding
as-is, with mostly a few tweaks related to instruction-length and the
WN/WM/WI bits and similar.

It will still have Jumbo encodings, just now with the possibility of
there being 16 unique Jumbo prefixes (rather than 2). Not sure yet what
I will do with the additional Jumbo prefixes.

Implicitly, a lot of the Imm5u fields will become Imm6s (as with the
existing XGPR encodings).

Stuff in BGBCC is still going to be a mess, but nothing new here... This
is where I currently expect most of the "fight" will be.

Changes needed to my emulator should also be fairly modest.

But, yeah, an obvious drawback is that it would fracture the ISA into
two variants which will not be binary compatible with each other (hence
why different CPU modes and MachineID values are needed for this).

The new mode did require making some semantic changes to the handling of
the Link Register, but my existing code doesn't notice the change.
Mostly changing some of the high-order bits in LR, and changing it to
*always* be encoded as the Inter-ISA encoding (which will now be the
standard encoding for LR).

Terje Mathisen

unread,

Dec 14, 2022, 3:59:42 AM12/14/22

to

MitchAlsup wrote:
> On Tuesday, December 13, 2022 at 7:52:10 PM UTC-6, BGB wrote:
>> On 12/13/2022 2:53 PM, MitchAlsup wrote:
>
>> Then again, I guess it was interesting to observe that some types of
>> optimizations which help on BJX2 also help on x86-64, despite x86-64
>> having not enough registers to actually make use of the optimization as
>> expressed (x86-64 seems to be able to pretend as-if the stack
>> spills/fills were registers).
>>
> The stack on x86 is defined to always be aligned to the size of the GPRs.

That is an OS/software convention afaik?

I.e. at least in real (16-bit) and 32-bit mode you _can_ misalign your
stack pointer as long as you don't enable the check_alignment control
word flag?

My mime-ascii executable text starts by pushing a word of 0000h and a
word of 0ffffh on the stack, then incrementing the stack pointer so that
a pop will return 00ffh.

It is much worse in the US than here, but that might just be a matter of
time?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

John Dallman

unread,

Dec 14, 2022, 5:34:05 AM12/14/22

to

In article <tnc39r$1uf0$1...@gioia.aioe.org>, terje.m...@tmsw.no (Terje
Mathisen) wrote:

> MitchAlsup wrote:
> > The stack on x86 is defined to always be aligned to the size of
> > the GPRs.
>
> That is an OS/software convention afaik?
>
> I.e. at least in real (16-bit) and 32-bit mode you _can_ misalign
> your stack pointer as long as you don't enable the check_alignment
> control word flag?

There are SSE2 instructions that deal with aligned pairs of doubles, such
as <https://en.wikipedia.org/wiki/MOVAPD>. Those trap on addresses that
aren't 16-byte aligned, even if alignment checking is turned off. They
produce a General Protection trap, not an alignment trap. There are
versions for unaligned addresses, but they're slower.

I first noticed them with a Microsoft 64-bit compiler that was confused
about the alignment requirements, and ended up submitting about ten
compiler bugs in a week. On x86-64, the Microsoft compiler keeps the
stack 16-byte aligned; on x86-32, the stack is only 4-byte aligned. It
appears that Microsoft don't use MOVAPD and relatives in 32-bit code.

John

Anton Ertl

unread,

Dec 14, 2022, 8:38:57 AM12/14/22

to

j...@cix.co.uk (John Dallman) writes:
>There are SSE2 instructions that deal with aligned pairs of doubles, such
>as <https://en.wikipedia.org/wiki/MOVAPD>. Those trap on addresses that
>aren't 16-byte aligned, even if alignment checking is turned off. They
>produce a General Protection trap, not an alignment trap.

And there is vmovapd that moves from/to a zmm register and that traps
if the address is not 64-byte aligned.

One difference between SSE and AVX is that the implicit memory
accesses of the load-and-op instructions of SSE require 16-byte
alignment, while the implicit memory accesses of such instructions in
AVX and later extensions do not require alignment.

>There are
>versions for unaligned addresses, but they're slower.

A commom myth, and it was brought up for justifying that gcc's
auto-vectorization breaks a program by compiling movdqa instread of
movdqu, so I measured it for the routine for which the bug was
reported <http://www.complang.tuwien.ac.at/anton/autovectors/>.

It turned out that even on the K10 and the Core 2, which were
explicitly mentioned as affected CPUs, the movdqa version (16-byte
alignment required) was either the same speed (K10) as the movdqu
version (no alignment required, but present in the benchmark) or was
slower (Core 2) than the unvectorized version written in
standard-compliant C (see below). On the Core 2 the movdqa version
was faster than the movdqu version by a factor 1.0014, on the others
they were essentially the same speed.

[Side note: Rewriting the program in standard-compliant C (by using
memcpy() instead of dereferencing pointers to 64-bit values) resulted
in the program not being auto-vectorized on the same compiler and was
significantly slower on 4 out of 5 CPUs (in particular, by a factor of
1.7 on Haswell and Skylake).]

Given the defaults for the implicit loads of AVX instructions, I
expect instructions that do not require alignment are at least as fast
as those that require alignment on CPUs that support AVX. And indeed,
an Intel performance guy writes
<https://software.intel.com/comment/1470256#comment-1470256> that
vmovdqu vs. vmovdqa is neutral for performance.

One interesting thing about the auto-vectorized code was that it uses
movdqa (requiring 16-byte alignment) for loads and movups for stores.
That is probably because even if you assume (based on the pointer
type) that both pointers are 8-byte aligned, if you load two items
aligned to a 16-byte boundary, the corresponding store may be
misaligned wrt the 16-byte requirement of movdqa/movaps/movapd.

One strange thing here is that one supposedly should not mix types:
Movdqu and movups are architecturally the same instruction, but movdqu
is for integers, while movups is for 32-bit floats, and there are
supposedly performance penalties for mixing types.

The other issue is: For code like this that performs the same number
of loads and stores, I would have chosen to align the store pointer to
16 bytes rather than the load pointer, because most CPUs can perform
more loads than stores per cycle, and the idea is that it can then
perform an unaligned load by performing two aligned loads in one
cycle.

I wondered if maybe the store buffer is used to make adjacent
unaligned stores as cheap as aligned stores, but another
microbenchmark
<http://www.complang.tuwien.ac.at/anton/unaligned-stores/> refutes
this theory.

Nevertheless, in the "other results" in
<http://www.complang.tuwien.ac.at/anton/autovectors/>, the (actually)
unaligned store variants were as fast on K10, Sandy Bridge, and
Skylake as the same benchmarks with actually aligned stor pointers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Scott Lurndal

unread,

Dec 14, 2022, 9:45:16 AM12/14/22

to

3000 responses out of 330 million americans. How representative is
that survey?

Stephen Fuld

unread,

Dec 14, 2022, 11:05:29 AM12/14/22

to

The SNOPES article shows the margin of error to be 3%. So, assuming it
was a non-biased sample, and the firm doing the survey seems to be
reputable, pretty good. Would you feel better if they said it was 53%?

But it is interesting that they were trying to get a bias, i.e. anti
Arab, and picked a question where they expected few to realize the
origin (actually the origin was Indian, but brought to the western world
by Arabs) of the numerals. As it is a specific piece of "trivia", I am
not sure how significant it is.

MitchAlsup

unread,

Dec 14, 2022, 1:46:47 PM12/14/22

to

Had your base port been from LLVM instead of GCC this likely would not
have happened.
GCC is more of a K&R (anything goes) while LLVM is more of a C-done-right
(ADA-style)

>
> Options are one of:
> Types match exactly, allow it through as-is;
> Types are "compatible", quietly coerce the type;
> Quietly convert it into a type-cast operation;
> Actually do the load.
<

What if the int was 32-bits and short* was 64-bits ? Somehow, 32-bits must
get invented prior to the store.
>
> >> <snip>

> >> BTW: I put up a vote on Twitter wanting to see what the general
> >> sentiment was on possible ways to "resolve" the encoding orthogonality
> >> issues with R32..R63:
> >> https://twitter.com/cr88192/status/1602801136590782466
> >>
> > There was a recent poll on some site and 56% of Americans do not think
> > that Arabic numerals should be taught in schools, too; and 15% don't have
> > an opinion.
> > https://www.snopes.com/fact-check/teaching-arabic-numerals/
> > We are well on the way to Idiocracy.
> "Oh those squiggles, what do they mean. Radix-10 positional arithmetic,
> what sorcery is this! We all know true numbers look like MCMXCIX!", then
> proceeds to lose their crap if someone tries to bring up zero or
> negative numbers...
<

I had a 7-th grade algebra teacher state that you cannot do Multiplication and
Division in Roman Numerals. It took but a single day to show her the fallacy of
her ways.

>
>
> Thus far, the dominant response seems to be that people are against
> having 64 GPRs. Would have preferred better prompts, but each was
> limited to 25 characters, which isn't really enough for this.
<

More than 60% of the subroutines from the LLVM front end, EMBench, and
CoreMark can be compiled <essentially> optimally with the 16-temporary
registers my ABI provides. Of the rest only 2 subroutines (from 1000+)
do any stack push/pops of temporaries values (not associated with
subroutine calling), and this is without a FP RF ! and only 32 total registers.
32 registers bought 15%
while 32->64 bought only 3% and may constrain the OpCode layout. So,
is it worth it:: does it buy more than it costs? .....
<
One thing I will note is that having constants universally available is like
having 3-5 more registers in your file. Sort-of-like LD-Ops make the ISA
as efficient as if it were a LD-only machine with 3-6 more registers.

>
> I was more just wondering what the general sentiment was, rather than
> committing to follow with whatever option wins.
>
>
>
> But, yeah, the "least effort" option is "leave everything as-is", where:
> BGBCC does not use R32..R63 by default unless enabled via a command-line
> option (except for the 128-bit ABI).
>
> If not enabled, as far as BGBCC and the assembler are concerned, these
> registers do not exist:
> -fxgpr: Allows ASM to use these registers.
> -fxgpr_ena: Allow BGBCC to use them for register allocation.
> -fxgpr_abi: Allow ABI to use them for argument passing
> Increases the number of register arguments to 16 (*).
> But, also increases ABI's register spill space to 128 bytes.
>
> *: Increasing the number of register arguments from 8 to 16 seems to
> increase the number of function calls which fit entirely in registers
> from around 80% to around 98%.
>

Interesting data point, thanks. Any idea as to where 15 would fall ??

>
> Using R32..R63 can help slightly for things like TKRA-GL and JPEG
> decoding, but is slightly detrimental in many other cases (globally
> enabling them is slightly detrimental to performance and code density
> with the existing encoding scheme).
<

And therein lies the rub.

>
> Part of the issue is likely due to the orthogonality issue:
> Prevents cases where instructions could have been bundled;
> Forces using 64-bit Op64 encodings in many cases (as a fallback);
> ...

rub with liniment.

>
>
> This seems to be enough to offset the (arguably small) reduction in the
> number of stack spill-and-fill (spill-and-fill is bad; but 64-bit
> instruction encodings being a roadblock to the WEXifier is also bad...).
>
> By design, this would eliminate cases where Op64 encodings are needed to
> deal with XGPR (but where the op could otherwise fit into a 32-bit
> encoding).
>
>
> But, it does allow the "majority of all functions" to switch almost or
> entirely to static-assigning all of the local variables to registers.
> This mostly applies to non-leaf functions in this case; as the majority
> of leaf functions are already able to go full-static with 32 GPRs, but
> non-leaf functions can't use scratch registers for static assignment.
<

I am seeing (on different applications and not as many of them) that
90%-ish of all leaf functions are happy with 16-registers--no prologue
or epilogue (and no spills/fills because the compiler inserts prologue
if spills or fills would be required and then uses as many of the 32
registers as needed.).

>
> Say, 32 GPRs:
> Leaf function: Has ~ 26 registers it can use in this case.
<

Leaf function (not using a FP) has 30 registers it can use.

<
> Non-Leaf: Has ~ 14 registers it can use in this case.
<

Non-leaf has 15 (no FP) or 14 (FP) it can preserve across subroutine
calls.

>
> With 64 GPRs:
> Leaf function: Has ~ 58 registers it can use.
> Non-Leaf: Has ~ 30 registers it can use.
>
> Partial reason being, in a leaf function, nothing is going to stomp the
> scratch registers, but a non-leaf functions need to deal with these
> registers getting stomped during function calls.
<

There is no reason a leaf subroutine cannot dump the preserved registers
to the stack and use as many as it needs. The contract is that these must
be restored before returning.

MitchAlsup

unread,

Dec 14, 2022, 1:51:12 PM12/14/22

to

On Wednesday, December 14, 2022 at 2:59:42 AM UTC-6, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Tuesday, December 13, 2022 at 7:52:10 PM UTC-6, BGB wrote:
> >> On 12/13/2022 2:53 PM, MitchAlsup wrote:
> >
> >> Then again, I guess it was interesting to observe that some types of
> >> optimizations which help on BJX2 also help on x86-64, despite x86-64
> >> having not enough registers to actually make use of the optimization as
> >> expressed (x86-64 seems to be able to pretend as-if the stack
> >> spills/fills were registers).
> >>
> > The stack on x86 is defined to always be aligned to the size of the GPRs.
> That is an OS/software convention afaik?
<

A push byte will write a byte and adjust the stack by a doubleword.
Thus, it is partially enforced by HW !

>
> I.e. at least in real (16-bit) and 32-bit mode you _can_ misalign your
> stack pointer as long as you don't enable the check_alignment control
> word flag?
<

I am talking about -64 not that stuff that should have died in 1983.
But the actual HW probably has 87-bifferent ways depending on
what mode the CPU is currently in.

>
> My mime-ascii executable text starts by pushing a word of 0000h and a
> word of 0ffffh on the stack, then incrementing the stack pointer so that
> a pop will return 00ffh.
<

Then you are on your own.

> > There was a recent poll on some site and 56% of Americans do not think
> > that Arabic numerals should be taught in schools, too; and 15% don't have
> > an opinion.
> > https://www.snopes.com/fact-check/teaching-arabic-numerals/
> > We are well on the way to Idiocracy.
> It is much worse in the US than here, but that might just be a matter of
> time?
<

You have a population that still values an education, we have a population
that values an edumacation.

MitchAlsup

unread,

Dec 14, 2022, 1:56:12 PM12/14/22

to

Exactly as representative as "any gun question ask on any media outlet"
which then is picked up by 999 of 1000 gun forums around the net, and
they, in turn send out messages to all their members asking them to
participate in the "survey" while the other 65% of people get no notice
at all and simply stumble across the survey at some kind of normal pace.
<
So, polling by the net is not representative of anything.
<
Penn and Teller did on on di-hydrogen-monoxide that is hilarious. They
got a very large majority of people to sign a petition to ban H2O.

John Levine

unread,

Dec 14, 2022, 3:27:04 PM12/14/22

to

According to MitchAlsup <Mitch...@aol.com>:

>So, polling by the net is not representative of anything.
><
>Penn and Teller did on on di-hydrogen-monoxide that is hilarious. They
>got a very large majority of people to sign a petition to ban H2O.

But that's different. Surely you are aware that DHMO has been
scientifically shown to be the major cause of drowning, is found in
many dangerous substances including Nitric Acid, and that prolonged
exposure to its solid form causes tissue damage.

Further info here: https://www.dhmo.org/

(Please note Followup-To:)

MitchAlsup

unread,

Dec 14, 2022, 4:19:56 PM12/14/22

to

On Wednesday, December 14, 2022 at 2:27:04 PM UTC-6, John Levine wrote:
> According to MitchAlsup <Mitch...@aol.com>:
> >So, polling by the net is not representative of anything.
> ><
> >Penn and Teller did on on di-hydrogen-monoxide that is hilarious. They
> >got a very large majority of people to sign a petition to ban H2O.
> But that's different. Surely you are aware that DHMO has been
> scientifically shown to be the major cause of drowning, is found in
> many dangerous substances including Nitric Acid, and that prolonged
> exposure to its solid form causes tissue damage.
<

Just imaging how safe all of would be if we lived on a world where there
was no di-hydrogen-monoxide:: no drownings, no freezing in snow, no ...

>
> Further info here: https://www.dhmo.org/
>
> (Please note Followup-To:)
> --
> Regards,
> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
> Please consider the environment before reading this e-mail. https://jl.ly
<

Oh Wait !

BGB

unread,

Dec 14, 2022, 9:02:21 PM12/14/22

to

I am using BGBCC, where:

Much of the front-end was hacked together when I was in my early 20s,
based off of a fork of my BGBScript interpreter, which was originally a
JavaScript clone; the fork had modified it to accept C syntax and use a
C like typesystem (with the original wonkiness that the ASTs in the
original BGBScript VM were implemented on top of a repurposed XML DOM
API; slightly later versions had switched over to cons-cell based
S-Expressions, *).

*: One needs to clarify this because someone else had also gone and used
the term "S-Expressions" for something not based on Lisp/Scheme or the
use of cons-cells (with only a vague superficial similarity to Lisp
style syntax).

I couldn't debug it at the time, so basically shelved it for around a
decade (apart from some use as a tool for mining stuff from C headers
for the BGBScript VM).

Later, I started working on BJX1 (back when I was still in my early 30s,
yeah...), and needed a compiler for this.

I had also looked at LCC, which would have still required writing a
backend, and it seemed like less effort to revive BGBCC as a base, than
to try to write a backend for LCC.

I had by this point already gained a bit more experience working with
code-generation and similar, mostly from continuing to work on the
BGBScript VMs (which had by this point mutated in a more Java/C# style
direction).

Well, and also tried making a "C-Aux" VM/language, which was intended to
be a sort of C/C# hybrid (but turns out "looks like C but can't compile
non-trivial C code" is "kinda lame"). Some bits of the C-Aux effort were
used as a basis for parts of my compiler middle and backend though
(noting as there wasn't that huge of jump from a Dalvik inspired VM to a
RISC style ISA; just that the latter can be made to run on an FPGA).

...

Nevermind if my first choice here ended up trying to base things around
SH-4 as the base ISA, sorta lazily copying the hardware interfaces and
memory map from the SEGA Dreamcast and similar.

Where this mutated into BJX1, which was then soft-rebooted into BJX2
(mostly in an attempt to clean up the awful mess that had resulted, and
try to rework it into something that could be more viable to implement
on an FPGA).

Things might have gone differently had I started out with RISC-V rather
than SH-4 (might have been less tempted to pile on extensions in an
attempt to "make it not suck").

>>
>> Options are one of:
>> Types match exactly, allow it through as-is;
>> Types are "compatible", quietly coerce the type;
>> Quietly convert it into a type-cast operation;
>> Actually do the load.
> <
> What if the int was 32-bits and short* was 64-bits ? Somehow, 32-bits must
> get invented prior to the store.

Values in my case are normally stored in a sign-or-zero extended form.

But, moving from 'int' to 'short' normally requires sticking in a sign
extension op.

If the expression is 'magically' int, and still has bits that should
have been cut off by the store (and reload), this isn't quite right, and
can lead to bugs. In this case, one will have to sign or zero extend the
value so that it looks as if it had gotten stored to memory (and
truncated in the process), and then reloaded as the intended type (well,
and also that the type is reported as 'short' rather than 'int').

Similar also applies to storing and reloading pointers...
It was stuff breaking hard on a pointer-type mismatch which pointed out
the issue (it was using the type for the stored pointer rather than from
the array).

>>
>>>> <snip>
>>>> BTW: I put up a vote on Twitter wanting to see what the general
>>>> sentiment was on possible ways to "resolve" the encoding orthogonality
>>>> issues with R32..R63:
>>>> https://twitter.com/cr88192/status/1602801136590782466
>>>>
>>> There was a recent poll on some site and 56% of Americans do not think
>>> that Arabic numerals should be taught in schools, too; and 15% don't have
>>> an opinion.
>>> https://www.snopes.com/fact-check/teaching-arabic-numerals/
>>> We are well on the way to Idiocracy.
>> "Oh those squiggles, what do they mean. Radix-10 positional arithmetic,
>> what sorcery is this! We all know true numbers look like MCMXCIX!", then
>> proceeds to lose their crap if someone tries to bring up zero or
>> negative numbers...
> <
> I had a 7-th grade algebra teacher state that you cannot do Multiplication and
> Division in Roman Numerals. It took but a single day to show her the fallacy of
> her ways.

Could be possible, but probably isn't worthwhile.

In some ways, it would almost be nicer if everyone went over to
hexadecimal, as the rules are more consistent.

Yet, for whatever reason, there are a lot more people pushing for
duodecimal than for hexadecimal, which I don't really understand.

Like, with hexadecimal, all arithmetic can be carried straight down to
the level of boolean operators and similar if need be.

Granted, in some contexts it might be "better" if there were some
semantic difference between the hexadecimal digits A..F and the letters
A..F. I suspect the usual "math person" solution would be to use a fancy
font, and then rely on "semantically relevant information conveyed via
the choice of font."

>>
>>
>> Thus far, the dominant response seems to be that people are against
>> having 64 GPRs. Would have preferred better prompts, but each was
>> limited to 25 characters, which isn't really enough for this.
> <
> More than 60% of the subroutines from the LLVM front end, EMBench, and
> CoreMark can be compiled <essentially> optimally with the 16-temporary
> registers my ABI provides. Of the rest only 2 subroutines (from 1000+)
> do any stack push/pops of temporaries values (not associated with
> subroutine calling), and this is without a FP RF ! and only 32 total registers.
> <
> I am not against 64 registers, I just don't see the "value add" of consuming
> that many more instruction bits for "that fewer" instructions. That is: does
> 64 registers buy anything. The old data was 16->32 registers bought 15%
> while 32->64 bought only 3% and may constrain the OpCode layout. So,
> is it worth it:: does it buy more than it costs? .....
> <
> One thing I will note is that having constants universally available is like
> having 3-5 more registers in your file. Sort-of-like LD-Ops make the ISA
> as efficient as if it were a LD-only machine with 3-6 more registers.

It goes from "a majority of the leaf functions can static assign
everything" to "nearly all of the functions can static-assign everything".

In non-leaf functions, there are fewer registers for static assignment,
so a big chunk of non-leaf functions end up with only a subset of
variables being static-assigned (with the rest being dynamically assigned).

In the case if BJX2, the base encoding already burns 3 bits on the 16/32
split, so the 'XG2' encoding (as I am now calling it) effectively
reclaims these bits.

For speed optimized code, the loss of 16-bit instructions seems to add a
roughly 7% penalty (which seems pretty close to my theoretical estimate).

>>
>> I was more just wondering what the general sentiment was, rather than
>> committing to follow with whatever option wins.
>>
>>
>>
>> But, yeah, the "least effort" option is "leave everything as-is", where:
>> BGBCC does not use R32..R63 by default unless enabled via a command-line
>> option (except for the 128-bit ABI).
>>
>> If not enabled, as far as BGBCC and the assembler are concerned, these
>> registers do not exist:
>> -fxgpr: Allows ASM to use these registers.
>> -fxgpr_ena: Allow BGBCC to use them for register allocation.
>> -fxgpr_abi: Allow ABI to use them for argument passing
>> Increases the number of register arguments to 16 (*).
>> But, also increases ABI's register spill space to 128 bytes.
>>
>> *: Increasing the number of register arguments from 8 to 16 seems to
>> increase the number of function calls which fit entirely in registers
>> from around 80% to around 98%.
>>
> Interesting data point, thanks. Any idea as to where 15 would fall ??

15 function arguments? Probably right up near 16...

Correction, minor screw-up:
~ 80% turns out to have been for for 4 arguments, not 8.
My initial metric was off by a factor of 2 (*).

*: The value had been previously scaled up by 2 "for reasons", need to
divide by 2 again when building stats.

So, it looks like (looking at a few different programs):
4: ~79.2%
8: ~98.6%
15: ~99.9%
16: ~99.9%
18: 100%

As noted, the BJX2 ABI is 8 arguments with 32 GPRs, and 16 arguments
with 64 GPRs.

The length of argument lists seems to to follow a nearly geometric
distribution, hitting a peak at around 2 or 3 arguments, then drops off
rapidly (there are only a few random stragglers much past 12 arguments).

In both programs: the sum of 2 + 3 arguments is nearly half of the total
for all argument lists (with 0 + 1 + 4 making up another 35%).

>>
>> Using R32..R63 can help slightly for things like TKRA-GL and JPEG
>> decoding, but is slightly detrimental in many other cases (globally
>> enabling them is slightly detrimental to performance and code density
>> with the existing encoding scheme).
> <
> And therein lies the rub.

With the base ISA, the "sometimes help, sometimes hurt"; compiler ends
up needing to try to figure out where to enable them for best effect.

>>
>> Part of the issue is likely due to the orthogonality issue:
>> Prevents cases where instructions could have been bundled;
>> Forces using 64-bit Op64 encodings in many cases (as a fallback);
>> ...
> rub with liniment.

But, I can at least make it "slightly less bad" via 'XG2'...

Still comes with a 7% penalty in terms of code-density though.

In this case any "loss" (in terms of speed) would be more likely to be
in terms of things like making function prologs/epilogs bigger, rather
than things like using these registers forcing the compiler to fall back
to Op64 encodings (and thus being unable to shuffle or bundle these
instructions).

For example, in the baseline ISA:
MOV 0x1234, R18 //Can be encoded in 32 bits
But:
MOV 0x1234, R36 //Needs a 64-bit encoding (...)

Whereas, 'XG2' will allow both to use a 32-bit encoding.

>>
>>
>> This seems to be enough to offset the (arguably small) reduction in the
>> number of stack spill-and-fill (spill-and-fill is bad; but 64-bit
>> instruction encodings being a roadblock to the WEXifier is also bad...).
>>
>> By design, this would eliminate cases where Op64 encodings are needed to
>> deal with XGPR (but where the op could otherwise fit into a 32-bit
>> encoding).
>>
>>
>> But, it does allow the "majority of all functions" to switch almost or
>> entirely to static-assigning all of the local variables to registers.
>> This mostly applies to non-leaf functions in this case; as the majority
>> of leaf functions are already able to go full-static with 32 GPRs, but
>> non-leaf functions can't use scratch registers for static assignment.
> <
> I am seeing (on different applications and not as many of them) that
> 90%-ish of all leaf functions are happy with 16-registers--no prologue
> or epilogue (and no spills/fills because the compiler inserts prologue
> if spills or fills would be required and then uses as many of the 32
> registers as needed.).

It is roughly similar in my case...

>>
>> Say, 32 GPRs:
>> Leaf function: Has ~ 26 registers it can use in this case.
> <
> Leaf function (not using a FP) has 30 registers it can use.
> <

There are a few registers which can't really be used for register
assignment.

It is 11 if only counting scratch registers (partly as some of the
scratch registers are not allowed for holding variables mostly for
compiler related reasons).

As can be noted, the majority of leaf functions can already go "full
static".

>> Non-Leaf: Has ~ 14 registers it can use in this case.
> <
> Non-leaf has 15 (no FP) or 14 (FP) it can preserve across subroutine
> calls.

Similar.
The stack pointer is at R15, which cuts out one of the callee-save
registers.

So, theoretically:
R8..R14, R24..R31
R40..R47, R56..R63 (XGPR)

A smart compiler (or ASM code) can use all 15 registers here (within the
base 32).

Also, hand-written ASM can use "the power of the human mind" to figure
out how to effectively use all 16 scratch registers without getting
trapped in a corner (or noting that a few of the registers have slightly
wonky rules for how they can be used).

Like, humans can better deal with occasional non-orthogonality, whereas
the compiler has to be conservative or else it will blindly stumble into
these cases and be like "oh crap" and then die.

Well, and a human can figure out how to assign multiple non-overlapping
temporaries to the same CPU register, ...

Sometimes, it seems like it is asking a bit much even to expect the
compiled program to behave correctly with neither the program nor the
compiler crashing in the process.

...

Or, I sorta get the new ISA mode implemented, but now have to figure out
why BGBCC is generating some "obviously mangled" instructions in this case.

But, alas... I sorta expected this much...

>>
>> With 64 GPRs:
>> Leaf function: Has ~ 58 registers it can use.
>> Non-Leaf: Has ~ 30 registers it can use.
>>
>> Partial reason being, in a leaf function, nothing is going to stomp the
>> scratch registers, but a non-leaf functions need to deal with these
>> registers getting stomped during function calls.
> <
> There is no reason a leaf subroutine cannot dump the preserved registers
> to the stack and use as many as it needs. The contract is that these must
> be restored before returning.

Granted, but if one uses more than can be static assigned, this means
spill and fill.

Also, BGBCC's register allocation is kinda stupid if compared with some
other compilers.

For example, GCC will realize that two temporaries will have
non-overlapping lifespan, and can assign the same register to both
temporaries.

For "static assign everything", BGBCC needs to be able to give every
temporary its own register; thus increasing register pressure for the
"full static" case.

MitchAlsup

unread,

Dec 14, 2022, 10:54:21 PM12/14/22

to

On Wednesday, December 14, 2022 at 8:02:21 PM UTC-6, BGB wrote:
> On 12/14/2022 12:46 PM, MitchAlsup wrote:
> > On Wednesday, December 14, 2022 at 1:54:24 AM UTC-6, BGB wrote:

> So, it looks like (looking at a few different programs):
> 4: ~79.2%
> 8: ~98.6%
> 15: ~99.9%
> 16: ~99.9%
> 18: 100%
>

Thanks for the data.

Thomas Koenig

unread,

Dec 15, 2022, 2:00:21 AM12/15/22

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Tuesday, December 13, 2022 at 1:54:45 PM UTC-6, BGB wrote:
>> Or, not necessarily a NOP if one demands that "Every NOP eats exactly 1
>> clock cycle".
><
> On a 10-wide machine, does a NoOp eat exactly 1 cycle ?? and what does
> 1 cycle mean on a 10-wide machine ??

#ifdef SIDE_REMARK

Generally, I would expect a reasonable high-performance OoO
implementation to remove NOPs - they need not be scheduled.
On the other hand, for measuring performance, it might actually
be desirable to have a no-op which is not removed.

POWER actually has both. "ori 0,0,0" is a NOP which is removed without
taking up execution resources. "xori 0,0,0" will never be removed,
and will consume the resources that a normal xori would use.

If you want to define a variable-length NOP in My66000 for alignment,
you have the different formats of "ori r0, r0, #...." to play with.
You could follow the same convention as POWER and remove these
before execution. But the assembler would have to be changed to
not chose the smallest version available :-)

#endif

Of course, "taking up one cycle" is a hazy concept at best on an
OoO-machine. "Introducing one cycle of latency WRT the following
instruction" might be doable, if anybody sees a need for this.
I have doubts that it would be useful.

Thomas Koenig

unread,

Dec 15, 2022, 2:07:19 AM12/15/22

to

BGB <cr8...@gmail.com> schrieb:

(argument list)

> So, it looks like (looking at a few different programs):
> 4: ~79.2%
> 8: ~98.6%
> 15: ~99.9%
> 16: ~99.9%
> 18: 100%

Did you ever take a look at LAPACK, or CLAPACK? Those are _long_
argument lists...

For example, dgerfsx has 22 arguments (unless I miscounted) plus
two hidden character length arguments.

Terje Mathisen

unread,

Dec 15, 2022, 4:21:56 AM12/15/22

to

MitchAlsup wrote:
> On Wednesday, December 14, 2022 at 1:54:24 AM UTC-6, BGB wrote:
>> On 12/13/2022 10:19 PM, MitchAlsup wrote:
>>> There was a recent poll on some site and 56% of Americans do not think
>>> that Arabic numerals should be taught in schools, too; and 15% don't have
>>> an opinion.
>>> https://www.snopes.com/fact-check/teaching-arabic-numerals/
>>> We are well on the way to Idiocracy.
>> "Oh those squiggles, what do they mean. Radix-10 positional arithmetic,
>> what sorcery is this! We all know true numbers look like MCMXCIX!", then
>> proceeds to lose their crap if someone tries to bring up zero or
>> negative numbers...
> <
> I had a 7-th grade algebra teacher state that you cannot do Multiplication and
> Division in Roman Numerals. It took but a single day to show her the fallacy of
> her ways.

Roman is just a notation system, you can of course do any kind if math
in it just like you can do in binary, decimal or hex.

IX * VII
=
X - I +
X - I +
L - V
= LXX - VII
= LXIII

:-)

If I had to implement a program to do so, withjout ever converting to
binary or decimal, I would start by generating small tables, like for
multiplication:

I V X L C ...
I I V X L C
V V XXV L CCL D (?500)
X X L C D

Next would be to extract any subtraction terms and handle them separately.

Anton Ertl

unread,

Dec 15, 2022, 5:28:56 AM12/15/22

to

MitchAlsup <Mitch...@aol.com> writes:
>Had your base port been from LLVM instead of GCC this likely would not
>have happened.
>GCC is more of a K&R (anything goes) while LLVM is more of a C-done-right
>(ADA-style)

On what evidence do you base this claim?

And why would "C-done-right" be "ADA-style"? My take on C done right
is that a compiler maintains backwards compatibility with earlier
versions, while others think that C done right punishes the sinners
who write code that performs undefined behaviour as much as possible
by compiling it to unintended code (best even if there is no
performance advantage to that). I don't have Ada experience, but the
general image is compile-time errors and run-time errors; I would not
expect the latter from GCC or Clang.

What I have experienced from Clang is that a common bit-rotation idiom
is compiled to 0 for rot(x,0) (for variable x) without compile-time or
run-time warning or error, because it exercises undefined behaviour
for a rotation count of 0, while gcc produced the intended behaviour.

>I had a 7-th grade algebra teacher state that you cannot do Multiplication and
>Division in Roman Numerals. It took but a single day to show her the fallacy of
>her ways.

Not bad. I heard that in the middle ages being able to divide earned
you a doctoral degree, because dividing numbers written in roman
numerals is so hard.

David Brown

unread,

Dec 15, 2022, 5:54:02 AM12/15/22

to

You could also switch to Roman-style Roman numerals, rather than
classical European style Roman numerals. The Romans wrote 9 as "VIIII",
not "IX", and only very occasionally used abbreviated forms. Drop the
abbreviations, and you've basically got an alternating 5,2 radix number
system.

Tim Rentsch

unread,

Dec 15, 2022, 9:57:23 AM12/15/22

to

MitchAlsup <Mitch...@aol.com> writes:

[...]

> There was a recent poll on some site and 56% of Americans do not
> think that Arabic numerals should be taught in schools, too; and
> 15% don't have an opinion.
> https://www.snopes.com/fact-check/teaching-arabic-numerals/
> We are well on the way to Idiocracy.

That isn't idiocy; it's illiteracy. Those aren't the same thing.
Don't make the mistake of thinking someone is stupid when all they
are is ignorant.

Anton Ertl

unread,

Dec 15, 2022, 9:57:37 AM12/15/22

to

MitchAlsup <Mitch...@aol.com> writes:
>On Tuesday, December 13, 2022 at 1:54:45 PM UTC-6, BGB wrote:
>> It is a NOP in terms of "high level behavior" or "How EX stages see it".
>>
>> Or, not necessarily a NOP if one demands that "Every NOP eats exactly 1
>> clock cycle".

The NOPs I know make no such guarantees. And the typical use (code
padding) is best served if the NOPs are as cheap as possible.

>On a 10-wide machine, does a NoOp eat exactly 1 cycle ??

Why should it?

>and what does
>1 cycle mean on a 10-wide machine ??

It's a time unit that does not depend on the width of the machine.

On a modern OoO machine the front end tends to deal with NOPs, and the
front end usually runs far ahead of the execution engine (and latency
(combined with OoO limits) or resource limitations in the execution
engine are often the bottleneck), so in the usual case a NOP will cost
nothing. A NOP after a branch misprediction may cost, however.

If you want a delay-1-cycle instruction, this should be possible even
in an OoO machine: when that instruction is processed, the clock is
not gated through to the rest of the execution engine for one cycle.
Of course, with an OoO engine, you cannot really predict which
instructions are affected by the additional cycle. The question is:
Why would anyone use such an instruction?

Terje Mathisen

unread,

Dec 15, 2022, 10:32:33 AM12/15/22

to

I was going to agree with you, but realized that "willful ignorance"
cannot be as easily (if ever?) forgiven.

"I just prefer to believe whatever I want, and will actively ignore any
attempts to learn/study anything which could disagree with those beliefs."

BGB

unread,

Dec 15, 2022, 12:03:17 PM12/15/22

to

On 12/15/2022 4:13 AM, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> Had your base port been from LLVM instead of GCC this likely would not
>> have happened.
>> GCC is more of a K&R (anything goes) while LLVM is more of a C-done-right
>> (ADA-style)
>
> On what evidence do you base this claim?
>
> And why would "C-done-right" be "ADA-style"? My take on C done right
> is that a compiler maintains backwards compatibility with earlier
> versions, while others think that C done right punishes the sinners
> who write code that performs undefined behaviour as much as possible
> by compiling it to unintended code (best even if there is no
> performance advantage to that). I don't have Ada experience, but the
> general image is compile-time errors and run-time errors; I would not
> expect the latter from GCC or Clang.
>
> What I have experienced from Clang is that a common bit-rotation idiom
> is compiled to 0 for rot(x,0) (for variable x) without compile-time or
> run-time warning or error, because it exercises undefined behaviour
> for a rotation count of 0, while gcc produced the intended behaviour.
>

Yeah. Both GCC and LLVM are playing fast and loose with C semantics.
For BGBCC, I was more trying to follow after MSVC in some areas, tending
to be "a little more conservative".

Though, there are cases where there are bugs, and some tension between
trying to generate efficient code, and keeping traditional coding
practices working.

Also, some of the ways one needs to write code to make it "actually
work" with GCC, will tend to perform poorly with BGBCC.

...

>> I had a 7-th grade algebra teacher state that you cannot do Multiplication and
>> Division in Roman Numerals. It took but a single day to show her the fallacy of
>> her ways.
>
> Not bad. I heard that in the middle ages being able to divide earned
> you a doctoral degree, because dividing numbers written in roman
> numerals is so hard.
>

Well, also apparently math in general was treated as arcane secret
knowledge to be passed on from masters to pupils.

Apparently, their whole system started to break apart (and with some
conflicts) when some people started printing and distributing
information about the subject via the printing press.

Well, that along with things like commoners being able to get their own
copies of the Bible, rather than it being kept as something that only
priests and similar were allowed to read.

All sort of disrupting everything in a world before the powers that be
could issue DMCA takedowns or similar against anything they don't
approve of.

...

> - anton

David Brown

unread,

Dec 15, 2022, 2:21:26 PM12/15/22

to

On 15/12/2022 16:32, Terje Mathisen wrote:
> Tim Rentsch wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>>
>> [...]
>>
>>> There was a recent poll on some site and 56% of Americans do not
>>> think that Arabic numerals should be taught in schools, too; and
>>> 15% don't have an opinion.
>>> https://www.snopes.com/fact-check/teaching-arabic-numerals/
>>> We are well on the way to Idiocracy.
>>
>> That isn't idiocy; it's illiteracy. Those aren't the same thing.
>> Don't make the mistake of thinking someone is stupid when all they
>> are is ignorant.
>
> I was going to agree with you, but realized that "willful ignorance"
> cannot be as easily (if ever?) forgiven.
>
> "I just prefer to believe whatever I want, and will actively ignore any
> attempts to learn/study anything which could disagree with those beliefs."
>

There is also a difference between being ignorant, and ignoring your
ignorance.

If someone asks you "Do you support teaching of Arabic numerals?", or
"Should schools teach the creation theories of the Catholic priest
Georges Lemaître?", and you don't know what these mean, you should ask
for clarification or say "I don't know". You should not hold opinions
on things you know nothing about.

I do not imagine anyone who answered "no" to teaching Arabic numerals
knew what they were. (Maybe some knew what Lemaître's theories were,
and still didn't want them taught, but I expect very few people knew
what it is.)

I am not sure what the right term is for people who hold definite
opinions on things they know nothing about. I don't think "idiocy",
"ignorance" or "illiteracy" are right. Clearly bigotry is part of the
picture, and there is a definite lack of critical thinking and
self-awareness.

MitchAlsup

unread,

Dec 15, 2022, 2:30:11 PM12/15/22

to

On Thursday, December 15, 2022 at 4:28:56 AM UTC-6, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >Had your base port been from LLVM instead of GCC this likely would not
> >have happened.
> >GCC is more of a K&R (anything goes) while LLVM is more of a C-done-right
> >(ADA-style)
> On what evidence do you base this claim?
>
> And why would "C-done-right" be "ADA-style"? My take on C done right
> is that a compiler maintains backwards compatibility with earlier
> versions, while others think that C done right punishes the sinners
> who write code that performs undefined behaviour as much as possible
> by compiling it to unintended code (best even if there is no
> performance advantage to that). I don't have Ada experience, but the
> general image is compile-time errors and run-time errors; I would not
> expect the latter from GCC or Clang.
<

Me, I can see both sides of the issue:
a) side 1 wants the fastest code
b) side 2 wants the most compatible code
<
LLVM goes out of its way to prevent a variable assigned to a register from
ever containing a value that its memory container count not hold. Thus,
if a register contained an unknown value that arrived as an argument to
a subroutine, and y0u add 1 to it, LLVM will generate code to "smash" the
value range so that the value-range of the defined type in to exceeded.
<
Depending on the architecture; this smash can use Rn&(~(~0<<k)) for
unsigned and ((Rn<<(64-(1<<k)))>>(64-(1<<k))) for signed (or if available
various sign/zero extension OpCodes.
<
This comes with performance disadvantages, but comes with cross
calling advantages where the recipient is not able to deal with out of
container sized values.
<
It is annoying that LLVN puts a smash on both caller and callee sides
of argument passing and value returning.
<
K&R C allows for variables {" register char a;"} for a to have values
which it could never contain if it were not allocated to a regsiter.

>
> What I have experienced from Clang is that a common bit-rotation idiom
> is compiled to 0 for rot(x,0) (for variable x) without compile-time or
> run-time warning or error, because it exercises undefined behaviour
> for a rotation count of 0, while gcc produced the intended behaviour.
<

I have not found this to be the case from the code coming out of Brian's
compiler.
<
Clang internal VM defines rot(x,y) as a funnel shift of ({x,x} >> y ) an
input of double width, both halves containing the same bit pattern,
shifted by the specified amount. And there is likely some restriction
on the value space of y (like: y <= 8×sizeof(x)).
<
Even machines without a ROT instruction find this definition quite
useable. But I am speaking with light knowledge of this as Brian
just fixed this up a couple of weeks ago-getting rid of 30% of the
instructions in sha-256 in EMBench.

<
> >I had a 7-th grade algebra teacher state that you cannot do Multiplication and
> >Division in Roman Numerals. It took but a single day to show her the fallacy of
> >her ways.
> Not bad. I heard that in the middle ages being able to divide earned
> you a doctoral degree, because dividing numbers written in roman
> numerals is so hard.
<

Realistically, Roman numerals are base {2,5} and if you understand how to
do division base {10} by hand digit by digit, doing it by base {2,5} is just a
few sequences you have to know.
<
Later, I even got SQRTs to work by hand.

BGB

unread,

Dec 15, 2022, 2:35:44 PM12/15/22

to

None of my code has quite that many.
The longest argument list currently is in my GLQuake port, mostly due to
an 18 argument function being used in TKRA-GL.

But, I guess one could debate if there are many codebases which deviate
strongly from this, having a lot more long argument lists than short ones.

Between Doom, ROTT, Quake, etc, specifics vary slightly, but the overall
pattern remains pretty similar.

So, it seems like:
8 will be sufficient "most of the time";
16 is slightly better;
24 or 32 would be needed to entirely eliminate the existence of stack
arguments.

It is similar to the issues of, say, most functions are fine with a
relatively small limit of local variables, but this doesn't mean one
wont occasionally run into functions with 200 or more local variables...

In most cases, it is preferable to have more registers than one has
local variables for "the vast majority case".

With 32 GPRs and BGBCC, this is generally true of leaf functions, but
less true of non-leaf functions (which are mostly limited to only using
callee-save registers for statically-assigned local variables)

Well, and BGBCC's dynamic allocation scheme is basically:
Load into a register from the stack;
Spill at the end of the basic block.

Which tends to be a bit less efficient than being able to statically
assign a variable to a register.

Temporary assignment of variables across basic-block boundaries would be
better, but would require mechanisms which I haven't had much success
with making them work.

Also, sort of gradually making progress getting the new ISA mode
working. It is in a "sorta working, but still highly unstable" state.

Have noted that, unlike with the baseline ISA variant, enabling XGPR
causes binaries to shrink in this case (rather than expand).

But, without XGPR enabled, XG2 is effectively equivalent to Fix32.
Still TBD what happens with the extra bits in instructions which lack
the relevant register fields. For now they will be required to be set to
1 (in some cases, they could potentially serve as additional opcode bits).

For now, only the original Jumbo prefix forms are allowed.

Some instability at present exists with running programs via the
TestKern shell, where the shell/kernel runs in the original ISA mode,
but then the program is running in XG2 mode.

Though, in some sense, this is also working as a test case for the
mechanisms needed to run RISC-V code on BJX2 (since both XG2 mode and
RISC-V mode are built on the same basic mechanism).

Also, in this case, trying to jump (incorrectly) from one mode to
another will cause stuff to crash relatively quickly due to encoding
differences.

Did find some crashes in my emulator due to uninitialized variables and
similar, seems MSVC has a similar limitation to where it can detect and
warn about uninitialized variables as BGBCC does:
Both seem to fail to detect or warn about variables that are used which
are initialized on one code path but not on another (whereas GCC and
Clang seem to detect and warn about uninitialized variables along
multiple paths).

Sorta makes me wonder if MSVC is also using a naive "warn if a local
variable is read before its first assignment is encountered" algorithm.

...

Stephen Fuld

unread,

Dec 15, 2022, 2:44:07 PM12/15/22

to

On 12/15/2022 11:35 AM, BGB wrote:
> On 12/15/2022 1:07 AM, Thomas Koenig wrote:
>> BGB <cr8...@gmail.com> schrieb:
>>
>> (argument list)
>>
>>> So, it looks like (looking at a few different programs):
>>> 4: ~79.2%
>>> 8: ~98.6%
>>> 15: ~99.9%
>>> 16: ~99.9%
>>> 18: 100%
>>
>> Did you ever take a look at LAPACK, or CLAPACK? Those are _long_
>> argument lists...
>>
>> For example, dgerfsx has 22 arguments (unless I miscounted) plus
>> two hidden character length arguments.
>
> None of my code has quite that many.
> The longest argument list currently is in my GLQuake port, mostly due to
> an 18 argument function being used in TKRA-GL.
>
> But, I guess one could debate if there are many codebases which deviate
> strongly from this, having a lot more long argument lists than short ones.
>
> Between Doom, ROTT, Quake, etc, specifics vary slightly, but the overall
> pattern remains pretty similar.

This reminds me of the floating point constants discussion we had
recently. Your sample seems to be at least mostly games, whereas Thomas
was talking about math/science libraries. To me, it is not unreasonable
that they two code bases have very different profiles. For a general
purpose CPU, I think it is important to get as wide a coverage of the
program space as is reasonable.

MitchAlsup

unread,

Dec 15, 2022, 3:51:13 PM12/15/22

to

In the early 2000s, I had the simulation team (x86 K9 simulator) convert
from a number of arguments into a pointer to a data structure for
argument passing. This sped things up markedly. 4,5,6,7 argument
functions became 1 and 2 argument functions--and at the same time
the simulator went from being able to support 1 CPU to as many as
we wanted to run that day........

BGB

unread,

Dec 15, 2022, 4:42:42 PM12/15/22

to

Yeah. Most of my ported software is games. Mostly ID Software, Raven,
and Apogee; as these guys released source, whereas pretty much everyone
else did not.

Trying to port Linux or Unix had generally looked like a much bigger
undertaking.

Generally easier to port things with few external dependencies, and
which don't make too many assumptions about the underlying platform or
build system (say, stuff which uses a lot of 3rd party libraries and
assumes using GNU autoconf and similar is basically no-go at present).

Though, I guess one possible option could be to try to beat BGBCC into a
form that autotools can accept, or possibly make a "new" C compiler that
can better mimic GCC's interface and behavior (say, so that autotools
can use it as a cross-compiler).

Eg:
bjx2-pel-gcc
bjx2-pel-as
bjx2-pel-ld
...

TBD if it keeps using RIL (as a trivially modified BGBCC; probably doing
all the codegen in 'ld' and faking the different tools depending on how
it is called), or if I should go and implement a "proper" assembler and
linker (probably using COFF objects). Would assume probably sticking
with PEL (PE/COFF + LZ, *), rather than go and try to figure out how to
effectively map BJX2 onto ELF (things like a GOT/PLT are not really a
thing ATM).

*: My loaders will still generally accept non LZ compressed PE/COFF as
well though (with or without the MZ header in the uncompressed case;
BGBCC emits a variant which omits the MZ header; and the MZ header is
always omitted for PEL, as it served no real purpose in my case).

I guess technically there is also a.out ... but ... no.

Some types of optimizations that BGBCC does would not be possible with
(actual) separate compilation and linking though. But, there are
possible merits to having these as well.

AFAIK, LLVM can mimic GCC's interface here, just using bitcode files
rather than bare COFF or ELF objects.

Still seems like it would be an uphill battle to try to port a userland
or libraries though (either with an actual Unix or Linux kernel, or
TestKern pretending to be a Unix-style kernel).

...

Anton Ertl

unread,

Dec 16, 2022, 8:00:15 AM12/16/22

to

MitchAlsup <Mitch...@aol.com> writes:
>LLVM goes out of its way to prevent a variable assigned to a register from
>ever containing a value that its memory container count not hold. Thus,
>if a register contained an unknown value that arrived as an argument to
>a subroutine, and y0u add 1 to it, LLVM will generate code to "smash" the
>value range so that the value-range of the defined type in to exceeded.

I wanted to check this claim, so I tried to recreate an example where
recent C compilers eliminate a sign extension by assuming that
overflow does not happen. The example is:

long foo(long a[], int n1, int n2)
{
int i;
long r=0;
for (i=n1; i!=n2; i++)
r += a[i];
return r;
}

I compiled with -O -Wall, with clang 11.0.1 or gcc 10.2.1 and without
or with -fwrapv, and decompiled the result:

for i in clang gcc; do for j in "" -fwrapv; do echo $i $j; $i -O -Wall $j -c xxx.c; objdump -d xxx.o; done; done

What I found is the following variants (loop only):

clang gcc
add (%rcx,%rsi,8),%rax add (%rax),%rdx
add $0x1,%rsi add $0x8,%rax
cmp %esi,%edx cmp %rcx,%rax
jne 20 <foo+0x20> jne 1d <foo+0x1d>

clang -fwrapv gcc -fwrapv
movslq %esi,%rsi movslq %esi,%rdx
add (%rdi,%rsi,8),%rax add (%rdi,%rdx,8),%rax
add $0x1,%esi add $0x1,%esi
cmp %esi,%edx cmp %esi,%ecx
jne 10 <foo+0x10> jne b <foo+0xb>

So we see that without -fwrapv, clang does not prevent i/%rsi from
holding a value outside the int range; instead, it just assumes that
this does not happen. You need to add -fwrapv, and then both
compilers perform a 32-bit add and, in preparation for the 64-bit add
in the effective-address computation, and sign extension (because the
architecture zero-extends in the 32-bit add).

>It is annoying that LLVN puts a smash on both caller and callee sides
>of argument passing and value returning.

So that would predict that, for

extern int flip(int n);

int flop(int n)
{
return flip(n);
}

clang would sign-extend n before passing it on to flip(), and
sign-extend the result of flip() before passing it on. What I see
with the same parameters as above is:

clang gcc
push %rax sub $0x8,%rsp
callq 6 <flop+0x6> callq 9 <flop+0x9>
pop %rcx add $0x8,%rsp
retq retq

So, no sign or zero extension by clang. The ABI should specify
whether the caller must pass an int in zero-extended or sign-extended
form or whether it is allowed to pass it in garbage-extended form; the
callee then may want to sign-extend the value (zero-extension does not
appear useful for int).

BTW, this all is fallout from making ints shorter than machine words.

Thomas Koenig

unread,

Dec 16, 2022, 10:56:32 AM12/16/22

to

BGB <cr8...@gmail.com> schrieb:

> Then again, I guess it was interesting to observe that some types of
> optimizations which help on BJX2 also help on x86-64, despite x86-64
> having not enough registers to actually make use of the optimization as
> expressed (x86-64 seems to be able to pretend as-if the stack
> spills/fills were registers).

This is as old as optimizing compilers. The very first FORTRAN
compiler assumed that there was an infinite amount of index
registers, which it then later on assigned to the three that were
actually available.

> Meanwhile, I am left needing to use lots of registers, and occasional
> optimizations which try to side-step a few (store then load) cases, then
> recently needing to fix a bug where one of these optimizations broke due
> to a type-handling issue in my compiler (was using the type of the value
> stored to an array, rather than the type of the value as loaded from the
> array).

Did you actually look at optimizing register allocation algorithms?
http://web.cecs.pdx.edu/~mperkows/temp/register-allocation.pdf is one
of the first search results that Google (well, Startpage) returns.
You should really look into graph coloring.

Thomas Koenig

unread,

Dec 16, 2022, 11:00:03 AM12/16/22

to

BGB <cr8...@gmail.com> schrieb:

You don't have experience with other people using your compiler,
I presume.

If somebody puts an artificial limitation like that into a widely
used ABI or compiler, then somebody out there will run across that
limitation and post a bug report.

Some people fuzz compilers as a hobby.

Scott Lurndal

unread,

Dec 16, 2022, 11:01:33 AM12/16/22

to

I'd recommend starting with the Dragon book, which discusses register
allocation fairly extensively, in multiple contexts.

Thomas Koenig

unread,

Dec 16, 2022, 11:01:53 AM12/16/22

to

BGB <cr8...@gmail.com> schrieb:

> Though, I guess one possible option could be to try to beat BGBCC into a
> form that autotools can accept, or possibly make a "new" C compiler that
> can better mimic GCC's interface and behavior (say, so that autotools
> can use it as a cross-compiler).

A better bet would be to write a specific back end for gcc or LLVM.
There are people in this group who have done this (Marcus and
Brian, respectively).

BGB

unread,

Dec 16, 2022, 11:23:00 AM12/16/22

to

There are no hard limit on the number of arguments in BGBCC.

More, it is a case of avoiding stack arguments (entirely, if-possible)
is better for performance in a strictly in-order machine.

As for the ABI, it is sort of like:
First 8 (or 16) arguments are passed in registers;
Any remaining arguments are passed on the stack.

There is also a spill space for the register arguments (in the 64-bit
ABI), with the remaining arguments being after the spill space.

Though, other ABI variants exist where there is no spill space (32-bit
ABI doesn't use it, as the "no spill space" case is a little better for
size optimization).

Granted, there are still a lot of bugs.

Debugging random issues in BGBCC and similar has been probably where a
bulk of the effort in this project has ended up going...

MitchAlsup

unread,

Dec 16, 2022, 11:43:22 AM12/16/22

to

In My 66000 ABI there is a limit of 8 register passed arguments (and results)
the rest are passed on the stack. An unconcerned programmer does not
need to even see that there is a limit. The performance aware programmer
will see higher performance with 8 or fewer arguments.

>
> More, it is a case of avoiding stack arguments (entirely, if-possible)
> is better for performance in a strictly in-order machine.
>
> As for the ABI, it is sort of like:
> First 8 (or 16) arguments are passed in registers;
> Any remaining arguments are passed on the stack.
>
> There is also a spill space for the register arguments (in the 64-bit
> ABI), with the remaining arguments being after the spill space.
<

I worked out positions for the register and memory based arguments
such that the stack has "no excess space allocated" and "you can concatenate"
the register and memory spaces using ENTER--and indeed, this is how
varargs works.

BGB

unread,

Dec 16, 2022, 12:34:01 PM12/16/22

to

The number of register arguments is limited by the ABI mode in question.
The total maximum number of arguments is unbounded.

I haven't yet settled on which ABI I will make the official ABI for XG2
mode.

While arguably "more register arguments is better", "more spill space is
not better". Burning 256 bytes of the stack (rather than 128 bytes) for
every non-leaf function seems to offset any gains from passing more
arguments in registers (since, as noted, more than 8 is uncommon; at
least for code using 64-bit pointers).

This almost makes a stronger case for not providing any spill space.

>>
>> More, it is a case of avoiding stack arguments (entirely, if-possible)
>> is better for performance in a strictly in-order machine.
>>
>> As for the ABI, it is sort of like:
>> First 8 (or 16) arguments are passed in registers;
>> Any remaining arguments are passed on the stack.
>>
>> There is also a spill space for the register arguments (in the 64-bit
>> ABI), with the remaining arguments being after the spill space.
> <
> I worked out positions for the register and memory based arguments
> such that the stack has "no excess space allocated" and "you can concatenate"
> the register and memory spaces using ENTER--and indeed, this is how
> varargs works.

I wanted to go over to a "you can naively spill the registers to the
stack and get back to a linear argument list like on x86" model.

The original ABI did not do this, so the "va_list" structure is more of
a snapshot of all the argument registers followed by a pointer to the
first non-register argument (as was needed for the original ABI).

As noted, the design for my ABI was somewhat influenced by the Windows
X64 ABI (well, and also the Windows CE SH-4 ABI, from which it was also
partially derived).

There was also the Hitachi ABI, and GCC's SH-4 ABI, where ironically all
were "similar but different", as even with the same ISA, seemingly no
two parties can agree as to what exactly the C ABI rules should be...

Would hope it would stay more consistent in my case, but seemingly I
can't agree with myself as to what exactly the C ABI rules should be in
some cases (hence why, annoyingly, the ABI has fragmented into
sub-variants).

MitchAlsup

unread,

Dec 16, 2022, 3:28:37 PM12/16/22

to

The amount of excess space on My 66000 stack is 0 bytes. The amount of
space in local_data_area is "whatever the compiler wants". The compiler
allocates 0 bytes that are not used.
<
As to "more is always better"; I am not so sure. With 8 registers used to pass
arguments, and 16 (including FP and SP) preserved registers, there are 8
freely used temporaries for leaf functions to "do stuff" without overhead.
If you increased the register arguments to 16 without addressing the free
temporaries, you end up having to save/restore some registers in order
to "do stuff", and based on your statistics, these account for only 1.4%
of subroutines.

<
>
> This almost makes a stronger case for not providing any spill space.
<

Which I do not.

> >>
> >> More, it is a case of avoiding stack arguments (entirely, if-possible)
> >> is better for performance in a strictly in-order machine.
> >>
> >> As for the ABI, it is sort of like:
> >> First 8 (or 16) arguments are passed in registers;
> >> Any remaining arguments are passed on the stack.
> >>
> >> There is also a spill space for the register arguments (in the 64-bit
> >> ABI), with the remaining arguments being after the spill space.
> > <
> > I worked out positions for the register and memory based arguments
> > such that the stack has "no excess space allocated" and "you can concatenate"
> > the register and memory spaces using ENTER--and indeed, this is how
> > varargs works.
<
> I wanted to go over to a "you can naively spill the registers to the
> stack and get back to a linear argument list like on x86" model.
<

While you can't do it naïvely, there is a straightforward way to achieve
this goal of a vector of arguments for varargs--and in My 66000 you
can achieve this at the cost of 0 extra instructions !

>
> The original ABI did not do this, so the "va_list" structure is more of
> a snapshot of all the argument registers followed by a pointer to the
> first non-register argument (as was needed for the original ABI).
>

It is even worse with multiple register files.

>
> As noted, the design for my ABI was somewhat influenced by the Windows
> X64 ABI (well, and also the Windows CE SH-4 ABI, from which it was also
> partially derived).
>
> There was also the Hitachi ABI, and GCC's SH-4 ABI, where ironically all
> were "similar but different", as even with the same ISA, seemingly no
> two parties can agree as to what exactly the C ABI rules should be...
>
>
> Would hope it would stay more consistent in my case, but seemingly I
> can't agree with myself as to what exactly the C ABI rules should be in
> some cases (hence why, annoyingly, the ABI has fragmented into
> sub-variants).
<

Auld lang syne ?

BGB

unread,

Dec 16, 2022, 9:22:22 PM12/16/22

to

Probably true.

Though, should have been 128 vs 64, not 256 vs 128.

It would appear that 8 may be a better option than 16, even with 64 GPRs
(and ~ 30 usable scratch registers).

So, a small extra cost for 1.5% of functions (them needing to use stack
arguments past the first 8), may not be worth a relatively small cost
being paid by every other function (mostly in terms of bigger stack
frames reducing the effectiveness of the 9-bit load/store displacement).

This is since addressing is relative to the bottom of the stack, and
thus has to address over the callee's spill space and similar.

If the stack frame exceeds the range of a 9-bit displacement, this also
means a penalty for accessing any variables stored in the spill space,
since these may now require using a bigger displacement to cross the
stack frame (vs had they been stored along with the other local
variables and similar, in my case typically closer to the bottom of the
stack frame).

>>
>> This almost makes a stronger case for not providing any spill space.
> <
> Which I do not.

It varies a bit.

There seem to be two different camps for ABI design:

SysV style ABIs:
Typically no spill space;
Structures are passed by copying them in registers or on the stack;
(They often decompose structures into registers, ...)
...

Microsoft Style:
Typically provide a spill space,
usually equal to the number of register arguments;
Typically pass and return structures by reference;
...

I had noted that the official RISC-V ABI seems to follow SysV patterns.

MS seems to list a Machine ID for RISC-V in PE/COFF, but no ABI spec to
be found, so it is uncertain if they adopted the RISC-V ABI as-is or
have their own.

I have found exceptions to both patterns though:
SysV MIPS ABI has a spill space for arguments;
The MS ARMv8 ABI seems to follow SysV patterns;
( I am guessing they adopted ARM's ABI as-is?... )
...

I can note that the BJX2 ABI mostly follows patterns similar to the
Microsoft style ABIs.

Eg:
Spill space (in 64b/128b variants);
Structures passed by reference (if > 16B);
Struct return by copying to caller-supplied pointer (if > 16B);
...

>>>>
>>>> More, it is a case of avoiding stack arguments (entirely, if-possible)
>>>> is better for performance in a strictly in-order machine.
>>>>
>>>> As for the ABI, it is sort of like:
>>>> First 8 (or 16) arguments are passed in registers;
>>>> Any remaining arguments are passed on the stack.
>>>>
>>>> There is also a spill space for the register arguments (in the 64-bit
>>>> ABI), with the remaining arguments being after the spill space.
>>> <
>>> I worked out positions for the register and memory based arguments
>>> such that the stack has "no excess space allocated" and "you can concatenate"
>>> the register and memory spaces using ENTER--and indeed, this is how
>>> varargs works.
> <
>> I wanted to go over to a "you can naively spill the registers to the
>> stack and get back to a linear argument list like on x86" model.
> <
> While you can't do it naïvely, there is a straightforward way to achieve
> this goal of a vector of arguments for varargs--and in My 66000 you
> can achieve this at the cost of 0 extra instructions !

In the 8-arg case, it can be done with 4 MOV.X instructions.

>>
>> The original ABI did not do this, so the "va_list" structure is more of
>> a snapshot of all the argument registers followed by a pointer to the
>> first non-register argument (as was needed for the original ABI).
>>
> It is even worse with multiple register files.

Yeah.

>>
>> As noted, the design for my ABI was somewhat influenced by the Windows
>> X64 ABI (well, and also the Windows CE SH-4 ABI, from which it was also
>> partially derived).
>>
>> There was also the Hitachi ABI, and GCC's SH-4 ABI, where ironically all
>> were "similar but different", as even with the same ISA, seemingly no
>> two parties can agree as to what exactly the C ABI rules should be...
>>
>>
>> Would hope it would stay more consistent in my case, but seemingly I
>> can't agree with myself as to what exactly the C ABI rules should be in
>> some cases (hence why, annoyingly, the ABI has fragmented into
>> sub-variants).
> <
> Auld lang syne ?

Had to look this up.
But, I don't know.

It seems though that in many cases, things are built around things that
happened in the past, either their own lives, or by interpreting various
texts and deriving rules from them, looking to various sources for
precedent on which to base their decisions, ...

But, when there is a line of development, then some patterns are likely
to exist here. Like, if one tries to design some of their tech initially
by trying to imitate stuff from, say, "Windows CE on a Sega Dreamcast",
then things will tend to follow patterns.

If one takes the ISA, and at first doubles, and then later doubles
again, the number of CPU registers, the patterns may still hold.

Say:
R4..R7
Became:
R4..R7, R20..R23
Becomes:
R4..R7, R20..R23, R36..R39, R52..R55

But, the stack pointer still remains awkwardly stuck at R15, as to move
it basically causes everything else to fall apart.

Much the same as the format for binaries having been based on PE/COFF, ...

Otherwise, it is partly a local optimization question.

Doesn't really seem like there is much information for the "why" of
various design choices, and in my case I just sorta keep beating on
stuff looking for a local optimum.

Some choices seem to be a benefit in some cases at the expense of others
(such as the optimal number of registers, or whether it is better to
have or not have a dedicated register spill space for arguments, ...).

It is possible I might have designed stuff differently if it had been
"blue sky" rather than "imitate something, and then proceed to beat the
design almost beyond recognition..." (just with the occasional vestiges
of the thing it once was).

MitchAlsup

unread,

Dec 16, 2022, 9:58:14 PM12/16/22

to

On Friday, December 16, 2022 at 8:22:22 PM UTC-6, BGB wrote:
> On 12/16/2022 2:28 PM, MitchAlsup wrote:
>> There seem to be two different camps for ABI design:
>
> SysV style ABIs:
> Typically no spill space;
> Structures are passed by copying them in registers or on the stack;
> (They often decompose structures into registers, ...)
> ...
>
> Microsoft Style:
> Typically provide a spill space,
> usually equal to the number of register arguments;
> Typically pass and return structures by reference;
> ...
>
>
> I had noted that the official RISC-V ABI seems to follow SysV patterns.
<

MIPS has a spill space............And I see now that RISC-V has corrected that.......
>
snip

> > While you can't do it naïvely, there is a straightforward way to achieve
> > this goal of a vector of arguments for varargs--and in My 66000 you
> > can achieve this at the cost of 0 extra instructions !
<
> In the 8-arg case, it can be done with 4 MOV.X instructions.
<

Preserving some registers, and allocating a stack costs me 1 instruction.
<
ENTER R26,R0,#local_data_area_size
<
Pushing the register arguments, preserving some registers, and allocating
a stack still costs only 1 instruction !
<
ENTER R26,R9,#local_data_area_size
<
You do it in 4 more instructions, I do it in an already existing instruction.

>
>
> It seems though that in many cases, things are built around things that
> happened in the past, either their own lives, or by interpreting various
> texts and deriving rules from them, looking to various sources for
> precedent on which to base their decisions, ...
>
>
> But, when there is a line of development, then some patterns are likely
> to exist here. Like, if one tries to design some of their tech initially
> by trying to imitate stuff from, say, "Windows CE on a Sega Dreamcast",
> then things will tend to follow patterns.
>
> If one takes the ISA, and at first doubles, and then later doubles
> again, the number of CPU registers, the patterns may still hold.
>
> Say:
> R4..R7
> Became:
> R4..R7, R20..R23
> Becomes:
> R4..R7, R20..R23, R36..R39, R52..R55
<

This is why you have to get it right the first time.

>
> But, the stack pointer still remains awkwardly stuck at R15, as to move
> it basically causes everything else to fall apart.
>

Because they did not get it right the first time.
>

BGB

unread,

Dec 16, 2022, 10:26:52 PM12/16/22

to

On 12/16/2022 10:01 AM, Scott Lurndal wrote:
> Thomas Koenig <tko...@netcologne.de> writes:
>> BGB <cr8...@gmail.com> schrieb:
>>
>>> Then again, I guess it was interesting to observe that some types of
>>> optimizations which help on BJX2 also help on x86-64, despite x86-64
>>> having not enough registers to actually make use of the optimization as
>>> expressed (x86-64 seems to be able to pretend as-if the stack
>>> spills/fills were registers).
>>
>> This is as old as optimizing compilers. The very first FORTRAN
>> compiler assumed that there was an infinite amount of index
>> registers, which it then later on assigned to the three that were
>> actually available.
>>

I was thinking more in the sense that it seems to behave as-if it could
load from memory into a register in 1 cycle, even if realistically this
should take ~ 4 cycles on x86-64.

Say, one x86-64, one can do a version of a piece of logic:
2 operations at a time, using 10 variables;
4 operations at a time, using 20 variables.

x86-64 has 16 registers, so one would expect the 10 variable version to
be faster than the 20 variable version. But, one may observe the 20
variable case is still faster.

Contrast, a lot of 90s era code seems to use small loops using as few
variables as possible to process one item at a time, which tends to be
slower (on both x86-64 and BJX2) than aggressively unrolling the loop
and doing a bunch of items in parallel (often using significantly more
working variables; and writing code that resembles ASM if written in C
syntax).

>>> Meanwhile, I am left needing to use lots of registers, and occasional
>>> optimizations which try to side-step a few (store then load) cases, then
>>> recently needing to fix a bug where one of these optimizations broke due
>>> to a type-handling issue in my compiler (was using the type of the value
>>> stored to an array, rather than the type of the value as loaded from the
>>> array).
>>
>> Did you actually look at optimizing register allocation algorithms?
>> http://web.cecs.pdx.edu/~mperkows/temp/register-allocation.pdf is one
>> of the first search results that Google (well, Startpage) returns.
>> You should really look into graph coloring.
>
> I'd recommend starting with the Dragon book, which discusses register
> allocation fairly extensively, in multiple contexts.

I had just sort of came up with something over time.

Might need to look into some more stuff...

In my compilers, I mostly used fairly simple/naive strategy:
Maintain a list of "live" variables;
If a variable is accessed, and not in the list, add it to list and load
it from memory (if used as a source variable);
If no free spots in the list, find something to evict (LRU style);
Write all non-statically-assigned dirty variables back to memory at the
end of the basic block (and set their entries back to empty).

So, variables will typically be loaded each time when they are first
used, with a blob of stack spills at the end of the block.

The demand-loaded variables often come with a 3-cycle or so penalty (to
avoid this requires loading them before they are used).

Though, not particularly optimal, attempts to improve on this have not
generally been successful (either making things worse, or causing the
program to break).

A lot of my JITs also used this strategy, just working on "traces".

Then, with a few options:
The N most-used variables get assigned to the first N spots in the list
for the entire function;
If we have more registers than variables (both locals and temporaries),
everything gets static-assigned.

Performance is generally better when more of the variables can be
static-assigned.

Say, if you have 30 callee-saved registers and 20 variables, well then,
every variable can get its own register for the scope of the function,
and the spill-and-fill disappears (at least for local variables and
similar).

Even if not perfect, it still seems better than the "-O0" or "/O0"
strategies in GCC and MSVC, which do everything as "Load; Load; Op; Store".

But, yeah, by "-O1" or "-O2" GCC's code generation tends to somewhat
beat my compiler on this front (or even MSVC's).

But, yeah, I have noted that other compilers tend to be more effective here.

The WEXifier can try to help (by trying to shuffle the instructions
around after-the-fact; in machine-code form), but its effectiveness is
in turn limited (and having lots of registers and giving every variable
its own register makes the WEXifier's job easier, vs trying to reuse
registers within the basic-block; but avoiding register reuse within a
block does increase register pressure slightly).

There are also high level transforms, like say:
foo->x = a;
foo->y = b;
...
c = foo->x + foo->y;
One might be tempted to try to transform the last expression to:
c = a + b;

But, a bunch of extra checks are needed to try to exclude cases where
this is not valid (such as a depended-on variable being modified, ...).

And, as noted, the 'foo->x' still needs to return the type of 'foo->x'
and not the type of 'a'.

This transform was generally being done in the front-end in this case
(where bugs here were changing the types "as seen by the C code").

This mapping was being done with a sort of lookup table operating mostly
at the level of local variables (and "reducing" expressions in terms of
AST nodes).

In a few of my past languages (namely the original BGBScript), a similar
strategy was also used for type-inference (before later switching
primarily to explicit declared types); and also for constant-propagation.

Say:
a=3;
b=4;
c=a+b;
And the compiler realizes it can rewrite the last expression to:
c=7;

This strategy was also generally limited to a single basic-block at a
time (the contents of the table being "forgotten" at the end of each block).

...

BGB

unread,

Dec 16, 2022, 11:45:39 PM12/16/22

to

On 12/16/2022 8:58 PM, MitchAlsup wrote:
> On Friday, December 16, 2022 at 8:22:22 PM UTC-6, BGB wrote:
>> On 12/16/2022 2:28 PM, MitchAlsup wrote:
>>> There seem to be two different camps for ABI design:
>>
>> SysV style ABIs:
>> Typically no spill space;
>> Structures are passed by copying them in registers or on the stack;
>> (They often decompose structures into registers, ...)
>> ...
>>
>> Microsoft Style:
>> Typically provide a spill space,
>> usually equal to the number of register arguments;
>> Typically pass and return structures by reference;
>> ...
>>
>>
>> I had noted that the official RISC-V ABI seems to follow SysV patterns.
> <
> MIPS has a spill space............And I see now that RISC-V has corrected that.......

But, not found any clear information as to why a spill space is either
good or bad (apart from the issue of not scaling up very well by putting
more strain on the displacement distance).

Effect on performance either way seems to be fairly small, though it
does seem to effect code density.

For things like trying to wrangle function argument lists across an RPC
mechanism, the spill-space makes things easier.

Some descriptions of the Windows X64 ABI also made a case for why
spill-space was a good thing.

I have ABI variants both with and without a spill-space.
Was using the "no spill space" variant for code built with
"sizeof(void*)==4".

>>
> snip
>>> While you can't do it naïvely, there is a straightforward way to achieve
>>> this goal of a vector of arguments for varargs--and in My 66000 you
>>> can achieve this at the cost of 0 extra instructions !
> <
>> In the 8-arg case, it can be done with 4 MOV.X instructions.
> <
> Preserving some registers, and allocating a stack costs me 1 instruction.
> <
> ENTER R26,R0,#local_data_area_size
> <
> Pushing the register arguments, preserving some registers, and allocating
> a stack still costs only 1 instruction !
> <
> ENTER R26,R9,#local_data_area_size
> <
> You do it in 4 more instructions, I do it in an already existing instruction.

I don't have fancy instructions here.

I have MOV.X and ADD and similar...

Say:
MOV LR, R18
MOV GBR, R19
ADD -128, SP
MOV.X R18, (SP, 112)
MOV.X R8, (SP, 64)
MOV.X R10, (SP, 80)
MOV.X R12, (SP, 96)
...

If I were doing it again, might want LR and GBR in GPR space, so that I
could use MOV.X to store and reload them directly.

>>
>>
>> It seems though that in many cases, things are built around things that
>> happened in the past, either their own lives, or by interpreting various
>> texts and deriving rules from them, looking to various sources for
>> precedent on which to base their decisions, ...
>>
>>
>> But, when there is a line of development, then some patterns are likely
>> to exist here. Like, if one tries to design some of their tech initially
>> by trying to imitate stuff from, say, "Windows CE on a Sega Dreamcast",
>> then things will tend to follow patterns.
>>
>> If one takes the ISA, and at first doubles, and then later doubles
>> again, the number of CPU registers, the patterns may still hold.
>>
>> Say:
>> R4..R7
>> Became:
>> R4..R7, R20..R23
>> Becomes:
>> R4..R7, R20..R23, R36..R39, R52..R55
> <
> This is why you have to get it right the first time.

Whole thing started out with 16 registers, then doubled it twice.
No obvious "better" option than to repeat the existing pattern 2 or 4 times.

The strategy does allow one to determine if a register is callee saved
fairly easily:
if(regID&8)
{ callee-saved }
else
{ scratch }

Except for a few special cases.

With argument registers following a pattern:
zz01zz

A lot of other ISA's will require using ad-hoc lookup tables here.

>>
>> But, the stack pointer still remains awkwardly stuck at R15, as to move
>> it basically causes everything else to fall apart.
>>
> Because they did not get it right the first time.
>>
>

Yeah, there are reasons a lot of my considered full-ISA redesigns had
things like, say:
R0=ZR/PC (ZR|PC depends on context)
R1=LR
R2=SP
R3=GBR

But, XG2 was not a sufficient redesign for me to consider this, so XG2
keeps using the existing BJX2 register map (it is essentially being
treated more like an alt-mode encoding scheme rather than an entirely
separate ISA).

Though, strictly speaking, there is not a whole lot left in XG2 (at the
level of encoding or semantics) that actually cares which register is
being used as the stack pointer (most of this was in the 16-bit ops with
hard-wired implicit registers, which are N/E in XG2 mode).

Though, in the near term, this is unlikely to change short of other
architectural changes.

...

BGB

unread,

Dec 17, 2022, 12:23:44 AM12/17/22

to

Looks some at the linked PDF.

Yeah, BGBCC doesn't do anything like this.

Only two levels:
Within a given basic-block;
Or, global for the whole function.

Would have to figure out how to try to glue on something like graph
coloring, seems like a more advanced type of optimization than most of
the stuff I had been doing...

And, well, the "maybe stupid" assumption that if one typically has more
registers in the ISA than local variables and temporaries in the
function, they can static-assign everything and mostly sweep the
register-allocation issues under the carpet.

Well, pros/cons...

There are also issues, like trying to weed out the sources that are
generating a bunch of spurious (and frivolous) RegReg MOVs (one can fix
them in one case, and find more frivolous MOVs somewhere else, or a
bunch of otherwise unnecessary sign or zero extension operations, ...).

Then again, I guess it remains as an open question of what performance
would look like if my optimizer didn't suck...

Then again, I guess no one is claiming that writing a "good quality" C
compiler is particularly fast or easy (getting something that "sorta
works" hacked together isn't too bad; getting it to generate decent
quality output and relatively bug-free is a whole different issue).

...

robf...@gmail.com

unread,

Dec 17, 2022, 3:49:46 AM12/17/22

to

Yes. I came to realize that compilers are quite an undertaking. I only work
on the compiler periodically now, preferring to use VBCC. Had a look at
GCC and LLVM.

Here is another link on color-graphing. I spent quite a bit of time reading and
re-reading this until I was able to come up with a yet to get working properly
color graphing allocator for my compiler. I think there are issues with register
spills and reloads. Then I found out that if there are enough registers > 16
maybe color graphing is not so necessary.

https://www.cs.utexas.edu/users/mckinley/380C/lecs/briggs-thesis-1992.pdf

I believe a linear scan allocator is another common allocator.

I have the code for the allocator in my Github account. It is quite complex,
written in c++.

The allocator I currently use simply counts register usages, giving more counts
for registers used in a loop. And prioritizes register allocations that way.

robf...@gmail.com

unread,

Dec 17, 2022, 3:58:47 AM12/17/22

to

That should be expression usages.

Tim Rentsch

unread,

Dec 17, 2022, 9:53:04 AM12/17/22

to

Terje Mathisen <terje.m...@tmsw.no> writes:

> Tim Rentsch wrote:
>
>> MitchAlsup <Mitch...@aol.com> writes:
>>
>> [...]
>>
>>> There was a recent poll on some site and 56% of Americans do not
>>> think that Arabic numerals should be taught in schools, too; and
>>> 15% don't have an opinion.
>>> https://www.snopes.com/fact-check/teaching-arabic-numerals/
>>> We are well on the way to Idiocracy.
>>
>> That isn't idiocy; it's illiteracy. Those aren't the same thing.
>> Don't make the mistake of thinking someone is stupid when all they
>> are is ignorant.
>
> I was going to agree with you, but realized that "willful ignorance"
> cannot be as easily (if ever?) forgiven.

My comment is about people who happen to be ignorant, not about
people who choose to remain ignorant.

Furthermore, it is still the case that one shouldn't _assume_ that
someone is stupid just because they are ignorant. If later
information shows signs of willful ignorance, the question of
intelligence can always be revisited (or even just visited).

Note by the way that most of the people responding to this survey
probably didn't choose to be ignorant of the essential fact that
our number symbols are indeed Arabic numerals. And it isn't
obvious that including that bit of knowledge in the educational
curriculum would yield any significant benefit. It's more
important to teach people how to think than to fill their heads
with zillions of miscellaneous facts (and it's disappointing that
this basic truth is not more widely recognized).

Tim Rentsch

unread,

Dec 17, 2022, 10:46:41 AM12/17/22

to

BGB <cr8...@gmail.com> writes:

> On 12/15/2022 4:13 AM, Anton Ertl wrote:
>
>> MitchAlsup <Mitch...@aol.com> writes:
>>
>>> Had your base port been from LLVM instead of GCC this likely would not
>>> have happened.
>>> GCC is more of a K&R (anything goes) while LLVM is more of a C-done-right
>>> (ADA-style)
>>
>> On what evidence do you base this claim?
>>
>> And why would "C-done-right" be "ADA-style"? My take on C done right
>> is that a compiler maintains backwards compatibility with earlier
>> versions, while others think that C done right punishes the sinners
>> who write code that performs undefined behaviour as much as possible
>> by compiling it to unintended code (best even if there is no
>> performance advantage to that). I don't have Ada experience, but the
>> general image is compile-time errors and run-time errors; I would not
>> expect the latter from GCC or Clang.
>>
>> What I have experienced from Clang is that a common bit-rotation idiom
>> is compiled to 0 for rot(x,0) (for variable x) without compile-time or
>> run-time warning or error, because it exercises undefined behaviour
>> for a rotation count of 0, while gcc produced the intended behaviour.
>
> Yeah. Both GCC and LLVM are playing fast and loose with C semantics.

The construct in question is undefined behavior. You may not
like what GCC and LLVM do with it, but they are not in any
way shape or form violating C semantics.

MitchAlsup

unread,

Dec 17, 2022, 10:53:21 AM12/17/22

to

On Friday, December 16, 2022 at 9:26:52 PM UTC-6, BGB wrote:
> On 12/16/2022 10:01 AM, Scott Lurndal wrote:
> > Thomas Koenig <tko...@netcologne.de> writes:
> >> BGB <cr8...@gmail.com> schrieb:
> >>
> >>> Then again, I guess it was interesting to observe that some types of
> >>> optimizations which help on BJX2 also help on x86-64, despite x86-64
> >>> having not enough registers to actually make use of the optimization as
> >>> expressed (x86-64 seems to be able to pretend as-if the stack
> >>> spills/fills were registers).
> >>
> >> This is as old as optimizing compilers. The very first FORTRAN
> >> compiler assumed that there was an infinite amount of index
> >> registers, which it then later on assigned to the three that were
> >> actually available.
> >>
> I was thinking more in the sense that it seems to behave as-if it could
> load from memory into a register in 1 cycle, even if realistically this
> should take ~ 4 cycles on x86-64.
>

One can organize a pipeline such that inbound memory access appears
to take no cycles at all. Fetch-Decode-AGEN-LDAlign-Calculate-WriteBack.
Memory reference instructions are presented to the cache (or memory
pipeline in order) memory is accessed, aligned and then made available
to some calculation instruction as an operand, writeback can send the result
of the calculation into the cache (ST) as required.
 in the past. One could call
it the LD-OP pipeline just as easily. The 360/91 pipeline is just an extreme
version (without cache).
<
On a machine such as Gould 32/<and> various interlocks were applied
between calculate and AGEN such that the correct AGENs were performed.
And, here, inbound memory references that hit in the cache actually do take
ZERO extra cycles.

>
> Say, one x86-64, one can do a version of a piece of logic:
> 2 operations at a time, using 10 variables;
> 4 operations at a time, using 20 variables.
>
> x86-64 has 16 registers, so one would expect the 10 variable version to
> be faster than the 20 variable version. But, one may observe the 20
> variable case is still faster.
<

The architecture of x86 is so fucked up at this point in time that budding
architects should avoid being influenced by any of its features.
>

MitchAlsup

unread,

Dec 17, 2022, 10:59:41 AM12/17/22

to

In my best Ivan voice: "Fix it" !
> snip

> > <
> > This is why you have to get it right the first time.
> Whole thing started out with 16 registers, then doubled it twice.
> No obvious "better" option than to repeat the existing pattern 2 or 4 times.
<

So, you admit you got it wrong the first time, and now you are patching and
patching.......

>
> The strategy does allow one to determine if a register is callee saved
> fairly easily:
> if(regID&8)
> { callee-saved }
> else
> { scratch }
<

if(regID & 16)
{ preserved }
else
{ argument or temporary }
<
It is also easier to remember that R16..R32 are all preserved than R8..R15 and
R24..R32 (for the human)

>
> Except for a few special cases.
<

No special cases for me......
>

BGB

unread,

Dec 17, 2022, 1:06:33 PM12/17/22

to

IIRC, had read somewhere before that many x86's had handled things like:
ADD EAX, [ECX]
By breaking them into multiple uOps, one for the Load and another for
the Op, and then using a more RISC-like pipeline internally.

This does not seem true at least for the Ryzen.

For Piledriver, would have been more believable, as performance-related
behavior did seem pretty close to what I would expect (theoretically)
from a 3-wide in-order superscalar.

But, Ryzen seems to have changed something, and got a pretty big speedup
here (the estimated model would be something like a 4-wide superscalar
where nearly every operation takes ~ 1 clock cycle).

For some tasks (mostly for stuff written in ASM) have beaten my Ryzen in
terms of work per clock-cycle, mostly because a lot of tasks seem to fit
my ISA design better than they fit into x86-64 / SSE.

Partly this is due to things like 3R ops, ability to load FP values as
immediate, or even "simple things" like less awkward vector shuffles, ...

For C code, or Dhrystone, it is a different matter. Seem to be taking
roughly twice as many instructions for each run through Dhrystone if
compared with GCC + RISC-V.

Where everything is going? I don't know.

>>
>> Say, one x86-64, one can do a version of a piece of logic:
>> 2 operations at a time, using 10 variables;
>> 4 operations at a time, using 20 variables.
>>
>> x86-64 has 16 registers, so one would expect the 10 variable version to
>> be faster than the 20 variable version. But, one may observe the 20
>> variable case is still faster.
> <
> The architecture of x86 is so fucked up at this point in time that budding
> architects should avoid being influenced by any of its features.

Probably true.

Also tested, and it seems like (sadly) my "make ISA variant that
natively does 64 GPRs, enable 64 GPRs everywhere" strategy doesn't seem
to be working out as well as I had hoped.

In particularly it has a hump where, say:
Selectively enable R32..R63, slight code-size reduction and speed increase;
Enable R32..R63 for every function, binary size gets bigger, performance
gets worse.

It appears the issue with this strategy isn't just about the encoding
orthogonality (since using R32..R63 should no longer be as much of an
issue with XG2).

I suspect the added overhead of saving/restoring more registers in the
prolog/epilog sequences is offsetting the gains in many cases.

John Levine

unread,

Dec 17, 2022, 1:40:38 PM12/17/22

to

According to Tim Rentsch <tr.1...@z991.linuxsc.com>:

>>>> There was a recent poll on some site and 56% of Americans do not
>>>> think that Arabic numerals should be taught in schools, too; and

>>>> 15% don't have an opinion. ...

>Note by the way that most of the people responding to this survey
>probably didn't choose to be ignorant of the essential fact that
>our number symbols are indeed Arabic numerals.

They are and they aren't. While we call our familar digits Arabic
numerals, there is a quite different set of Arabic digits used in
written Arabic. You can see them in this code chart, U+0660 to U+0669:

https://en.wikipedia.org/wiki/Arabic_(Unicode_block)

I was in Abu Dhabi for a conference a few years ago and indeed the
numbers on the speed limit signs used those digits.

While I agree it's pretty lame that apparently a majority of Americans
don't know that we call our digits Arabic numerals, I can have a fair
amount of sympathy that there's no need to teach the Arabic digits
that Arabs currently use.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

MitchAlsup

unread,

Dec 17, 2022, 1:48:15 PM12/17/22

to

On Saturday, December 17, 2022 at 12:40:38 PM UTC-6, John Levine wrote:
> According to Tim Rentsch <tr.1...@z991.linuxsc.com>:
> >>>> There was a recent poll on some site and 56% of Americans do not
> >>>> think that Arabic numerals should be taught in schools, too; and
> >>>> 15% don't have an opinion. ...
<
> >Note by the way that most of the people responding to this survey
> >probably didn't choose to be ignorant of the essential fact that
> >our number symbols are indeed Arabic numerals.
<

I was taught in 3,4,5 th grades that the 0..9 digits we used were Arabic
numerals. So, at least for my generation, you could not get out of grade
school without being told this fact. Most quickly forgot....................sigh

BGB

unread,

Dec 17, 2022, 2:08:02 PM12/17/22

to

OK. Had not heard of or looked at VBCC.

As noted, I am using BGBCC for my project, which was written by me,
originally back in the late 2000s (when I was still taking college
classes, still early 20s at the time).

First version kinda "really sucked" and didn't go much beyond fairly
trivial test programs (at the time it was mostly beyond my debugging
skills).

Also this first version was trying to generate x86 code directly from
the stack IR, which was a bad idea. It wasn't until the 2010s that most
of my VM projects would switch over to translating from a stack model to
a 3AC IR, and then either interpreting or JIT'ing based on this (the 3AC
model giving a big speedup over trying to run a stack-machine model
directly).

> Here is another link on color-graphing. I spent quite a bit of time reading and
> re-reading this until I was able to come up with a yet to get working properly
> color graphing allocator for my compiler. I think there are issues with register
> spills and reloads. Then I found out that if there are enough registers > 16
> maybe color graphing is not so necessary.
>
> https://www.cs.utexas.edu/users/mckinley/380C/lecs/briggs-thesis-1992.pdf
>
> I believe a linear scan allocator is another common allocator.
>
> I have the code for the allocator in my Github account. It is quite complex,
> written in c++.
>
> The allocator I currently use simply counts register usages, giving more counts
> for registers used in a loop. And prioritizes register allocations that way.

Similar in my case.

In the "frame layout" step, it goes and looks at all the IR ops, counts
usage count, with a 1+x^2 multiplier based on loop nesting level
(recorded in the IL/IR for each label).

So:
a=0; //multiplier = 1 (level=0)
for(i=0; i<10; i++)
{
b=2; //multiplier = 2 (level=1)
for(j=0; j<10; j++)
{
c=3; //multiplier = 5 (level=2)
}
}

The variables in the function are ranked based on their total counts,
with the "top N" statically assigned to variables.

The N is variable here, where N has a theoretical hard-limit (based on
the number of available registers, minus a certain minimum).

Subtracted minimum is, say:
4 if no SIMD ops;
7 if SIMD ops.

Basically, need to keep at least enough registers free to be able to
load the working values for a 3R operation (else the compiler may end up
backed into a corner where it is unable to allocate dynamic registers
for things, and has no option other than to abort).

As noted:
R8..R14, Limit=3 or 0 (*)
R8..R14, R24..R31: Limit=11 or 8.
R8..R14, R24..R31, R40..R47, R55..R63: Limit=27 or 24.

With the limit being increased for leaf functions (where one can also
allocate variables into scratch registers).

*: For 16 GPR machines, it may make sense to use only dynamic
allocation. This is what my code-generators on x86-64 and ARM had
typically done (works well enough on x86, but my attempts to generate
code for ARM have tended to result in performance that is "decidedly
awful").

The value for N is determined based on finding the point where the sum
of the current and preceding variables crosses 1/16 of the total count
for all variables (*).

Except where the hard-limit is greater than the total variables, where
it then sets N to the to the number of variables in the function.

*: I am looking at my code, but the logic of how it works is not making
sense to me at the moment (seems suspect).

Intuitively would have made more sense to compare the sum of
current-and-preceding against the sum of all future variables, say:
if(current>((total-current)*magicScale)) { N=i; break; }

Presumably I knew what I was doing at the time, but seems like for
whatever reason, the logic escapes me at the moment.

Paul A. Clayton

unread,

Dec 17, 2022, 2:33:31 PM12/17/22

to

MitchAlsup wrote:
[snip]

> One can organize a pipeline such that inbound memory access appears
> to take no cycles at all. Fetch-Decode-AGEN-LDAlign-Calculate-WriteBack.
> Memory reference instructions are presented to the cache (or memory
> pipeline in order) memory is accessed, aligned and then made available
> to some calculation instruction as an operand, writeback can send the result
> of the calculation into the cache (ST) as required.

Of course, that does not apply to loaded values used for
addressing other values in memory (pointers and indexes).

For some special cases one can avoid/defer address generation by
addressing the cache by the offset (this even allows hoisting the
load operation ahead of register read). E.g., Knapsack cache
(indexed by offset from global pointer), signature cache (object
base pointer address is base of cache), or possibly stack cache
(most proposals do use address generation but one could avoid that
at the cost of utilization, e.g., inverting offsets for every
other stack frame would allow two (possibly partial) frames to be
trivially addressed in a direct-mapped stack cache).

(Special casing base address value access to provide the least
significant bits — even excluding three or more least significant
bits if objects, stack, etc. pointers have 8+ byte alignment —
which might shave a little latency at the cost of area and power.)

Summed-address memory is another technique for reducing cache
access latency. (Of course, Itanium provided only post-increment
loads, so address generation did not add latency — not that such
is an example of a good design choice.)

ORing (or XORing) the offset with the base address has been
proposed, but that seems likely to be a poor choice.

(Interestingly, Intel already penalized carry-out past a certain
point in address generation — or so I seem to recall reading.)

> I have called this the 360 pipeline <organization> in the past. One could call
> it the LD-OP pipeline just as easily. The 360/91 pipeline is just an extreme
> version (without cache).

The term I read (which applies generally to delaying some
operations relative to others) is "skewed pipeline".

> On a machine such as Gould 32/<and> various interlocks were applied
> between calculate and AGEN such that the correct AGENs were performed.
> And, here, inbound memory references that hit in the cache actually do take
> ZERO extra cycles.

While the specific trick of skewed-pipelining is probably simple
enough to be reinvented, it still seems sad that so much
cleverness of the past is not remembered (documented for easy
reference).

Paul A. Clayton

unread,

Dec 17, 2022, 2:34:01 PM12/17/22

to

Anton Ertl wrote:
[snip]
> If you want a delay-1-cycle instruction, this should be possible even
> in an OoO machine: when that instruction is processed, the clock is
> not gated through to the rest of the execution engine for one cycle.
> Of course, with an OoO engine, you cannot really predict which
> instructions are affected by the additional cycle. The question is:
> Why would anyone use such an instruction?

MIPS defined a superscalar nop. MIPS64™ Architecture For
Programmers, Volume II: The MIPS64™ Instruction Set, Revision 2.00
(June 9, 2003) descripes Superscalar No Operation (SSNOP) thus:

Purpose:
Break superscalar issue on a superscalar processor.

Description:
SSNOP is the assembly idiom used to denote superscalar no
operation. The actual instruction is interpreted by the hardware
as SLL r0, r0, 1.

This instruction alters the instruction issue behavior on a
superscalar processor by forcing the SSNOP instruction to single-
issue. The processor must then end the current instruction issue
between the instruction previous to the SSNOP and the SSNOP. The
SSNOP then issues alone in the next issue slot.

On a single-issue processor, this instruction is a NOP that takes
an issue slot.

...

Programming Notes:
SSNOP is intended for use primarily to allow the programmer
control over CP0 hazards by converting instructions into cycles in
a superscalar processor. For example, to insert at least two
cycles between an MTC0 and an ERET, one would use the following
sequence:

mtc0
ssnop
ssnop
eret
x,y

Based on the normal issues rules of the processor, the MTC0 issues
in cycle T. Because the SSNOP instructions must issue alone, they
may issue no earlier than cycle T+1 and cycle T+2, respectively.
Finally, the ERET issues no earlier than cycle T+3. Note that
although the instruction after an SSNOP may issue no earlier than
the cycle after the SSNOP is issued, that instruction may issue
later. This is because other implementation-dependent issue rules
may apply that prevent an issue in the next cycle. Processors
should not introduce any unnecessary delay in issuing SSNOP
instructions.

A less code bloating alternative was introduced later: Execution
Hazard Barrier (EHB):

Purpose:
To stop instruction execution until all execution hazards have
been cleared.

Description:
EHB is the assembly idiom used to denote execution hazard barrier.
The actual instruction is interpreted by the hardware as SLL r0,
r0, 3.

This instruction alters the instruction issue behavior on a
pipelined processor by stopping execution until all execution
hazards have been cleared. Other than those that might be created
as a consequence of setting StatusCU0, there are no execution
hazards visible to an unprivileged program running in User Mode.
All execution hazards created by previous instructions are cleared
for instructions executed immediately following the EHB, even if
the EHB is executed in the delay slot of a branch or jump. The EHB
instruction does not clear instruction hazards - such hazards are
cleared by the JALR.HB, JR.HB, and ERET instructions.

...

Programming Notes:
In MIPS64 Release 2 implementations, this instruction resolves all
execution hazards. On a superscalar processor, EHB has alters the
instruction issue behavior in a manner identical to SSNOP. For
backward compatibility with Release 1 implementations, the last of
a sequence of SSNOPs can be replaced by an EHB. In Release 1
implementations, the EHB will be treated as an SSNOP, thereby
preserving the semantics of the sequence. In Release 2
implementations, replacing the final SSNOP with an EHB should have
no performance effect because a properly sizedsequence of SSNOPs
will have already cleared the hazard. As EHB becomes the standard
in MIPS implementations, the previous SSNOPs can be removed,
leaving only the EHB.

BGB

unread,

Dec 17, 2022, 2:41:55 PM12/17/22

to

On 12/17/2022 8:53 AM, Tim Rentsch wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>
>> Tim Rentsch wrote:
>>
>>> MitchAlsup <Mitch...@aol.com> writes:
>>>
>>> [...]
>>>
>>>> There was a recent poll on some site and 56% of Americans do not
>>>> think that Arabic numerals should be taught in schools, too; and
>>>> 15% don't have an opinion.
>>>> https://www.snopes.com/fact-check/teaching-arabic-numerals/
>>>> We are well on the way to Idiocracy.
>>>
>>> That isn't idiocy; it's illiteracy. Those aren't the same thing.
>>> Don't make the mistake of thinking someone is stupid when all they
>>> are is ignorant.
>>
>> I was going to agree with you, but realized that "willful ignorance"
>> cannot be as easily (if ever?) forgiven.
>
> My comment is about people who happen to be ignorant, not about
> people who choose to remain ignorant.
>
> Furthermore, it is still the case that one shouldn't _assume_ that
> someone is stupid just because they are ignorant. If later
> information shows signs of willful ignorance, the question of
> intelligence can always be revisited (or even just visited).
>

And then there is also...

Whatever, exactly, is going on in the minds of YECs (Young Earth
Creationists).
Like, all the sort of extreme contortions they go through, to try to
make the world (and the universe as a whole) try to fit into a 6K year
timeline...

Like, this can be ruled out mostly on the basis of plausibility.

Well, then again, there are also groups set on the idea that there was
once a technologically advanced civilization among the Native Americans,
but that they had a war and (somehow forgot?) and reverted back to a
hunter/gatherer lifestyle.

...

> Note by the way that most of the people responding to this survey
> probably didn't choose to be ignorant of the essential fact that
> our number symbols are indeed Arabic numerals. And it isn't
> obvious that including that bit of knowledge in the educational
> curriculum would yield any significant benefit. It's more
> important to teach people how to think than to fill their heads
> with zillions of miscellaneous facts (and it's disappointing that
> this basic truth is not more widely recognized).

Yeah.

School system is seemingly more about making people do busywork,
memorize random trivia, and then accept the superiority of being a "busy
worker bee" for the corporate overlords, than anything else...

I was left as someone who had generally far too much anxiety to really
put much effort into all the meaningless busywork, and had doubts about
the "inherent goodness" of a world based on the control of a smallish
number of large corporations.

Had also sort of believed at the time that society was probably going to
collapse relatively soon anyways.

Granted, the "2008 financial crisis" happened a few years after I got
out of high-school, but this wasn't quite the level of collapse I had
imagined. Even as such, things still don't look entirely stable (say,
what will people do if/when inflation goes out of control and the USD
becomes essentially worthless?...).

Though, the shake-up could either lead to:
Many corporations die off;
The "overlord" situation gets worse, if those who remain largely take
over many of the roles traditionally held by the US government (their
ability to retain authority or control in many areas being less certain
in the case of a full economic collapse).

Like, say, what if much of the western US is no longer "US" per se, but
instead mostly split up between areas ruled over by Apple and
Microsoft?... ( Well, say, with UT/NV/AZ/... becoming part of an LDS
theocracy? ... ).

...

MitchAlsup

unread,

Dec 17, 2022, 4:08:14 PM12/17/22

to

The real shame is that none of the new trainees to bother to look the old
stuff up.
<
Those who do not remember history are doomed to repeat it.
<
Apparently this holds double for Computer Science.................

BGB

unread,

Dec 17, 2022, 5:02:28 PM12/17/22

to

Saying something is "undefined behavior" is kind of a weak argument IMO
when it has otherwise "perfectly reasonable" behavior that "most older
compilers supported, and most C code from that era accepted as canonical".

Undefined Behavior more makes sense for the programmer to take note of,
when there might be implementation differences between compilers.
Compilers treating UB as license to do whatever, is not ideal, except in
cases where the depended on semantics are "clearly unreasonable".

robf...@gmail.com

unread,

Dec 17, 2022, 5:13:20 PM12/17/22

to

>The real shame is that none of the new trainees to bother to look the old
>stuff up.
><
>Those who do not remember history are doomed to repeat it.
><

It is ego I think. Maybe lacking humility? It takes a bit to recognize that
maybe someone else could have done a better job than oneself. Personally,
I like to find out if someone else has done the thing I am after, then just
copy the ideas as a starting point. Trying to be innovative. The world is
bigger than just me.

Even those who do remember history are also likely doomed to repeat it.

Thomas Koenig

unread,

Dec 18, 2022, 4:38:21 AM12/18/22

to

BGB <cr8...@gmail.com> schrieb:

> On 12/17/2022 9:46 AM, Tim Rentsch wrote:

>> The construct in question is undefined behavior. You may not
>> like what GCC and LLVM do with it, but they are not in any
>> way shape or form violating C semantics.
>
> Saying something is "undefined behavior" is kind of a weak argument IMO
> when it has otherwise "perfectly reasonable" behavior that "most older
> compilers supported, and most C code from that era accepted as canonical".

I don't suppose you have ever written code to a specification, have you?

A specification, in this case the C standard, is binding on both sides:
The compiler writer has to adhere to it (or else it's a compiler bug),
and the user has to adhere to it (or else it's a bug in the program).

The C standard allows extensions. You can rely on these _for
that particular compiler and version_ if, and only if, they are
described in the compiler's documentation. For example, a compiler
may document that int has two's complement, and is four bytes long.

If it is _not_ described in the compiler's documentation and given
as "undefined behavior" by the standard, well... undefined behavior
is what you get, which means that anything can happen.

> Undefined Behavior more makes sense for the programmer to take note of,
> when there might be implementation differences between compilers.
> Compilers treating UB as license to do whatever, is not ideal, except in
> cases where the depended on semantics are "clearly unreasonable".

Again, try to understand what a specification is.