Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Compare-and-branch vs PIC

1,130 views
Skip to first unread message

Russell Wallace

unread,
Dec 7, 2022, 10:26:11 AM12/7/22
to
I've been trying to better understand the reasoning behind some of the design decisions that went into historical computer architectures. For example, I note that ARM-1 is not a purist RISC, but has a number of features designed to get more work done in each instruction. But one thing it does not have is compare-and-branch in one instruction; you have to compare separately, then branch on condition code. Yet compare-and-branch was not entirely unprecedented at the time. There doesn't seem to be any clear evidence that it would have added an extra cycle of delay:
https://retrocomputing.stackexchange.com/questions/25785/would-compare-and-branch-have-added-an-extra-cycle-on-arm-1

A possibility that occurs to me, though: maybe you can't use the ALU to compare operands and also add a displacement to the instruction pointer, in the same cycle. Maybe if you want compare-and-branch, you have to choose between an extra cycle of delay, vs the simpler model of branch instruction that provides bits that simply replace the low bits of the instruction pointer. Which is potentially awkward if you were hoping for position independent code.

Is that actually the case, assuming you are tightly constrained in die area (either historically on a micron-scale process, or today on a small embedded core)? Or am I missing the mark?

Theo

unread,
Dec 7, 2022, 11:42:18 AM12/7/22
to
Russell Wallace <russell...@gmail.com> wrote:
> I've been trying to better understand the reasoning behind some of the design decisions that went into historical computer architectures. For example, I note that ARM-1 is not a purist RISC, but has a number of features designed to get more work done in each instruction. But one thing it does not have is compare-and-branch in one instruction; you have to compare separately, then branch on condition code. Yet compare-and-branch was not entirely unprecedented at the time. There doesn't seem to be any clear evidence that it would have added an extra cycle of delay:
> https://retrocomputing.stackexchange.com/questions/25785/would-compare-and-branch-have-added-an-extra-cycle-on-arm-1
>
> A possibility that occurs to me, though: maybe you can't use the ALU to
> compare operands and also add a displacement to the instruction pointer,
> in the same cycle. Maybe if you want compare-and-branch, you have to
> choose between an extra cycle of delay, vs the simpler model of branch
> instruction that provides bits that simply replace the low bits of the
> instruction pointer. Which is potentially awkward if you were hoping for
> position independent code.

You can do PIC on 32-bit ARM, you just have to do:

ADDEQ pc,pc,#offset
(or whatever condition code)

instead of a branch (where #offset is calculated by your assembler). If the
range of an immediate constant is too short, you have to use a register:

MOV rN,#(offset & 0x3FC00) << 10
ORR rN,rN,#(offset & 0x3FC)
ADDEQ pc,pc,rN

which can also be generated by your assembler. (you know the bottom two
bits of the offset are going to be zero so don't need to set them)

If you want to do compare and relative-branch you need to use the ALU twice,
once for the compare and once for adding to the PC. It would have been
possible to do a compare with absolute branch
(PC <= cond ? immediate : PC+4)
without using the ALU, but that's not very useful.

> Is that actually the case, assuming you are tightly constrained in die
> area (either historically on a micron-scale process, or today on a small
> embedded core)? Or am I missing the mark?

These days addition is cheap, so pc <= pc + N isn't very costly. Back then,
it was.

Theo

Theo

unread,
Dec 7, 2022, 11:44:32 AM12/7/22
to
Theo <theom...@chiark.greenend.org.uk> wrote:
> MOV rN,#(offset & 0x3FC00) << 10
> ORR rN,rN,#(offset & 0x3FC)
> ADDEQ pc,pc,rN

should be:

MOV rN,#(offset & 0x3FC00)
ORR rN,rN,#(offset & 0x3FC)
ADDEQ pc,pc,rN

(the <<10 is inferred by the assembler)

John Dallman

unread,
Dec 7, 2022, 11:53:15 AM12/7/22
to
In article <b6260f7b-fbeb-4549...@googlegroups.com>,
russell...@gmail.com (Russell Wallace) wrote:

> For example, I note that ARM-1 is not a purist RISC, but has a
> number of features designed to get more work done in each
> instruction. But one thing it does not have is compare-and-branch
> in one instruction; you have to compare separately, then branch on
> condition code.

I think the design intent was that you didn't branch unless absolutely
necessary, but used the conditional (predicated) instructions as much as
possible. Those have proved hard for compilers to use efficiently, but
are quite easy for humans writing assembler code. Acorn Computers
primarily designed ARM for use in their own machines, and wrote their
early operating systems in assembly language.

John

MitchAlsup

unread,
Dec 7, 2022, 1:06:55 PM12/7/22
to
On Wednesday, December 7, 2022 at 9:26:11 AM UTC-6, Russell Wallace wrote:
> I've been trying to better understand the reasoning behind some of the design decisions that went into historical computer architectures. For example, I note that ARM-1 is not a purist RISC, but has a number of features designed to get more work done in each instruction. But one thing it does not have is compare-and-branch in one instruction; you have to compare separately, then branch on condition code. Yet compare-and-branch was not entirely unprecedented at the time. There doesn't seem to be any clear evidence that it would have added an extra cycle of delay:
> https://retrocomputing.stackexchange.com/questions/25785/would-compare-and-branch-have-added-an-extra-cycle-on-arm-1
>
> A possibility that occurs to me, though: maybe you can't use the ALU to compare operands and also add a displacement to the instruction pointer, in the same cycle. Maybe if you want compare-and-branch, you have to choose between an extra cycle of delay, vs the simpler model of branch instruction that provides bits that simply replace the low bits of the instruction pointer. Which is potentially awkward if you were hoping for position independent code.
<
Even early RISC machines had a separate adder for branching and for integer calculations.
These are in different stages of the pipeline and have to be that way.
>
> Is that actually the case, assuming you are tightly constrained in die area (either historically on a micron-scale process, or today on a small embedded core)? Or am I missing the mark?
<
At about the 1.5 micron level, you get to choose to put the TLB on die or the FPU on die.
At 1 micron, you can put both on die.
At 0.7 micron a significant portion of cache comes on die.
At 0.5 micron, you can put reservation stations and make it GBOoO.

Stephen Fuld

unread,
Dec 7, 2022, 1:26:02 PM12/7/22
to
On 12/7/2022 7:26 AM, Russell Wallace wrote:
> I've been trying to better understand the reasoning behind some of the design decisions that went into historical computer architectures. For example, I note that ARM-1 is not a purist RISC, but has a number of features designed to get more work done in each instruction. But one thing it does not have is compare-and-branch in one instruction; you have to compare separately, then branch on condition code. Yet compare-and-branch was not entirely unprecedented at the time. There doesn't seem to be any clear evidence that it would have added an extra cycle of delay:
> https://retrocomputing.stackexchange.com/questions/25785/would-compare-and-branch-have-added-an-extra-cycle-on-arm-1

Unless you are talking about a compare to some fixed value, expressed in
the op code (e.g. a compare to zero and branch instruction) it would
have required an extra field in the instruction, i.e. two values to
compare and a third to specify the branch target address. Given the
fixed 32 bit instructions, this would have been "a challenge". :-)


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,
Dec 7, 2022, 2:00:13 PM12/7/22
to
Doing a general compare + branch has a risk of additional latency cost,
because now one has the "initiate a branch" logic depending directly on
the ALU compare logic.


In my case, I went with a single status bit (a True/False bit) that is
used (mostly) to predicate stuff.

So, say:
OP ... //Always Execute
OP?T ... //Execute Op if T is Set
OP?F ... //Execute Op if F is Clear

The BT/BF instructions (Branch if True or False), are currently encoded
as if they were:
BRA?T
BRA?F

There were some older dedicated BT/BF encodings, but they are
deprecated/dropped (in the 32-bit forms), and the space may later be
reclaimed for more 3R ops or similar.

The use of a single status bit also allows fitting:
OP ... //Always, Normal
OP ... || //Always, Bundle
OP?T ... //If True
OP?F ... //If False

Into roughly 2 bits of encoding entropy.


There were a few special ops:
LDIx Imm24, R0
That only exist in scalar form, so:
LDIx Imm24, R0 //Scalar Form
Jumbo Ext24 //Jumbo Prefixes
OP?T ... || //If True + Bundle (ISA subset)
OP?F ... || //If False + Bundle (ISA subset)

This stays within the two bits of encoding entropy, though internally
3-bits are used in the pipeline.



If I were not using any status bits, such as for a more RISC-V like ISA
design, I would likely go instead with hard-wired compare-with-zero ops.

In this case, it is nearly as capable, but the checks are significantly
cheaper to perform (and one can use the extra bits from not having a
second register, to maybe have a slightly bigger displacement).

Say:
BZ Rn, Disp17 //Branch if Rn==0
BNZ Rn, Disp17 //Branch if Rn!=0
BLT Rn, Disp17 //Branch if Rn< 0 (Bit 63 is Set)
BGE Rn, Disp17 //Branch if Rn>=0 (Bit 63 is Clear)

In this case, one can have tristate outputs on the compare ops, say:
-1: A< B
0: A==B
1: A> B

I would also get rid of the user-specified Link Register, as:
This is rarely useful;
It adds a non-trivial cost to the "branch-with-link" mechanism.

While a second link register "can" be useful for prolog compression
and/or "millicode" functions, the relative savings (vs, say, manually
copying the link register to a different register and/or saving it to
the stack) are small enough to be mostly ignored.

In intermediate option would be to instead allow encodings for two
hard-wired link registers.

Say, for a 3b Branch space:
000 B Disp22
001 -
010 BL Disp22 //Branch with Link, Save Primary LR
011 BL2 Disp22 //Branch with Link, Save Alternate LR
100 BZ Rn, Disp17
101 BNZ Rn, Disp17
110 BLT Rn, Disp17
111 BGE Rn, Disp17

This could be fit into the same amount of encoding space as the original
RISC-V JAL instruction (if using a similar encoding scheme to RISC-V).


I would not consider this design a win over my current ISA though, as
while it could fit more easily into a slightly smaller core, to get
similar performance would require more advanced hardware-level support
(such as superscalar and special logic to detect and handle short
forward branches as predication).

Similarly, there isn't a whole lot of point in trying to face off
against RISC-V with an ISA that is "nearly the same, just with slightly
cheaper branches and similar".

Though, beyond the base ISA (into the "extensions") space, would likely
do some more significant redesign (mostly to try to keep things more
consistent and for more aggressive cost minimization).


...


BGB

unread,
Dec 7, 2022, 2:10:33 PM12/7/22
to
RISC-V compares an arbitrary pair of registers.

IMHO, this is not worth the cost (given compare-with-zero would have
been "nearly as effective", allow larger branch displacements, and would
have also been cheaper...).



For something more like ARM, one would need a fairly small branch
displacement, and it would likely do little to help over the use of
condition-codes.

Though, burning 4-bits on the condition code is a little steep, but they
didn't need to deal with variable-length instructions (so, they gain
back a few bits by not needing to be able to encode 16-bit operations or
similar).

MitchAlsup

unread,
Dec 7, 2022, 3:03:48 PM12/7/22
to
On Wednesday, December 7, 2022 at 1:10:33 PM UTC-6, BGB wrote:
> On 12/7/2022 12:25 PM, Stephen Fuld wrote:

> > Unless you are talking about a compare to some fixed value, expressed in
> > the op code (e.g. a compare to zero and branch instruction) it would
> > have required an extra field in the instruction, i.e. two values to
> > compare and a third to specify the branch target address. Given the
> > fixed 32 bit instructions, this would have been "a challenge". :-)
> >
> RISC-V compares an arbitrary pair of registers.
<
Yes, It does.
>
> IMHO, this is not worth the cost (given compare-with-zero would have
> been "nearly as effective", allow larger branch displacements, and would
> have also been cheaper...).
>
My 66000 does not have a Compare-Branch, using a compare-to-zero-branch
and a Comparison instruction followed by a branch-on-bit. I also have Pred-
ication on bit and on condition. There are some codes where RISC-V wins;
there are cases where we tie, and there are cases where My 66000 wins.
Overall in integer codes; it looks to be break even--no winner either way--it
is just different.
<
Since RISC-V does not allow constants in its compare-branch instruction,
all comparisons against a constant that is not-zero costs an instruction.
In FP comparisons, RISC-V has to use a SET-T/F followed by a branch
and in many cases that SET is a comparison with a constant which also
adds to the instruction stream. Most of the FP compare codes are better
in My 66000 than in RISC-V--but their are outliers on both sides.
<
Many times My 66000 can "keep" a comparison as a persistent value
(in a register) and use that value again:: Say we compared x and y and
put the result of this comparison in R7. As long as R7 is not damaged,
we can extract any of the relations between x and y from the bit pattern
stored in R7 {==, !=, s>, s>=, s<, s<=, u>, u>=, u<, u<=, 0<x<y, 0<=x<y, 0<x<=y,
0<=x<=y} and a few more esoteric relations. Thus, My 66000 has as many
"condition codes" as the compiler has available registers. Any of the relations
can be used to start a predicated string, qualify a branch, or be extracted
into {0,+1} <unsigned> or {-1,0} <signed> -- without out having SET instructions
in the instruction set, nor be limited with SET instructions in the instruction
set.
>
>
> For something more like ARM, one would need a fairly small branch
> displacement, and it would likely do little to help over the use of
> condition-codes.
>
> Though, burning 4-bits on the condition code is a little steep, but they
> didn't need to deal with variable-length instructions (so, they gain
> back a few bits by not needing to be able to encode 16-bit operations or
> similar).
<
The only proper number of bit in the condition codes are 0.

MitchAlsup

unread,
Dec 7, 2022, 3:10:41 PM12/7/22
to
I put this into 1-bit of entropy, and did not charge the consuming instruction
for that bit. That is normal instructions do not waste a bit encoding a possibility
that occurs less than 20% of the time, but instead, I move a group of these bits
over into a PRED instruction, which casts a execute/don't shadow across a
number of instructions.
>
>
> There were a few special ops:
> LDIx Imm24, R0
> That only exist in scalar form, so:
> LDIx Imm24, R0 //Scalar Form
> Jumbo Ext24 //Jumbo Prefixes
> OP?T ... || //If True + Bundle (ISA subset)
> OP?F ... || //If False + Bundle (ISA subset)
>
> This stays within the two bits of encoding entropy, though internally
> 3-bits are used in the pipeline.
>
1-bit per instruction in the pipeline.
>
>
> If I were not using any status bits, such as for a more RISC-V like ISA
> design, I would likely go instead with hard-wired compare-with-zero ops.
>
> In this case, it is nearly as capable, but the checks are significantly
> cheaper to perform (and one can use the extra bits from not having a
> second register, to maybe have a slightly bigger displacement).
>
> Say:
> BZ Rn, Disp17 //Branch if Rn==0
> BNZ Rn, Disp17 //Branch if Rn!=0
> BLT Rn, Disp17 //Branch if Rn< 0 (Bit 63 is Set)
> BGE Rn, Disp17 //Branch if Rn>=0 (Bit 63 is Clear)
>
> In this case, one can have tristate outputs on the compare ops, say:
> -1: A< B
> 0: A==B
> 1: A> B
>
> I would also get rid of the user-specified Link Register, as:
> This is rarely useful;
> It adds a non-trivial cost to the "branch-with-link" mechanism.
>
> While a second link register "can" be useful for prolog compression
> and/or "millicode" functions, the relative savings (vs, say, manually
> copying the link register to a different register and/or saving it to
> the stack) are small enough to be mostly ignored.
<
You "could" add ENTER and EXIT instructions to the ISA and dispense
with the fake code density, along with the alterations of the control
flow (power and cycles).
>

robf...@gmail.com

unread,
Dec 7, 2022, 3:26:20 PM12/7/22
to
I think deciding between compare-and-branch as one instruction may have to
do with the size of the branch displacement possible in an instruction format.
It takes more bits to represent the instruction if there are two register fields in
it and that may limit the displacement size. If instructions are 16-bits specifying
two registers in the instruction probably is not practical due to the small
displacement.

It is often not necessary to do the compare, in some processors the condition
codes are set by many instructions and a branch can take place without needing
an explicit compare. In that case the instructions are just as compact as doing a
compare-and-branch in one instruction.

RISCV has only a 12-bit displacement field which is adequate most of the time,
but no doubt on rare occasions extra code needs to be inserted to branch further
away.

Another consideration is that compare instructions are needed in the instruction
set anyway, it may be less hardware then to use separate compare and branch
instructions. Compares are needed for expressions like: X = a < b. which are
sometimes recorded instead of branched on.

My latest core stores the results in condition code registers like the PowerPC
rather than gprs. It is just a bit different. I manage then to get a 21-bit branch
displacement. For branch-to-subroutine 24-bit displacements are possible.

BGB

unread,
Dec 7, 2022, 5:18:41 PM12/7/22
to
If I were to do stuff differently here, it would probably be to spend
less of the encoding space on 16-bit ops.

I might choose to stick with 32 GPRs, since the number of cases that
benefit from 64 GPRs is fairly modest.


If 64 GPRs is supported, it likely makes sense to treat these as a
special case encoding (ISA subset or 64-bit ops), since otherwise 6-bit
register fields would still be too much of an issue for encoding space
(and only sometimes bring much benefit).

Though kinda ugly, BJX2 sort of manages 64 GPRs "semi-effectively" with
5-bit register fields. But, not sure I would do it again...



Say, possible re-imagined encoding space:
zzzz-zzzz-zzzz-zzz0 //16 bit
zzzz-zzzz-zzzz-zzzz zzzz-zzzz-zzzz-zpp1 //32+ bit

Then, say, pp:
00: Always
01: WEX
10: ?T
11: ?F

At this point, assuming I stick with 5b register fields and modest
Imm/Disp fields, I would be ahead of where I started out with for BJX2
(which burns 3-bits on the 16/32 split).


Possible Schemes, 16b:
zzzz-mmmm-nnnn-zzz0 //16b 2R (16 regs)
zzzz-iiii-nnnn-zzz0 //16b 2RI (Imm4/Disp4)
zzzm-mmmm-nnnn-nzz0 //16b 2R (32 regs)
zzzi-iiii-nnnn-nzz0 //16b 2RI (Imm5/Disp5, 32 regs)
zzzi-iiii-iiii-izz0 //16b Imm10

Would likely need to provide:
SP-relative Load/Store (R0..R31)
Constant Load (R0..R31)
MOV 2R (R0..R31)
CMPxx-2R and 2RI (R0..R31)
A few basic ALU ops (ADD/SUB);
Short-form Branch Instructions;
...


Could be more minimal than in BJX2, since there is little need for
16-bit ops to actually be able to run the system (but, exist primarily
so that one can save space).

3R or 3RI cases can be skipped as these don't really have enough of a
hit-rate to justify the level of hair they add.


For 32-bit ops, might choose to remain with primarily 9-bit and
immediate and displacement fields, since with scaling, these are
reasonably effective.

Possible 32-bit layout:
zzzt-tttt-zzzz-ssss szzn-nnnn-zzzz-zpp1 //32b, 3R
iiii-iiii-izzz-ssss szzn-nnnn-zzzz-zpp1 //32b, 3RI Imm9
iiii-iiii-iiii-iiii izzn-nnnn-zzzz-zpp1 //32b, 2RI Imm17
iiii-iiii-iiii-iiii iiii-zzzz-zzzz-zpp1 //32b, Imm20 / Branch
iiii-iiii-iiii-iiii iiii-iiii-zzzz-zpp1 //32b, Imm24 (Jumbo)

In this case, The Branch-Op would not be allowed as WEX, and this would
encode the Jumbo block.

Possible:
zzzt-tttt-zzzz-ssss szzn-nnnn-zzz0-0pp1 //32b, 3R
iiii-iiii-izzz-ssss szzn-nnnn-zzz0-1pp1 //32b, 3RI Imm9

iiii-iiii-iiii-iiii izzn-nnnn-z011-1pp1 //32b, 2RI Imm17

Imm16/Imm17 Block:
iiii-iiii-iiii-iiii 000n-nnnn-0011-1pp1 MOV Imm16u, Rn
iiii-iiii-iiii-iiii 100n-nnnn-0011-1pp1 MOV Imm16n, Rn
iiii-iiii-iiii-iiii 001n-nnnn-0011-1pp1 ADD Imm16u, Rn
iiii-iiii-iiii-iiii 101n-nnnn-0011-1pp1 ADD Imm16n, Rn
iiii-iiii-iiii-iiii 010n-nnnn-0011-1pp1 LDSH Imm16u, Rn
iiii-iiii-iiii-iiii 110n-nnnn-0011-1pp1 FLDCH Imm16u, Rn
iiii-iiii-iiii-iiii 011n-nnnn-0011-1pp1 -
iiii-iiii-iiii-iiii 111n-nnnn-0011-1pp1 -

iiii-iiii-iiii-iiii 0ttn-nnnn-1011-1pp1 ? MOV.x Rn, (GBR, Disp16u)
iiii-iiii-iiii-iiii 1ttn-nnnn-1011-1pp1 ? MOV.x (GBR, Disp16u), Rn
00=L, 01=Q, 10=UL, 11=X
Scaled by 4 or 8 (First 256K or 512K of data/bss).

Branch Block:
iiii-iiii-iiii-iiii iiii-ii00-0111-1001 BRA Disp22
iiii-iiii-iiii-iiii iiii-ii01-0111-1001 MOV Imm22u, Rtmp
iiii-iiii-iiii-iiii iiii-ii10-0111-1001 BSR Disp22
iiii-iiii-iiii-iiii iiii-ii11-0111-1001 MOV Imm22n, Rtmp
iiii-iiii-iiii-iiii izzn-nnnn-1111-1001 Bcc Disp17

iiii-iiii-iiii-iiii iiii-iiii-0111-1011 Jumbo-Imm
iiii-iiii-iiii-iiii iiii-iiii-1111-1011 Jumbo-Op
iiii-iiii-iiii-iiii iiii-ii00-0111-1101 BT Disp22
iiii-iiii-iiii-iiii iiii-ii00-0111-1111 BF Disp22

Where, Rtmp is a fixed temporary register (likely the return-value
register).

64-bit:
iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii iiii-iiii-0111-1001 BRA Abs48

96-bit:
iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii 000n-nnnn-0011-1pp1 MOV Imm64, Rn

iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii iiii-iiii-0111-1011 -
iiii-iiii-iiii-iiii 001n-nnnn-0011-1pp1 ADD Imm64, Rn

...


Might still need some more thinking here...


>>
>>
>> There were a few special ops:
>> LDIx Imm24, R0
>> That only exist in scalar form, so:
>> LDIx Imm24, R0 //Scalar Form
>> Jumbo Ext24 //Jumbo Prefixes
>> OP?T ... || //If True + Bundle (ISA subset)
>> OP?F ... || //If False + Bundle (ISA subset)
>>
>> This stays within the two bits of encoding entropy, though internally
>> 3-bits are used in the pipeline.
>>
> 1-bit per instruction in the pipeline.

Possibly, but my scheme allows executing both the Then and Else branches
in parallel in many cases, and does not need an explicit "PRED" op, ...

Also doesn't lead to extra state that would still need to be saved
somehow if an interrupt occurs.


>>
>>
>> If I were not using any status bits, such as for a more RISC-V like ISA
>> design, I would likely go instead with hard-wired compare-with-zero ops.
>>
>> In this case, it is nearly as capable, but the checks are significantly
>> cheaper to perform (and one can use the extra bits from not having a
>> second register, to maybe have a slightly bigger displacement).
>>
>> Say:
>> BZ Rn, Disp17 //Branch if Rn==0
>> BNZ Rn, Disp17 //Branch if Rn!=0
>> BLT Rn, Disp17 //Branch if Rn< 0 (Bit 63 is Set)
>> BGE Rn, Disp17 //Branch if Rn>=0 (Bit 63 is Clear)
>>
>> In this case, one can have tristate outputs on the compare ops, say:
>> -1: A< B
>> 0: A==B
>> 1: A> B
>>
>> I would also get rid of the user-specified Link Register, as:
>> This is rarely useful;
>> It adds a non-trivial cost to the "branch-with-link" mechanism.
>>
>> While a second link register "can" be useful for prolog compression
>> and/or "millicode" functions, the relative savings (vs, say, manually
>> copying the link register to a different register and/or saving it to
>> the stack) are small enough to be mostly ignored.
> <
> You "could" add ENTER and EXIT instructions to the ISA and dispense
> with the fake code density, along with the alterations of the control
> flow (power and cycles).


These would require some sort of state machine, and/or a mechanism to
break apart ops into multiple u-ops in the pipeline. Similar issues go
for ARM style LDM/STM.

Something like millicode allows the (cheap) mechanism of just using an
unrolled sequence of normal Load/Store instructions (may be kept inline
in the function in cases where N is small).


BGB

unread,
Dec 7, 2022, 6:00:15 PM12/7/22
to
On 12/7/2022 2:26 PM, robf...@gmail.com wrote:
> I think deciding between compare-and-branch as one instruction may have to
> do with the size of the branch displacement possible in an instruction format.
> It takes more bits to represent the instruction if there are two register fields in
> it and that may limit the displacement size. If instructions are 16-bits specifying
> two registers in the instruction probably is not practical due to the small
> displacement.
>
> It is often not necessary to do the compare, in some processors the condition
> codes are set by many instructions and a branch can take place without needing
> an explicit compare. In that case the instructions are just as compact as doing a
> compare-and-branch in one instruction.
>

As noted, I still like the 1-bit True/False scheme. It has some of the
advantages of CCs, mostly without the same sorts of drawbacks.

Failing this, fixed compare-with-zero is the second-place option (nearly
as powerful as the full-compare case, but cheaper and with fewer issues).


> RISCV has only a 12-bit displacement field which is adequate most of the time,
> but no doubt on rare occasions extra code needs to be inserted to branch further
> away.
>

Such is the issue.

The 12-bits falls into the category of "usually but not always sufficient".

For larger functions, the compiler would need to do something like:
Bcc Rs, Rt, .L0
JAL X0, lbl
.L0:

With 17 bits, this would be far less likely.

With 20 or 22 bits, it would drop essentially to zero (no functions are
this large).


OTOH, with 8 or 9 bits, it more a case of "well, it is often sufficient"
(but much of the time, it is not).


> Another consideration is that compare instructions are needed in the instruction
> set anyway, it may be less hardware then to use separate compare and branch
> instructions. Compares are needed for expressions like: X = a < b. which are
> sometimes recorded instead of branched on.
>

Yeah.

RISC-V also has a "SLT" instruction for these cases:
SLT Rn, Rs, Rt
Does effectively:
Rn = Rs < Rt;


In my case, there are "MOVT Rn" / "MOVNT Rn", which copy the T bit into
a register (as-is, or inverted).



> My latest core stores the results in condition code registers like the PowerPC
> rather than gprs. It is just a bit different. I manage then to get a 21-bit branch
> displacement. For branch-to-subroutine 24-bit displacements are possible.
>


In a few cases, I considered options for using a small set of P-bit
registers, say:
000= P0 (P0=1)
001=!P0 (P0=1)
010= P1 (T)
011=!P1 (T)
100= P2 (S)
101=!P2 (S)
110= P3 (U)
111=!P3 (U)

Where P0 is hard-wired as 1, and !P0 is a special case (such as WEX).

In this case, things like compare ops would be directed into one of the
P-bit registers.

But, the need for more than a single predicate bit is obscure enough to
make it not really worthwhile to spend an extra encoding bit on it
(needing to deal with multiple predicated branches or similar at the
same time being "fairly uncommon").

Had also experimented with a mechanism to treat P-bits as a small stack
machine, but there wasn't any good way to make use of this in my
compiler (enough to offset the cost of the mechanism to do so).

This could have allowed effectively handling modest-size if/else trees
using predication, but is "much less straightforward" for a compiler
(and both less versatile and harder to work with in the backend than
having a few free-form predicate-bit registers).

...

Russell Wallace

unread,
Dec 8, 2022, 2:02:06 AM12/8/22
to
On Wednesday, December 7, 2022 at 6:26:02 PM UTC, Stephen Fuld wrote:
> Unless you are talking about a compare to some fixed value, expressed in
> the op code (e.g. a compare to zero and branch instruction) it would
> have required an extra field in the instruction, i.e. two values to
> compare and a third to specify the branch target address. Given the
> fixed 32 bit instructions, this would have been "a challenge". :-)

10 bits operands, 1 bit direction, 1 bit yes/no, 2 bits for kind of comparison (==, signed/uns <, and a <= b ⇔ !(b < a)), 2 bits for opcode, leaves 16 bits of displacement, which should be more than plenty. I think 2 bits for opcode is justified given the importance of this instruction, but could bump it all the way to something like 6 bits, and still get a reasonably ample 12 bits of displacement.

MitchAlsup

unread,
Dec 8, 2022, 2:52:44 PM12/8/22
to
On Wednesday, December 7, 2022 at 2:26:20 PM UTC-6, robf...@gmail.com wrote:
> I think deciding between compare-and-branch as one instruction may have to
> do with the size of the branch displacement possible in an instruction format.
> It takes more bits to represent the instruction if there are two register fields in
> it and that may limit the displacement size. If instructions are 16-bits specifying
> two registers in the instruction probably is not practical due to the small
> displacement.
<
With the Compare instruction and then branch model, there is 1 branch instruction
with a myriad of sub instructions {Br ==, Br !=, Br s>, BR s>=, Br s<, BR s<=, Br u<
Br u<=, Br u>, BR u>=, and more} all as 1 single Major OpCode (front 6 bits). I
use the bits where the destination register would be to encode the condition.
<
In contrast, the Compare-Branch model only has a few compares-and-branches
due to not wanting to waste too much OpCode space of the compare-branch.
<
My compare then branch model uses a bit vector condition, so branch on bit
falls out for free. Over on the compare-to-zero-and-branch side, I have a complete
set of compare-to-zero-branch instructions, some with the ability to test for
situations outside of normal programmatics--for example: test if an ATOMIC
sequence has suffered from interference.
>
> It is often not necessary to do the compare, in some processors the condition
> codes are set by many instructions and a branch can take place without needing
> an explicit compare. In that case the instructions are just as compact as doing a
> compare-and-branch in one instruction.
<
The SPARC experience indicates that 80% of this free work is non-useful.
>
> RISCV has only a 12-bit displacement field which is adequate most of the time,
> but no doubt on rare occasions extra code needs to be inserted to branch further
> away.
<
My 66000 has 16-bits of word displacement in the OpCode which is inadequate
a whole lot less than RISC-V. You would need a subroutine with more than 32767
instructions.
>
> Another consideration is that compare instructions are needed in the instruction
> set anyway, it may be less hardware then to use separate compare and branch
> instructions. Compares are needed for expressions like: X = a < b. which are
> sometimes recorded instead of branched on.
<
My 66000 compare instructions build a bit-vector of relations. Then my shift
instructions can extract the signed set {-1,0} or the unsigned set {0,+1}, so
different languages can access what is their definition of the right result.
>
> My latest core stores the results in condition code registers like the PowerPC
> rather than gprs. It is just a bit different. I manage then to get a 21-bit branch
> displacement. For branch-to-subroutine 24-bit displacements are possible.
<
How often is 21-bits sufficient (where 16-bits would have been insufficient) ??
Do you know of a subroutine that is bigger than 1,000,000 instructions ?
<
As for direct BR and CALL one has 26-bits

Scott Lurndal

unread,
Dec 8, 2022, 3:07:31 PM12/8/22
to
For assembly code, as opposed to more constrained compiler
generated subroutines; particularly OS/VM level code, I can
see cases where the virtual address space is fragmented such that
21 bit offsets (or even 32-bit offsets) wouldn't be sufficient;
particularly in a 64-bit address space.

MitchAlsup

unread,
Dec 8, 2022, 3:46:15 PM12/8/22
to
Fair enough::
<
But when the Branch/call space is insufficient, I have access to both 32-bit
and 64-bit (using my constant extension). So, if the ASM programmer
writes::
<
BR/CALL Some_GuestOS_Common_Entry_Point
<
The assembler leaves it in 64-bit form. Later, when the linker resolves the
linkages and find this is 32-bit or 26-bit capable, it (linker) can shrink the
module.
<
When ASLR randomizes the VASs, different images may find different entry
points using different sized "constants" than the previous image or later image.
<
I should also notice that should the linker leave the linkage uniformly 64-bits
that this takes no more cycles to execute (98% confidence) than the smaller
forms--all you are saving is code footprint not cycles. The 1-word BR address
takes the same cycle count as the 2-word BR and as the 3-word BR. So, the only
gain in compression of potentially large displacements is code footprint not
cycles.

BGB

unread,
Dec 8, 2022, 4:58:22 PM12/8/22
to
On 12/8/2022 1:52 PM, MitchAlsup wrote:
> On Wednesday, December 7, 2022 at 2:26:20 PM UTC-6, robf...@gmail.com wrote:
>> I think deciding between compare-and-branch as one instruction may have to
>> do with the size of the branch displacement possible in an instruction format.
>> It takes more bits to represent the instruction if there are two register fields in
>> it and that may limit the displacement size. If instructions are 16-bits specifying
>> two registers in the instruction probably is not practical due to the small
>> displacement.
> <
> With the Compare instruction and then branch model, there is 1 branch instruction
> with a myriad of sub instructions {Br ==, Br !=, Br s>, BR s>=, Br s<, BR s<=, Br u<
> Br u<=, Br u>, BR u>=, and more} all as 1 single Major OpCode (front 6 bits). I
> use the bits where the destination register would be to encode the condition.
> <
> In contrast, the Compare-Branch model only has a few compares-and-branches
> due to not wanting to waste too much OpCode space of the compare-branch.
> <
> My compare then branch model uses a bit vector condition, so branch on bit
> falls out for free. Over on the compare-to-zero-and-branch side, I have a complete
> set of compare-to-zero-branch instructions, some with the ability to test for
> situations outside of normal programmatics--for example: test if an ATOMIC
> sequence has suffered from interference.


I ended up defining some RISC-V style compare-and-branch instructions.
Only real reason they are enabled in because the RISC-V mode needs them.

They end up talking up ~ 1/4 of the F1 block (mostly used for Load/Store
Disp9 ops).


IIRC, I did add compiler support for them, but they are still "pretty
meh": "if(a<b)" as a single op, but only if the target is within +/- 256
bytes being not that big of a win.

Ironically, "if(a<0)" tends to end up being more useful (these ended up
existing as separate ops in BJX2 mostly because of the lack of an
architectural zero register)


In the case of integer compare ops, the majority of cases tend to be
comparisons with a constant, which is on average better served, say, by
"CMPxx Imm10, Rn; BT/BF lbl"


>>
>> It is often not necessary to do the compare, in some processors the condition
>> codes are set by many instructions and a branch can take place without needing
>> an explicit compare. In that case the instructions are just as compact as doing a
>> compare-and-branch in one instruction.
> <
> The SPARC experience indicates that 80% of this free work is non-useful.

Most status flag updating is noise.

If one does status-flags like on x86, then far more often the status
flags will end up being stomped before one can do anything with the results.

Status flags are more useful, ironically, when only a very limited
selection of instructions modify them.


Similar for True/False bit predication:
To be useful for Then/Else blocks, also implies that no instruction
inside the Then/Else block may update this bit.

>>
>> RISCV has only a 12-bit displacement field which is adequate most of the time,
>> but no doubt on rare occasions extra code needs to be inserted to branch further
>> away.
> <
> My 66000 has 16-bits of word displacement in the OpCode which is inadequate
> a whole lot less than RISC-V. You would need a subroutine with more than 32767
> instructions.

In my case, BT/BF are 20-bit, which is almost always sufficient.

But, yeah, a branch within +/- 32K words will hit a lot more often than
+/- 2K words (which will hit more often than +/- 256 words).


>>
>> Another consideration is that compare instructions are needed in the instruction
>> set anyway, it may be less hardware then to use separate compare and branch
>> instructions. Compares are needed for expressions like: X = a < b. which are
>> sometimes recorded instead of branched on.
> <
> My 66000 compare instructions build a bit-vector of relations. Then my shift
> instructions can extract the signed set {-1,0} or the unsigned set {0,+1}, so
> different languages can access what is their definition of the right result.

In my case, they come in "flavors":
EQ
GT, GE (Signed)
HI, HS (Unsigned)

BT/BF choice may be used to built the other possibilities.
Strictly speaking, one only needs EQ, GT, and HI; but this would leave
immediate-form instructions unable to directly express some cases (LT
and GE; and if these cases need a separate constant-load, this is lame).

So, one needs, say:
CMPEQ Reg, Reg
CMPEQ Imm, Reg
CMPGT Reg, Reg
CMPGT Imm, Reg
CMPGE Imm, Reg

CMPHI Reg, Reg
CMPHI Imm, Reg
CMPHS Imm, Reg


>>
>> My latest core stores the results in condition code registers like the PowerPC
>> rather than gprs. It is just a bit different. I manage then to get a 21-bit branch
>> displacement. For branch-to-subroutine 24-bit displacements are possible.
> <
> How often is 21-bits sufficient (where 16-bits would have been insufficient) ??
> Do you know of a subroutine that is bigger than 1,000,000 instructions ?
> <
> As for direct BR and CALL one has 26-bits

Thus far, none of my test programs have exceeded the existing 20-bit
limit (1MB), for the entire ".text" section.

When this happens, things like function calls will start needing to use
a bigger encoding (thus far mostly untested; and will not currently be
predicted by the branch predictor). As-is, there are both Disp33s and
Abs48 encodings as possible fallback cases.

For most sane functions, for internal branches, this is likely in
"probably not going to happen" territory.


Closest to this limit thus far is ROTT, which weighs in at a (relatively
massive) 754K of ".text" (for reference, the x86-64 builds are nearly 2MB).

Though, part of this though seems to be that the people at Apogee seem
to have used excessive amounts of macros and copy/paste.


Something like a 22-bit branch might be better in the sense of "yeah,
4MB .text section is not implausible".

Though, 26-bits seems a little overkill though.


Non-local calls and branches need not use relative branch instructions
though, so this case does not matter.




In other news, my experiment for Huffman coding + LZ with 4 interleaved
bitstreams was a moderate success (on my Ryzen 2700X, MSVC):
Decode speeds ~ 400 to 700 MB/sec with Huffman coding;
Decode speeds ~ 600 to 1200 MB/sec with fixed-length (byte) symbols;
Compression ratios also slightly better than Deflate.

GCC still seems to be doing its "approx 40% faster" thing, so with GCC
in a few cases was getting Huffman-coded decode speeds of around 1.0GB/sec.

General approach should also map over to BJX2.

The LZ format is a little wonky though on account of the use of parallel
buffers for encoding/decoding stuff. Other than the parallel buffer
wonkiness, the design is vaguely similar to that of LZ4 (just with tags,
literal, distance, and extra bytes, being spread into 4 parallel buffers).


This is also with a 12-bit symbol length limit (can do 13 bit, which
helps compression at the cost of decoding speed).


Still technically only a moderate improvement over a codec based on a
normal (symbol at a time) decoding with a 12-bit symbol length limit
though (vs, say, the speed improvement of a 12-bit limit vs 15 bits).

But, at least, it is not being particularly bottle-necked by the entropy
coder (in the profiler), which is something (also spends a lot of time
in handling the LZ format, and in the match copying function).


Had also designed the format to be able to deal with deltas between
buffers (partly inspired by Sierra's video format). Though this part was
not being tested in this case.


So, would at least appear to be "kinda fast-ish", at least as far as
entropy-encoded designs go.

...

robf...@gmail.com

unread,
Dec 10, 2022, 3:51:40 AM12/10/22
to
Found I can use the condition registers as complicated predicate registers
with the use of the PRED instruction modifier. I am liking the My66000 PRED
modifier idea. The original Thor core had 16 predicate registers which were
basically, condition code registers that were coupled with a four-bit condition
specifier. Thus, eight bits were tacked onto the start of every instruction to
specify a predicate. I found most of the time the predicate was not used so a
byte of memory was wasted for many instructions. Eventually decided to
shelve the core. But now with My66000 style PREDs there is one extra
instruction for the PRED, 32-bitts, but it can be applied for up to the following
seven instructions (in my core). And the bytes are used only when a predicate
is needed. So, now there are eight condition (predicate) registers instead of
16. But the number of predicate conditions has expanded to allow a variety of
floating-point conditions. So, a six-bit condition field is used. This is nine bits
total, but it only must be present in the PRED instruction which has lots of bits
available.
One complication is dealing with external interrupts that may happen. My first
thought is to simply not allow them in the shadow of a PRED instruction. It
would be delaying interrupt processing by seven or eight instructions max.
Otherwise, the PRED state would need to be saved somehow and restored after
the interrupt. Possibly by saving a copy of the interrupted instruction plus
predicate.

My core uses condition registers to store a bit vector of comparison flags in a
manner analogous to the My66000 use of a GPR.
I am hoping to have more GPRs available for other use by offloading the compare
results to dedicated registers. The branch instruction then also need only specify
three bits for the condition register.

EricP

unread,
Dec 10, 2022, 10:41:08 AM12/10/22
to
robf...@gmail.com wrote:
> One complication is dealing with external interrupts that may happen. My first
> thought is to simply not allow them in the shadow of a PRED instruction. It
> would be delaying interrupt processing by seven or eight instructions max.
> Otherwise, the PRED state would need to be saved somehow and restored after
> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> predicate.

An alternative is a PRED prefix that only applies to a single instruction.

Of course it won't win awards on code size but decode sees the pairing
as a single instruction and the shadow complications do go away.
Internally it still has many of the complications of predicated uOps,
but the dependency chain is simplified because it doesn't interact
with the shadow mask and its invisible global state -
it is just an optional extra source operand on each uOp.



BGB

unread,
Dec 10, 2022, 1:55:20 PM12/10/22
to
And, burning 2 or 3 bits in the encoding means no space (or cycles) are
wasted on a prefix.

Granted, one could argue, the predicate field and existence of 16-bit
ops, is the dividing line between the practicality of 6-bit register fields.

Say, 6b:
zztt-tttt-zzzz-ssss ssnn-nnnn-zzzz-pp11 //10b for opcode
zztt-tttt-zzzz-ssss ssnn-nnnn-zzzz-zpp1 //11b for opcode
zztt-tttt-zzzz-ssss ssnn-nnnn-zzzz-zzzz //14b for opcode

Vs, 5b:
zzzt-tttt-zzzz-ssss szzn-nnnn-zzzz-pp11 //13b for opcode
zzzt-tttt-zzzz-ssss szzn-nnnn-zzzz-zpp1 //14b for opcode
zzzt-tttt-zzzz-ssss szzn-nnnn-zzzz-zzzz //17b for opcode



With my existing ISA (in the above scheme):
zzzz-znst-tttt-zzzz 111p-zzpz-nnnn-ssss //12b for opcode

Have noted, one doesn't really want to have much less opcode than this.

Had hacked on some 32b encodings with 6b fields, say:
zzzz-znst-tttt-zzzz 0111-wnst-nnnn-ssss //9b for opcode
zzzz-znsi-iiii-iiii 1001-wnsz-nnnn-ssss //6b for opcode

Which lack predication and only cover a subset of the ISA, and then
stuff gets more hacky...

There is the Op64 encoding, which supports both 6b register fields and
predication, but not bundling, and needs 64b to encode.

There is also 2x Op40, which supports both (in a bundle), as a 96-bit
encoding, but is a bit of a hack (in the decoder, it fakes it as if
there were a pair of op64 encodings; just with a lot of the bits
hard-wired to 0).



With 6b fields, a "break even" would be to give up entirely on 16-bit
ops but keep a 2b pp field, say:
zztt-tttt-zzzz-ssss ssnn-nnnn-zzzz-zzpp //12b for opcode
Or, the RISC-V style notation of breaking up by field:
zz-tttttt-zzzz-ssssss-nnnnnn-zzzzzz-pp //12b for opcode

Code density could suffer slightly, but it could allow both uniform
6-bit register access and per-instruction predication (but not multiple
predicate bits).

Another option being to just start with 11b and try a little harder to
optimize the use of the encoding space (will still need to keep Jumbo
encodings or similar as an "escape hatch", or for less-common instructions).

Though, 16b encodings wont go very far if trying for 6b fields:
zznn-nnnn-ssss-ssz0 //3b opcode... yeah, wont go far.

Almost makes more sense to not bother if going directly for 64 GPRs, and
instead skip out on 16b ops (gaining an extra opcode bit).


Well, and/or treat the low 3 bits as a combined field:
000: 32b
001: 32b
010: 16b (Opt, 1)
011: 32b
10z: 32b
11z: 32b

1: So, there would be half the encoding space which would effectively be
disallowed from being used outside of Lane 1. This encoding space could
be left to 16b ops, which will necessarily need to operate on a subset
of the register space.

...


MitchAlsup

unread,
Dec 10, 2022, 2:57:48 PM12/10/22
to
signed {=, !=, >, >=, <, <=} unsigned {>, >=, <, <=} float {=, !=, >, >=, <, <=, OR, UN} and
double {=, !=, >, >=, <, <=, OR, UN}
<
I count 26 which fits in 5-bits.
<
> One complication is dealing with external interrupts that may happen. My first
> thought is to simply not allow them in the shadow of a PRED instruction. It
> would be delaying interrupt processing by seven or eight instructions max.
> Otherwise, the PRED state would need to be saved somehow and restored after
> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> predicate.
>
> My core uses condition registers to store a bit vector of comparison flags in a
> manner analogous to the My66000 use of a GPR.
<
Another convert........

John Levine

unread,
Dec 10, 2022, 4:37:58 PM12/10/22
to
According to EricP <ThatWould...@thevillage.com>:
>robf...@gmail.com wrote:
>> One complication is dealing with external interrupts that may happen. My first
>> thought is to simply not allow them in the shadow of a PRED instruction. It
>> would be delaying interrupt processing by seven or eight instructions max.
>> Otherwise, the PRED state would need to be saved somehow and restored after
>> the interrupt. Possibly by saving a copy of the interrupted instruction plus
>> predicate.
>
>An alternative is a PRED prefix that only applies to a single instruction.

Fifty years ago we called those skip instructions.

On a PDP-6, to put the larger of A and B into accumulator AC:

MOVE AC,A ; put A into AC
CAMGE AC,B ; skip if AC >= B
MOVE AC,B ; put B into AC

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

robf...@gmail.com

unread,
Dec 10, 2022, 5:07:25 PM12/10/22
to
>signed {=, !=, >, >=, <, <=} unsigned {>, >=, <, <=} float {=, !=, >, >=, <, <=, OR, UN} and
>double {=, !=, >, >=, <, <=, OR, UN}
><
>I count 26 which fits in 5-bits.

I followed the 68k FPU's float tests of which there are 32 just for one precision. They
went a bit crazy testing for Nan's in combination with other relations. Then there are
the 16 integer conditions based on the flags nf,vf,cf,zf. That leave 16 conditions to be
defined.
I was thinking of including magnitude greater than and less than too, which would take
the abs() of the numbers first. I have a low bar for including ops in the instruction set,
as a lot can be supported. Somewhere around 0.1% used.

Just added triple precision decimal-float to my 68k core. Primarily using the 68k core
to test the decimal-float instructions. They use separate compare and branch
instructions. Found an error in recent docs for the 68881/68882 FPU.



MitchAlsup

unread,
Dec 10, 2022, 7:09:40 PM12/10/22
to
On Saturday, December 10, 2022 at 3:37:58 PM UTC-6, John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
> >robf...@gmail.com wrote:
> >> One complication is dealing with external interrupts that may happen. My first
> >> thought is to simply not allow them in the shadow of a PRED instruction. It
> >> would be delaying interrupt processing by seven or eight instructions max.
> >> Otherwise, the PRED state would need to be saved somehow and restored after
> >> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> >> predicate.
> >
> >An alternative is a PRED prefix that only applies to a single instruction.
> Fifty years ago we called those skip instructions.
<
Which comes back to the important point about predication ::
<
You do not want to predicate any farther than is already covered by your FETCH
momentum. Predication is entirely present to avoid disrupting the current FETCH
stream--and no farther. If you have to (or want to) disrupt the FETCH point, then
branch (and friends) is the proper tool.
>
> On a PDP-6, to put the larger of A and B into accumulator AC:
>
> MOVE AC,A ; put A into AC
> CAMGE AC,B ; skip if AC >= B
> MOVE AC,B ; put B into AC
>
There were a lot of cool things about the PDP-6 and the PDP-10 follow ons.

EricP

unread,
Dec 11, 2022, 11:25:47 AM12/11/22
to
John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
>> robf...@gmail.com wrote:
>>> One complication is dealing with external interrupts that may happen. My first
>>> thought is to simply not allow them in the shadow of a PRED instruction. It
>>> would be delaying interrupt processing by seven or eight instructions max.
>>> Otherwise, the PRED state would need to be saved somehow and restored after
>>> the interrupt. Possibly by saving a copy of the interrupted instruction plus
>>> predicate.
>> An alternative is a PRED prefix that only applies to a single instruction.
>
> Fifty years ago we called those skip instructions.
>
> On a PDP-6, to put the larger of A and B into accumulator AC:
>
> MOVE AC,A ; put A into AC
> CAMGE AC,B ; skip if AC >= B
> MOVE AC,B ; put B into AC
>

Not really - skip is a very short branch, forward only.

Skip/Branch removes intervening instructions from the pipeline
and no resources are assigned for them.
(Consider how skip would behave in an in-order pipeline without branch
prediction - it would stall at fetch until its conditional resolved.)

Predication keeps the pipeline running, feeding the predicated
instructions into the pipeline and tentatively assigns resources
to them in case they do execute.
Later if they do not execute it must patch the result environment
so it looks like the disabled instructions never existed,
which can possibly include propagating values to different locations
and dynamically editing dependency chains accordingly,
then clean up and recover the assigned resources.
This must be done in such a way that it can take an interrupt/exception
at any point and resolve to a predictable and restartable state.

BTW on the simple cpu designs in days of yore, the advantage of
skip conditional over an unconditional branch vs a conditional branch
is for skip the HW instruction execute state machine always has the same
state sequence and just changes the value added to IP, whereas BRcc has to
change its whole state sequence depending on CC to perform/not-perform
the offset fetch and add. So BRcc is more complex to implement than SKIPcc,
and when you are paying per-gate for the cpu the cost difference matters.



MitchAlsup

unread,
Dec 11, 2022, 1:10:54 PM12/11/22
to
On Sunday, December 11, 2022 at 10:25:47 AM UTC-6, EricP wrote:
> John Levine wrote:
> > According to EricP <ThatWould...@thevillage.com>:
> >> robf...@gmail.com wrote:
> >>> One complication is dealing with external interrupts that may happen. My first
> >>> thought is to simply not allow them in the shadow of a PRED instruction. It
> >>> would be delaying interrupt processing by seven or eight instructions max.
> >>> Otherwise, the PRED state would need to be saved somehow and restored after
> >>> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> >>> predicate.
> >> An alternative is a PRED prefix that only applies to a single instruction.
> >
> > Fifty years ago we called those skip instructions.
> >
> > On a PDP-6, to put the larger of A and B into accumulator AC:
> >
> > MOVE AC,A ; put A into AC
> > CAMGE AC,B ; skip if AC >= B
> > MOVE AC,B ; put B into AC
> >
> Not really - skip is a very short branch, forward only.
>
> Skip/Branch removes intervening instructions from the pipeline
> and no resources are assigned for them.
> (Consider how skip would behave in an in-order pipeline without branch
> prediction - it would stall at fetch until its conditional resolved.)
>
> Predication keeps the pipeline running, feeding the predicated
> instructions into the pipeline and tentatively assigns resources
> to them in case they do execute.
<
There may be a short range (like 1 cycle) where the front end has
to guess whether the then-clause or the else-clause is to be executed,
by the following cycle that has resolved. And with the mask available
in the PRED instruction, it is easy for the front end to skip forward
over the then-clause to arrive at the else-clause without executing
any (but 1 might get nullified) instruction in the then-clause. So,
you don't put all the instructions in the pipeline.
<
Plus fixing up the rename table is harder than skipping over
the un-executed clause.
<
> Later if they do not execute it must patch the result environment
> so it looks like the disabled instructions never existed,
> which can possibly include propagating values to different locations
> and dynamically editing dependency chains accordingly,
> then clean up and recover the assigned resources.
> This must be done in such a way that it can take an interrupt/exception
> at any point and resolve to a predictable and restartable state.
>
> BTW on the simple cpu designs in days of yore, the advantage of
> skip conditional over an unconditional branch vs a conditional branch
> is for skip the HW instruction execute state machine always has the same
> state sequence and just changes the value added to IP, whereas BRcc has to
> change its whole state sequence depending on CC to perform/not-perform
> the offset fetch and add. So BRcc is more complex to implement than SKIPcc,
> and when you are paying per-gate for the cpu the cost difference matters.
<
The Front-End has to treat branches as resets--take this address and
fetch a bunch of instructions, whereas Predication tells the Front-End
Just continue using your current fetch strategy.

John Levine

unread,
Dec 11, 2022, 2:07:35 PM12/11/22
to
According to EricP <ThatWould...@thevillage.com>:
>> Fifty years ago we called those skip instructions.

>Not really - skip is a very short branch, forward only.
>
>Skip/Branch removes intervening instructions from the pipeline
>and no resources are assigned for them.
>(Consider how skip would behave in an in-order pipeline without branch
>prediction - it would stall at fetch until its conditional resolved.)
>
>Predication keeps the pipeline running, feeding the predicated
>instructions into the pipeline and tentatively assigns resources
>to them in case they do execute.

Seems to me that's more of an implementation decision. When it sees
the skip why couldn't it treat that as a predicate on the next instruction?
I realize that when we designed machines with skip instructions, they
didn't have pipelines.

>BTW on the simple cpu designs in days of yore, the advantage of
>skip conditional over an unconditional branch vs a conditional branch
>is for skip the HW instruction execute state machine always has the same
>state sequence and just changes the value added to IP, whereas BRcc has to
>change its whole state sequence depending on CC to perform/not-perform
>the offset fetch and add. So BRcc is more complex to implement than SKIPcc,
>and when you are paying per-gate for the cpu the cost difference matters.

I don't ever recall seeing a machine that had condition codes and skip
instructions. The skip instructions all did some kind of comparison or
bit test. If you wanted the effect of a conditional jump you'd reverse
the test and skip over the jump. The point of the skip was that the
memory address in the skip instruction was the location to test or
(on PDP-6/10) sometimes an immediate operand or bit mask. There
wasn't room for a jump address.

MitchAlsup

unread,
Dec 11, 2022, 2:14:37 PM12/11/22
to
On Sunday, December 11, 2022 at 10:25:47 AM UTC-6, EricP wrote:
> John Levine wrote:
> > According to EricP <ThatWould...@thevillage.com>:
> >> robf...@gmail.com wrote:
> >>> One complication is dealing with external interrupts that may happen. My first
> >>> thought is to simply not allow them in the shadow of a PRED instruction. It
> >>> would be delaying interrupt processing by seven or eight instructions max.
> >>> Otherwise, the PRED state would need to be saved somehow and restored after
> >>> the interrupt. Possibly by saving a copy of the interrupted instruction plus
> >>> predicate.
> >> An alternative is a PRED prefix that only applies to a single instruction.
> >
> > Fifty years ago we called those skip instructions.
> >
> > On a PDP-6, to put the larger of A and B into accumulator AC:
> >
> > MOVE AC,A ; put A into AC
> > CAMGE AC,B ; skip if AC >= B
> > MOVE AC,B ; put B into AC
> >
> Not really - skip is a very short branch, forward only.
>
> Skip/Branch removes intervening instructions from the pipeline
> and no resources are assigned for them.
> (Consider how skip would behave in an in-order pipeline without branch
> prediction - it would stall at fetch until its conditional resolved.)
>
> Predication keeps the pipeline running, feeding the predicated
> instructions into the pipeline and tentatively assigns resources
> to them in case they do execute.
<
This might be a reasonable model if every instruction has its own
condition and maybe a unique condition code. It is not a reasonable
model where one condition casts a then-shadow and an else-shadow
over a number of instructions.
<
> Later if they do not execute it must patch the result environment
> so it looks like the disabled instructions never existed,
<
Then there are the cases where the same registers are used differently
in the then and else clauses. Patching this stuff up is "non trivial".

Stephen Fuld

unread,
Dec 12, 2022, 11:34:23 AM12/12/22
to
I didn't know much about the PDP-6/10 except that they were 36 bit
systems, so I spent a little time researching them. I was struck by the
similarity of the instruction formats, etc. to the Univac 1100 series
(as opposed to other 36 bit systems such as the IBM 704 or the Honeywell
DPS6).

That is a preface to saying that the 1108 had skip instructions (they
were called "Test" instructions in 1100 parlance), and conditional jump
instructions. It was pretty much as you say, conditional jump for things
like testing a register for zero, etc., but as operations like comparing
two operands required both operands, there were Test instructions as
there was not room in the instruction format for the two operands and a
jump target address.

As for condition codes, there were none for "operand compares", but
there were conditional jumps for "conditions" like overflow, FP
underflow, etc., which were kept as bits in the CPU's Processor Status
Register, which was not generally accessible by user programs.

The relative timings were interesting. Most "typical" instructions took
750ns. Note that this was before caches and basic memory access time was
also 750 ns. (Fetch next instruction was normally overlapped with
executing the current instruction.) A conditional jump instruction took
750 ns if not taken and 1,500 ns if taken (Required extra time to fetch
the instruction at the target address) A test instruction took 875 ns
if the next instruction was executed, but 1,625 ns if it was skipped. I
have forgotten why the "extra" time for test instructions.

EricP

unread,
Dec 12, 2022, 12:26:18 PM12/12/22
to
John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
>>> Fifty years ago we called those skip instructions.
>
>> Not really - skip is a very short branch, forward only.
>>
>> Skip/Branch removes intervening instructions from the pipeline
>> and no resources are assigned for them.
>> (Consider how skip would behave in an in-order pipeline without branch
>> prediction - it would stall at fetch until its conditional resolved.)
>>
>> Predication keeps the pipeline running, feeding the predicated
>> instructions into the pipeline and tentatively assigns resources
>> to them in case they do execute.
>
> Seems to me that's more of an implementation decision. When it sees
> the skip why couldn't it treat that as a predicate on the next instruction?
> I realize that when we designed machines with skip instructions, they
> didn't have pipelines.

Its cheapest to do a conditional increment on the IP.

>> BTW on the simple cpu designs in days of yore, the advantage of
>> skip conditional over an unconditional branch vs a conditional branch
>> is for skip the HW instruction execute state machine always has the same
>> state sequence and just changes the value added to IP, whereas BRcc has to
>> change its whole state sequence depending on CC to perform/not-perform
>> the offset fetch and add. So BRcc is more complex to implement than SKIPcc,
>> and when you are paying per-gate for the cpu the cost difference matters.
>
> I don't ever recall seeing a machine that had condition codes and skip
> instructions. The skip instructions all did some kind of comparison or
> bit test. If you wanted the effect of a conditional jump you'd reverse
> the test and skip over the jump. The point of the skip was that the
> memory address in the skip instruction was the location to test or
> (on PDP-6/10) sometimes an immediate operand or bit mask. There
> wasn't room for a jump address.

Skip on condition code is what I've seen.

PDP-8 has a skip instruction for various combination of zero & carry.
It also has increment memory and skip EQ/NE zero.

DG Nova has a 3-bit skip field on ALU instructions for various
carry and zero tests on the result.

RCA 1802 8-bit microprocessor has skip over 1 or 2 bytes instructions
on various carry and zero test. It also has branch conditional
instructions which is strange as skip should make them unnecessary.
But 1802 does lots of strange things so that is par for the course.
Like the short and long branches are 2 or 3 bytes but the skip is
over 1 or 2 bytes, so it really could only skip over a short branch.
The could have made it conditional skip over 1,2 or 3 bytes, but instead
they added two whole sets of short and long conditional branches.
And the branches are really jumps to short or long absolute addresses.
Ah, the 1970's.



EricP

unread,
Dec 12, 2022, 12:26:18 PM12/12/22
to
MitchAlsup wrote:
> On Sunday, December 11, 2022 at 10:25:47 AM UTC-6, EricP wrote:
>>
>> Predication keeps the pipeline running, feeding the predicated
>> instructions into the pipeline and tentatively assigns resources
>> to them in case they do execute.
> <
> This might be a reasonable model if every instruction has its own
> condition and maybe a unique condition code. It is not a reasonable
> model where one condition casts a then-shadow and an else-shadow
> over a number of instructions.

An architecture with predication must have an internal execution mechanism
for enabling/disabling the various uOps based on matching a predicate
condition to a predicate value, and selecting what enable/disable actions
each uOp takes. That execution mechanism should be independent to how the
predicate conditions are distributed to the uOps.

One simple but size-costly distribution method would be a PRED prefix
for each macro instruction.

From that point of view the PRED shadow mask is a code space saving
method for mass distribution of the predicate condition to multiple uOps.
But the back end execution mechanism should remain the same.

> <
>> Later if they do not execute it must patch the result environment
>> so it looks like the disabled instructions never existed,
> <
> Then there are the cases where the same registers are used differently
> in the then and else clauses. Patching this stuff up is "non trivial".

As I see it, the only really viable predicate execute mechanism
(because the alternatives have exponential complexity expansion)
is to treat most uOps similar to a CMOV, where each has two actions:

- if the uOp is enabled it operates on its source registers
and writes its result as usual
- if the uOp is disabled it either (for in-order) patches the scoreboard
dependencies or (for OoO) copies the original dest register value.

Once the predicate value has resolved and is forwarded to all its
dependent uOps then those uOps can prune out source dependencies register
that are no longer relevant so they don't block execution.

The net result of the above is that all predicated instructions
always execute an action, usually either operate or copy.

John Levine

unread,
Dec 12, 2022, 12:42:33 PM12/12/22
to
According to EricP <ThatWould...@thevillage.com>:
>Skip on condition code is what I've seen.
>
>PDP-8 has a skip instruction for various combination of zero & carry.
>It also has increment memory and skip EQ/NE zero.

I also spent a lot of time programming a PDP-8 and I can promise you it
did not have condition codes. It had a single accumulator, and a "link"
bit which was more or less a carry flag. The skip instructions looked
at the AC and link, e.g., skip on negative AC, skip on zero link.

ISZ increment and skip if zero was how you did loop counters. There
was no ISNZ.

>DG Nova has a 3-bit skip field on ALU instructions for various
>carry and zero tests on the result.

The Nova's skip instructions were similar to the PDP-8, not surprising
since the same guy designed them. Skip on carry, skip on zero register.
No condition codes either.

>RCA 1802 8-bit microprocessor has skip over 1 or 2 bytes instructions
>on various carry and zero test. It also has branch conditional
>instructions which is strange as skip should make them unnecessary.

Didn't use that one.

I did use the Varian 620i which had one- and two-word instructions,
but the skips only skipped one word. That led to some amusing bugs.

John Levine

unread,
Dec 12, 2022, 1:06:24 PM12/12/22
to
According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:
>I didn't know much about the PDP-6/10 except that they were 36 bit
>systems, so I spent a little time researching them. I was struck by the
>similarity of the instruction formats, etc. to the Univac 1100 series
>(as opposed to other 36 bit systems such as the IBM 704 or the Honeywell
>DPS6).

The 704 was the ur-36 bit machine in 1954 that inspired everything
else. (The 701 was sort of the beta version and the 704 fixed many of
its mistakes like half-word addressing and Williams tube memory.) The
704 architecture was carried forward to the 709, 7090, and 7094, then
killed in favor of S/360 but not without a fight within IBM.

The DPB6 was a 16/32 bit machine so I'm guessing you mean the 6000
series which was the 36-bit GE 600 series, on which DTSS and Multics
ran. It had extremely complicated addressing but only a single AC and
MQ which in retrospect wasn't a great choice. It did have condition
codes, zero, negative, carry, overflow, and conditional branches on
them.

All of the 36 bit machines ran out of address bits. Some had extended
addressing hacks which didn't help enough.

>As for condition codes, there were none for "operand compares", but
>there were conditional jumps for "conditions" like overflow, FP
>underflow, etc., which were kept as bits in the CPU's Processor Status
>Register, which was not generally accessible by user programs.

We called them flags and didn't use them much. They weren't much like
condition codes, they were set when something like overflow happened
and stayed on until tested and reset by a JFCL or overwritten by a
JRST instruction. In all the time I programmed a PDP-10 I don't recall
writing a JFCL other than as a no-op and a JRST other than as an
unconditional jump. JSP AC,.+1 saved the flags in a register if you
wanted for some reason to test them without resetting them. Never did
that either.

>The relative timings were interesting. Most "typical" instructions took
>750ns. Note that this was before caches and basic memory access time was
>also 750 ns. (Fetch next instruction was normally overlapped with
>executing the current instruction.) A conditional jump instruction took
>750 ns if not taken and 1,500 ns if taken (Required extra time to fetch
>the instruction at the target address) A test instruction took 875 ns
>if the next instruction was executed, but 1,625 ns if it was skipped. I
>have forgotten why the "extra" time for test instructions.

Probably the same as for jump, it had to refill the prefetch buffer when
it didn't execute the skipped instruction.

Stephen Fuld

unread,
Dec 12, 2022, 1:44:05 PM12/12/22
to
On 12/12/2022 10:06 AM, John Levine wrote:
> According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:
>> I didn't know much about the PDP-6/10 except that they were 36 bit
>> systems, so I spent a little time researching them. I was struck by the
>> similarity of the instruction formats, etc. to the Univac 1100 series
>> (as opposed to other 36 bit systems such as the IBM 704 or the Honeywell
>> DPS6).
>
> The 704 was the ur-36 bit machine in 1954 that inspired everything
> else.

The Univac 1103 was a 36 bit machine introduced in 1953, though it was
not compatible with later 1100 series machines.

https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series



(The 701 was sort of the beta version and the 704 fixed many of
> its mistakes like half-word addressing and Williams tube memory.) The
> 704 architecture was carried forward to the 709, 7090, and 7094, then
> killed in favor of S/360 but not without a fight within IBM.
>
> The DPB6 was a 16/32 bit machine so I'm guessing you mean the 6000
> series which was the 36-bit GE 600 series, on which DTSS and Multics
> ran.

Yes. My sloppy research. :-(


> It had extremely complicated addressing but only a single AC and
> MQ which in retrospect wasn't a great choice. It did have condition
> codes, zero, negative, carry, overflow, and conditional branches on
> them.
>
> All of the 36 bit machines ran out of address bits. Some had extended
> addressing hacks which didn't help enough.

I think it is fair to say that *all* computers introduced before say the
1980s, that survived, ran out of address bits. The 1100 series
introduced their Extended Mode, that survives (albeit in emulation) to
this day. I don't think that lack of address bits was the primary
reason for its decline.

I have often said that a reason for its decline was their marketing
department's utter failure to convince the world that 36 was an integral
power of 2. :-)

BGB

unread,
Dec 13, 2022, 4:57:10 AM12/13/22
to
On 12/12/2022 11:25 AM, EricP wrote:
> MitchAlsup wrote:
>> On Sunday, December 11, 2022 at 10:25:47 AM UTC-6, EricP wrote:
>>>
>>> Predication keeps the pipeline running, feeding the predicated
>>> instructions into the pipeline and tentatively assigns resources to
>>> them in case they do execute.
>> <
>> This might be a reasonable model if every instruction has its own
>> condition and maybe a unique condition code. It is not a reasonable
>> model where one condition casts a then-shadow and an else-shadow
>> over a number of instructions.
>
> An architecture with predication must have an internal execution mechanism
> for enabling/disabling the various uOps based on matching a predicate
> condition to a predicate value, and selecting what enable/disable actions
> each uOp takes. That execution mechanism should be independent to how the
> predicate conditions are distributed to the uOps.
>

In my case, the predication mode travels along with the opcode, and once
it moves into EX1, it gets remapped if needed.

Most normal operations turn into a NOP operation. Branches split into
BRA and BRA_NB (Branch Non-Branch). The main purpose of BRA_NB is that
if the branch-predictor had predicted a branch as taken, then BRA_NB
will initiate a branch to the following instruction (essentially a "BRA 0").


> One simple but size-costly distribution method would be a PRED prefix
> for each macro instruction.
>
> From that point of view the PRED shadow mask is a code space saving
> method for mass distribution of the predicate condition to multiple uOps.
> But the back end execution mechanism should remain the same.
>

As noted, I went with essentially dedicating 2 bits to this in every
instruction.


This is a little less than 32-bit ARM, albeit there are 3 bits spent on
encoding the 16/32 split.


Well, apart from 7xxx and 9xxx, which are more a case of these spaces
having been "mostly" unused in the 16-bit opcode map (whereas most other
options would have broken binary compatibility).


I had originally almost considered making XGPR being a special operating
mode that would have essentially reclaimed all of the 16-bit space, say:
wZnm-ZeoZ
wZnm-Zeii

Where the high E/F is has the '111' bits replaced by the (inverted) high
bits of the 3 register fields.

This encoding would have mostly avoided the orthogonality issues, but I
ended up choosing the "more conservative" option.

But, faced with some other issues, I am almost left to wonder if going
that route might have been better in the long run.


It is possible that I could still consider this option.

Pros: Regains some amount of orthogonality.
Cons: Operating mode hassle, complete loss of 16-bit ops in this mode.


So, say:
ADD?T R17, 0x123, R37 | SUB?F R17, 0x145, R37

Could then be encoded as two 32-bit instructions.


Handling would be otherwise similar to that of RISC-V mode; may need a
different PE/COFF machine-ID so that the PE loader knows to start the
program in this mode. Calling between modes would be slightly less of an
issue than BJX2 <-> RISC-V, since at least both sides will (more or
less) agree on the same ABI.

Basically, it is similar to the ARM32<->Thumb situation.


A mode switch being needed because (for fairly obvious reasons), this
would otherwise break binary compatibility in a pretty major way.


>> <
>>> Later if they do not execute it must patch the result environment so
>>> it looks like the disabled instructions never existed,
>> <
>> Then there are the cases where the same registers are used differently
>> in the then and else clauses. Patching this stuff up is "non trivial".
>
> As I see it, the only really viable predicate execute mechanism
> (because the alternatives have exponential complexity expansion)
> is to treat most uOps similar to a CMOV, where each has two actions:
>
> - if the uOp is enabled it operates on its source registers
>   and writes its result as usual
> - if the uOp is disabled it either (for in-order) patches the scoreboard
>   dependencies or (for OoO) copies the original dest register value.
>
> Once the predicate value has resolved and is forwarded to all its
> dependent uOps then those uOps can prune out source dependencies register
> that are no longer relevant so they don't block execution.
>
> The net result of the above is that all predicated instructions
> always execute an action, usually either operate or copy.
>

Probably depends on the design of the core.

In my design, simply turning them into NOPs is sufficient.

Granted, my core is strictly in-order, and there is no scoreboard. So,
the NOPs are just "empty spots" in the pipeline.

As with NOPs, these instructions may still eat clock-cycles.
Because interlocking happens before they enter the EX stage, they may
also still trigger interlock stalls on other instructions.

EricP

unread,
Dec 13, 2022, 2:08:39 PM12/13/22
to
WRT interlocking, yes that is exactly my point - they are not really
NOPs because in-order still has to track RAW and WAW register
dependencies and diddle the tracking flip-flops.

Say I have a predicate prefix that tests if the lsb of a register
matches a predicate condition. Eg
ET rN: Execute True prefix enables if rN[0] == 1
EF rN: Execute False prefix enables if rN[0] == 0

#1 CMP r0 = r9 > 1234 // set/clear r0 lsb
#2 ET r0: ADD r3 = r2 + r1 // enable if r0[0] == 1
#3 ADD r5 = r4 + r3

In the above instruction #2 the prefix tests lsb of r0 and
enables if it matches the T==1 or F==0 predicate condition.
#2 is RAW dependent on r0 for the predicate value
and may additionally be RAW on r2 and r1 if enabled.
#3 may be RAW dependent on r3 until r0 is resolved.

Either (a) the pipeline stalls #2 at Reg Read stage until r0 resolves,
or (b) it reads r2 and r1 anyway and marks r3 as Write Pending.
If it chooses (b) then #3 will RAW stall on r3, and if later r0 resolves
and disables #2 then #2 has to reset the Write Pending flag on r3
allowing #3 to continue execution.

In both cases instruction #2 either stalls for the r0 value or
proceeds to perform work like register read and then does housekeeping
on tracking flags to patch up after.
In both cases a disabled instruction still has visible side effects
so is not a NOP.


BGB

unread,
Dec 13, 2022, 2:54:45 PM12/13/22
to
Sorta, I guess...

It is a NOP in terms of "high level behavior" or "How EX stages see it".

Or, not necessarily a NOP if one demands that "Every NOP eats exactly 1
clock cycle". These may eat multiple clock cycles for any register
dependencies that may have otherwise occurred (since, this part happens
before it enters EX1, and thus before it can be known whether or not the
instruction will execute).


In theory, it could be possible to shortcut the predication during
ID1->ID2 by looking forward and verifying that nothing in the ID2 or EX1
stages can touch SR.T, and if so using the prior value of SR.T, however,
this isn't likely to save enough clock cycles to make it worth the extra
cost.

...

MitchAlsup

unread,
Dec 13, 2022, 3:53:18 PM12/13/22
to
On a 10-wide machine, does a NoOp eat exactly 1 cycle ?? and what does
1 cycle mean on a 10-wide machine ??
<
> These may eat multiple clock cycles for any register
> dependencies that may have otherwise occurred (since, this part happens
> before it enters EX1, and thus before it can be known whether or not the
> instruction will execute).
<
On GBOoO machines, many to most instructions to not appear to add any
latency to the rest of the executing instructions !! This makes the above
definition intractable.
>
>
> In theory, it could be possible to shortcut the predication during
> ID1->ID2 by looking forward and verifying that nothing in the ID2 or EX1
> stages can touch SR.T, and if so using the prior value of SR.T, however,
> this isn't likely to save enough clock cycles to make it worth the extra
> cost.
<
The was predication is defined my My 66000 is that the front end is
free to skip all instructions that are not supposed to be executed--but
this comes with a restriction that all of the instructions can be categorized
into exactly 1 of 2 groups (then-clause and else-clause). I started out by
allowing instructions to be common, but the compiler could not figure
out how to use and we simplified the encoding--I may further simplify
the encoding from here........
>
> ...

BGB

unread,
Dec 13, 2022, 8:52:10 PM12/13/22
to
This was a response to the assertion that a NOP from a predicated-away
instruction was not a true NOP because it still has to deal with
register dependencies and similar (so, more like a phantom version of
the instruction, rather than a "true NOP").

The only real alternative then, is to assume that a NOP has a known
latency (say, 1 cycle).

But, yeah, 0 cycles is another possible interpretation, but 0 cycle NOPs
aren't really a workable definition for an in-order machine.


>> These may eat multiple clock cycles for any register
>> dependencies that may have otherwise occurred (since, this part happens
>> before it enters EX1, and thus before it can be known whether or not the
>> instruction will execute).
> <
> On GBOoO machines, many to most instructions to not appear to add any
> latency to the rest of the executing instructions !! This makes the above
> definition intractable.

Possibly, I was thinking for a simple in-order machine.

Instruction latency on GBOoO is more a "throw hands in the air and say
'who knows'?" thing.


Then again, I guess it was interesting to observe that some types of
optimizations which help on BJX2 also help on x86-64, despite x86-64
having not enough registers to actually make use of the optimization as
expressed (x86-64 seems to be able to pretend as-if the stack
spills/fills were registers).


Meanwhile, I am left needing to use lots of registers, and occasional
optimizations which try to side-step a few (store then load) cases, then
recently needing to fix a bug where one of these optimizations broke due
to a type-handling issue in my compiler (was using the type of the value
stored to an array, rather than the type of the value as loaded from the
array).

Well, along with other issues, like local variables having their storage
being culled, then missing some of these variables were stored to but
not loaded from, resulting in them having a wild-card offset relative to
the stack-pointer (thus leading to some annoying stack-corruption bugs).

...


>>
>>
>> In theory, it could be possible to shortcut the predication during
>> ID1->ID2 by looking forward and verifying that nothing in the ID2 or EX1
>> stages can touch SR.T, and if so using the prior value of SR.T, however,
>> this isn't likely to save enough clock cycles to make it worth the extra
>> cost.
> <
> The was predication is defined my My 66000 is that the front end is
> free to skip all instructions that are not supposed to be executed--but
> this comes with a restriction that all of the instructions can be categorized
> into exactly 1 of 2 groups (then-clause and else-clause). I started out by
> allowing instructions to be common, but the compiler could not figure
> out how to use and we simplified the encoding--I may further simplify
> the encoding from here........

Nothing requires that they are not skipped in my case either, just that
this is beyond what my current implementation can manage.

Rules are mostly similar to normal instructions, just with the
difference that ?T and ?F instructions are allowed to have register
conflicts with each other in bundles on account of them being mutually
exclusive.



BTW: I put up a vote on Twitter wanting to see what the general
sentiment was on possible ways to "resolve" the encoding orthogonality
issues with R32..R63:
https://twitter.com/cr88192/status/1602801136590782466


Partly, because "add a new CPU mode and sort of half-way address the
issue" is kind of, not exactly a clear win...

The new encoding would at least make R32..R63 able to be able to be used
in all the same contexts as R0..R31, but still can't address some of the
other orthogonality issues, mostly because there is basically no way
possible to fit everything I would want into a 32 bit instruction word.


I have already experimented with adding some of the features needed for
the new mode to the Verilog, and everything seems to survive (mostly
involves some minor tweaks to the handling of the Link Register and
similar, increasing the number of CPU modes from 4 to 16).


Say:
0: BJX2 + Scalar (16/32)
1: BJX2 + WEX (16/32/64/96)
2: RISC-V (Generic 16/32)
3: RISC-V + WEX (TBD)
4: BJX2_XGP2 + Scalar (32)
5: BJX2_XGP2 + WEX (32/64/96)
6..15: Reserved

This will come at the cost that the status-flag bits for SIMD comparison
(PQRO) will no longer be preserved across function calls (replaced by
additional CPU-mode bits).

Was also tempted to replace '3', noting that this mode is probably DOA
(in favor of handling RISC-V as superscalar). Mode-3 would have required
a load-time bundle conversion, which in turn requires additional
metadata to do safely.


They will also have new Machine IDs in PEL4, say:
B232: Start in Mode 0|1 (32-bit pointers)
B264: Start in Mode 0|1 (64-bit pointers)
B296: Start in Mode 0|1 (but uses 128-bit pointers)
B265: Start in Mode 4|5 (64-bit pointers)
B297: Start in Mode 4|5 (128-bit pointers)

5032: RISC-V (32-bit), N/A on BJX2
5064: RISC-V (64-bit), Starts in Mode 2

These modes would also apply to COFF objects (if COFF were used).


Mode 0/1 differ mostly in that 0 executes instructions like a scalar
RISC machine, whereas 1 will execute instructions as bundles (code may
jump between these modes based on the result of the WEXMD instruction).

Similar would apply to 4/5.


Jumps between other modes would be accomplished via setting up bits in
function pointers (or, implicitly in the case of the Link Register;
which will implicitly return the mode to that used by the caller).

Could in theory direct-call between BJX2 and RISC-V, but doing so in
practice would require some mechanism to auto-generate glue thunks to
gloss over the (relatively significant) differences in the C ABI.

Tough, thus far, running RISC-V code on BJX2 still hasn't really been
tested much beyond trivial test cases.

...


MitchAlsup

unread,
Dec 13, 2022, 11:19:48 PM12/13/22
to
On Tuesday, December 13, 2022 at 7:52:10 PM UTC-6, BGB wrote:
> On 12/13/2022 2:53 PM, MitchAlsup wrote:

> Then again, I guess it was interesting to observe that some types of
> optimizations which help on BJX2 also help on x86-64, despite x86-64
> having not enough registers to actually make use of the optimization as
> expressed (x86-64 seems to be able to pretend as-if the stack
> spills/fills were registers).
>
The stack on x86 is defined to always be aligned to the size of the GPRs.
>
> Meanwhile, I am left needing to use lots of registers, and occasional
> optimizations which try to side-step a few (store then load) cases, then
> recently needing to fix a bug where one of these optimizations broke due
> to a type-handling issue in my compiler (was using the type of the value
> stored to an array, rather than the type of the value as loaded from the
> array).
<
If there is only 1 set of registers, this problem vanishes.
> <snip>
> Nothing requires that they are not skipped in my case either, just that
> this is beyond what my current implementation can manage.
>
This was a discussion about NoOps and the necessity of having them in ISA.
If they are present they have to have some defined meaning, EricP and I
are trying to ascertain exactly what the prescription needs to be. Being able
to be skipped fails several of the assertions in this thread.
<
> Rules are mostly similar to normal instructions, just with the
> difference that ?T and ?F instructions are allowed to have register
> conflicts with each other in bundles on account of them being mutually
> exclusive.
>
>
>
> BTW: I put up a vote on Twitter wanting to see what the general
> sentiment was on possible ways to "resolve" the encoding orthogonality
> issues with R32..R63:
> https://twitter.com/cr88192/status/1602801136590782466
>
There was a recent poll on some site and 56% of Americans do not think
that Arabic numerals should be taught in schools, too; and 15% don't have
an opinion.