On 6/11/2023 11:59 AM,
robf...@gmail.com wrote:
> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
>> EricP <
ThatWould...@thevillage.com> writes:
>>> I was driven to instructions with 2-dest results by my desire to
>>> (a) not have condition codes and (b) wanting to handle add with carry
>>> in a straight forward way.
>> I have expanded my answer to that problem into a paper:
>> <
http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>>
>> Of course, the question is whether you consider this answer
>> straightforward.
>>> Variable length instructions opens the ISA to much useful functionality
>>> like full size immediate operands but requires a more complicated fetch
>>> unit than fixed size. This cost is generally accepted today.
>> Counterexample: ARM A64 (and ARM has ample experience with
>> variable-size instructions).
>
> I just decided not to long ago to go with a fixed sized instruction because
> variable length calculations became on the critical timing path for an FPGA
> design. Postfixes can effectively extend the instruction with constants.
>
I went with prefixes.
If I designed another (new) variable-length ISA, it is probable I would
still go with prefixes.
Though, I would likely start (this time) with the assumption of the
32-bit instructions as baseline, and the 16-bit as a shorthand. My
encoding is currently a little messy in that (in its original form) the
32-bit instructions were themselves a prefix encoding (of two 16-bit parts).
Granted, prefixes have the downside of turning the immediate values into
confetti... But, then again, there is RISC-V with fixed-length
instructions and the immediate fields are still confetti...
Doesn't really effect the CPU too much, but admittedly is kind of a pain
for the assembler and linker (linker needing a larger number of reloc
types). Though, one can reduce the annoyance, say, by making all PC-rel
offsets either Byte or Word (depending on the op), and all GBR-Rel ops
Byte based, ... At least cutting down on the needed number of reloc types.
Well, except when the encoding rules change from one instruction to
another, which is admittedly thus far why I haven't bothered with
RISC-V's 'C' extension.
Like, RVC is sort of like someone looked at Thumb and was then like
"Hold my beer!".
One other option would be, say, to have variable length ops with a
size/form bucket, say:
000: 32-bit op, 3R Forms
001: 32-bit op, 3RI Forms
010: 64-bit op, 3R/4R/etc forms
011: 64-bit op, 3RI Forms, Imm32
100: 96-bit op, -
101: 96-bit op, 3RI Forms, Imm64
Where, say:
000z-zzzz-zzdd-dddd ssss-sstt-tttt-zzzz //3R
001z-zzzz-zzdd-dddd ssss-ssii-iiii-iiii //3RI (Imm10)
011z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm32)
iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
101z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm64)
iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
Where, say, the Imm32 and Imm64 encodings operate in (mostly) the same
space as the 3R encodings, but will normally suppress the Rt field in
favor of an immediate. The Imm10 encodings would represent a slightly
different encoding space (likely exclusively Load/Store ops and ALU
Immediate instructions).
This would be nicer for an assembler and linker, but would implicitly
mean that the instruction stream would no longer be "self-aligning".
Say, with a self-aligning stream, if one starts decoding/disassembling
instructions from an arbitrary location in the "text" sections, then the
decoding will realign with the instruction stream within a few
instruction words (this property is useful for tools like debuggers).
Possible, but likely to get rather annoying, rather fast.
One almost may as well have some designated architectural scratchpad
registers in this case.
Say:
DMULU R4, R5 //MACH:MACL = R4*R5
STS MACL, R2 // R2 = MACL
But, yeah, I know of an ISA that did the multiplier this way (cough, SH).
Though, to be fair, it is less bad than doing the multiplier via MMIO
(cough, MSP430).
In my case, the 32-bit widening multiply ops generate a 64-bit result:
MULU.L R4, R5, R2 //Zero-Extended
DMULU.L R4, R5, R2 //Widening
The 64-bit ops are only available in low/high variants, and slow. Had
considered a widening variant (with a 128-bit result), but given these
were only really added for sake of the RISC-V 'M' extension, and 'M'
doesn't have widening ops, I didn't bother in this case.
Though, a widening version would at least (in theory) make it more
useful for implementing 128-bit multiply, nevermind if it would still
currently be faster to build 128-bit multiply using the 32-bit multiply
ops (but, this part could be streamlined slightly if I added helper ops
to do the high/low and high-high multiplies without needing to use a
bunch of shift instructions...).
Well, and the (slow) 64-bit multiplier was still left out to try to
shoe-horn a 3-wide version of the BJX2 core into an XC7S50 (along with a
bunch of other features).
But, I still need to find a way to shave off some more LUTs, as I want
to be able to have a CPI based camera-interface module, and I still
don't currently have the LUT budget for this.
... decided to leave out going into the specifics of the CPI camera
interface or how the module would work (or my annoyances trying to fit
all this into the XC7S50 on the Arty S7...).
But, also annoyingly: I can't use my Nexys A7 board either, as it
doesn't have enough unassigned IO (all those
LEDs/switches/7-segment-display/VGA-port/... eating up much of the
FPGA's total IO pins). Its "sibling" (the Arty A7) mostly just leaving a
lot more of this to PMODs and similar.