I didn't make this convention.
It was more one of those things that ended up "grandfathered in".
I just sort of made the stylistic change from "@Reg" to "(Reg)", but for
the most part the assembler will still accept "@Reg".
BGBCC will also accept auto-increment notation as well, but these are
faked with a multi-op sequence:
MOV.L R4, @-R6
Emitted as:
ADD -4, R6
MOV.L R4, (R6)
Could have faked a "MOV.L @R4, @R5" instruction, but didn't, as the
ancestor ISA didn't have this either.
In my newer (still incomplete) TKUCC effort, the handling for most of
these "fake" instructions was dropped, so the ASM will need to be
written more in terms of what instructions actually exist.
Though, one other difference was that for TKUCC, it was generally
assuming that jumbo prefixes always exist.
>>
>> So, the design didn't originally "evolve" out of something like RISC-V
>> or similar, rather, it evolved out of SuperH with influence from TMS320,
>> but then managed to go in a convergent direction towards RISC-V in some
>> areas...
>>>>
>>>
>>>>>> Ironically, despite being a microcontroller RISC, the IMM
>>>>>> prefix-instruction in MicroBlaze is also functionally similar to a jumbo
>>>>>> prefix.
>>>>> <
>>>>> STD 3.141592653589278643,[R3,R7<<3,DISP64]
>>>>> <
>>>>> Is 1 instruction, issues in 1 cycle, wastes no temporary registers,.......
>>>>> That is, you can store an arbitrary constant anywhere in memory
>>>>> using any addressing mode at any time with a single instruction.
>>>> Possible.
>>>>
>>>> Pulling similar off in my case would likely require 3 instructions
>>>> (assuming the RiMOV extension), or 4 (otherwise).
>>> <
>>> RISC-V typically uses 3 or 4 instructions:
>>> AUPIC; LD const; LDHI; ST location
>> You would need more than this to represent such an address mode.
> <
> Where you is not me but is most other RISCs.
I meant on RISC-V...
That addressing mode kinda "steps in it".
>>
>> I would estimate this case would need more like 8 instructions for RISC-V:
>> AUIPC; LDD; AUIPC; LDD; SLL; ADD; ADD; STD
> <
> Which is why RISC-V is mediocre at best.
>>
>>
>> My case, it is mostly the addressing mode:
>> MOV Imm64, R16
>> MOV Disp64, R17
>> ADD R3, R17, R17
>> MOV.Q R16, (R17, R7)
> <
> Still 1 instruction in my ISA
> <
> STD 3.141592653589278643,[R3,R7<<3,DISP64]
>>
>> If it were a Disp33:
>> MOV Imm64, R16
>> LEA.B (R3, Disp33s), R17
>> MOV.Q R16, (R17, R7)
> <
> DISP32 form saves 1 word::
> <
> STD 3.141592653589278643,[R3,R7<<3,DISP32]
>
Here it saves a constant load, since the largest allowed displacement
encoding is 33 bits. While it could theoretically be encoded, a larger
fixed displacement would not easily be supported with the current
implementation.
If one were to use a simpler addressing mode, this case could drop to 2
instructions.
There is not currently any way to directly store a constant to memory.
Similarly, there are still some other implementation limits at present,
like there is only support for a single immediate/displacement for a
given instruction (at least short of using multiple lanes and some
additional decoder hackery).
With the RiMOV extension, there is, however:
MOV.Q R2, (R3, R7, Disp11u)
But, unlike the normal displacements, this displacement is unscaled and
can't currently be expanded with a jumbo prefix.
A similar encoding was used for instructions like:
DMACS.L R4, R5, R6, R7 //R7=R4*R5+R6
But, this feature is also an optional extension.
>>> All of the unused OpCode Groups are reserved for the future. There are 22
>>> (out of 64) Major OpCodes for future expansion. Given that I consumed
>>> 21 for 16-bit immediates, I think there is plenty (at least for the rest of
>>> my lifetime.) Also notice I got Vectorization and SIMD into 2 instructions.
>> As noted, I would have assumed having enough opcode space to fit
> <
>> ideally, say, several thousand unique instructions.
> <
> Certainly my ISA has room, but remember I get both vectorization and SIMD
> out of exactly 2 instructions--instead of 1300..........
There end up needing to be a lot of special cases even for integer ops, say:
ADD Rm, Ro, Rn
ADD Rm, Imm9u, Rn //zero extended
ADD Rm, Imm9n, Rn //one extended
ADD Imm16u, Rn
ADD Imm16n, Rn
ADDS.L Rm, Ro, Rn //sign-extend result from 32-bits
ADDS.L Rm, Imm9u, Rn //zero extended
ADDS.L Rm, Imm9n, Rn //one extended
ADDU.L Rm, Ro, Rn //zero-extend result from 32-bits
ADDU.L Rm, Imm9u, Rn //zero extended
ADDU.L Rm, Imm9n, Rn //one extended
...
Or, variant semantics:
FADD Rm, Ro, Rn //FPU ADD, Binary64, fixed RNE
FADDG Rm, Ro, Rn //FPU ADD, Binary64, dynamic rounding mode
FADDA Rm, Ro, Rn //FPU ADD, Binary64, fake Binary32 RNE
...
Or:
FADD Rm, Imm5fp, Rn //FPIMM
...
SIMD ops, eg:
PADD.H Rm, Ro, Rn //Packed ADD 4x Binary16
PADD.F Rm, Ro, Rn //Packed ADD 2x Binary32
PADDX.F Xm, Xo, Xn //Packed ADD 4x Binary32
PADD.W Rm, Ro, Rn //Packed ADD 4x Int16
PADD.L Rm, Ro, Rn //Packed ADD 2x Int32
...
Didn't bother with signed and unsigned saturate variants, at least for
32-bit encodings (things like "PADDSS.W"/"PADDUS.W"/... would add a lot
of ops).
And, a lot of format converter ops, ...
PLDCH Rm, Rn //2x Binary16 (Low bits) to 2x Binary32
PLDCHH Rm, Rn //2x Binary16 (High bits) to 2x Binary32
PLDCXH Rm, Xn //4x Binary16 to 4x Binary32
PSTCH Rm, Rn //2x Binary32 to 2x Binary16
...
RGB5UPCK64 Rm, Rn //Unpack RGB555 to 64-bit (16b per component)
RGB5PCK64 Rm, Rn //Pack 64-bit to RGB555
...
Though, these sorts of converter ops have resulted in a fair number of
mnemonics.
But, yeah, as noted, assuming that no more "heavy eaters" are added, the
remaining F3 and F9 blocks have theoretically enough space for 1024 more
3R ops (or 32768 if one wanted to use it all for 2R ops...).
Potentially, relocating BRA/BSR to the F8 block could free up 64 more 3R
spots in the F0 block, but would be a pretty major "breaking change".
And, potentially, one could need some more Imm16 ops, and there was
debate over the possibility of, say, adding "BRGT Rn, Disp12s" ops and
similar (say, because usefulness of the existing Disp8s ops are limited
by the small displacement size; and "Conditional branch that doesn't
stomp SR.T" is potentially useful for combining predication with
modulo-scheduling, ...).
Granted, one could do the latter case by faking:
BRGT R4, .L0
As:
BRLE R4, .L1
BRA .L0
.L1:
In cases where .L0 is outside the 256 byte limit.
But, this is bulkier and less efficient.
The 2-register cases, eg:
BREQ R4, R5, Label
Would not see such an upgrade (if done, it would only be for the
"compare register with 0" cases).
Well, and then there was debate for, if added, whether to put these in
the F8 block or in the reclaimed space in the F0 block.
A lot comes down to the uncertainty of, whether in the future, I might
need any more Imm16 ops than the ones I have already (since, as noted,
this block is already 5/8 full, or 6/8 if including the space reserved
for the Disp12s compare-with-0 branches).
Partly, it is a case that it is not exactly difficult to write loop
bodies which exceed the Disp8s limit.
In which case, the current typical fallback being:
CMPGT 0, R4
BT Label //Encoded as BRA?T
Which is, technically, 2 ops and stomps SR.T, but can branch +/- 1MB.
>>
>
>>>> After the decode stage, all the pipeline sees is a 33-bit value...
>>> <
>>> On a 64-bit machine ?!?
>> Yeah. If you want to pass a 64-bit immediate, it eats multiple lanes...
> <
> Then it is not really a 64-bit machine in a similar manner that Mc 68000
> was a 16-bit machine that could perform 32-bit calculations.
> <
The registers and ops are still 64-bits...
Just the immediate field from the decoders remain 33 bits.
For reasons, the width of the immediate field has a disproportionate
impact on LUT cost (so, it was "better" to have the decoders spit out
33-bit halves and glue them together later, than have each decoder emit
a full-fledged 64-bit immediate).
It sorta works:
No (single) 32-bit instruction can produce more than 33 bits.
For the Jumbo96 encodings, one can special case it, with the Lane1
decoder dealing with the low 32 bits, and the Lane2 decoder with the
remaining 32 bits.
Imm57 and Imm53 cases add ugly (and not cheap) special cases, which is
why I was on the fence about them.
They require the other decoder to know what is going on in Lane 1:
F0 block: Imm53
F1 block: Imm57 (Invalid)
F2 block: Imm57
F3 block: Imm53?
F4..F7: Invalid
F8 block: Imm64
F9 block: Imm53?
FA/FB: Imm48
FC..FF: Invalid
...
Naturally, otherwise, all Lane 2 would see would be a pair of Jumbo
prefixes:
Lane 2 sees the prefixes in Lane 2 and Lane 3 spots;
Lane 1 would see the instruction word and the prefix in Lane 2.
If Lane3 sees a jumbo prefix, or Lane2 a solitary prefix (excluding
Imm48), it does nothing (just sorta behaves as if it were a NOP).
>> The decoders cooperate to produce a 64-bit value split across two 33-bit
>> immediate fields (which may then be glued back together at a later stage).
>>
>> "A 33 bit immediate should be big enough for anyone..."
>>
> Even Floating Point ??
Say:
MOV Imm64, Rn
Can load full Binary64, but is technically a 2-lane operation (that two
32 bit halves are glued together in the pipeline is invisible).
Or:
FLDCH Imm16, Rn //Load immediate as Binary16 to Binary64
Routes the immediate through a format converter.
For the FpImm experiment, had ended up needing to make the decoders
perform a 5-bit to Binary16 conversion, with Binary16 to Binary64
converters shoved into the register-file module (only valid on certain
register ports).
These sorts of cases are handled with "fake" internal registers that
essentially tell the register-file "Hey, there is a Binary16 value in
the Imm33 field, get the value of this having been converted to Binary64".
>>
>> Basically, the decoder deals with 64-bit values in a similar way to how
>> the ALU ops deal with 128-bit values, namely by having multiple narrower
>> units cooperate and give the illusion of a wider unit.
>>>>
>>>> Granted, during decode, the decoder needs to deal with all of the
>>>> various possible instruction layouts.
>>>>
>>>> So, say:
>>>> Lookup opcode based on the various bits;
>>> <
>>> if( 6 <= inst.major <= 14 ) then OpCode format is from OpCode
>>> else OpCode format is from Major
>>> // but the important thing is that all register specifiers are always in the same
>>> // bit positions
>>> <
>>>> Finds where it is routed to;
>>> if( 9<= inst.major <= 10) MODIF determines routing
>>> if( inst.major == 12 ) MOD determines routing
>> My instruction format wasn't organized based on where the instruction is
>> routed. In some cases, this routing has changed around based on design
>> changes within the core (adding or removing units, ...).
> <
> No, you misunderstand:: it is not where instructions are routed to that I am
> talking about, it is where OPERANDS are routed from.
OK.
In my case, how to decode the operands is determined by the FormID,
which is determined based on looking up the opcode bits.
Basically, it looks up a few parameters:
NMID (6b): Major opcode (function unit)
FMID (5b): Major instruction layout
UCMDIX(6b): Minor Opcode / Control Bits
ITY (4b): Layout sub-type (*1)
BTY (3b): (Load/Store ops): Data type for memory access
UCTY (3b): Control for multi-lane/conditional operations, etc.
*1: Ordering of register ports, zero/one extension for immeds, ...
So, say, 3R ops: (Rm,Ro,Rp,Rn)
Rm, Ro, Rn, Rn
Ro, Rm, Rn, Rn
2R ops:
Rn, Rm, Rn, Rn
Rm, Rn, Rn, Rn
ZR, Rm, Rn, Rn
Rm, ZR, Rn, Rn
Cm, ZR, Rn, Rn
Rm, ZR, Cn, Cn
...
So, the ITY field is a "necessary evil" here.
Then another set of blocks does all the unpacking based on the FMID and
similar.
Some preceding logic unpacks all the register fields and possible
immediate values (based on mode and presence/absence of jumbo-prefixes,
...).
So, the FMID logic is basically a big mess of case blocks to plug the
correct unpacked values into the correct output ports (I suspect this
part is where the bulk of the LUTs is going).
Granted, I guess an alternate strategy could have been to specify a
per-port permutation field, say:
Rm/Ro/Rp/Rn outputs: Select from Rm/Ro/Rn/Cm/Cn/IMM/ZR/...
Imm33 output: Select from
0/Imm5u/Imm5n/Imm6u/Imm6n/Imm9u/Imm9n/Imm10u/Imm10n/Imm16u/Imm16n/Imm16s/Imm20s/Imm24u/Imm24n
But, likely this would have ended up needing more LUTs than the FMID+ITY
approach.
Though, having the FMID drive a selector for the Imm33 output vs handle
the Immed bits directly could be worth looking at (could potentially
save some LUTs).
Inner decoder Outputs:
Rm, Ro, Rp, Rn: Registers, each 7 bits
Rm/Ro/Rp: Source Ports (elsewhere Rs/Rt/Rp)
Rn: Destination Port
For most ops, Rp==Rn.
Imm: 33 bits
UCmd: 9 bits (6b major op, 3b control)
UIxt: 9 bits (6b minor op, 3b control)
uFl: 20 bits
Control-flags for decoding multi-lane ops
Secondary Load/Store operation or inline-shuffle value.
(The values for these are stashed in Lane 3).
The outer part of the decoder then packs these outputs into the pipeline
outputs (based on the bundle layout and similar), and also deals with a
lot of the special handling for multi-lane operations.
Generally, in the outer stage, there are, say:
3x BJX2 Op32 decoders
1x BJX2 Op16 decoder
1x (or 2x) RISC-V Op32 decoders.
1x RISC-V Op16/RVC decoder (incomplete, RVC = blarg).
The RISC-V decoder uses a similar 2-stage approach to the BJX2 decoder
(but then annoys me with its dog-chewed immediate fields).
Started on an RVC decoder, but, its encoding is dog chewed and there are
too many "special case" encodings, ... So, I just sort of gave up. My
preference is to avoid the proliferation of "one off" cases.
>>
>> Rather, things were more organized by instruction format, so 3R
>> instructions are near other 3R instructions, most 2R instructions are
>> consolidated into big blocks, ...
>>
> Yes, I have this setup, too, but INS and FMAC sit in the same subGroup.
OK.
Apart from the F8/Imm16 block (which has its own layout), the other
blocks have the same layout, so theoretically nothing would have
prevented putting F1 or F2 style ops in F0, or F0 style ops in F2, ...
But, I had consolidated all the Disp9 LD/ST ops and Imm9/Imm10 ops into
larger blocks for organizational reasons.
Also I found it preferable to have most of the 2R ops consolidated
rather than spread all over the place. I suspect also this sort of
consolidation is likely better for LUT cost in the decoder as well.
Say, the "casez" doesn't need to check any of the opcode bits for 2R ops
if no 2R ops are in the area (Vivado appears to do a fair job in this area).
>>
>> In effect, there is a giant set of nested "casez" blocks for every
>> instruction in the ISA.
> <
> I do this with tabularized subroutines:: three_operand[opcode](arguments);
> where the routing <from> is performed as setup to arguments.
OK.
>>
>