On 3/16/2025 8:53 AM, Michael Clark wrote:
> On 3/16/25 18:18, BGB wrote:
>> This is all going a little OT for here, but I will respond...
>>
>>
>> IMHO:
>> The balance of the encoding scheme doesn't make much sense to me.
>
> it comes from vectorized software decoding. if I drop it to 2 bits or
> bump to 4 bits then it needs an irregular scheme. it is surprising to me
> how well the numbers came out. we are use extension by striping field
> bits over packets sort of like REX but much less complex.
>
FWIW:
I also use multiple decoders running in parallel for the larger
encodings. Though, in my case, it is 3x 32-bit decoding.
I didn't mess with 16-bits as:
In 16/32 code, it is infrequent to find contiguous runs of 16-bit ops;
The logic is harder (because RV-C is so dog chewed);
The 16-bit decoders are expensive (again because of dog chew);
...
It also doesn't work with mixed 16/32 code because dealing with it would
have added significant cost. So, it was cheaper to make a superscalar
implementation that only did 32-bit instructions.
Granted, one could make a 16-bit first superscalar, using parallel
16-bit encoders. However, this does not require all parts of the
instruction to use exactly the same layout, which is more just kinda weird.
>> 16-bit ops:
>> 3-bits for register fields is a bit small.
>> Sorta works for 16-bit ops,
>> but 2R with 16-registers is likely better.
>
> X86 had 8 registers in two byte instructions. opcode plus ModRM. there
> are a small number of one byte codes but they chew lots of opcode space.
>
In x86, also most of the operations are either loads/stores or operate
from memory.
For an x86 style ISA, it is possible to add memory access to the
pipeline, but this generally requires a longer pipeline. A longer
pipeline in turn adds cost in other areas, and increases the latency of
branch operations. Branches are common enough that ideally the cost
should be kept low.
For sake of keeping pipeline latency low, it is preferable to use a
Load/Store design.
Another mechanism is to instead perform memory operations directly in
the D$, but in this case the number of operators that can operate
directly on memory needs to be kept small (Say, for example, XCHG,
ADD/SUB/RSUB, ...). However, it does not allow for general-case LoadOp
(like one would see in X86).
However, as noted, the above sort of mechanism (also needed for the
RISC-V 'A' extension) can allow for a limited set of x86-like
operations. However, even as such, I didn't explore this path all that
far (and it would be fairly limited in terms of how much it could be
scaled in terms of operation complexity). There is a limited set of
basic LoadOp instructions which exist more as an experiment, and will
generally only exist if the 'A' extension is enabled.
While technically required as part of RV64G, for the most part GCC by
itself does not appear to use any of the AMO instructions (leaving 'A'
as effectively semi-optional; with RV64IMFD as what is actually being used).
In a RISC style ISA, it may be preferable to keep the working values
primarily in registers, but this means that one needs more registers (of
the appropriate types) than there are working values.
Having 8 registers isn't really sufficient, and in my case, my compiler
works in a way that its native register usage does not map well to RV-C.
It is possible to use RV-C more like it were x86:
Load working values from memory to registers as needed;
Operate on values;
Usually evict fairly quickly (writing back to memory).
But, this strategy is worse for performance.
Sorta works on x86 more because the CPUs are better at hiding memory
access costs, but is very much not ideal for a simple in-order CPU.
Having 16 registers would at least be closer to manageable.
Say, X8..X15, X24..X31
6 Arguments (X10..X15);
6 Callee Save (X8/X9, X24..X27);
4 General Scratch (X28..X31).
Would likely be more generally usable (to allow for a higher percentage
of 16-bit ops while also keeping spill/fill rate acceptably low).
>> Though, compilers may differ, but IME one is likely to have better
>> results with 16 registers in 16-bit encodings than 8 registers.
>> Any more is impractical for a 16-bit encoding.
>> If RV based, possibly:
>> X8..X15 and X24..X31
>>
>> 32-bit ops:
>> Immediate fields are a little small;
>> Ideally, one wants 9 or 10 bits for Imm/Disp with 2 registers.
>> 12 bits is a little over.
>> 6-bit registers mostly good.
>> 32 or 64 is near optimal
>> 32 is traditionally more popular;
>> But, isn't enough to eliminate spill-and-fill.
>> 64 allows slightly better performance.
>> Best results IME have around 28 to 32 callee save registers.
>> Though, 24 also works "mostly OK".
>> This can reduce the amount of spill-and-fill.
>> So:
>> OP Rd6, Rs6, Imm10
>> OP Rd6, Imm16
>> Mostly for LI and ADDI and similar.
>> Though, "Disp16(GP)" is also useful.
>
> yes disp16 will be useful.
>
Disp16 in my ISA's is currently only used for GP (or GBR), mostly as I
use an ABI design where GP tends to point at the start of ".data" and
with both ".data" and ".bss" accessed relative to this register.
So, operations that directly support Disp16 being (in RV terms):
LW Rd, Disp16u*4(GP) //Range: 256K
LD Rd, Disp16u*8(GP) //Range: 512K
SW Rd, Disp16u*4(GP)
SD Rd, Disp16u*8(GP)
ADDI Rd, GP, Disp16u*8
These cases are most useful for global variables, which are commonly
used in a lot of the programs I am testing with.
Though, I natively use a different ASM syntax for my ISAs (which is more
in the same line as PDP/VAX/M68K/SuperH syntax), so:
MOV.L (GBR, Disp16u), Rd //Destination on right
MOV.L Disp16u(GBR), Rd //Equivalent to the above
For indexed loads, typically:
MOV.L (Rb, Ri), Rd //Indexed Load
Vs:
MOV.L (Rb)[Ri], Rd //VAX style, uglier IMHO
MOV Rd, DWORD PTR [Rb+Ri*4] //x86 style
But, in an RV group, makes sense to use RV-like ASM notation.
Reason for not having PC here:
Apart from possibly if I wanted to use constant pools (rather than
inline constants), often little is in range.
And, if used, usually it makes sense to consolidate constants into
".rodata" and strings into ".strtab", which are generally not within +/-
128K of PC. Taking the address of a function is relatively infrequent,
and often still outside this range.
For other base registers, it will use a 64-bit encoding for a Disp33
instead (if a Disp10 is insufficient).
Can note that SP relative addressing is usually within the Disp10 range,
as are most structs. Though, I typically use scaled displacements, so
for DWORD Disp10s has a +/- 2K range, and QWORD is +/- 4K.
For the Disp33s encodings, both scaled and unscaled forms exist:
Scaled Disp33s: +/- 32GB
Unscaled Disp33s: +/- 4GB
For things like non-local function calls, it is more common to load the
target function's address from memory and branch via a register.
Un RV terms, something like:
LD X5, Disp33s(PC) //Points at import-address
JALR RA, 0(X5)
Though, I can note that I am using PE/COFF, which generally does not use
a GOT (and generally requires programs to be more explicit about DLL
imports/exports).
> you can imagine the full picture with the 32-bit ops in there. we can do
> 8-bit, 16-bit, 32-bit, 64-bit and 128-bit constants and addresses will
> have CISC-style relocations. fast. just a store. but we are depending on
> bypassing larger immediate values (>12) from this constant operand bus
> if it ever ended up in hardware. in a translator we can turn it back
> into whatever works on the target. one front-end instruction stream.
>
> the problem with any change to this design is that it breaks the super
> regular scheme as the balance between opcode bits and operand bits is
> careful, we just don't have LUI/AUIPC type instructions. if we don't
> stick to 3/6/12 bits and bond the register slots for larger immediate
> values, but not too large where we just use a constant. because I don't
> want to add any more instruction forms for this experiment. that is the
> experiment. use a dedicated constant fetcher and less i-forms.
>
I personally cared slightly less about design aesthetics.
More I cared about costs.
Can note that my addressing modes are mostly:
Disp10, Disp33
GBR+Disp16 (special)
Register-Indexed
Internally, the address generation always behaves as if it was doing
indexed load/store, just with displacements being treated as a constant
index.
For the most part, the same mechanism is also used for branch calculations.
> also we can go to 5 x 6-bit operands in 64-bit instructions which we
> will need if we want to map in something like AVX-512. static config
> bits like length come out of the opcode space. i did do some numbers on
> it and it seems like it is going to fit okay.
>
> there is a lot of flexibility for opcodes but I just can't come up with
> any other numbers for the divisions without breaking the super regular
> design. and unlike REX/VEX/EVX, we expand registers similarly by adding
> bits in successive packets. but we do so in super regular way that will
> be kind to a vectorized software decoder.
>
> it's a risk. we will have to see how it comes out if a compiler back end
> for this target ever gets built.
>
Note that hardware design tradeoffs are different than software design
tradeoffs.
Generally, hardware is much more accepting of "extreme bit twiddling".
Though, that said, one does want to limit needless variation (things
like variability in the encodings still has cost).
>> Where, I can note that generally small immediate values fallow a bell
>> curve:
>> 0 is very common;
>> Small values are most common;
>> For many operations, positive values are significantly more common.
>> ADD/SUB: Signed bell curve around 0;
>> AND: Positive is more common, but negative is also common;
>> OR/XOR: Almost exclusively positive;
>> Load/Store Displacements: Strongly positive dominant.
>> Branch displacements: Weakly negative dominant.
>> Around 95% of immediate values fall between 0 to 511.
>> It drops to around 40% at 0 to 63.
>> Only around 3% between 512 and 2047.
>> Of the remaining percentage, the main hump is between 15 and 17 bits.
>> Followed by another smaller hump at 31 to 33 bits.
>> IME, 9 (zero extended) or 10 (sign extended) were usually best.
>> For local branches, most are within +/- 512 bytes.
>> For non-local branches (such as function calls): Around 1 to 4 MB.
>>
>> This is not to say that larger immediate values are unnecessary, but
>> rather that the relative gain from 10 to 12 is small, and it may be
>> preferable to stick with 10 bits and deal with larger values in a
>> different way.
>
> i want to see numbers. I won't be convinced until there is a compiler.
>
These stats were generally from my own compiler. Generally I have it set
up so that every time it compiles something, it dumps a bunch of stats
to the console (generally also dumping preprocessed output, ASTs, and
ASM / disassembly, ..., into a "dump" directory).
Granted, it uses primarily inline constants (and in present use, is
primarily 32/64 bit encodings).
In my own ISA designs, there are 96-bit encodings, but these are
comparably rare. Strict 32/64 can still get good results.
>> Using a pipeline mechanism to load immediate values from memory also
>> doesn't make much sense to me (even if built into the hardware / ISA).
>>
>> At best, it is more awkward to work with than inline immediate values.
>
> all of the GPUs do it and they get a lot more bandwidth. the problem is
> stall latency may go up 1 or 2 cycles. if you have a good constant
> branch predictor you can maintain throughput and they appear like normal
> immediate values on the operand bus once you are at execution ports.
>
> that is the theory at least.
>
Branch predictor helps but doesn't fully eliminate the cost of
unpredictable branches.
One may end up in a situation where, say, 96% of the branches take 1 or
2 clock-cycles but the remaining 4% take 10 cycles.
Predication can help reduce the number of branches, but mostly applies
to small-scale cases (like small if-branches). But, is another feature
that can't be readily accommodated in RV's encoding scheme.
While it is possible to special case short-forward-branches into a form
of predication, this creates wonk for interrupt handling (if an
interrupt happens, it needs to land after the skipped instruction block,
rather than in the middle of it).
>> For hardware, it is likely to be more expensive and have a higher
>> latency (it is likely to require an extra cache or memory-port and
>> require higher pipeline latency; where longer pipelines hurt in terms
>> of adding resource cost and also making branches slower).
>>
>> Better, IMHO, is to use larger instruction formats for bigger
>> immediate values.
>>
>> Say, for example, a 64-bit encoding for 33 bit immediates, and a 96-
>> bit encoding for 64 bit immediates.
>
> that breaks our super regular design. it is very deliberate to lose the
> 1-bit of coding space in the 16-bit space so that we can do 16/32/64/128
> but not larger, with 2 wires from each packet. the currently provisional
> RISC-V scheme needs 7-wires per packet, even when decoding length of 16-
> bit instructions due to 64-bit. I know folks mentioned taking out 48-
> bit. if you stick to just the 32-bit ops, RISC-V could adopt a scheme
> like we have proposed. but the 16-bit ops needs constants to work.
>
> so based on a distribution of lengths, the RISC-V scheme might be better
> but it is more complex for combinatorial logic and I am certain of this.
>
> the choice of 64-bit constants for the 16-bit space is access to larger
> mask constants as we see them a lot and their synthesis is a little
> unwieldy. Stanford bit-twiddling hacks style stuff. a bswap on RISC-V
> without extensions is terrible due to constant synthesis. 32-bit does
> inline synthesis and 64-bit is gp-relative loads. I deliberately don't
> want to evict data cache lines due to using a constant in these types of
> pure functions. I want dedicated constant fetch bandwidth.
>
My modified scheme for RISC-V mostly just uses 32 and 64 bit encodings.
16 and 48 bit encodings can be supported, but are not used in my case.
Using either 16 or 48 will cause the CPU to drop down to running
instructions one at a time, so Would almost assume skipping both, except
that 16/32/48 seems to be more popular for the rest of RISC-V land,
versus the 32/64 encoding scheme that I am using here.
Though, keeping support for 16 and 48 bit ops at least allows code using
them to run, albeit with a modest performance penalty.
I still mostly prefer/use inline constants (rather than any sort of
memory loads).
>> A constant pool can make sense, but:
>> Probably still better left to conventional memory loads (and base
>> registers);
>> Probably better if not be used as a "default".
>
> it works in GPUs. it could work in an OoO CPU with some careful design
> but yes it remains to be seen except in GPUs and we don't have Linux or
> Windows running on those. another page table format. what a pain.
>
I am mostly focused on optimizing for in-order designs.
While maximum performance is lower for in-order vs OoO, in-order still
wins in terms of Perf/Watt and Perf/Area.
>> As for the bigger encodings:
>> Usually these make sense for larger immediate fields;
>> Large register fields do not make sense.
>> With 64 registers, most variables fit into registers;
>> The relative gains of going much bigger here don't make sense.
>
> we split and go to 6 operands (actually 5, things like masks, and 2 out,
> etc and the other one we use for things like rounding modes etc).
>
OK.
At least 6-operand makes sense...
It came off like it was for 12-bit register IDs, which would not make sense.
?...
As for "SLT Rd, Rs, Rt", it does:
Rd = Rs < Rt;
Though, with RV64G, for the other combinations, it needs
multi-instruction sequences:
Rd = (Rs == Rt); //3 ops
Rd = (Rs != Rt); //4 ops
Rd = (Rs >= Rt); //2 ops
Rd = ((Rs & Rt) == 0); //2 ops
Rd = ((Rs & Rt) != 0); //3 ops
Rd = (Rs > Rt); //1 ops (flip S/T)
Rd = (Rs <= Rt); //2 ops (flip S/T)
Things get a little more complicated when immediate values get involved
(adds 1 or 2 instructions to most cases, adds a penalty for "Rs>Imm" as
it is no longer possible to simply flip the arguments).
My compiler ended up using conditionals enough that these multi-op
sequences had around a 3% code-size penalty (and around 5% for run-time
performance).
I ended up adding 32 bit encodings for these cases.
Previously, I had 64-bit encodings, and the gain relative to 64-bit is
smaller. However, adding the 32-bit encodings does increase the
immediate size for the 64-bit encodings from 17 to 26 bits.
So:
b = (x == 0x123456);
Can be encoded in a single 64-bit instruction.
I expect GCC would use them less often though, as it uses different
strategies for things like compound branches, etc. In my compiler, it
was fastest to turn compound expressions (when the expressions were free
of side-effects) into plain boolean logic and then drive a single
branch, rather than using short-circuiting and multiple branches (C
standard behavior, and what GCC seems to do here).
Though, there are a certain amount of "dark arts" in all of this (which
strategy is faster will depend a lot on the runtime probability).
Short circuiting will often be faster if the branches are highly
predictable, but consolidated boolean expressions may be faster in the
case of lower predictability.
>> Some of my 64-bit encodings can merge the X and F register spaces;
>> Though, RV64 is still mostly better left as a 32-register ISA.
>> The gains from 64 GPRs are less than cost of the 64-bit ops.
>> In this mode, a significant number of ops become 64b forms.
>>
>>
>> In my own ISA designs, I have beat RISC-V performance by around 40%,
>> but I suspect that this is near a theoretical hard limit for generic
>> integer code.
>>
>> Within RV, "LI Xd, Imm17s" can save some, as there are still a fair
>> number of values for which Imm17s is sufficient and which do not need
>> a "LUI+ADDI" pair.
>>
>> Etc...
>
> we definitely need to be able to experiment.
>
> I want to try out those numbers and see what codegen is like. it needs
> the 32-bit opcodes in place which we don't have yet. it was mostly I
> wanted to make a small proof-of-concept and test out the branch
> instructions. I need to write proper tests and wire them down but that
> makes sense when there are some 32-bit opcodes which are not there yet.
>
Yeah, a compiler is sort of a pretty useful thing to have for ISA design.
My ISA design process was more based on tinkering and fine tuning over a
period of years.
Well, and when I started out, I was actually tinkering with an ISA known
as "Hitachi SuperH SH-4". Where I am now, it is pretty much
unrecognizable as such, as nearly everything has changed...
In some ways, it became more RISC-V like, and I was than able to throw a
RISC-V decoder onto my existing CPU core. After filling in the gaps, and
back porting some RV features to my own ISA, things ended up getting
more similar.
However, plain RISC-V isn't so great in terms of performance. Porting
some features from my ISA over to the RISC-V encoding scheme does
improve performance (but, there are limits as to what can be done here).
For my newer ISA variant (XG3), my compiler treats it like an augmented
RV variant, but it uses a different encoding scheme.
Between my extended RISC-V variant (RV+Jx) and XG3:
There is around a 10% speed difference (XG3 is 10% faster) and 6% binary
size difference (XG3 is 6% smaller) for plain integer workloads. So,
pretty close...
Major difference between them is that XG3 uses 64 registers natively
(using both X and F registers as a single larger register space), vs
RV+Jx still mostly limited to 32 registers (or split X / F spaces) for
practical reasons.
XG3 is slightly slower than its predecessor XG2. The feature-set between
them is nearly 1:1; most obvious differences being in the ABI (XG3 uses
an ABI based on the RV ABI, with 24 callee-save registers and 8 argument
registers; rather than 31 callee-save and 16 arguments).
The possible temptation would be to tweak the ABI, possibly switching
F4..F7 over to callee save, and allowing F10..F17 for 8 more function
arguments. But, the possible performance gains are likely not worth the
interop issues it would create.
Keeping nybbles consistent makes it easier to visually decode things in
a hex dump.
OK.
I prefer inline constants, which mostly use I$.
But, as noted, the bigger instruction formats generally end up saving
more than they cost...
OK.
I mostly optimized for FPGA.
For emulation, it is decoded in advance into traces which hold arrays of
decoded instructions, each with a function pointer.
As such, decoding cost isn't really the limiting factor for the emulator.
Like, say, main interpreter loop is like:
VM_Trace *tr;
int n;
...
tr=VM_GetTraceForAddr(ctx, ctx->regs[VM_REG_PC]);
n=limit;
while(tr && n--)
{
tr=tr->Run(ctx, tr);
}
Each trace function then does something like:
VM_Opcode *op, **ops;
ctx->tr_next=tr->tr_next; //initial next trace
ctx->cyc+=tr->cyc; //clock cycle cost
ops=tr->ops;
op=ops[0]; op->Run(ctx, op);
op=ops[1]; op->Run(ctx, op);
op=ops[2]; op->Run(ctx, op);
...
return(ctx->tr_next);
Dispatch and modeling are more the main costs.
It is fairly expensive to model the cache hierarchy and keep track of
how many cycles are spent on things like L1 misses, etc.
Though, if not going for being cycle-accurate, some of this could be
skipped.
It essentially builds a trace graph, and in this case I$ flush is
explicit, and will generally involve flushing the entire graph (it was
generally faster to flush the entire graph and start over rather than
try to invalidate parts of the graph that may have been overwritten).
However, in this case the FPGA version of the CPU runs at 50 MHz and it
isn't too difficult to keep the emulator fast enough to emulate it in
real time on a desktop PC.
Though, real-time emulation on a cell-phone or RasPi is a fair bit harder.
Note that performance was generally being measured in terms of how many
clock cycles it would take on an FPGA, and not the time needed to
emulate it in the emulator.
In past VMs, I had used a JIT stage. But, in this VM, the JIT fell into
disrepair as generally the interpreter was fast enough.
>> *: My CPU core can do in-order superscalar and also explicit
>> parallelism (in my older ISA variants), but only for natively-aligned
>> 32-bit instructions. If using 16-bit ops or if the instruction stream
>> is misaligned, it falls back to running instructions one-at-a-time
>> (generally, larger encodings also run one at a time).
>>
>> Superscalar with 16-bit instructions would add too much cost and
>> complexity to be worthwhile.
>>
>> My newest ISA variant has also gone over to superscalar (but only has
>> 32/64/96 bit encodings).
>
> that's somewhere where I do have numbers because I have evaluated
> combinatoric expansion of unaligned decoders for various widths and I
> think it is practical up to 256-bit (16 x 16-bit ops, 8 x 32-bit ops).
> which is probably good up to about 12-wide execution.
>
> also we use a lot less wires than the current RISC-V scheme at the
> expense of that 1 bit in the 16-bit opcode space, but we win some back
> from short access to large constants.
>
> I still want to see code density numbers for this experiment because it
> feels like it might be okay. surely okay for a virtual machine. likely
> better than CLR or JVM and closer to a scheme that could be reified in
> hardware. like a RISC machine with constant memory. certainly somewhat
> decent for vectorized decoding.
>
OK.
Can note that I am mostly going for 3-wide:
So, 1 to 3 instructions in parallel, 32-bit only;
One instruction sequentially: All other cases.
Going much wider than 3-wide likely isn't practical for the class of
FPGA I am targeting (mostly testing on an XC7A100T).
A simple 1-bit flag (SR.T in this case) works OK.
But, yeah, in my ISA's, instruction names in ASM could have an optional
suffix:
?T: If-True (Run if SR.T is Set; NOP otherwise);
?F: If-False (Run if SR.T is Clear; NOP otherwise).
There were BT/BF pseudo instructions, encoded as BRA?T and BRA?F.
This was demoted to optional for XG3 mostly because it does have some
effect on architecture, and it is possible someone might want to do the
reverse and implement XG3 on top of what was otherwise a pure RISC-V
core, in which case they wouldn't necessarily want to need to support
features that require new architectural state.
>> Granted, people are free to doubt...
>> Does possibly seem a little absurd on face value.
>>
>>
>> Main path to size reductions (and performance) being to reduce the
>> number of instructions needed in common cases.
>>
>> Though, I am mostly avoiding going too far outside of RISC territory
>> (and avoiding adding or relying on features which have a significant
>> added cost in terms of logic).
>>
>> Mostly it is lots of "little things"; "death by 1000 paper cuts" type
>> stuff.
>
> right. what I have right now is enough for a tiny MCU without any
> privileged instructions or MMU. it's mostly just an experiment with, not
> just using constant memory, but treating the base register like PC and
> modifying our branch instructions, along with a constant block branch
> instruction. it doesn't even have MUL/DIV. because the focus is on a
> small but complete 16-bit ISA as the basis. I still have to explore the
> scheme for the opcode space to expand because we stripe across packets.
>
> the constant block branches are a bit like graph traversal. imho it is
> surprising this has not been put into a general purpose CPU yet.
>
OK.
In my case, there is an MMU based around a 256x 4-way software-managed TLB.
This wouldn't make as much sense for an emulator, but is a cheaper
option to implement in an FPGA if compared with a hardware page-table
walker. It is also more flexible than a hardware page walker, though has
a higher performance overhead (since handling TLB misses involves
needing to use an interrupt).
For an emulator, a page-walker is arguably a better option.
I have considered the possibility of an "inverted page table" (similar
to what was used on Itanium), however this would be a more complicated
option and has similar implementation issues to those of a page walker
(an inverted page-table also needs to be able to access memory to check
for matching TLB entries, and then trigger an exception if the
memory-backed TLB lacks a valid TLBE).
I had looked some, and had noted that if I wanted to run a RISC-V Linux
build, I would likely need a UEFI implementation, which would likely
itself need to be implemented as a DOS-like OS (even a basic UEFI
implementation is likely to be more than I can fit into a 32K or 48K
ROM). So, the "UEFI BIOS" would likely also need to be loaded from the
SDcard.
However, in this case, the "firmware" could likely provide its own
page-walker (and likely hide the lack of actual hardware support from
the OS).
>> Though, for most non microcontroller use-cases, saving code-density is
>> not likely to matter as much. In my 32/64/96 ISA variants, I was
>> mostly optimizing for performance (trying to reduce instruction
>> counts) and mostly got code-density savings as a side effect.
>>
>> Note, this isn't going quite so far as using Load-Op encodings or
>> auto- increment addressing though (which generally go a little outside
>> the scope of RISC territory).
>>
>> Say, for example:
>> (*ptr)+=x;
>> Will still still need 3 instructions in my ISA designs, but could in
>> theory be reduced to a single instruction, as could "y=*ptr++;"
>>
>> But, then this goes into "is it worth it?" territory. My current
>> leaning is that these cases aren't likely worth it.
>
> my POV is I don't really like CLR/JVM/WASM as a target and I would like
> a virtual machine that is closer to a metal scheme. if I have time to
> write a translator I can test it out. but it seems unlikely I can manage
> implementing GCC/LLVM backends on my own. or synthesizable logic. I just
> don;t have the bandwidth to attempt that.
>
OK.
Though, the tradeoffs for a VM are different from those of an "actual
hardware" ISA (or, at least, FPGA; ASIC is its own thing, but I am
unlikely to target ASIC as I am not made of money).
> but I can advance a tiny experiment. anyway, thanks for your feedback.
>
> I think we need a chance to test these ideas out. and I am open to
> changes in the opcodes, but I can't really change the instructions forms
> without breaking the design premise of this experiment. it will be more
> interesting when the 32-bit space is there and we have access to the
> full suite of constant sizes: 8/16/32/64/128.
>
> and I think I will come out with 64 and 128 XLEN.
>
> as mentioned the 16-bit space is all XLEN wide because it is designed
> for prolog/epilog spill/refill call, load constant etc. so if the XLEN
> were 128 then the entire 16-bit page changes in width but the opcodes
> remain the same. we might lose a few percent from keeping it regular.
> but I want to try this out in a software virtual machine. I get a good
> feel for the instruction count so long as the 32-bit space has all of
> the bitmanip instructions and the typical SIMD stuff.
>
OK.
Can note I did 128 SIMD using paired 64-bit registers.
Though, in many cases, 64-bit SIMD remains dominant.
Mostly 4x 16-bit packed-integer and Binary16 / Half-Float.
I have an OpenGL implementation which uses a lot of 128-bit 4x Binary32
SIMD as well, but this is currently only usable with my own ISA.
On a 50 MHz CPU, running OpenGL on RV64G is glacial. Needs SIMD to have
any real hope of being usable.
> Regards,
> Michael.
>