On 3/4/2023 5:48 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:
>
>
> On Saturday, March 4, 2023 at 4:50:37 PM UTC-6 BGB wrote:
>
> On 3/4/2023 1:36 PM, L Peter Deutsch wrote:
> JIT can work, and can also help to separate a higher level
> "distribution
> format" from the more specialized native ISA used by a given machine. A
> bytecode format could work here, apart from the seeming inability of
> anyone to develop and adopt a bytecode format that "doesn't suck".
>
>
> Is LLVM IR to high or "sucks to much" ??
>
Unless you are using LLVM, then LLVM IR is very painful to try to work
with. The text format is "slightly less evil" in some ways than
"bitcode", but still not ideal as a starting point (using a text-based
format and parser is needless overhead).
The bitcode is less a bytecode format, and more a binary pickling scheme
for LLVM's C++ classes being passed off as a bytecode.
For a small target machine, the memory overhead for LLVM would basically
also make it unusable.
If going the SSA route, something almost like SPIR-V might make sense as
a starting point, but as I see it, SSA is itself a big up front cost.
Unlike either RPN or a straightforward 3-register bytecode, SSA is also
not particularly usable for an interpreter; and only really makes sense
if one assumes always using a full compiler as a backend; but then it is
easy enough to convert a non-SSA bytecode into SSA form.
Something more like JVM bytecode makes sense for an interpreter, but
directly interpreting a design like JVM bytecode will give lackluster
performance. It needs a lot of operations to get much done, and a direct
interpreter is mostly bottle-necked by how quickly it can spin in a loop
and dispatch bytecode instructions.
Something like .NET bytecode makes less sense for a direct interpreter
due to the lack of type information for the instructions (trying to pass
types on the stack and deal with this for each operation, will ruin
performance).
So, say:
JVM like IR: OK for interpreter and compiler;
But, bad for most any language "not Java".
.NET like IR: Better for compiler, bad for interpreter;
SSA form: Good for compiler, bad for interpreter.
LLVM IR: OK for LLVM, bad for everything not LLVM.
Three Register IR (non SSA): OK for compiler, good for interpreter.
> <snip>
> While it may be tempting to try to gloss over 32 vs 64 bit pointers; in
> my experience trying to gloss over this when compiling C code is "non
> trivial", so the IR would likely still needed to assume either 32 or 64
> bit pointers on the target.
>
> No reason to gloss over displacement widths or pointer widths.
>
It is tempting to try to make a bytecode that works "natively" on both
32 bit and 64 bit machines.
But, making this work with typical C code and header files is basically
no go.
So, C is a leaky abstraction, and there is basically no real way within
the existing language to entirely sidestep things like "sizeof(void *)"
and "sizeof(long)" and similar.
There was once also an attempt by EFI ByteCode to address this, but had
some severe limitations and EBC was apparently mostly DOA (while still
not general purpose enough to be useful outside of trying to use it for
hardware drivers).
>
> Otherwise, for a CPU, I would more assume having multiple tiers of
> implementation, and making some features optional at lower-stat tiers,
> but potentially required for higher-stat tiers.
>
> <snip>
> As-is, last I have heard there is still not even an extension in RISC-V
> for indexed load/store, which in my experience tends to be the second
> most common addressing mode after fixed displacement.
>
>
> This is a RISC-V problem not inherently a JIT problem
>
Yeah.
>
> A general-use ISA without indexed addressing personally seems misguided
> IMO. Its costs (such as potentially needing to get a 3rd register read
> port from somewhere) seem mostly moot by the time one has an MMU and
> FPU
> (and almost entirely moot by the time one has superscalar; as by then
> the register file already needs to have 4 or 6 read ports in any case).
>
> As soon as yo have FMAC (y = x×y+z) you have the 3-register operand
> requirement in DECODE. [Rbase+Rindex<<scale+DISP] only has a 2-register
> operand requirement in DECODE. AGEN much receive 3-Operands, but
> one of them does not arrive from the register file or need forwarding.
>
In my case, AGU was handled in the EX1 (with immediate values handled
internally as a special register), and so things like memory store
needed 3 register ports.
So, as noted (in my case):
RISC-like profile:
3R1W register file;
Only one op at a time.
2-wide: 4R2W
But, MEM eats both lanes
Only being able to do "ALU | ALU" is lame...
3-wide: 6R3W
MEM eats 3rd lane.
"ALU | MEM", Fairly common
"ALU | ALU | ALU", Occasionally happens.
So, I went with 3-wide because, it wasn't too much more expensive than
2-wide, but was in many ways significantly more useful.
For the "GPU Profile" core, the idea was to drop to a 6R2W register
file, with a limit of 2 instructions per clock. The use-case for the
last 2 register ports being exclusively to supply a 3rd port for the
other instructions. This allows memory ops and MAC/FMAC and similar to
not eat the second lane, and also still has enough ports for the current
set of 128-bit SIMD operations.
My current thinking would be also that this mode would be hard-wired to
use my "XG2" ISA mode (fixed-length 32-bit instructions with 64 GPRs).
Similarly, stripping off most other ISA features not directly relevant
to 3D rendering and similar. So, for example, it probably would still
keep the "LDTEX" instruction (Loads a texel value from a
block-compressed texture), but a lot of other stuff would go away.
If I had a dedicated GPU core, this might be reason to disable LDTEX in
the main core(s); it being a bit niche, and not useful for much other
than drawing texture-mapped spans (and the mechanism for this
instruction isn't entirely free either).
There are possible expansions to the SIMD ISA, such as possible debate
for whether to add, say:
Pxxx.H Rm, Imm52, Rn
Pxxx.F Xm, Imm52, Xn
SIMD ops, which use a 4x FP13 (S.E5.F7) immediate.
As-is, it is necessary to load a 4x Binary16 constant:
JLDI Imm64, Rn //64b as 4x Binary16
PLDCH Imm64, Xn //64b 4x FP16 convert to 4x Binary32
Which burns an extra clock cycle.
But, this is only really likely to make a difference for neural nets,
which is a bit niche (and would be partially counteracted in this case
by no longer being able to put vector shuffles in parallel with the SIMD
ops in the case of Binary16).
Encoding would be, essentially, sticking two Jumbo_Imm prefixes onto a
SIMD op; with the decoder then interpreting the presence of the jumbo
prefixes to mean to decode it as an immediate-form rather than as a 3R
register form (likely extending the mechanism for floating-point
immediate values to be able to handle SIMD vector immediate cases and
similar as well).
Interestingly, with the new FPGA board (with an XC7A200T-1), there is
enough space for two CPU cores and a GPU core. So, roughly triple core;
almost but not quite enough LUTs to go quad-core. Will see how long this
lasts...
Though, still working on trying to get the DDR3 RAM module and similar
working. And it seems like the XC7A200T -1 is more prone to fail timing
than the XC7A100T -1; which was in turn more prone to fail timing than
the XC7S50 -1 (at least, until the FPGAs start to run out of space or
have "routing congestion" warnings).
I guess, more space to put stuff into, means longer paths, and more "net
delay"...
Can put more stuff in the FPGA, but can't have more than 16K of L1
cache... Can at least sort of counter-balance the 16K L1's by now being
able to have 512K of L2 cache (still throwing most of the block RAM in
the FPGA at the L2 cache).
Would have gotten a nice speedup with a -2 FPGA, but the board I got was
a little cheaper and only has a -1 speed grade.
Then again, it is possibly all a little moot; as it had been a good
number of months since I last ran any testing on actual hardware, and it
seems general stability on actual FPGAs had declined slightly during
this time (stuff is annoyingly crash-prone at the moment; much more so
than in the simulations).
TBD if doubling the size of the L2 cache will counter-balance the RAM
being roughly 50% slower in this case. Minimum CAS latency effectively
goes from 3 cycles to 6 cycles; but the RAM chip also goes from 128MB to
256MB. Mostly it is that DDR3 was designed for a higher minimum clock
speed than DDR2 (was designed for 667MHz, not 50MHz...).
Could in theory use Vivado MIG, but meh, still don't want to deal with
AXI or similar.
>
> Meanwhile, opcode fusion is an expensive feature that doesn't scale
> well. Not an ideal approach for either small or large implementations.
>
> Whereas a well rounded ISA does not have these problems..........
>
Yeah.
I am not needing to fuse any opcodes in BJX2...
Then again, I have a different set of issues...
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
isa-dev+u...@groups.riscv.org
> <mailto:
isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
>
https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/88dd08d6-f4b6-4e71-8876-17ea5b3ebc6bn%40groups.riscv.org <
https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/88dd08d6-f4b6-4e71-8876-17ea5b3ebc6bn%40groups.riscv.org?utm_medium=email&utm_source=footer>.