Yeah.
There are a few reasons I went with VLIW rather than superscalar...
Both VLIW and an in-order superscalar require similar logic from the
compiler in order to be used efficiently, and the main cost of VLIW here
reduces to losing 1 bit of instruction entropy and some additional logic
in the compiler to detect if/when instructions can run in parallel.
Superscalar effectively requires a big glob of pattern recognition early
in the pipeline, which seems like a roadblock.
Had considered support for superscalar RISC-V by using a lookup to
classify instructions as "valid prefix" and "valid suffix" and also
logic to check for register clashes, and then behaving as if there were
a "virtual WEX bit" based on this.
Hadn't got around to finishing this. RISC-V support on the BJX2 core is
still mostly untested, and will still be limited to in-order operation
for the time being (and, the design of superscalar mechanism would only
be able to give 2-wide operation for 32-bit instructions only).
Running POWER ISA code on the BJX2 pipeline would be a bit more of a
stretch though (I added RISC-V as it was already pretty close to a
direct subset of BJX2 at that point).
Things like "modulo scheduling" could in theory help with a VLIW
machine, but in my experience modulo-scheduling could likely also help
with some big OoO machines as well (faking it manually in the C code
being a moderately effective optimization strategy on x86-64 machines as
well).
Apparently, clang supports this optimization, but this sort of thing is
currently a bit out of scope of what I can currently manage in BGBCC.
Though, have observed that this strategy seems to be counter-productive
on ARM machines (where it seems to often be faster to not try to
manually modulo-schedule the loops). Though, this may depend on the ARM
core (possibly an OoO ARM core might fare better; most of the ones I had
tested on had been in-order superscalar).
Though, I wouldn't expect there to be all that huge of a difference
between AArch64 and BJX2 on this front, where manual modulo-scheduling
is generally effective on BJX2.
But, as noted in my case on BJX2, latency is sort of like:
1-cycle:
Basic converter ops, like sign and zero extension;
MOV reg/reg, imm/reg, ...
...
2-cycle:
Most ALU ops (ADD/SUB/CMPxx/etc);
Some ALU ops could be made 1-cycle, but "worth the cost?".
More complex converter-class instructions ('CONV2').
Many of the FPU and SIMD format converters go here.
3-cycle:
MUL (32-bit only);
"low-precision" SIMD-FPU ops (Binary16, opt Binary32*);
Memory Loads;
The newer RGB5MINMAX and RGB5CCENC instructions;
...
*: The support for Binary32 is optional, and was pulled off by fiddling
the FPU to "barely" give correct-ish Binary32 results, with the notable
restriction that it is truncate-only rounding.
RGB5MINMAX was basically:
Cycle 1: Figure out the RGB555 Y values;
Cycle 2: Compare and select based on Y values;
Cycle 3: Deliver output from Cycle 2.
Initial attempt had routed this through the CONV2 path, but it was bad
for cost and timing, so I had reworked it to share the RGB5CCENC module,
which also had similar logic (both needed to find Y based on RGB555
values, ...).
RGB5CCENC was basically:
Cycle 1:
Figure out the RGB555 Y values (for pixels);
Figure out the RGB555 Y values (for Mid, Lo-Sel, Hi-Sel);
Cycle 2: Compare and generate selector indices based on Y values;
Cycle 3: Deliver output from Cycle 2.
Some longer non-pipelined cases:
6-cycle: FADD/FMUL/etc (main FPU)
10-cycle: FP-SIMD via main FPU ("high precision").
40-cycle: Integer DIVx.L and MODx.L
80-cylce: Integer DIVx.Q and MODx.Q, 64-bit MULx.Q, ...
120-cycle: FDIV.
480-cycle: FSQRT.
For integer divide and modulo, it is mostly a toss-up between the ISA
instruction and "just doing it in software".
For 64-bit integer multiply, doing it in software is still faster.
For floating-point divide, doing it in software is faster, but the
hardware FDIV is able to give more accurate results (software N-R
seemingly being unable to correctly converge the last few low-order bits).
The HW FDIV was based on noting that I could basically hack the "Binary
Long Division" hardware divider to also handle floating-point by making
it run ~ 50% longer (and adding a little extra special-case logic for
the exponents).
The attempt at hardware FSQRT is basically a boat anchor (doing it in
software is significantly faster). Not aware of any "good/cheap" way to
do SQRT in hardware.
> Even though slow an FPGA is great for experimenting with things like
> register renaming.
> One issue with the FPGA is it cannot handle non-
> clocked logic at all. So some logic that relies on a circuit settling in
> a loop cannot be done very well in an FPGA.
>
Yeah.
> For my most recent core I am starting with a simple scalar
> sequential machine. It should be able to run at 40 MH+, but likely
> taking 3 or more clocks per instruction. I have not been able to
> isolate a signal that is missing and causing about half the core
> to be omitted when built.
>
In early design attempts, I was trying for 100MHz, but was mostly being
defeated. Eventually, I dropped to 50, and have mostly settled here.
Some people have gotten 200 MHz cores to work, but these were mostly 8
and 16 bit designs (typically with no external RAM interface, so only
using block-RAM as RAM).
Though, apparently, some people have done Commodore 64 clones using the
XC7A100T and XC7A200T this way, since the FPGA has more Block-RAM than
the entire memory space of the C64 (then mostly implementing the 6502
ISA and various peripheral devices).
Not sure of the point of a "massively overclocked C64" though.
Doing it on an FPGA makes more sense than some other people trying to do
it more "authentically" with DIP chips though (and trying to source a
bunch of chips that have been out-of-production almost as long as I have
been alive).
Then apparently people have done "replacement clone" DIP chips, that are
effectively just FPGAs on small PCBs made to mimic the original DIP
pinouts (but, then one ends up with a C64 clone mostly running off a
bunch smaller FPGAs rather than a single bigger FPGA).
But, I guess my project is mostly different in that it wasn't
particularly motivated by "childhood nostalgia", and some of the major
games that I played when I was still young (namely Quake 1 and 2), are
still a little heavyweight for the BJX2 core.
Others, like Half-Life, never had their source code released (there is
partial source, such as for the game-logic, etc, but it is not under
anything like GPL or similar). In theory, someone could do a clone of
the GoldSrc engine based on a modified GLQuake or Quake2 engine or similar.
Well, and game consoles like the PlayStation, with games like Mega-Man
X4/Legends/etc, FF7, etc. Or DreamCast with games like Sonic Adventure
and similar.
Well, and sadly, my software OpenGL rasterizer still falls short of
matching the performance of the GPU in the original PlayStation.
Though, possibly, doing something like a game with graphics on-par to
something like Mega-Man Legends could work (the scenes were mostly made
out of a limited number of fairly large polygons).
Ironically, games like "Resident Evil" and similar would be easier, as
they were mostly using low-detail 3D models over static image backgrounds.
Or, other possible possible trickery involving cube-maps or skyboxes to
fake a bigger and more detailed scene than could actually be rendered
directly (stuff near the camera being 3D rendered, and anything else
offloaded to what is effectively a skybox).
> I think getting a VLIW machine working would be very difficult
> to do. Getting the compiler to generate code making use of the
> machine would be a significant part of it. Risc with a lot of registers
> makes it easier on the compiler.
>
Yeah.
I have done a VLIW ISA, but with the compiler mostly treating it like a
RISC.
This is kinda crappy, but basically works...
Sadly, does mean that "performance sensitive parts" end up needing to
fall back to ASM far more often than would be needed on a more modern
system (on a modern PC, whether there is any real speedup from ASM being
a bit hit or miss).
...