On 2/11/2024 7:03 AM, Tommy Murphy wrote:
> I'm not totally clear on the rationale behind this comparison or what
> the ultimate aim is?
>
>> this has allowed me to more directly compare performance between my own ISA and RV64G.
>> ...
>> In this case, the idea is that the CPU core has decoders for both my
> own ISA (BJX2) and for RV64.
>> ...
>> There is also a possible difference due to my compiler/ISA using ...
>> ...
>> I am using a custom C library
>
> Surely you're not comparing just ISAs but some combination of ISAs, ISA
> microarchitecture implementations on a particular FPGA (and, from what
> Bruce says, a possibly sub-optimal RV64G implementation), different
> compilers (GCC and your own BJX2, or is it XG2?, compiler), different C
> libraries etc.? Given the number of variables it's difficult to see what
> general conclusions can be drawn. But maybe I've misunderstood something
> here?
>
It is one CPU core design that runs both my own ISA and RV64G.
The Boot ROM will boot into the ISA Mode corresponding to the binary
that was loaded.
In both cases, I am using the same C library:
A heavily modified version of PDPCLIB from Paul Edwards...
As-in, I ended up rewriting probably half of it.
If booting bare metal, the C library also includes most of the "OS"
functionality, like low-level memory management, hardware interfacing,
and filesystem support.
I didn't really want to do performance comparisons until I got RV64
running on my stuff, since comparing across two different CPUs and C
libraries doesn't exactly lead to accurate results.
I can run the comparisons by trying at least to get things at least
semi-equivalent.
But, yeah, my own compiler is BGBCC.
I will note that currently, it can't yet compile RISC-V code (had
started working on it, but I can note that the C ABI's are rather
different).
So:
BJX2 is the name of the overall ISA;
XG2 is a sub-variant of the ISA;
It gets slightly better performance, but worse code density.
Instruction sizes are 32/64/96 bits.
Though, primarily, 32-bit is the most common instruction size.
BGBCC is the compiler I am using.
It is basically full-custom, but has also been around for a while.
Resisting from going into the history of BGBCC.
When it was revived, it was first targeted to build for SuperH/SH-4.
In effect, my ISA design is distantly related to SH-4,
even if now pretty much unrecognizable as such.
But, in terms of the compiler mismatch, I don't think this negatively
effects RISC-V. Code generation in my compiler is "not particularly
good" at times, and GCC is generally much better at clever optimizations.
So, I don't feel comparing BGBCC and GCC output here is particularly
unfair against RV64. If anything, I would expect BGBCC to loose due to
lackluster code generation.
Like, for example, register allocation strategy:
My compiler has two basic ways of dealing with registers:
Statically assign a variable to a register for the whole function;
Dynamically assign a variable within a single basic-block;
Any such variable will be loaded as needed,
and then spilled at the end of the basic-block.
GCC seems to assign variables to registers point-by-point,
with the registers flowing from one basic-block to another.
Though, this was a partial incentive for my ISA ending up with 64 registers:
This allows a larger number of functions to use a "statically assign
every variable to a register" strategy, which results in less spill and
fill.
Also, GCC has other clever abilities:
Ability to propagate constant values through variables;
Ability to inline functions and similar
Nevermind if I disabled this in this case.
...
Note that the C library stuff also includes all the "OS level" APIs I
was using for the hardware interfacing (I had been working on moving
away from programs interfacing with the hardware directly, and instead
going through APIs).
Though, I can note that Dhystone is fairly sensitive to "strcmp()" speed
and similar, where currently the logic for strcmp looks like:
__PDPCLIB_API__ int strcmp(const char *s1, const char *s2)
{
const unsigned char *p1;
const unsigned char *p2;
u64 c0, c1;
u64 li0, lj0, li1, lj1;
u64 li, lj;
int i;
p1 = s1;
p2 = s2;
c0=0x8080808080808080ULL;
c1=0x7F7F7F7F7F7F7F7FULL;
li0=*(u64 *)p1;
li1=*(u64 *)p2;
lj=(li0|(li0+c1))&c0;
while((li0==li1) && (lj==c0))
{
p1+=8; p2+=8;
li0=*(u64 *)p1;
li1=*(u64 *)p2;
lj=(li0|(li0+c1))&c0;
}
if((((u32)li0)==((u32)li1)) && (((u32)lj)==0x80808080ULL))
{ p1+=4; p2+=4; }
while (*p1 != '\0')
{
if (*p1 < *p2) return (-1);
else if (*p1 > *p2) return (1);
p1++;
p2++;
}
if (*p2 == '\0') return (0);
else return (-1);
}
As this was generally somewhat faster than using solely a naive
byte-loop (like at the end).
Where, 'u64' is basically equivalent to 'uint64_t'.
Where, in this case, de-referencing values from pointers is basically
the fastest strategy (granted, apparently some/all of the SiFive chips
would have horridly slow misaligned access; misaligned access is
generally fast in my case). Partly this was motivated by things like
wanting to be able to have LZ4 decoding and similar being "not horridly
slow" (note that copying using byte operations will hit a hard limit of
around 20MB/s at 50MHz, vs around 150MB/s if copying in 64-bit chunks,
or 300 MB/s with 128-bit chunks).
One of the major ones thus far is something I am calling "TKGDI", which
sort of vaguely takes inspiration from the Windows GDI and also VFW
(Video For Windows).
It can be used for both standalone full-screen programs, and for
creating windows within a limited GUI style context (not available with
bare-metal booting).
Basically, the program sets up an output display/window by describing
the requested parameters via BITMAPINFOHEADER objects (with the ability
to use these to also query supported graphics modes; in a vaguely
similar way to how codec configuration works in VFW).
The program can then draw into off-screen buffers, and then draw them
into an "HDC" (Handle for Device Context).
It also supports audio output, MIDI commands, input events, etc. Though,
the handling of input events was handled more like how it works in X, in
that the program uses a polling loop to request events from the HDC
(unlike in Windows GDI which had used callback functions for this part).
Other than this, the API design practices also take inspiration from OpenGL.
Otherwise, most of the API's are POSIX like.
Note that internally, a lot of the APIs work via something akin to COM
objects, but these typically have a C style API wrapper. In the "not
bare metal" use-case, these objects can generally be used for
"inter-task calls"; where, say, the application front-end, TKGDI
backend, etc, would run in different logical tasks.
Typically, things like system calls were also handled by using context
switches.
Note that, unlike the RISC-V privileged spec, I am not using multiple
sets of registers. Instead, interrupt handlers generally need to
manually save and restore all of the registers every time an interrupt
happens. The execution context inside of interrupt handlers is fairly
limited (they can only interact with physically addressed memory), so in
this case the most practical way to handle syscalls is to use a SYSCALL
interrupt handler primarily to perform a task switch, with the
system-call task running as its own logical process (just effectively
running in "Supervisor Mode").
When the syscall is done, it invokes the SYSCALL handler again to
transfer control back to the caller (or, some other task, as needed).
For COM-style objects, each method effectively invokes a "special
syscall", and the idea is that the SYSCALL interrupt handler will
task-switch to the task corresponding to the object whose method has
been called.
Note that unlike normal tasks, these handler tasks are not actively
scheduled, but instead sit around idle, and being scheduled whenever one
of their methods is called (they will get the request, finish
dispatching the method, and then transfer control back to the caller),
at which point they go silent until the next time they are used.
Though, as of yet, I haven't really ported a lot of the mechanisms
needed for all this over to RV64 mode.
Note that the application and OS kernel don't need to run in the same
ISA mode, so the original idea was to run the OS kernel as BJX2 code,
but then allow applications in RISC-V.
This got derailed though, mostly by the difficulty of getting usable
output from GCC (in the form of ELF binaries that I can freely load
anywhere within the virtual address space).
Actually, for my own ISA, I was using a modified PE/COFF, but seemingly
GCC doesn't support a RISC-V + PE/COFF option either (with the added
requirement that the binaries still have base relocations and similar).
>> compare performance between my own ISA and RV64G.
>
> Wouldn't RV64GC be a more representative RISC-V ISA to compare against
> given that it (or maybe more specifically RV64GC_Zicsr_Zifencei) is the
> base ISA for most Linux/rich-OS platforms?
>
I still haven't fully implemented support for the 'C' extension, partly
as the instruction formats are kinda hairy and just sort of ended up
putting it off.
I can note that I have now experimentally implemented support for
superscalar decoding for RV64, but it seems to be still very buggy.
Luckily, if there is a merit to RISC-V, it is that implementing the
logic needed to check for superscalar with it is fairly straightforward
(and it doesn't blow out resource cost or FPGA timing, so seems
worthwhile to work in this direction).
This currently seems able to gain a roughly 17% increase in performance
over purely scalar operation.
It seems Dhrystone with RV64G will beat Dhrystone in XG2 mode.
88k vs 79k (used along with 1-cycle ALU ops)
Though, Doom and similar is still faster with XG2.
Doom sorta boots in the Verilog implementation with superscalar enabled,
but at the moment it seems in demo loop, the player then immediately
noclip's out of the worlds and then crashes not long after (along with
some other graphical glitches).
Most likely, I will guess something is ending up in Lane 2 that
shouldn't be running in Lane 2 (to get this far, already needed to
special-case SLT and similar, as these are effectively "Lane-1 only" in
this implementation). Note that the current design will not attempt to
make use of Lane 3 in this case.
But, thus far it seems promising.
Less priority on doing similar for my own ISA, as generally the compiler
will flag which instructions can run in parallel, and does a "pretty OK"
job at this part.
>> but can note that the instructions from the 'A' extension do not seem to make an appearance in the GCC output.
>
> As far as I know the compiler will never unilaterally generate A
> instructions - they would normally be manually used in hand crafted
> assembly by the relevant OS related atomicity primitives or linked in
> via some library if necessary.
>
Makes sense.
I was not seeing them in my debugging effort.
>> Supporting F and D was a bit of work, as these had a lot instructions
> which lacked a direct equivalent, and the way the FPU is used is different.
>> ...
>> `-march=rv64g -mabi=lp64`
>
> Seems to me that by passing `-mabi=lp64` rather than, say,
> `-mabi=lp64d`, you're telling the compiler to never generate *any* hard
> float/double instructions (not just `fmadd`) which seems sub-optimal?
This passes the floating point values in integer registers, but
otherwise still uses FPU instructions.
This may not be optimal for floating-point-intensive programs, but
should be OK in most other respects.
But, yeah, it was "-ffp-contract=off" that managed to eliminate the
FMADD style instructions.
It was mostly used as previously I was building for RV64IMA, but then
switched over to 'G' once I got enough implemented to switch over. This
ABI allowed linking RV64G code against RV64IMA code.
I can note that both Doom and Dhrystone make very little use of
floating-point, as they are almost entirely integer code.