Misc: Relative testing between RV64G and my own ISA.

BGB

unread,

Feb 10, 2024, 4:29:26 PMFeb 10

to RISC-V ISA Dev

I had recently been able to get the "RISC-V Mode" in my own project
working, and this has allowed me to more directly compare performance
between my own ISA and RV64G.

Take of this what you will, I have been trying to keep the tests "fair".

If there are any objections to my testing or analysis, people can state
their opinions. I am not trying to "sell" my project here (and, as I see
it, RV64 actually did pretty well here, all things considered).

In this case, the idea is that the CPU core has decoders for both my own
ISA (BJX2) and for RV64.

These are handled via "Modes":
BJX2 Baseline, the more original form of my ISA
16/32/64/96 bit instructions;
5-bit register fields.
XG2 Mode:
Drops 16-bits instructions;
Expands register fields to 6 bits;
RV Mode:
Switches to a RISC-V decoder;
Uses a modified register-numbering scheme.
XG2RV Mode:
Uses the decoder for my ISA, but with RV Mode's register numbering;
Idea is that this mode will also use the RISC-V C ABI.

In this case, Mode switches are encoded via function-pointer and
link-register values:
(63:48): Mode/Status bits
(47: 1): PC Address
( 0): Mode change flag.
If 0, high bits are ignored, target uses same mode as caller.
If 1, high bits will be loaded into the correct places.

Note that JAL/JALR in this case will produce tagged pointers with the
MSB set. This does not seem to effect GCC output, but is needed for
correct operation of RV Mode in this implementation.

Theoretically, it could be used to compose programs with multiple ISA's
in use at the same time. The creation of XG2RV was mostly so that my own
ISA could be used as-needed in ASM blobs (with less awkwardness via
allowing both to use the same register numbering and ABI).

I have beat on stuff enough to more or less have functioning support for
RV64G. This includes the F and D extensions, but can note that the
instructions from the 'A' extension do not seem to make an appearance in
the GCC output.

Supporting F and D was a bit of work, as these had a lot instructions
which lacked a direct equivalent, and the way the FPU is used is different.

Within the base RV64I ISA, my ISA was almost a direct superset of RV64I,
and most internal features mapped pretty close to 1:1 (apart from
needing to add compare-and-branch, and support for a non-fixed
link-register).

Otherwise, neither ISA uses conventional ALU status flags, etc.

At present, the CPU core itself does not support superscalar operation
of RV64 code, but for testing I had modeled things as-if it did, albeit
the relative performance difference seems to be fairly modest.

My ISA supports explicit bundling and static scheduling (LIW style),
which if disabled also has a relatively modest impact.

It seems much of what executing instructions in parallel gives,
instruction latency takes away. The differences in these cases seems to
be less than the relative performance differences between the ISAs.

Similarly, the pipeline design in this case is fairly naive and strictly
in-order, and is presently incapable of splitting or fusing instructions.

The design is using a 6R3W register file, which allows up to:
3x 128-bit inputs and 1x 128-bit output for scalar operation;
3x 64-bit inputs and 1x 64-bit output for dual-lane operation;
2x 64-bit inputs and 1x 64-bit output for triple-lane operation.

Though, some smaller cores had used a 4R2W register file:
2x 128-bit inputs and 1x 128-bit output for scalar operation;
3x 64-bit inputs and 1x 64-bit output for scalar operation;
2x 64-bit inputs and 1x 64-bit output for dual-lane operation.

As can be noted, RISC-V only uses 2-input instructions for integer
operations, whereas my own ISA design has 3-input instructions (such as
indexed-store, and integer multiply-accumulate).

Internally, the CPU treats all load/store operations as-if they were
indexed, with the immediate going into the instruction via one of the
register ports. Only 1 immediate field is supported per lane. Though,
store could potentially support a second immediate, because it uses the
3rd lane for its value port.

Here, the lanes are asymmetric:
Lane 1: Allows all instructions;
Load/Store, Integer Multiply/Divide, Branch, etc.
Are only allowed in Lane 1.
Note that Branch may only be used as a scalar operation.
Lane 2: Allows many instructions, except:
Load/Store, Branch, status-modifying instructions, ...
Are not allowed in Lane 2;
FPU operations only allowed if not already present in Lane 1.
Except for certain instructions that may be co-issued.
Note that Integer Multiply (and Divide) are not allowed.
Lane 3: Only basic ALU operations.
ADD/SUB/AND/OR/XOR, MOV, etc.
At present, does not allow Shift
Shift is a comparably heavy operation.
And, this lane is rarely used for actual instructions.

Load/Store:
Supports misaligned loads, 8/16/32/64 bits.
Supports 128-bit load/store, with a 64-bit alignment.
The 128-bit case is not used by RV64G.
At present, 3-cycle latency for the generic case;
May reduce to 2 cycles for aligned 32/64 bit access.

For RV64G:
I had to spend a while trying to figure out how to get GCC to not emit
the FMADD instructions, which are actually slower than using separate
FMUL and FADD on my implementation.

This is in part, because actual FMA is more expensive and has a higher
latency than separate FMUL and FADD instructions. Note that the FPU also
follows DAZ/FTZ semantics (so is not strictly IEEE 754 compliant, as
full compliance would make the FPU significantly more expensive).

Note that, at present, FDIV, FSQRT, FMA ops, etc, exist solely to
support the F and D extensions (at some cost). Cheaper would have been
to leave them out (dealing with the former in software). With the
current implementation, doing FDIV and FSQRT in software is actually
faster as well (via unrolled Newton-Raphson iteration).

Note that on smaller FPGAs, such as the XC7S25, it is needed to leave
out an FPU entirely (though, at present, the smallest I am currently
bothering with is the XC7S50; but is still kind of a pain to shoe-horn
my CPU design into an XC7S50, and it still needs to leave out some
features).

Here is an example of my GCC options for testing:
-nostdinc -nostdlib -nostartfiles -fno-builtin -O3 \
-fwrapv -fno-strict-aliasing -fno-inline \
-march=rv64g -mabi=lp64 -mno-fdiv \
-mno-strict-align -ffast-math -ffp-contract=off

First line:
I am using a custom C library, these are needed for stuff to build.
Second line:
These options are generally needed for software to work;
"-fno-inline" is optional, can go either way;
Third/Fourth:
"-fno-fdiv", the FDIV instruction is very slow
Software Newton-Raphson is slightly faster.
"-mno-strict-align"
Mostly so "memcpy()" doesn't use byte moves,
which are slower on my target.
"-ffp-contract=off"
The FMADD ops are slower than separate ops.

Some other options used:
-lvxcore_rv64 -ltkgdi_rv64 -Wl,--emit-relocs \
-ffunction-sections -Wl,-gc-sections \
-Wl,-T,./doomsrc2/elf64rv_dflmod.x

A customized linker script was used, mostly to move the starting address
to 0x01100000, as 0x10000 is not valid memory in my case.

The second-line options make the ".text" section smaller, albeit with
"--emit-relocs" the ELF image is still quite large.

In this case, I can't yet load up the RV64 programs within my OS, so
mostly they are being compiled into standalone images and then booted
into directly.

Note that nothing from the privileged spec is supported, and in this
area my CPU design is very different (very different MMU and interrupt
handling mechanisms, ...). So, not exactly like something like Linux
will boot on this.

I have generally been comparing against a variant of my own ISA which I
am calling "XG2", which eliminates the use of 16-bit ops, but gains the
ability to directly encode all 64 GPRs for every instruction within a
32-bit instruction word.

For the RV64 Mode, F0..F31 are mapped to R32..R63. In effect, the
existence of FPRs is mostly treated as an artifact of the decoder.

In my own ISA, the floating point operations share the GPR space with
the integer instructions (with no real distinction between these cases;
with SIMD and similar also existing in the GPR space).

So, general observations:
RV64G binaries have ".text" sections that are around 14% smaller.
For Doom and Dhrystone:
Code in my ISA seems to be around 22% faster.
Say, Doom averaging ~ 18fps for RV64 at 50MHz, ~ 23fps for XG2.
After a pipeline optimization, jumped to ~21 and ~27 fps (1).
Dhrystone is roughly:
50k for RV64 (~ 0.57 DMIPS/MHz)
64k for XG2 (~ 0.73 DMIPS/MHz)
After pipeline optimization (1):
69k for RV64
72k for XG2 (~ 0.82 DMIPS/MHz)
For Software Quake:
It is pretty close to break-even.
Both perform in single-digit territory at 50MHz.
Both are more or less using a plain C software renderer.
Though, had been modified to use RGB555 rather than indexed color.
For GLQuake:
There is roughly an order of magnitude performance difference;
On RV64, my OpenGL backend currently seems to be extremely slow.
XG2: ~ 8-12 fps.
RV64: ~ 0 fps.

*1: Initially, had a 2-cycle latency for ALU ops, and a 3-cycle latency
for Load/Store. Had added special-case optimizations to reduce ALU to 1
cycle for a subset of operations, and 2 cycle for a subset of Load
operations (natively aligned 32 or 64 bit). This giving around a 17%
speedup for both ISA's, but didn't significantly effect the ratio
between them (for Doom).

This optimization did apparently somewhat reduce the difference ratio
for Dhrystone.

It seems that RV64 is more strongly negatively effected by 2-cycle ALU
ops (and rather needs basic ALU ops to be 1 cycle to perform well).

For Doom, it looks like the biggest performance factor is the lack of
indexed Load/Store operations in RV64.

For Doom, a fair amount of time is spent in R_DrawSpan and R_DrawColumn,
both of which are array-oriented functions. Likewise, ADD and Shift
operations are very high in the ranking of which instructions eat the
most clock-cycles.

In many cases, array oriented code seems to be a notable bottleneck.

Likely, Zba would help, but I haven't yet implemented or tested support
for it. I suspect actual indexed ops would help more though.

I suspect probably at least, the performance difference would drop with Zba.

In all, not "that huge" of a performance difference; in all, the
performance differences seem to smaller than expected (though, I did
sort of expect RV64 to win for Dhrystone, given "GCC magic", but the
magic didn't seem to manifest in this case).

In other contexts, GCC tended to significantly outperform other
compilers when it came to Dhrystone score, just, not in this case it seems.

It seems that, quite possibly, if RV64 had indexed load/store, it could
have won here.

There is also a possible difference due to my compiler/ISA using 128-bit
loads/stores to save and restore registers for function prolog and
epilog, but this seems to be a more minor difference (also, in general,
GCC seems to typically save/restore fewer registers as well; while at
the same time also having a smaller amount of spill/fill).

Well, and some other difference, like my compiler also tending to use
stack-canaries by default (it writes a magic number in the prolog, and
then verifies the number in the epilog, and if it doesn't match, it
triggers a breakpoint). GCC does not seems to use stack canaries.

Even if no one else implemented the extension, it would still be good to
have an extension with some encoding reserved for it.

I had informally claimed:
* 00110ss-ooooo-mmmmm-ttt-nnnnn-01-01111 Lt Rn, (Rm, Ro) (BGB: ScIx)
** 00110ss-ttttt-mmmmm-000-nnnnn-01-01111 LB Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-001-nnnnn-01-01111 LH Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-010-nnnnn-01-01111 LW Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-011-nnnnn-01-01111 LD Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-100-nnnnn-01-01111 LBU Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-101-nnnnn-01-01111 LHU Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-110-nnnnn-01-01111 LWU Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-111-nnnnn-01-01111 LX Rn, (Rm, Rt*Sc)

* 00111ss-ooooo-mmmmm-ttt-nnnnn-01-01111 St (Rm, Ro), Rn (BGB: ScIx)
** 00110ss-ttttt-mmmmm-000-nnnnn-01-01111 SB (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-001-nnnnn-01-01111 SH (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-010-nnnnn-01-01111 SW (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-011-nnnnn-01-01111 SD (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-100-nnnnn-01-01111 SBU (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-101-nnnnn-01-01111 SHU (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-110-nnnnn-01-01111 SWU (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-111-nnnnn-01-01111 SX (Rm, Rt*Sc), Rn

But, this is assuming no one else has already claimed this space.

A secondary issue seems to be that RISC-V code is frequently loading
constants from memory.
I suspect adding something like an "LI Xd, Imm17s" instruction could
help. One possibility could be using one of the remaining spots in the
OP-IMM-32 space for this.

I had informally claimed:
* 0iiiiii-iiiii-iiiii-110-nnnnn-00-11011 ? SHORI Rn, Imm16u (BGB)
* iiiiiii-iiiii-iiiii-111-nnnnn-00-11011 ? LI Rn, Imm17s (BGB)

Though, at present, nothing has been done here yet.

Note that small integer constants tend to follow a bell-curve, and
neither 10 or 12 bits entirely covers this. A 17-bit constant load could
cover a fair number of cases that fall outside the 12-bit range, without
the expense of using a memory load or similar (or LUI+ADD, but for
whatever reason GCC appears to be more inclined to use memory loads than
LUI+ADD).

Software Quake was initially slower, until I worked around a few of the
slow cases.
FADD.S and FMUL.S have a 3-cycle latency (pipelined, 3L 1T)
FADD.D and FMUL.D have a 6-cycle latency (non-pipelined, 6c)
FMADD.S/etc, have a 14-cycle latency (non-pipelined).
Currently FDIV.x has a 122 cycle cost.
It is cheaper to do Newton-Raphson in software.

Now, it is closer to parity.

For GLQuake:
In my ISA, this made significant use of specialized SIMD features;
Falling back to generic C code, with no SIMD ops, has significantly hurt
in this case (and, some of the functions with large numbers of local
variables doesn't exactly help; the existence of functions with 100+
local variables, was a factor in me migrating mostly to 64-GPR
configurations in this case).

Most other factors seem to not play in quite as significantly.

Though, as-implemented, for RV64 Mode there is (theoretically) a sort of
SIMD that exists with single-precision ops; as most were implemented on
top of the SIMD operators (in my ISA, the FPU natively uses Binary64 for
all of the scalar FPU stuff, dealing with both Binary32 and Binary16
primarily via format converters, and SIMD ops).

For implementing the F extension, these operators were (mostly) mapped
over to the corresponding SIMD operations.

As for some other differences:
RV64 has 12-bit immediate and displacement values;
XG2 has 10-bit immediate and displacement values;
Displacements are signed;
ALU immediate values are mostly unsigned.

However, the displacements are scaled by the native element size in my
case, so for 64-bit load/store, the instructions have a slightly larger
reach than their RV64 counterparts.

Though, for ALU ops, 10-bit-unsigned misses more often than 12-bit-signed.

Though, "Jumbo Prefixes" allow encoding an Imm33/Disp33 case, but are
relatively infrequently used and don't seem to be a significant
contributor to the performance difference.

Similar for predicated instructions.

Branches:
RV64:
20-bit, JAL
12-bit, Compare+Branch
XG2:
23-bit BRA/BSR/BT/BF
13-bit CompareZero+Branch
1 register, always compared with 0.

In these areas, it is more of a toss-up.

Compiler differences:
GCC seems to do much better at register allocation and inferring stuff
than my compiler, as well as being better at avoiding extraneous
instructions (such as needlessly shuffling values between registers,
etc). Despite having fewer GPRs to work with, also seems to produce
fewer register spills.

However, the instruction scheduling seems to be fairly poor in GCC (it
appears as if no scheduling is performed), so the RV64 code is far more
often stepping on register interlock (RAW hazard) penalties.

My compiler uses a mechanism that checks dependencies between
instructions and attempts to shuffle them into a more favorable order.

ABI is also a bit different:
RV64:
Passes up to 8 arguments in A0..A7
No spill space;
Return value in A0;
Struct call/return: copy via stack.
XG2:
Uses R4..R7, R20..R23, R36..R39, R52..R55
Passing up to 16 arguments in registers.
Uses a 128 byte spill space.
May be reduced to 64 bytes in some cases.
Return value in R2.
Passes/returns structs by reference.
For return, the destination address is passed in via R2.
If under 16 bytes, pass/return via 128-bit register pair.
Design loosely inspired by the WinCE SH-4 ABI and Windows X64 ABI.

But, yeah, this is just what I am seeing here...

Any comments?...

Bruce Hoult

unread,

Feb 10, 2024, 11:35:40 PMFeb 10

to BGB, RISC-V ISA Dev

I don't think it makes sense to compare RISC-V running on a microarchitecture designed for something else. In the best case scenario you might find a "clock cycles per program" difference, but that doesn't take account of differences in achievable clock speed, silicon area (cost per die, number of cores on same size die, etc).

>Dhrystone is roughly:
> 50k for RV64 (~ 0.57 DMIPS/MHz)

That's awful. The very first RISC-V SoC sold, the FE310 microcontroller (HiFive1) in 2016, did 1.6 DMIPS/MHz. And moreover it did 320 MHz in a 180nm process, while comparable Arm cores (Cortex M3) topped out at 180 MHz. So RISC-V uses 10% (say) more instructions, but gets 75% higher clock speed. The end result is way ahead.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/23a2366c-ae99-4d62-b9d6-7a3048fcce21%40gmail.com.

BGB

unread,

Feb 11, 2024, 12:51:20 AMFeb 11

to Bruce Hoult, RISC-V ISA Dev

On 2/10/2024 10:35 PM, Bruce Hoult wrote:
> I don't think it makes sense to compare RISC-V running on a
> microarchitecture designed for something else. In the best case scenario
> you might find a "clock cycles per program" difference, but that doesn't
> take account of differences in achievable clock speed, silicon area
> (cost per die, number of cores on same size die, etc).
>

Granted.

I will not make any claim my CPU core's design is doing a "good" job at
running RISC-V code... (Arguably, yeah, it is actually pretty terrible).

Granted, yes, it would make more sense to run RV64 on a CPU actually
designed to run RV64, rather than one where it was sort of bolted on as
an afterthought (and where the fit is rather awkward in some places, and
where the RV64 implementation does not make use of many of the features
built into the core, such as the 128-bit FP-SIMD unit, ...).

But, this is what I am seeing of it, at the moment.

It seemed like an interesting experiment, and maybe worthwhile to at
least mention some of my observations/results in these areas.

Well, after I got it "actually working", after mostly ignoring it for
over a year (mostly because "riscv64-unknown-elf" only seems able to
generate plain static-linked ELF binaries; rather than "something more
useful" for my uses, like FDPIC or similar...).

Where, as noted, I am mostly running an "OS" which currently puts
everything into a large shared virtual address space, so ideally I need
an ABI, like FDPIC, that is designed to be able to load images wherever
and have ".text" and ".data"/".bss" as two independent entities in memory.

Note that the ABI I am using for my own ISA, does have separately code
and data areas (albeit, using a different mechanism from that used in
FDPIC).

Well, in particular, rather than using GOT entries where each
function-pointer also has an associated GOT pointer, as in FDPIC, it
instead keys each program image to be able to access a lookup table via
GP/GBR, which can be used to fetch a new address for this register
corresponding to a given binary (at which point, it can be used as a
base-register to access global variables).

Though, the main difference here being that FDPIC puts the work on the
caller side, whereas my ABI (PBO) puts this work on the callee side.

Though, as a result of this, running RISC-V code was mostly limited to
directly booting ELF images (rather than being able to launch programs
from a shell or similar).

>>Dhrystone is roughly:
>> 50k for RV64 (~ 0.57 DMIPS/MHz)
>
> That's awful. The very first RISC-V SoC sold, the FE310 microcontroller
> (HiFive1) in 2016, did 1.6 DMIPS/MHz. And moreover it did 320 MHz in a
> 180nm process, while comparable Arm cores (Cortex M3) topped out at 180
> MHz. So RISC-V uses 10% (say) more instructions, but gets 75% higher
> clock speed. The end result is way ahead.
>

Admittedly, I did expect better results than this...

But, this initial result was showing considerable hurt due to having
2-cycle latency on ADD, and increased to 0.79 DMIPS/MHz with ADD reduced
to a 1-cycle latency (though, this mechanism currently only handles
32-bit ADD, and 64-bit ADD where the values fall into signed-32-bit range).

Some initial results were a lot worse, but this was due to it running
versions of "strcmp()" and "strcpy()" that worked 1 byte at a time (I
had then modified these to work 8 bytes at a time). Partly this is
because Dhrystone also seems fairly sensitive to "strcmp()" performance.

Note that this is with scalar execution...

It looks like Dhrystone would increase to around 1.0 DMIPS/MHz if it was
running with 2-wide superscalar (in combination with the 1-cycle ALU
latency).

However, at present, I don't have a superscalar decoder for RISC-V (and
the relative cost of superscalar is why my ISA design had originally
gone with LIW).

But, generally hard to pull off 1.0 or above, if limiting everything to
1-wide / scalar execution.

Similarly, as noted, the core is strictly in-order, and will stall the
pipeline whenever there is an L1 miss or RAW hazard or similar.

But, if needed, could try to implement a superscalar fetch/decode
process for RV64, which could at least give modest gains in terms of
performance (the bigger uncertainty is whether I can pull it off without
too significant of an impact of FPGA timing constraints or LUT cost).

I don't know how my core would perform if implemented using an ASIC, as
noted:
It passes timing at 50 MHz on the FPGA's I am using (XC7S50, XC7A100T,
and XC7A200T, generally all at a -1 speed grade).

Most "actual hardware" testing had been on a "Nexys A7" and a QMTECH
board (was ~ $99 on AliExpress, at the time; The "Nexys A7" costs a bit
more...). Other than this, had tested on an "Arty S7 50" board.

Had gotten a smaller configuration onto a CMod-S7 board, but this was
too limited, and something like an RV32I core or similar would make a
lot more sense for this class of FPGA board (also, unlike the CMod-A7,
they don't have any external RAM chip, whereas the CMod-A7 had a 512K
QSPI SRAM chip... Only having the FPGA's internal Block-RAM available
does somewhat limit things).

Had experimentally got it to 75 MHz, but the compromises needed to pull
this off have tended to hurt more than what is gained by running at a
higher clock-speed.

Granted, it is possible that someone could get a 100MHz RV64 core on an
FPGA, but this would likely be limited to a scalar design.

If designing specifically for this, it should also be possible to do it
at a lower LUT cost.

But, I guess, one downside of trying to do a plain RV64 core, is that
lots of other people have already done "fairly competent" RV64 cores...

> <mailto:isa-dev%2Bunsu...@groups.riscv.org>.

> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/23a2366c-ae99-4d62-b9d6-7a3048fcce21%40gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/23a2366c-ae99-4d62-b9d6-7a3048fcce21%40gmail.com>.
>

Robert Finch

unread,

Feb 11, 2024, 7:15:46 AMFeb 11

to RISC-V ISA Dev, BGB, RISC-V ISA Dev, Bruce Hoult

>>Dhrystone is roughly:
>> 50k for RV64 (~ 0.57 DMIPS/MHz)

Is that 50k a typo? Is not 0.57 MIPS 570k?

>
> That's awful. The very first RISC-V SoC sold, the FE310 microcontroller
> (HiFive1) in 2016, did 1.6 DMIPS/MHz. And moreover it did 320 MHz in a
> 180nm process, while comparable Arm cores (Cortex M3) topped out at 180
> MHz. So RISC-V uses 10% (say) more instructions, but gets 75% higher
> clock speed. The end result is way ahead.
>

Tommy Murphy

unread,

Feb 11, 2024, 8:03:34 AMFeb 11

to BGB, RISC-V ISA Dev

I'm not totally clear on the rationale behind this comparison or what the ultimate aim is?

> this has allowed me to more directly compare performance between my own ISA and RV64G.

> ...

> In this case, the idea is that the CPU core has decoders for both my own ISA (BJX2) and for RV64.

> ...

> There is also a possible difference due to my compiler/ISA using ...

> ...

> I am using a custom C library

Surely you're not comparing just ISAs but some combination of ISAs, ISA microarchitecture implementations on a particular FPGA (and, from what Bruce says, a possibly sub-optimal RV64G implementation), different compilers (GCC and your own BJX2, or is it XG2?, compiler), different C libraries etc.? Given the number of variables it's difficult to see what general conclusions can be drawn. But maybe I've misunderstood something here?

> compare performance between my own ISA and RV64G.

Wouldn't RV64GC be a more representative RISC-V ISA to compare against given that it (or maybe more specifically RV64GC_Zicsr_Zifencei) is the base ISA for most Linux/rich-OS platforms?

> but can note that the instructions from the 'A' extension do not seem to make an appearance in the GCC output.

As far as I know the compiler will never unilaterally generate A instructions - they would normally be manually used in hand crafted assembly by the relevant OS related atomicity primitives or linked in via some library if necessary.

> Supporting F and D was a bit of work, as these had a lot instructions which lacked a direct equivalent, and the way the FPU is used is different.

> ...

> `-march=rv64g -mabi=lp64`

Seems to me that by passing `-mabi=lp64` rather than, say, `-mabi=lp64d`, you're telling the compiler to never generate *any* hard float/double instructions (not just `fmadd`) which seems sub-optimal?

Bruce Hoult

unread,

Feb 11, 2024, 9:04:14 AMFeb 11

to Tommy Murphy, BGB, RISC-V ISA Dev

> > `-march=rv64g -mabi=lp64`

>

> Seems to me that by passing `-mabi=lp64` rather than, say, `-mabi=lp64d`, you're telling the compiler to never generate *any* hard float/double instructions (not just `fmadd`) which seems sub-optimal?

No, it just means a soft float ABI, passing FP values in integer registers, but code compiled with rv64g will copy the values to FP registers to do arithmetic on them.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/LO2P123MB7050B0BD000331CDB4E9CFE6F9492%40LO2P123MB7050.GBRP123.PROD.OUTLOOK.COM.

Tommy Murphy

unread,

Feb 11, 2024, 9:17:45 AMFeb 11

to Bruce Hoult, BGB, RISC-V ISA Dev

> No, it just means a soft float ABI, passing FP values in integer registers, but code compiled with rv64g will copy the values to FP registers to do arithmetic on them.

Thanks Bruce - I guess that `-march=rv64ima` (i.e. architecture with no F/D extensions) would do what I was saying so?

But isn't `-mabi=lp64` in the presence of the F/D extensions likely to generate sub-optimal floating point code?

If the target is RV64G (aka RV64IMAFD) and thus the FP registers are available then isn't it better to use them?

Why would one not?

BGB

unread,

Feb 11, 2024, 2:26:43 PMFeb 11

to Tommy Murphy, RISC-V ISA Dev

On 2/11/2024 7:03 AM, Tommy Murphy wrote:
> I'm not totally clear on the rationale behind this comparison or what
> the ultimate aim is?
>
>> this has allowed me to more directly compare performance between my own ISA and RV64G.
>> ...
>> In this case, the idea is that the CPU core has decoders for both my
> own ISA (BJX2) and for RV64.
>> ...
>> There is also a possible difference due to my compiler/ISA using ...
>> ...
>> I am using a custom C library
>
> Surely you're not comparing just ISAs but some combination of ISAs, ISA
> microarchitecture implementations on a particular FPGA (and, from what
> Bruce says, a possibly sub-optimal RV64G implementation), different
> compilers (GCC and your own BJX2, or is it XG2?, compiler), different C
> libraries etc.? Given the number of variables it's difficult to see what
> general conclusions can be drawn. But maybe I've misunderstood something
> here?
>

It is one CPU core design that runs both my own ISA and RV64G.
The Boot ROM will boot into the ISA Mode corresponding to the binary
that was loaded.

In both cases, I am using the same C library:
A heavily modified version of PDPCLIB from Paul Edwards...
As-in, I ended up rewriting probably half of it.

If booting bare metal, the C library also includes most of the "OS"
functionality, like low-level memory management, hardware interfacing,
and filesystem support.

I didn't really want to do performance comparisons until I got RV64
running on my stuff, since comparing across two different CPUs and C
libraries doesn't exactly lead to accurate results.

I can run the comparisons by trying at least to get things at least
semi-equivalent.

But, yeah, my own compiler is BGBCC.
I will note that currently, it can't yet compile RISC-V code (had
started working on it, but I can note that the C ABI's are rather
different).

So:
BJX2 is the name of the overall ISA;
XG2 is a sub-variant of the ISA;
It gets slightly better performance, but worse code density.
Instruction sizes are 32/64/96 bits.
Though, primarily, 32-bit is the most common instruction size.
BGBCC is the compiler I am using.
It is basically full-custom, but has also been around for a while.
Resisting from going into the history of BGBCC.
When it was revived, it was first targeted to build for SuperH/SH-4.
In effect, my ISA design is distantly related to SH-4,
even if now pretty much unrecognizable as such.

But, in terms of the compiler mismatch, I don't think this negatively
effects RISC-V. Code generation in my compiler is "not particularly
good" at times, and GCC is generally much better at clever optimizations.

So, I don't feel comparing BGBCC and GCC output here is particularly
unfair against RV64. If anything, I would expect BGBCC to loose due to
lackluster code generation.

Like, for example, register allocation strategy:
My compiler has two basic ways of dealing with registers:
Statically assign a variable to a register for the whole function;
Dynamically assign a variable within a single basic-block;
Any such variable will be loaded as needed,
and then spilled at the end of the basic-block.
GCC seems to assign variables to registers point-by-point,
with the registers flowing from one basic-block to another.

Though, this was a partial incentive for my ISA ending up with 64 registers:
This allows a larger number of functions to use a "statically assign
every variable to a register" strategy, which results in less spill and
fill.

Also, GCC has other clever abilities:
Ability to propagate constant values through variables;
Ability to inline functions and similar
Nevermind if I disabled this in this case.
...

Note that the C library stuff also includes all the "OS level" APIs I
was using for the hardware interfacing (I had been working on moving
away from programs interfacing with the hardware directly, and instead
going through APIs).

Though, I can note that Dhystone is fairly sensitive to "strcmp()" speed
and similar, where currently the logic for strcmp looks like:
__PDPCLIB_API__ int strcmp(const char *s1, const char *s2)
{
const unsigned char *p1;
const unsigned char *p2;
u64 c0, c1;
u64 li0, lj0, li1, lj1;

u64 li, lj;
int i;

p1 = s1;
p2 = s2;

c0=0x8080808080808080ULL;
c1=0x7F7F7F7F7F7F7F7FULL;

li0=*(u64 *)p1;
li1=*(u64 *)p2;
lj=(li0|(li0+c1))&c0;
while((li0==li1) && (lj==c0))
{
p1+=8; p2+=8;
li0=*(u64 *)p1;
li1=*(u64 *)p2;
lj=(li0|(li0+c1))&c0;
}

if((((u32)li0)==((u32)li1)) && (((u32)lj)==0x80808080ULL))
{ p1+=4; p2+=4; }

while (*p1 != '\0')
{
if (*p1 < *p2) return (-1);
else if (*p1 > *p2) return (1);
p1++;
p2++;
}
if (*p2 == '\0') return (0);
else return (-1);
}

As this was generally somewhat faster than using solely a naive
byte-loop (like at the end).

Where, 'u64' is basically equivalent to 'uint64_t'.

Where, in this case, de-referencing values from pointers is basically
the fastest strategy (granted, apparently some/all of the SiFive chips
would have horridly slow misaligned access; misaligned access is
generally fast in my case). Partly this was motivated by things like
wanting to be able to have LZ4 decoding and similar being "not horridly
slow" (note that copying using byte operations will hit a hard limit of
around 20MB/s at 50MHz, vs around 150MB/s if copying in 64-bit chunks,
or 300 MB/s with 128-bit chunks).

One of the major ones thus far is something I am calling "TKGDI", which
sort of vaguely takes inspiration from the Windows GDI and also VFW
(Video For Windows).

It can be used for both standalone full-screen programs, and for
creating windows within a limited GUI style context (not available with
bare-metal booting).
Basically, the program sets up an output display/window by describing
the requested parameters via BITMAPINFOHEADER objects (with the ability
to use these to also query supported graphics modes; in a vaguely
similar way to how codec configuration works in VFW).

The program can then draw into off-screen buffers, and then draw them
into an "HDC" (Handle for Device Context).

It also supports audio output, MIDI commands, input events, etc. Though,
the handling of input events was handled more like how it works in X, in
that the program uses a polling loop to request events from the HDC
(unlike in Windows GDI which had used callback functions for this part).

Other than this, the API design practices also take inspiration from OpenGL.

Otherwise, most of the API's are POSIX like.

Note that internally, a lot of the APIs work via something akin to COM
objects, but these typically have a C style API wrapper. In the "not
bare metal" use-case, these objects can generally be used for
"inter-task calls"; where, say, the application front-end, TKGDI
backend, etc, would run in different logical tasks.

Typically, things like system calls were also handled by using context
switches.

Note that, unlike the RISC-V privileged spec, I am not using multiple
sets of registers. Instead, interrupt handlers generally need to
manually save and restore all of the registers every time an interrupt
happens. The execution context inside of interrupt handlers is fairly
limited (they can only interact with physically addressed memory), so in
this case the most practical way to handle syscalls is to use a SYSCALL
interrupt handler primarily to perform a task switch, with the
system-call task running as its own logical process (just effectively
running in "Supervisor Mode").

When the syscall is done, it invokes the SYSCALL handler again to
transfer control back to the caller (or, some other task, as needed).
For COM-style objects, each method effectively invokes a "special
syscall", and the idea is that the SYSCALL interrupt handler will
task-switch to the task corresponding to the object whose method has
been called.

Note that unlike normal tasks, these handler tasks are not actively
scheduled, but instead sit around idle, and being scheduled whenever one
of their methods is called (they will get the request, finish
dispatching the method, and then transfer control back to the caller),
at which point they go silent until the next time they are used.

Though, as of yet, I haven't really ported a lot of the mechanisms
needed for all this over to RV64 mode.

Note that the application and OS kernel don't need to run in the same
ISA mode, so the original idea was to run the OS kernel as BJX2 code,
but then allow applications in RISC-V.

This got derailed though, mostly by the difficulty of getting usable
output from GCC (in the form of ELF binaries that I can freely load
anywhere within the virtual address space).

Actually, for my own ISA, I was using a modified PE/COFF, but seemingly
GCC doesn't support a RISC-V + PE/COFF option either (with the added
requirement that the binaries still have base relocations and similar).

>> compare performance between my own ISA and RV64G.
>
> Wouldn't RV64GC be a more representative RISC-V ISA to compare against
> given that it (or maybe more specifically RV64GC_Zicsr_Zifencei) is the
> base ISA for most Linux/rich-OS platforms?
>

I still haven't fully implemented support for the 'C' extension, partly
as the instruction formats are kinda hairy and just sort of ended up
putting it off.

I can note that I have now experimentally implemented support for
superscalar decoding for RV64, but it seems to be still very buggy.

Luckily, if there is a merit to RISC-V, it is that implementing the
logic needed to check for superscalar with it is fairly straightforward
(and it doesn't blow out resource cost or FPGA timing, so seems
worthwhile to work in this direction).

This currently seems able to gain a roughly 17% increase in performance
over purely scalar operation.
It seems Dhrystone with RV64G will beat Dhrystone in XG2 mode.
88k vs 79k (used along with 1-cycle ALU ops)
Though, Doom and similar is still faster with XG2.

Doom sorta boots in the Verilog implementation with superscalar enabled,
but at the moment it seems in demo loop, the player then immediately
noclip's out of the worlds and then crashes not long after (along with
some other graphical glitches).

Most likely, I will guess something is ending up in Lane 2 that
shouldn't be running in Lane 2 (to get this far, already needed to
special-case SLT and similar, as these are effectively "Lane-1 only" in
this implementation). Note that the current design will not attempt to
make use of Lane 3 in this case.

But, thus far it seems promising.

Less priority on doing similar for my own ISA, as generally the compiler
will flag which instructions can run in parallel, and does a "pretty OK"
job at this part.

>> but can note that the instructions from the 'A' extension do not seem to make an appearance in the GCC output.
>
> As far as I know the compiler will never unilaterally generate A
> instructions - they would normally be manually used in hand crafted
> assembly by the relevant OS related atomicity primitives or linked in
> via some library if necessary.
>

Makes sense.
I was not seeing them in my debugging effort.

>> Supporting F and D was a bit of work, as these had a lot instructions
> which lacked a direct equivalent, and the way the FPU is used is different.
>> ...
>> `-march=rv64g -mabi=lp64`
>
> Seems to me that by passing `-mabi=lp64` rather than, say,
> `-mabi=lp64d`, you're telling the compiler to never generate *any* hard
> float/double instructions (not just `fmadd`) which seems sub-optimal?

This passes the floating point values in integer registers, but
otherwise still uses FPU instructions.

This may not be optimal for floating-point-intensive programs, but
should be OK in most other respects.

But, yeah, it was "-ffp-contract=off" that managed to eliminate the
FMADD style instructions.

It was mostly used as previously I was building for RV64IMA, but then
switched over to 'G' once I got enough implemented to switch over. This
ABI allowed linking RV64G code against RV64IMA code.

I can note that both Doom and Dhrystone make very little use of
floating-point, as they are almost entirely integer code.

BGB

unread,

Feb 11, 2024, 3:20:58 PMFeb 11

to Robert Finch, RISC-V ISA Dev

On 2/11/2024 6:15 AM, Robert Finch wrote:
>>>Dhrystone is roughly:
>>> 50k for RV64 (~ 0.57 DMIPS/MHz)
>
> Is that 50k a typo? Is not 0.57 MIPS 570k?

DMIPS/MHz = (Score/1757)/MHz
So, here, 50k does give 0.57 at 50MHz.

And, in this case, I am running at 50MHz, not 600MHz.

For contrast, can note that on my Ryzen, previously got Dhrystone scores
of, say, IIRC:
12M via MSVC (1.85 DMIPS/MHz);
40M via GCC (6.15 DMIPS/MHz).
So, same hardware, same OS, just different compiler.

So, GCC was a fair margin ahead of both MSVC and Clang (though, Clang
was also a fair bit ahead of MSVC here, around 28M IIRC), which had made
me a little suspicious of Dhrystone scores with GCC (more so when a
similar delta was not seen in other things I was using as benchmarks).

Though, in the case of the RV64 testing, the GCC Dhrystone score seems
to be more in the area of what I would expect to see (and more
consistent with my other tests).

But, at the same time, I wouldn't have been terribly surprised to see
150k either. In the past I had noted that, if scaled based on MHz,
performance from BGBCC wasn't too far off from a linear extrapolation of
MSVC numbers.

So, it wasn't entirely implausible that whatever caused such a
difference on x86-64, to also cause such a difference for RV64, but it
seems it didn't happen in this case.

Granted, this is noting that one generally needs to compile Dhrystone in
GCC with "-fno-inline" for more consistent results, otherwise it is
"stupidly fast"...

As I see it, this is also not unfair for ISA comparison, as basically
neither compiler will use inlining in this case (GCC, because it is
disabled; mine, because it doesn't support function inlining, *1, ...).

*1: Though, it does still support the "ye olde" trick of sticking big
blobs of code into macros. Generally this is not a good strategy though,
as it can result in unreasonable levels of code bloat.

Also doesn't support loop unrolling either.

It is actually fairly close to a direct linear translation of the
original C code (well, albeit with the funkiness of going through a
stack-machine stage as part of the IR):
C -> PP-C -> AST -> Stack-Machine IR.
Stack-IR -> 3AC -> machine code.

No separate assembler and linker stage in my case, rather:
The Stack-IR takes the place of object files and linking;
ASM blobs are actually parsed and fed back into the compiler backend.
ASM is passed though the stack IR stage as ASCII text blobs.

For those familiar with the Java JVM or .NET CIL (formerly MSIL), it is
a similar sort of idea, just in my case, was mostly using it to express
C rather than a Java-like language...

Note that, like the former (and unlike, say, WASM), control-flow in the
IR stage is primarily based on an "if-goto" style mechanism (any
higher-level control-flow constructs first being translated and
expressed in terms of if-goto).

Had evaluated other options, but in general, a stack-machine abstraction
seems to work pretty well as an IR / IL stage (though, is translated
into 3AC before the backend actually gets to work on the code-generation
parts; which eliminates most of the spurious stack manipulations that
are added via the RPN stage).

Though, some recent optimizations did involve more complex pattern
recognition in the ASTs, and needing to get more clever with the
ordering of the stack operations, as the original logic was prone to
generating a lot of otherwise unnecessary temporary variables in the 3AC
stage (which in turn may manifest as additional register spill/fill and
MOV instructions and similar).

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/23a2366c-ae99-4d62-b9d6-7a3048fcce21%40gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/23a2366c-ae99-4d62-b9d6-7a3048fcce21%40gmail.com> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/23a2366c-ae99-4d62-b9d6-7a3048fcce21%40gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/23a2366c-ae99-4d62-b9d6-7a3048fcce21%40gmail.com>>.
> >
>

Bruce Hoult

unread,

Feb 12, 2024, 12:24:59 AMFeb 12

to Tommy Murphy, BGB, RISC-V ISA Dev

On Mon, Feb 12, 2024 at 3:17 AM Tommy Murphy <tommy_...@hotmail.com> wrote:

> No, it just means a soft float ABI, passing FP values in integer registers, but code compiled with rv64g will copy the values to FP registers to do arithmetic on them.

Thanks Bruce - I guess that `-march=rv64ima` (i.e. architecture with no F/D extensions) would do what I was saying so?

But isn't `-mabi=lp64` in the presence of the F/D extensions likely to generate sub-optimal floating point code?

Sure, it's slower than using a float ABI. But it's a heck of a lot faster than emulating FP using integer instructions. If you have a code base written for machines with an FPU and then upgrade to one using an FPU then it can be easier to keep using the same argument passing ABI even while using FP instructions.

If the target is RV64G (aka RV64IMAFD) and thus the FP registers are available then isn't it better to use them?

Why would one not?

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/LO2P123MB70507A74D4F2649E494AE214F9492%40LO2P123MB7050.GBRP123.PROD.OUTLOOK.COM.

BGB

unread,

Feb 12, 2024, 3:42:30 AMFeb 12

to Bruce Hoult, Tommy Murphy, RISC-V ISA Dev

On 2/11/2024 11:24 PM, Bruce Hoult wrote:
> On Mon, Feb 12, 2024 at 3:17 AM Tommy Murphy <tommy_...@hotmail.com
> <mailto:tommy_...@hotmail.com>> wrote:
>
> > No, it just means a soft float ABI, passing FP values in integer
> registers, but code compiled with rv64g will copy the values to FP
> registers to do arithmetic on them.
>
> Thanks Bruce - I guess that `-march=rv64ima` (i.e. architecture with
> no F/D extensions) would do what I was saying so?
> But isn't `-mabi=lp64` in the presence of the F/D extensions likely
> to generate sub-optimal floating point code?
>
>
> Sure, it's slower than using a float ABI. But it's a heck of a lot
> faster than emulating FP using integer instructions. If you have a code
> base written for machines with an FPU and then upgrade to one using an
> FPU then it can be easier to keep using the same argument passing ABI
> even while using FP instructions.
>

Yeah. The slowdown is small, particularly when not testing with
particularly FPU intensive workloads.

In this case, I can note that I started this sub-project with RV64IMA,
and this ABI allows going more mix-and-match as needed.

However, I can also note that Quake is basically unusable with FP
emulation, which basically made F/D support more of a priority (so had
to beat this into working).

Also now got superscalar fetch/decode and the branch-predictor seemingly
working for RV64 code, so my initial results are not necessarily the
final results.

I guess, a logical next step may be to try to get the Zba extension
working, ... (or, maybe make another go at facing off against the 'C'
extension).

Well, where the 'C' extension does kinda seem like someone looked at
Thumb, with its respective level of dog-chew and bit-mashing, and was
like "hold my beer".

Would have been more happier with an encoding more like in SuperH, where
instructions had fairly consistent layout patterns:
ZZZZ-nnnn-mmmm-ZZZZ
ZZZZ-nnnn-ZZZZ-ZZZZ
ZZZZ-nnnn-iiii-iiii
ZZZZ-ZZZZ-iiii-iiii
But, modified to fit with the 32-bit instructions, say:
ZZZZ-mmmm-nnnn-ZZZZ
ZZZZ-ZZZZ-nnnn-ZZZZ
iiii-iiii-nnnn-ZZZZ
iiii-iiii-nnnn-ZZZZ
Where:
...-xxxx-xxxx-xx11 (32-bit ops go here)

I guess people can disagree with me on this one if they want...

Nevermind whether or not this sub-project actually makes sense.

But, say, RV64 was much closer to my own ISA design than some of the
other possibilities (so, apart from all the stuff I needed to add to
support RV64G; it was still not too far outside the scope of "throw an
alternate instruction decoder at the problem").

Like, for example, neither ISA has condition-codes, neither has
auto-increment addressing, etc...

Like, can draw strong contrast with something like AArch64 or POWER,
which would have been a whole lot more involved than "throw an alternate
decoder at it and try to patch up the holes...".

Well, nevermind that the interrupt handing mechanism is almost entirely
different.

Like, in my case, interrupt handling is more minimal:
Copy PC to SPC;
Copy low 32-bits of SR to high 32-bits of EXSR;
Copy CPU mode bits from high bits of VBR, set interrupt mode;
Implicitly causes SP and SSP to swap places in the decoder.
Perform a computed branch:
{ VBR[47:6], Index[2:0], 3'b000 };

Though, spec requires 256 byte alignment, in case the table gets bigger
(it branches into the table, which then branches to the corresponding
ISR entry point).

Then, 'RTE' reverses this process.
Copy high bits of EXSR into low bits of SR;
Implicitly causes SP ans SSP to unswap.
Branch to SPC.

ISR prolog and epilog then doing a bit of a dance to get all of the
registers saved and restored (IOW: a long series of load/store
instructions, with some wonk to make sure all of the registers are saved
and restored correctly, from a starting point of no free registers to
work with).

Granted, this is not the highest performance design possible, but was at
least relatively cheap for the hardware.

Something like the privileged-spec's description of having 3 copies of
all of the registers seems like a bit of a cause for concern...

Well, along with other fun details:
Only has direct-mapped caches (say: 32K L1D, 16K L1I);
Software refilled TLB (where the TLB is 4-way set-associative);
...

Well, counter-balanced with a semi-expensive SIMD unit that can
theoretically do 200 MFLOP at 50MHz.

Though, limited mostly by needing other non-SIMD instructions in the
mix. Had experimented with combined op-and-shuffle, which in some
use-cases can reduce costs associated with burning extra cycles on
shuffle ops.

If working with 4x Binary16 vectors, does at least allow putting SIMD
ops in parallel with memory load and store operations (for fetching the
vector data), etc. Though, doesn't work with 4x Binary32, as the 128-bit
SIMD ops eat all 3 lanes (and thus can't be co-issued with memory ops).

...

> If the target is RV64G (aka RV64IMAFD) and thus the FP registers are
> available then isn't it better to use them?
> Why would one not?
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to isa-dev+u...@groups.riscv.org

> <mailto:isa-dev+u...@groups.riscv.org>.

> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/LO2P123MB70507A74D4F2649E494AE214F9492%40LO2P123MB7050.GBRP123.PROD.OUTLOOK.COM <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/LO2P123MB70507A74D4F2649E494AE214F9492%40LO2P123MB7050.GBRP123.PROD.OUTLOOK.COM?utm_medium=email&utm_source=footer>.
>

Tommy Murphy

unread,

Feb 12, 2024, 7:54:07 AMFeb 12

to BGB, Bruce Hoult, RISC-V ISA Dev

One (possibly tangential) thing that I meant to respond to:

> GCC does not seems to use stack canaries.

Not by default - but it can do - e.g.:

* https://mcuoneclipse.com/2019/09/28/stack-canaries-with-gcc-checking-for-stack-overflow-at-runtime/

* https://security.stackexchange.com/questions/265438/why-are-stack-canaries-not-enabled-by-default-on-gcc

* https://developers.redhat.com/articles/2022/06/02/use-compiler-flags-stack-protection-gcc-and-clang

BGB

unread,

Feb 12, 2024, 5:48:55 PMFeb 12

to Tommy Murphy, Bruce Hoult, RISC-V ISA Dev

On 2/12/2024 6:53 AM, Tommy Murphy wrote:
> One (possibly tangential) thing that I meant to respond to:
>
>> GCC does not seems to use stack canaries.
>
> Not by default - but it can do - e.g.:
>
> *

> https://mcuoneclipse.com/2019/09/28/stack-canaries-with-gcc-checking-for-stack-overflow-at-runtime/ <https://mcuoneclipse.com/2019/09/28/stack-canaries-with-gcc-checking-for-stack-overflow-at-runtime/>
> *
> https://security.stackexchange.com/questions/265438/why-are-stack-canaries-not-enabled-by-default-on-gcc <https://security.stackexchange.com/questions/265438/why-are-stack-canaries-not-enabled-by-default-on-gcc>
> *
> https://developers.redhat.com/articles/2022/06/02/use-compiler-flags-stack-protection-gcc-and-clang <https://developers.redhat.com/articles/2022/06/02/use-compiler-flags-stack-protection-gcc-and-clang>

OK, useful to know.
I had been using stack canaries in my own compiler output, whereas GCC
did not use them, which seems like it could effect things.

Though, OTOH, it sounds like the GCC strategy might be more expensive,
since it involves potentially printing a message and aborting, rather
than checking and immediately triggering a breakpoint instruction.

As another thing I can note:
My project currently lacks any direct equivalent of the 'FENCE'
instruction, and I had mostly been handling this part in software
(partly by using a "cache knocking" strategy, making use of the
behavioral properties of the direct-mapped caches).

Where it is possible, for any address in memory, to compose another
address into a dummy area, where accessing memory through the second
address will knock the first address out of the cache.

Though, if one wants to flush the whole cache:
Use the L1 cache flush instruction;
Not strictly needed for DM caches,
would allow the mechanism to still work with associative caches.
Is given either a specific address to flush (via a register),
or values indicating the entire L1 I/D or L2 cache.
Do a series of loads across 32K or 64K of memory (dummy area).
This knocks any dirty lines out into the L2 cache.
An L2 flush is also possible, but:
Currently, all cores will share the same L2 cache;
Involves signaling an L2 flush;
Then, a sweep over 512K or 1MB or so of physically-mapped addresses;
Will not work correctly with virtual addresses.

Treating the FENCE as a NOP is probably not ideal though.

Though, I could probably instead treat it like a TRAP / ECALL, and let
an interrupt handler deal with it. Though, if used as shown in the RV
spec, this would be horridly slow (would be preferable in this case to
do any spinlock locking via a system-call).

Note that I am also using a weak memory model.

Otherwise, seems I still don't have the superscalar stuff for RV64 fully
debugged. Doom is able to start up, but still has graphical issues and
crashes.

Debugging this is slow, as it takes a good number of hours to boot Doom
in a Verilator simulation.

As-is, the logic consists of:
In the IF stage, check every position within the cache lines, flagging
the word with some flags:
Valid in Lane 1 with an operation in Lane 2;
Valid in Lane 1 with an operation in Lane 3;
Valid in Lane 2;
Valid in Lane 3.
And some logic to check for register port conflicts with the adjacent
instruction words (for each instruction word).

Some logic to MUX the correct bits based on the low-order bits of the
input PC.
Some checks to make sure that the combination is allowed and that the
register conflict bits are clear.

In this case, the superscaler handling is similar to the normal
bundle-handling, just pretending as-if there were a wide-execute bit,
and if this bit were set, in the case of instructions which pass the
superscalar check. In this case, this logic needs to happen in the IF
(Instruction Fetch) stage.

If the superscalar case, it will set the PC step to 8 rather than 4, and
the ID stage (Instruction Decode) will interpret a PC step of 8 (or 12)
in RV Mode, as signaling the use of superscalar decode.

Then realizes, there are some registers that may not be modified outside
of in Lane 1: SP, LR (RA), GBR (GP), or DHR (X5).

The existing logic doesn't check this, will probably need to do so.
These registers are "special", and may have side-channels.

Note that TBR (TP) is also special, but currently has the property that
it is treated as read-only in Usermode (if Usermode can stomp TBR/TP, it
could take down the whole OS in the process).

Doesn't currently check for more advanced co-issue cases, so thus far,
FPU ops and similar are treated as Lane-1 only. Note that Mem-Ops are
also Lane-1 only (there is only a single memory-access port).

Can note that memory access is pipelined, but follows a pattern:
EX1: Calculate Address, Feed request to L1 D$;
EX2: L1 D$ does its thing;
May stall the pipeline here if an L1 miss happens.
( Fast path: Result arrives here early )
EX3:
Result arrives from L1 D$;
Memory store also happens on this cycle.

Partly, it seems worth mention as some other cores seem to use a
different strategy (such as marking registers as being valid/invalid in
the register file, with Load/Store going through a FIFO or similar, and
permitting or blocking entry of instructions into the Execute stages
based on whether its registers were marked as valid).

However, the strategy I had used seemed simpler and cheaper.

One of the cores that had used the other strategy had a fairly high LUT
cost and ran at a lower clock speed (IIRC, ran RV32G, but ate most of an
XC7A100T and could only run at around 25 or 33 MHz). Granted, didn't
study the Verilog too closely, so may have misunderstood how it worked.

OTOH:
I can push my core up to 75MHz, but this generally requires increasing
instruction latency values, disabling the SIMD unit, and reducing the L1
cache sizes. The performance impacts of these changes generally outweigh
all of the gains of the higher clock speed.

Thus far, haven't had much luck getting any multi-lane cores to run at
100MHz or above (this seemingly limited mostly to scalar core designs).

Can also note that MicroBlaze also seemingly topped out at around 100MHz
on these FPGAs (so, likely isn't too much of an issue of "bad Verilog
skills").

...

Al Martin

unread,

Feb 13, 2024, 2:12:02 PMFeb 13

to RISC-V ISA Dev, BGB, RISC-V ISA Dev, Tommy Murphy, Bruce Hoult

Have you thought about lowering the frequency target? Making as many instructions as possible have execution latency = 1 might give you better overall performance.

BGB

unread,

Feb 13, 2024, 7:41:30 PMFeb 13

to Al Martin, RISC-V ISA Dev, Tommy Murphy, Bruce Hoult

On 2/13/2024 1:12 PM, Al Martin wrote:
> Have you thought about /lowering /the frequency target? Making as many

> instructions as possible have execution latency = 1 might give you
> better overall performance.
>

I had looked into this before, but unless I can find a good way to get
higher IPC, dropping below 50MHz (say, to 25 or 33 MHz) would generally
make performance worse (even if nearly every instruction could have a
1-cycle latency).

It seemed like targeting 50MHz was roughly the local optimum:
Allows keeping moderately low latency and allows moderately large L1 caches.
At 25 or 33 MHz, things are mostly bottle-necked by how quickly the CPU
can throw instructions through the pipeline;
And, at 75MHz, mostly by the latency needed to get the CPU to pass timing.

I had mostly excluded a 64-bit scalar core running at 100MHz fairly
early on, mostly because it was too much of a pain dealing with timing
(and a 32-bit core is, meh).

Similarly, a scalar core running at 75MHz didn't outperform a 3-wide
core running at 50MHz (it could almost win, apart from the issue of
still needing smaller L1 caches and higher latency to pass timing
constraints).

But, yeah, at the moment, latency values are:
1 cycle:
MOV (LUI, etc), ADDS.L (ADDW), SUBS.L (SUBW), AND/OR/XOR
ADD/SUB: If inputs fall in 32-bit range.
EXTS.L, EXTU.L (no direct equivalent in RV64G)
Though, EXTS.L is approximated via "ADDIW Xd, Xs, 0"
Zbb would add 'ADDUW', IIRC.
etc...
2 cycle:
ADD/SUB (inputs fall outside 32-bit range)
LEA.x (or, SHnADD, when implemented)
Shift instructions:
SHAD/SHLD/SHADQ/SHLDQ/SHLR/SHAR/SHLRQ/SHARQ (names in my ISA)
SLL, SRL, SRA, SLLW, SLRW, SRAW (RV64)
Memory Load:
32 or 64-bit, aligned, no memory RAW dependency.
CMP: SLT/SLTU, FCMP: FLE/FLT/FEQ, ...
3 cycle:
Memory Load (generic case)
32-bit multiply MUL.L (MULW), ...
FADD.S / FSUB.S / FMUL.S (if SIMD unit is enabled, ...)
Longer (and non-pipelined):
FADD/FSUB/FMUL (FADD.D/...), 6 cycle
Stalls the pipeline for 5 cycles, delivers result in EX2.
No SIMD unit: FADD.S/FSUB.S/FMUL.S: 10 cycle
FMAC (FMADD/FMSUB/...): 14 cycle
DIVS.L/DIVU.L (DIVW/DIVUW): 36 cycles
DIVS.Q/DIVU.Q (DIV/DIVU): 68 cycles
FDIV.S/FDIV.D: 122 cycles
All of these are implemented via a "shift and add" unit.
Integer: Mildly faster to use DIV ops;
FPU: Currently faster to do unrolled Newton-Raphson in software.
...

Where, I have an ~ 8-stage pipeline:
PF IF ID1 ID2 EX1 EX2 EX3 WB
Or, more accurate names:
PF IF ID RF EX1 EX2 EX3 WB

Where:
PF: PC arrives at L1I$
IF: Check hit/miss, determine instruction length.
RV64: Figure out if superscalar execution can be used.
BJX2: Instructions encode whether they execute in parallel.
ID1/ID:
Decode the instructions.
Branch predictor does its thing.
Overrides PC sent to L1I$.
Integrates results from a "normal branch".
ID2/RF:
Gather up all the register inputs for the various register ports;
May be forwarded from the EX1/EX2/EX3 stages.
Determine whether predicated instructions are executed or not.
Instructions are one of: Always/Never/True/False
True/False cases are effectively remapped to Always/Never here.
The True/False status depends on a True/False status flag.
EX1/EX2/EX3: Execute the instruction;
WB: Register results written back to register file.

If an L1I$ or L1D$ miss happens, the entire pipeline is stalled until
the miss is resolved (pretty much everything in the CPU core needs to
hold still until execution can continue). This also happens whenever the
main FPU is doing an operation, or the Shift-Add divider is working on
something.

If ID2/RF needs something, and it isn't available yet:
The PF/IF/ID1/ID2 stages are stalled, wheres EX1/EX2/EX3/WB continue;
This injects NOPs into the EX stages.
This creates a "pipeline bubble";
Also happens if a predicated op directly follows CMPxx or similar.

As noted before, main configuration uses a 6R+3W register file, with 3
lanes:
Source ports: S, T, U, V, X, Y
Destination Ports: M, N, O
(Z=Zero)
For normal scalar ops:
OP1 S, T, Y->M
OP2 U, V, X->N | OP1 S, T, Y->M
OP3 X, Y, Z->N | OP2 U, V, X->N | OP1 S, T, Y->M
Where: OP1/OP2/OP3 (Operation in Lane 1/2/3)
Lanes 2 and 3 exist as prefixes to Lane 1.
For 3 wide bundles, effectively only 2-input instructions are usable.

Store typically pulls its value from the Y or X/Y ports.
S: Base Address
T: Index or Displacement
Y | X/Y: Value

Where:
Imm/Disp cases are treated as equivalent to registers as far as executed
instructions are concerned. The immediate or displacement is handled
as-if it were a special register, whose value is given as the immediate
field associated with the decoded instruction (so, as far as the EX
stages are concerned, indexed addressing is the only addressing mode).

Where, register ID space looks like (7-bit):
00..3F: GPRs (General Purpose Registers)
40..5F: SPRs and pseudo-registers (ZR, IMM, ...)
60..7F: CRs (Control Registers)

Though, the EX stages add an extra bit to the ID:
0: Normal register, value may be forwarded from this stage;
1: Held register, value may not be forwarded.

With Lane 1 being "do everything", Lane 2 being more restricted, and
Lane 3 being minimal (only does MOV and basic ALU operations). Mostly
this is because the compiler could rarely find much to put there, and if
it did, it was typically a MOV or ALU operation, so not worthwhile to
invest more resources on it.

But why bother with 3 lanes? Mostly it is an excuse to justify 6 read
ports, which allows running 3-input instructions in parallel with other
instructions (so, a 4R+2W register file has an adverse effect on many
2-wide bundles; but is a little cheaper).
Had also looked into 6R+2W configurations with no 3rd lane, but this had
almost the same cost as the full 3-wide configuration.

Another merit is the 3-wide fetch/decode does also allow for some 96-bit
jumbo-encodings, for operations like:
MOV Imm64, Rn
Or, in RV terminology:
LI Xd, Imm64

Had looked into going wider than 3, but LUT cost goes crazy if one tries
to go wider than 3 (the register forwarding has an x^2 complexity, and
will quickly go out of control).

More so, my compiler has a hard enough time finding enough ILP even to
keep the existing pipeline busy.

At present, for my own ISA, I seem to be getting in the area of an
average bundle-size of around 1.35 instructions/bundle.

With the superscalar approach, for RISC-V, it seems to be around 1.20
instructions/bundle.

Where, say, the CPU core is generally running roughly 0.5 to 0.7
bundles/clock or so.

When running Dhrystone (RV64G, -O3):
31 MIPs at 50MHz, 1.12 instructions/bundle, 0.554 bundles/clock.
2% interlock, 7% mem acces,
6% branch-miss (88.5% branch-predict accuracy).
12$ divide
100% L1I$ hit, 99.6% L1D$ hit.
Score: 88k at 50MHz
1.62 DMIPS per RV64 MIPS
~ 1.0 DMIPs/MHz

Top instructions (descending order):
JAL, ADD, LD/SD, DIV, LW/SW, BEQ, BNE
LDBU, ADDW, ...

Shift seems to be lower on the ranking for Dhrystone, but near the top
for Doom, likely because Doom is spending most of its time in
array-processing functions.

Top functions (descending):
main, strcmp, strcpy, Proc_1, Func_1, ...

So (apart from main, the main hot-path functions are strcmp and strcpy).
However, not super obvious how to make these much faster, had already
made use of the trick of making them able to work 8 characters at a time.

As noted, for strcmp:
c0=0x8080808080808080ULL;
c1=0x7F7F7F7F7F7F7F7FULL;
li0=*(uint64_t *)p1;
li1=*(uint64_t *)p2;

lj=(li0|(li0+c1))&c0;
while((li0==li1) && (lj==c0))

...
Which effectively generates a bit-mask that distinguishes between zero
and non-zero bytes.

But, in general, Dhrystone does tend to be largely a benchmark of ones'
"strcmp()" function in general, so, ...

At the moment, for both ISAs, the overall MIPS values seem to be similar
(both around 30 MIPs +/- 5 MIPs).

Superscalar seems to have the most obvious effect on Dhrystone, wheres
Doom and Quake don't see as big of a difference from this.

As noted before, in terms of performance, at present:
Dhrystone: Currently now won by RV64G
Doom: Currently won by BJX2-XG2
Array dominated, DrawSpan and DrawColumn
SW Quake: Roughly tie.
GLQuake: No contest (RV64G lacks SIMD).

Neither ISA has particularly complex addressing modes:
RV64: (Rs, Disp)
BJX2: (Rs, Disp); (Rs, Ri)
With some differences:
RV64 uses 12s displacements with no scale;
BJX2 uses 9u or 10s with fixed scales (based on access size).
Register index scale is also fixed based on the access size.
If a different scale is needed, a multi-op sequence is used.

Generally, addressing modes are much more limited if compared with
something like ARM or similar. Had experimented with some more complex
addressing modes, but they didn't seem worth it.

From what I can gather:
Performance per clock appears to beat early 90s era x86 machines;
Performance per clock seems to be on-par with 90s era RISC machines.

Annoyingly, I still seem to need roughly 100 MHz to get decent
framerates in Quake (assuming cache sizes and timings remaining the same).

From what I can gather though, Quake performance on sub 100MHz machines
wasn't exactly good in the 90s either. Though, most PCs back in my
childhood (in the 90s) could run Quake and similar pretty OK.

Or, we need roughly 60+ MIPs to get SW Quake into consistently
double-digit territory...

>
> On Monday, February 12, 2024 at 2:48:55 PM UTC-8 BGB wrote:
>
> On 2/12/2024 6:53 AM, Tommy Murphy wrote:
> > One (possibly tangential) thing that I meant to respond to:
> >
> >> GCC does not seems to use stack canaries.
> >
> > Not by default - but it can do - e.g.:
> >
> > *
> >

> https://mcuoneclipse.com/2019/09/28/stack-canaries-with-gcc-checking-for-stack-overflow-at-runtime/ <https://mcuoneclipse.com/2019/09/28/stack-canaries-with-gcc-checking-for-stack-overflow-at-runtime/> <https://mcuoneclipse.com/2019/09/28/stack-canaries-with-gcc-checking-for-stack-overflow-at-runtime/ <https://mcuoneclipse.com/2019/09/28/stack-canaries-with-gcc-checking-for-stack-overflow-at-runtime/>>
> > *
> >
> https://security.stackexchange.com/questions/265438/why-are-stack-canaries-not-enabled-by-default-on-gcc <https://security.stackexchange.com/questions/265438/why-are-stack-canaries-not-enabled-by-default-on-gcc> <https://security.stackexchange.com/questions/265438/why-are-stack-canaries-not-enabled-by-default-on-gcc <https://security.stackexchange.com/questions/265438/why-are-stack-canaries-not-enabled-by-default-on-gcc>>
> > *
> >
> https://developers.redhat.com/articles/2022/06/02/use-compiler-flags-stack-protection-gcc-and-clang <https://developers.redhat.com/articles/2022/06/02/use-compiler-flags-stack-protection-gcc-and-clang> <https://developers.redhat.com/articles/2022/06/02/use-compiler-flags-stack-protection-gcc-and-clang <https://developers.redhat.com/articles/2022/06/02/use-compiler-flags-stack-protection-gcc-and-clang>>

Reply all

Reply to author

Forward