Such is the issue in my case:
MUL is 1-3 cycles in my case (32x32=>64);
MUL is also relatively infrequent in my tests;
MAC saves typically 1 cycle over MUL+ADD;
...
I didn't see much gain from adding it, because there are generally just
not enough integer multiplies to begin with.
Most of the integer MUL+ADD were in the form of "arrays of structs" and
similar, which benefited primarily from an "Rn=Rp+Rm*Imm" instruction.
However, even with this instruction able to fairly effectively handle
this use-case, there wasn't enough of them in the general case to make
much visible impact on performance.
One could argue that it could be "better" to have a full 64-bit hardware
multiply and not "fake it" with 32-bit multiply ops, however, the logic
for "faking it" fits into the pipeline moderately well. The most obvious
addition would be if there were a DMULS.L variant which could take the
high halves of a register (and avoid the need for shift ops).
Say (64x64=>128):
MOV 0, R17 | DMULUH.L R4, R5, R16 //1c, Lo(R4) * Hi(R5)
MOV 0, R19 | DMULUH.L R5, R4, R18 //1c, Lo(R5) * Hi(R4)
DMULSHH.L R4, R5, R3 //1c, Hi(R4) * Hi(R5)
DMULU.L R4, R5, R2 //1c, Lo(R4) * Lo(R5)
ADDX R16, R18, R20 //2c (interlock)
SHLDX R20, 32, R20 //1c (*2)
ADDX R2, R20, R2 //1c
RTS //2c (predicted)
Time: ~ 10 clock cycles (excluding function-call overheads).
Otherwise, as-is, one would need to spend an extra clock cycle on a pair
of "SHLD.Q" operations or similar.
*2: Newly added encoding.
The closest exception to this (DMACS.L being "kinda pointless") was
Dhrystone, which has some multidimensional arrays, which were used
highly enough to have a visible effect, but even then it was still
fairly modest.
Dhrystone score is still pretty weak though (~ 69k ATM; 0.79 DMIPS/MHz).
Though, this would seem to still be pretty solid at least if going by
"vintage stats" (eg; results for this benchmark as posted back in the
1990s).
Testing on my PC, there is actually a fairly large difference between
compilers and between optimization settings when it comes to this
benchmark (and both GCC and Clang seem to give notably higher numbers
than MSVC on this test; need to compare "-Os" vs "/O2" or similar to
give MSVC much hope of winning this one).
Still not really gotten "RISC-V Mode" working well enough to test an
RV64I build of Dhrystone to see how it compares; with both versions
running on the same hardware (could confirm or deny the "GCC is using
arcane magic on this benchmark" hypothesis).
Though, probably doesn't help that BGBCC kinda sucks even vs MSVC when
it comes to things like register allocation (despite x86-64 having half
as many registers, MSVC still somewhat beats BGBCC at the "not spilling
registers to memory all the time" game).
I actually got a lot more performance gains more recently mostly by some
fiddly with the register allocation logic (and was, for the first time
in a while, actually able to get a visible improvement in terms of Doom
framerate, *).
*: Doom is now ~ 15-25 (for the most part), now occasionally hitting the
30 fps limiter, and no longer dropping into single-digit territory.
This was a tweak that mostly eliminated a certain amount of "double
loading" (loading the same value from memory into multiple registers
using multiple memory loads). As well as some "spill value to memory,
immediately reload into another register" cases, ...
Mostly this was by adding logic to check whether a given variable would
be referenced again within the same basic-block, preferentially loading
a value into a register and working on the in-register version if this
was the case, or preferentially spilling to or operating on memory (via
scratch registers if needed) if this variable will not be used again
within the same basic block (the previous register allocation logic did
not use any sort of "forward looking" behavior).
This was along with recently running into some code (while working on
DMACS.L), which was handling scaled-addressing by acting as if it were
still generating code for SuperH, namely trying to build index-scale
from fixed-shift operators and ADD operations, and seemingly unaware
that the ISA now has 3R and 3RI operations (*3).
Could still be better here.
*3: Like, BJX2 is well past the stage of needing to do things like:
MOV RsrcA, Rdst
ADD RsrcB, Rdst
...
Or, trying to implement things like Rd=Rs*1280 as:
MOV Rs, Rd
SHLL2 Rd
ADD Rs, Rd
SHLL8 Rd
Just sorta used "#if 0" on a lot of this, since a 3RI MUL is now the
faster option (but, there are still some amount of "dark corners" like
this in the codegen I guess). Granted, the ISA had been enough of a
moving target that much of the codegen is sort of like a mass of layers
partly divided along different parts of the ISAs development.
So, some high-level parts of the codegen still pretend they are
targeting SuperH, then emit instructions with encodings for a much
earlier version of the ISA, which are then bit-twiddled into their newer
encodings. Some amount of it should probably be rewritten, but endless
"quick and dirty hacks" was an easier prospect than "just go and rewrite
all this from a clean slate".
Well, and if I did this, almost may as well go and finally banish the
"XML Demon" from the compiler frontend (because, well, using DOM as the
basis for ones' C compiler AST system was not such a great idea in
retrospect; have spent over a decade dealing with the fallout from this
decision, but it never being quite bad enough to justify "throw it out
and rewrite the whole thing from the ground up").
Though, my younger self was from an era when XML and SQL and similar
were hyped as the silver bullets to end all of ones' woes (well, also
Java; but my younger self was put off enough by how painful and awkward
it was, and at the time its performance was kinda trash which didn't
exactly help matters). Actually, this seems to be a common theme with
this era, most of the "silver bullet" technologies were about like
asking someone to build a house with an oversized lead mallet.
I guess this also slightly reduces the ranking position of "MOV Rm, Rn",
but there is still a lot more 'MOV' than there probably should be.
Probably still need to work on things, eg:
Trivial functions referenced via function pointers should probably not
create stack frames and save/restore GBR;
...
> Adding one or two register-size chunks to an N*N->2N MUL is however
> quite useful, even more so on architectures without carry flag(s).
>
The widening MUL is the default in my case; the narrow MUL is actually
implemented in hardware by taking the widening version and then sign or
zero extending the result.
Note that sign and zero extending arithmetic operators are fairly useful
in terms of keeping old C code behaving as expected. Some amount of the
code I am working with is prone to misbehave if integer values go "out
of range" rather than implementing modulo 2^32 wrapping behavior; which
in turn means either needing any operations which are prone to produce
out-of-range results to either have sign/zero extended versions, or
needing to insert explicit sign or zero extensions.
I suspect this may also be a reason why seemingly ARM64 and RV64I also
include 32-bit subsets of the various integer operations (even if
arguably it would be both simpler and "more RISC" to simply do
everything using 64-bit ALU operations).
But, then the split seems to be between, eg:
x86-64 and ARM64: provide a nearly full redundant set of 32b and 64b
operations, with 32b ops typically being zero-extended;
BJX2 and RV64I route, namely providing a more limited set of sign/zero
extended operators.
> Terje
>