On 1/23/2024 4:10 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/23/2024 6:06 AM, Robert Finch wrote:
>>>
>
>> IME, the main address modes are:
>> (Rm, Disp) // ~ 66% +/- 10%
>> (Rm, Ro*FixSc) // ~ 33% +/- 10%
>> Where: FixSc matches the element size.
>> Pretty much everything else falls into the noise.
>
> With dynamically linked libraries one needs:: k is constant at link time
>
> LD Rd,[IP,GOT[k]] // get a pointer to the external variable
> and
> CALX [IP,GOT[k]] // call external entry point
>
> But now that you have the above you can easily get::
>
> CALX [IP,Ri<<3,Table] // call indexed method
> // can also be used for threaded JITs
>
These are unlikely to be particularly common cases *except* if using a
GOT or similar. However, if one does not use a GOT, then this is less of
an issue.
Granted, this does mean if importing variables is supported, yes, it
will come with a penalty. It is either this or add a mechanism where one
can use an absolute addressing mode and then fix-up every instance of
the variable during program load.
Say:
MOV Abs64, R4
MOV.Q (R4), R8
Though, neither ELF nor PE/COFF has a mechanism for doing this.
Not currently a huge issue, as this would first require the ability to
import/export variables in DLLs.
>> RISC-V only has the former, but kinda shoots itself in the foot:
>> GCC is good at eliminating most SP relative loads/stores;
>> That means, the nominal percentage of indexed is even higher...
>
> A funny thing happens when you get rid of the "extra instructions"
> most IRSC ISAs cause you to have in your instruction stream::
> a) the number of instructions goes down
> b) you get rid of the easy instructions
> c) leaving all the complicated ones remaining
>
Possibly.
RISC-V is at a stage where execution is dominated by ALU ops;
BJX2 is at a stage where it is mostly dominated by memory Load/Store.
Being Ld/St bound seems like it would be worse, but part of this is
because it isn't burning quite so many ALU instructions on things like
address calculations.
Technically, part of the role had been moved over to LEA, but the LEA
ops are a bit further down the ranking.
>> As a result, the code is basically left doing excessive amounts of
>> shifts and adds, which (vs BJX2) effectively dethrone the memory
>> load/store ops for top-place.
>
> These are the easy instructions that are not necessary when ISA is
> properly conceived.
>
Yeah.
>> Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
>> also shoots itself in the foot. Because, not only has one hit the
>> limits of the ALU and LD/ST ops, there are no cheap fallbacks for
>> intermediate range constants.
>
> My 66000 has constants of all sizes for all instructions.
>
At present:
BJX2: 9u ALU and LD/ST, 10u/10s in XG2
Though, scaled 9u can give 2K / 4K for L/Q.
The Disp10s might have been better in retrospect as 10u.
RV64: 12s, unscaled for LD/ST
This gives a slight advantage for ALU ops in RV64.
BJX2:
Can load Imm17s into R0-R31 in Baseline, R0..R63 in XG2;
Can load Imm25s into R0.
RV64:
No single-op option larger than 12-bits;
LUI and AUIPC don't really count here.
RV64 can encode 32 bit constant in a 2-op sequence;
BJX2 can encode an arbitrary 33-bit immed with a 64-bit encoding, or a
64-bit constant in a 96-bit encoding.
RV64IMA has no way to encode a 64-bit constant in fewer than 6 ops.
Seems like GCC's solution to a lot of this is "yeah, just use memory
loads for everything" (though still using 2-op sequences for PC-relative
address generation).
>> If my compiler, with its arguably poor optimizer and barely functional
>> register allocation, is beating GCC for performance (when targeting
>> RISC-V), I don't really consider this a win for some of RISC-V's
>> design choices.
>
> When you benchmark against a strawman, cows get to eat.
>
Yeah.
Would probably be a somewhat different situation against a similar
clocked ARMv8 core.
Though, some people were claiming that RISC-V can match ARMv8
performance?...
I would expect ARMv8 to beat RV64 for similar reasons to how BJX2 can
beat RV64, but with ARMv8 also having the advantage of a more capable
compiler.
Then again, I can note that generally BGBCC also uses stack canaries:
On function entry, it puts a magic number of the stack;
On function return, it reads the value and makes sure it is intact,
if not intact, it triggers a breakpoint.
Well, also some boilerplate tasks:
Saving/reloading GBR, and going through a ritual to reload GBR as-needed
(say, in case the function is called from somewhere where GBR was set up
for a different program image);
Also uses an instruction that enables/disables WEX support in the CPU
based on the requested WEX profile;
...
There was also some amount of optional boilerplate (per function) to
facilitate exception unwinding (and the possibility of using try/catch
blocks). But, I am generally disabling this on the command-line
("-fnoexcept") as it is N/A for C. If enabled, every function needs this
boilerplate, or else it will not be possible to unwind through these
stack-frames on an exception.
These things eat a small amount of code-space and clock-cycles,
generally GCC doesn't seem to do any of this.
I am guessing also maybe it has some other way to infer that it doesn't
need to have exception-unwinding for plain C programs?...
>> And, if GCC in its great wisdom, is mostly loading constants from
>> memory (having apparently offloaded most of them into the ".data"
>> section), this is also not a good sign.
>
> Loading constants:
> a) pollutes the data cache
> b) wastes energy
> c) wastes instructions
>
Yes.
But, I guess it does improve code density in this case... Because the
constants are "somewhere else" and thus don't contribute to the size of
'.text'; the program just puts a few kB worth of constants into '.data'
instead...
Does make the code density slightly less impressive.
Granted, one can argue the same of prolog/epilog compression in my case:
Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly repetitive).
>> Also, needing to use shift-pairs to sign and zero extend things is a
>> bit weak as well, ...
>
> See cows eat above.
>
>>
>
>> Also, as a random annoyance, RISC-V's instruction layout is very
>> difficult to decipher from a hexadecimal view. One basically needs to
>> dump it in binary to make it viable to mentally parse and lookup
>> instructions, which sucks.
>
> When you consume 3/4ths of the instruction space for 16-bit instructions;
> you create stress in other areas of ISA>
BJX2 Baseline originally burned 7/8 of the encoding space for for 16-bit
ops.
For XG2, this space was reclaimed, generally for:
Expand register fields to 6-bits;
Expand Disp and Imm fields;
Imm9/Disp9 -> Imm10/Disp10 (3RI)
Imm10 -> Imm12 (2RI).
Expand BRA/BSR from 20 to 23 bits.
IOW: XG2 now has +/- 8MB for branch ops.
...
Bigger difference I think for mental decoding has to do with how bits
were organized. Most things were organized around a 4-bit nybbles, and
immediate fields are mostly contiguous, and still organized around 4-bit
nybbles. Result is generally that it is much easier to visually match
the opcode and extract the register fields.
With RISC-V, absent dumping the whole instruction in binary, this is
very difficult.
This was a bit painful when trying to debug Doom booting in RISC-V mode
in my Verilog core vis "$display()" statements.
But, luckily, did at least eventually get it working.
So, at least to the limited extent of being able to boot directly into
Doom and similar, RISC-V mode does seem to be working...