I am using BGBCC, where:
Much of the front-end was hacked together when I was in my early 20s,
based off of a fork of my BGBScript interpreter, which was originally a
JavaScript clone; the fork had modified it to accept C syntax and use a
C like typesystem (with the original wonkiness that the ASTs in the
original BGBScript VM were implemented on top of a repurposed XML DOM
API; slightly later versions had switched over to cons-cell based
S-Expressions, *).
*: One needs to clarify this because someone else had also gone and used
the term "S-Expressions" for something not based on Lisp/Scheme or the
use of cons-cells (with only a vague superficial similarity to Lisp
style syntax).
I couldn't debug it at the time, so basically shelved it for around a
decade (apart from some use as a tool for mining stuff from C headers
for the BGBScript VM).
Later, I started working on BJX1 (back when I was still in my early 30s,
yeah...), and needed a compiler for this.
I had also looked at LCC, which would have still required writing a
backend, and it seemed like less effort to revive BGBCC as a base, than
to try to write a backend for LCC.
I had by this point already gained a bit more experience working with
code-generation and similar, mostly from continuing to work on the
BGBScript VMs (which had by this point mutated in a more Java/C# style
direction).
Well, and also tried making a "C-Aux" VM/language, which was intended to
be a sort of C/C# hybrid (but turns out "looks like C but can't compile
non-trivial C code" is "kinda lame"). Some bits of the C-Aux effort were
used as a basis for parts of my compiler middle and backend though
(noting as there wasn't that huge of jump from a Dalvik inspired VM to a
RISC style ISA; just that the latter can be made to run on an FPGA).
...
Nevermind if my first choice here ended up trying to base things around
SH-4 as the base ISA, sorta lazily copying the hardware interfaces and
memory map from the SEGA Dreamcast and similar.
Where this mutated into BJX1, which was then soft-rebooted into BJX2
(mostly in an attempt to clean up the awful mess that had resulted, and
try to rework it into something that could be more viable to implement
on an FPGA).
Things might have gone differently had I started out with RISC-V rather
than SH-4 (might have been less tempted to pile on extensions in an
attempt to "make it not suck").
>>
>> Options are one of:
>> Types match exactly, allow it through as-is;
>> Types are "compatible", quietly coerce the type;
>> Quietly convert it into a type-cast operation;
>> Actually do the load.
> <
> What if the int was 32-bits and short* was 64-bits ? Somehow, 32-bits must
> get invented prior to the store.
Values in my case are normally stored in a sign-or-zero extended form.
But, moving from 'int' to 'short' normally requires sticking in a sign
extension op.
If the expression is 'magically' int, and still has bits that should
have been cut off by the store (and reload), this isn't quite right, and
can lead to bugs. In this case, one will have to sign or zero extend the
value so that it looks as if it had gotten stored to memory (and
truncated in the process), and then reloaded as the intended type (well,
and also that the type is reported as 'short' rather than 'int').
Similar also applies to storing and reloading pointers...
It was stuff breaking hard on a pointer-type mismatch which pointed out
the issue (it was using the type for the stored pointer rather than from
the array).
>>
>>>> <snip>
>>>> BTW: I put up a vote on Twitter wanting to see what the general
>>>> sentiment was on possible ways to "resolve" the encoding orthogonality
>>>> issues with R32..R63:
>>>>
https://twitter.com/cr88192/status/1602801136590782466
>>>>
>>> There was a recent poll on some site and 56% of Americans do not think
>>> that Arabic numerals should be taught in schools, too; and 15% don't have
>>> an opinion.
>>>
https://www.snopes.com/fact-check/teaching-arabic-numerals/
>>> We are well on the way to Idiocracy.
>> "Oh those squiggles, what do they mean. Radix-10 positional arithmetic,
>> what sorcery is this! We all know true numbers look like MCMXCIX!", then
>> proceeds to lose their crap if someone tries to bring up zero or
>> negative numbers...
> <
> I had a 7-th grade algebra teacher state that you cannot do Multiplication and
> Division in Roman Numerals. It took but a single day to show her the fallacy of
> her ways.
Could be possible, but probably isn't worthwhile.
In some ways, it would almost be nicer if everyone went over to
hexadecimal, as the rules are more consistent.
Yet, for whatever reason, there are a lot more people pushing for
duodecimal than for hexadecimal, which I don't really understand.
Like, with hexadecimal, all arithmetic can be carried straight down to
the level of boolean operators and similar if need be.
Granted, in some contexts it might be "better" if there were some
semantic difference between the hexadecimal digits A..F and the letters
A..F. I suspect the usual "math person" solution would be to use a fancy
font, and then rely on "semantically relevant information conveyed via
the choice of font."
>>
>>
>> Thus far, the dominant response seems to be that people are against
>> having 64 GPRs. Would have preferred better prompts, but each was
>> limited to 25 characters, which isn't really enough for this.
> <
> More than 60% of the subroutines from the LLVM front end, EMBench, and
> CoreMark can be compiled <essentially> optimally with the 16-temporary
> registers my ABI provides. Of the rest only 2 subroutines (from 1000+)
> do any stack push/pops of temporaries values (not associated with
> subroutine calling), and this is without a FP RF ! and only 32 total registers.
> <
> I am not against 64 registers, I just don't see the "value add" of consuming
> that many more instruction bits for "that fewer" instructions. That is: does
> 64 registers buy anything. The old data was 16->32 registers bought 15%
> while 32->64 bought only 3% and may constrain the OpCode layout. So,
> is it worth it:: does it buy more than it costs? .....
> <
> One thing I will note is that having constants universally available is like
> having 3-5 more registers in your file. Sort-of-like LD-Ops make the ISA
> as efficient as if it were a LD-only machine with 3-6 more registers.
It goes from "a majority of the leaf functions can static assign
everything" to "nearly all of the functions can static-assign everything".
In non-leaf functions, there are fewer registers for static assignment,
so a big chunk of non-leaf functions end up with only a subset of
variables being static-assigned (with the rest being dynamically assigned).
In the case if BJX2, the base encoding already burns 3 bits on the 16/32
split, so the 'XG2' encoding (as I am now calling it) effectively
reclaims these bits.
For speed optimized code, the loss of 16-bit instructions seems to add a
roughly 7% penalty (which seems pretty close to my theoretical estimate).
>>
>> I was more just wondering what the general sentiment was, rather than
>> committing to follow with whatever option wins.
>>
>>
>>
>> But, yeah, the "least effort" option is "leave everything as-is", where:
>> BGBCC does not use R32..R63 by default unless enabled via a command-line
>> option (except for the 128-bit ABI).
>>
>> If not enabled, as far as BGBCC and the assembler are concerned, these
>> registers do not exist:
>> -fxgpr: Allows ASM to use these registers.
>> -fxgpr_ena: Allow BGBCC to use them for register allocation.
>> -fxgpr_abi: Allow ABI to use them for argument passing
>> Increases the number of register arguments to 16 (*).
>> But, also increases ABI's register spill space to 128 bytes.
>>
>> *: Increasing the number of register arguments from 8 to 16 seems to
>> increase the number of function calls which fit entirely in registers
>> from around 80% to around 98%.
>>
> Interesting data point, thanks. Any idea as to where 15 would fall ??
15 function arguments? Probably right up near 16...
Correction, minor screw-up:
~ 80% turns out to have been for for 4 arguments, not 8.
My initial metric was off by a factor of 2 (*).
*: The value had been previously scaled up by 2 "for reasons", need to
divide by 2 again when building stats.
So, it looks like (looking at a few different programs):
4: ~79.2%
8: ~98.6%
15: ~99.9%
16: ~99.9%
18: 100%
As noted, the BJX2 ABI is 8 arguments with 32 GPRs, and 16 arguments
with 64 GPRs.
The length of argument lists seems to to follow a nearly geometric
distribution, hitting a peak at around 2 or 3 arguments, then drops off
rapidly (there are only a few random stragglers much past 12 arguments).
In both programs: the sum of 2 + 3 arguments is nearly half of the total
for all argument lists (with 0 + 1 + 4 making up another 35%).
>>
>> Using R32..R63 can help slightly for things like TKRA-GL and JPEG
>> decoding, but is slightly detrimental in many other cases (globally
>> enabling them is slightly detrimental to performance and code density
>> with the existing encoding scheme).
> <
> And therein lies the rub.
With the base ISA, the "sometimes help, sometimes hurt"; compiler ends
up needing to try to figure out where to enable them for best effect.
>>
>> Part of the issue is likely due to the orthogonality issue:
>> Prevents cases where instructions could have been bundled;
>> Forces using 64-bit Op64 encodings in many cases (as a fallback);
>> ...
> rub with liniment.
But, I can at least make it "slightly less bad" via 'XG2'...
Still comes with a 7% penalty in terms of code-density though.
In this case any "loss" (in terms of speed) would be more likely to be
in terms of things like making function prologs/epilogs bigger, rather
than things like using these registers forcing the compiler to fall back
to Op64 encodings (and thus being unable to shuffle or bundle these
instructions).
For example, in the baseline ISA:
MOV 0x1234, R18 //Can be encoded in 32 bits
But:
MOV 0x1234, R36 //Needs a 64-bit encoding (...)
Whereas, 'XG2' will allow both to use a 32-bit encoding.
>>
>>
>> This seems to be enough to offset the (arguably small) reduction in the
>> number of stack spill-and-fill (spill-and-fill is bad; but 64-bit
>> instruction encodings being a roadblock to the WEXifier is also bad...).
>>
>> By design, this would eliminate cases where Op64 encodings are needed to
>> deal with XGPR (but where the op could otherwise fit into a 32-bit
>> encoding).
>>
>>
>> But, it does allow the "majority of all functions" to switch almost or
>> entirely to static-assigning all of the local variables to registers.
>> This mostly applies to non-leaf functions in this case; as the majority
>> of leaf functions are already able to go full-static with 32 GPRs, but
>> non-leaf functions can't use scratch registers for static assignment.
> <
> I am seeing (on different applications and not as many of them) that
> 90%-ish of all leaf functions are happy with 16-registers--no prologue
> or epilogue (and no spills/fills because the compiler inserts prologue
> if spills or fills would be required and then uses as many of the 32
> registers as needed.).
It is roughly similar in my case...
>>
>> Say, 32 GPRs:
>> Leaf function: Has ~ 26 registers it can use in this case.
> <
> Leaf function (not using a FP) has 30 registers it can use.
> <
There are a few registers which can't really be used for register
assignment.
It is 11 if only counting scratch registers (partly as some of the
scratch registers are not allowed for holding variables mostly for
compiler related reasons).
As can be noted, the majority of leaf functions can already go "full
static".
>> Non-Leaf: Has ~ 14 registers it can use in this case.
> <
> Non-leaf has 15 (no FP) or 14 (FP) it can preserve across subroutine
> calls.
Similar.
The stack pointer is at R15, which cuts out one of the callee-save
registers.
So, theoretically:
R8..R14, R24..R31
R40..R47, R56..R63 (XGPR)
A smart compiler (or ASM code) can use all 15 registers here (within the
base 32).
Also, hand-written ASM can use "the power of the human mind" to figure
out how to effectively use all 16 scratch registers without getting
trapped in a corner (or noting that a few of the registers have slightly
wonky rules for how they can be used).
Like, humans can better deal with occasional non-orthogonality, whereas
the compiler has to be conservative or else it will blindly stumble into
these cases and be like "oh crap" and then die.
Well, and a human can figure out how to assign multiple non-overlapping
temporaries to the same CPU register, ...
Sometimes, it seems like it is asking a bit much even to expect the
compiled program to behave correctly with neither the program nor the
compiler crashing in the process.
...
Or, I sorta get the new ISA mode implemented, but now have to figure out
why BGBCC is generating some "obviously mangled" instructions in this case.
But, alas... I sorta expected this much...
>>
>> With 64 GPRs:
>> Leaf function: Has ~ 58 registers it can use.
>> Non-Leaf: Has ~ 30 registers it can use.
>>
>> Partial reason being, in a leaf function, nothing is going to stomp the
>> scratch registers, but a non-leaf functions need to deal with these
>> registers getting stomped during function calls.
> <
> There is no reason a leaf subroutine cannot dump the preserved registers
> to the stack and use as many as it needs. The contract is that these must
> be restored before returning.
Granted, but if one uses more than can be static assigned, this means
spill and fill.
Also, BGBCC's register allocation is kinda stupid if compared with some
other compilers.
For example, GCC will realize that two temporaries will have
non-overlapping lifespan, and can assign the same register to both
temporaries.
For "static assign everything", BGBCC needs to be able to give every
temporary its own register; thus increasing register pressure for the
"full static" case.