On 6/30/2022 1:24 PM, MitchAlsup wrote:
> On Wednesday, June 29, 2022 at 7:41:31 PM UTC-5, BGB wrote:
>> On 6/29/2022 3:13 PM, MitchAlsup wrote:
>
>>> Why is it discontinuous ?
>> Mostly because it evolved out of the SuperH layout and kept a similar
>> organization pattern.
>>
>> Eg:
>> SH:
>> R0..R7: Scratch
>> R8..R14: Preserved
>> R15: SP
>> BJX1 extended this to 32 registers:
>> R0..R7: Scratch
>> R8..R14: Preserved
>> R15: SP
>> R16-R23: More Scratch
>> R24-R31: More Preserved
>> BJX2 made R0 and R1 "Special"
>> Moved return value and similar from R0 to R2.
>> Otherwise the same.
> <
> Register_remap[] = { R0..R7, R16..R23, R15, R8..R14, R24..R31 }
> <
> emit_register = Register_remap[ compiler_register ];
> <
> Presto done: single level of indirection fixes the whole kit and caboodle.
Possible.
If I at some point do another major iteration of the ISA design, I might
reorder the registers.
When I earlier started working on a limited-scope effort to start adding
RISC-V support to BGBCC, this had added a partial register remapping to
map stuff from the current register space into the RISC-V register space
(though is imperfect as the C ABIs have differences beyond what can be
addressed by register shuffling; and the "usable" part of the RISC-V
register space is a little smaller than it is with BJX2).
>>
> <snip>
>>>>
>>> In my lower end cases, this number is 3.
>> If the BSR was predicted, function needs at least 2 instructions before
>> the RTS.
>>
>> So, eg:
>> int foo()
>> { return(0); }
>> May need a NOP.
> <
> My 66000 never needs a NoOp, it is inherently interlocked. It just takes cycles.
In my case, it still just takes cycles, just more of them than ideal...
>>>>
>>>> In some cases, this can be dealt with by the compiler detecting these
>>>> cases and adding "magic NOPs" or similar (technically cheaper than
>>>> dealing with this issue via the interlock mechanism; and not common
>>>> enough to make doing so worthwhile).
>>> <
>>> What does the compiler do when you have the resources to build a 10-wide
>>> machine ?
>> By this point, one can probably justify the cost of the interlock case
>> or similar, otherwise dunno...
>>
>> As-is, it is still "safe" at least, but the branch predictor will skip
>> the RTS, forcing a slower non-predicted branch to be used instead.
>>
> So, why not do it NOW ?
It will take somewhat more LUTs to drive the Interlock-Stall mechanism
than it does the "ignore this branch" logic in the branch predictor.
Partly I think it is a case of things which effect stall paths and
similar (which the interlock path is one) can cause a fairly significant
cost-multiplication to any logic which is connected to it.
>>
>> In the near term, going wider than 3 is unlikely, as it is unlikely to
>> be able to gain any ILP (as-is, I am not even really getting enough ILP
>> to use 3-wide effectively).
> <
> ( SQRT(3) = 1.73 ) * 0.7 = 1.21
> <
> unless you are getting over 1.21 I/C you aren't getting all the ILP.
> {The 0.7 accounts for cache and TLB misses.}
Yeah, it is a bit less than this.
As noted before, compiler output currently gets ~ 1.25
instructions/bundle, and around 0.65 bundles per clock (from a most
recent test running Doom).
Some recent compiler fiddling had gotten average bundle size from ~ 1.20
to around 1.25, mostly by changing the relative order in which it tries
to encode some instructions, and a few minor issues in the WEXifier.
Most of the bundled ops though (by the compiler) seem to be one of
(decreasing probability):
MOV (2-register);
LDI (constant load);
Sign/Zero extension;
ALU ops;
...
Lane 1 borders on being nearly a solid wall of Load/Store instructions.
>>
>> Partial issue being that in-general, the code tends to be almost
>> entirely dominated by instructions which depend tightly on the previous
>> instruction, with relatively little "shuffling" possible.
> <
> Which is why OoO is useful, to overlap dependent instruction streams.
Theoretically, the compiler could do a better job at this part.
But, I guess the great limiting issue here: But, it doesn't...
Though, partly another limiting issue is that it can't really change the
relative order of memory stores, because doing so is prone to cause the
program in question to "violently explode".
Could maybe be done more if there were some way to prove (at the level
of machine instructions) that the instructions don't alias (what
information might have otherwise been known about pointer aliasing is
lost by the time it reaches the machine-instruction stage).
Granted, one could argue for OoO on the basis that hardware just needs
to look at the memory addresses.
I guess another option could be to allow the compiler to somehow encode
a "this store doesn't alias with anything" flag into the store
instructions, such that the WEXifier can see this and more freely
shuffle it around.
Say, adding special purpose "MOV.RH.L" and "MOV.RH.Q" instructions,
where RH means "Restrict Hint", with these instructions (very likely)
decaying into their baseline forms after the WEXifier has finished.
And/or heuristics, say (assuming both element types match):
(SP,d1) x (SP,d2): Assume no alias if d1!=d2
(SP,d1) x (Rm,d2): Assume no alias if d2!=0
(Rx,d1) x (Ry,d1): Assume no alias if ((Rx==Ry)&&(d1!=d2))
...
>>
> <snip>
>>> Extra work, and only prevents a few of the attack vectors. It prevents
>>> array[i++] errors, but not of the array[i+k] errors.
> <
>> The "i++" and "*t++" cases represent the vast majority of typical buffer
>> overflows though...
>>
>> I had a more powerful mechanism (tripwires), but these are currently NOP
>> as they would require tagged memory and are currently incompatible with
>> my virtual memory subsystem.
> <snip>
>>> hard is not secure, impossible is secure.
>> Granted.
>>
>> But harder is better than nothing. People are less likely to bother with
>> a buffer overflow if it only works on a single build of a program, vs
>> one which works "across the entire family".
>>
>> ASLR can also help, but reaching "full power" with ASLR still requires
>> more work on the debugging front in my case.
> <
> ALSR should be unnecessary, as it is a crutch.
It exists for a reason, and provides a line of protection for other
cases where the lines of protection have failed.
>>
>> As-is, if I put the stack or program ".text" sections into pagefile
>> backed virtual memory, this is prone to cause stuff to crash, which
>> still somewhat limits my ASLR capabilities at the moment.
>>
>>
>> Direct-remapped memory is still limited to the low 4GB for now, vs
>> anywhere within the 48-bit address space.
>>
>> There is a technical limitation that the loader can't map a loaded PE
>> image across a 4GB boundary, but the loader was already accounting for this.
>>
>> I can at least put the heap and data/bss sections in pagefile backed
>> memory though, which is something.
> <snip>
>>>>> 2 fewer transfers of control
>>>>> 13 fewer instructions
>>>>> 1 less wasted register
>>>> R1 is otherwise reserved by the ABI, and was repurposed into a de-facto
>>>> secondary LR for these sorts of use-cases.
>>> <
>>> As I said:: a wasted register.
>> I took it out of ABI use well before it ended up being reused as a
>> secondary LR.
> <
> So, you accept making your code less efficient by wasting a register.
I am still "wasting" less registers than RISC-V in this sense.
At least they are being used, just not really at the normal ABI level.
Also I needed some registers to use for encoding PC/GBR/TBR relative
addressing modes, and these served this role, though:
R0 is only usable as an index register, but not as a base register;
R1 is not usable as either a base or index register (1).
*1: Trying to use R1 as an Index:
With R0 or R1 as Rm, encodes alternate modes;
With Rm>=2, mimics the semantics of the SH "Rm+R0" mode (2).
Or, "MOV.L @(R0,R5), R9" if that is ones' preference.
*2: Initially, this was so that SH-derived ASM would still work.
Ironically, one can still write stuff like "MOV.L @R4+, R7",
Just the assembler will fake it using multiple instructions.
>>
>> The reason they were cut off from normal use in BJX2 partly goes back to
>> how R0 and R1 were used in earlier forms of the ISA (and effectively
>> goes all the way back to how they were being used in SuperH).
>>
>> Decided to leave out a much longer description of the SH4->BJX1->BJX2
>> evolution path...
>>
> At some point you need to let the black eye heal.
They could be "re-allowed", but at this point it would be unclear what
exactly this would involve:
The registers encode special addressing modes, so can't really be used
again as normal GPRs in all cases.
If allowed as a "Non-Base Scratch Register", BGBCC is already doing this
(in addition to the auxiliary link register case).
R0 could also be used for scratch, but with extra care given the
assembler may use it for scratch without warning (mostly when trying to
use instructions which "don't actually exist" in the ISA).
Both still likely see more use than had I burned one of the register
spots to use as a Zero Register (though a Zero Register would have
allowed eliminating some of the 2R encodings).
Say:
NEG Rm, Rn
Could have been, say:
SUB ZR, Rm, Rn
But, practically, making the listing smaller doesn't save *that much*.
I was still able to write a disassembler (more or less) in a single day,
and much of this was spent writing the logic for the various instruction
forms, and filling in the listing table and similar.
Then again, one could argue that maybe an "actually simple" ISA would
have allowed someone to write a more-or-less complete disassembler in,
say, 1-3 hours or so.
Though, in this case the disassembler was mostly to allow looking at the
output after it has been fed through the WEXifier (without also having
to go through my emulator to do so).
Still has needed a bit of bug-fixing and fine-tuning though.
Disassembler is using a fairly naive "((instr&mask)==pattern)" algorithm
for the pattern matching (as opposed to the "nested switch() tables"
approach used by my emulator). It is generally slower, but simpler and
more compact.
Sadly, the and-masking approach kinda hinders the ability to use
hash-based lookups though (each pattern would effectively need to map to
multiple hash chains for this). So, it kinda uses a linear-lookup
approach, but alas...
>>
> <snip>
>>> Minimum set of saved registers in My 66000 ABI is zero.
>> You can save fewer registers (or zero), just it no longer makes sense to
>> use the prolog/epilog compression feature for these.
> <
> It is also the added control transfers.
OK.