On 11/28/2021 10:37 AM, Thomas Koenig wrote:
> BGB <
cr8...@gmail.com> schrieb:
>> On 11/27/2021 4:28 AM, Thomas Koenig wrote:
>>> BGB <
cr8...@gmail.com> schrieb:
>>>> On 11/27/2021 2:04 AM, Quadibloc wrote:
>>>>> On Saturday, November 27, 2021 at 12:01:19 AM UTC-7, BGB wrote:
>>>>>
>>>>
>>>> I did more testing, and noted that for dividing sufficiently large
>>>> numbers, my approach:
>>>> r = ((1<<32)+(d-1))/d;
>>>> q = (n*r)>>32;
>>>
>>> Get "Hacker's Delight" and read the stuff about division by constants.
>>> It's quite well written.
>>>
>>
>> Yeah, I don't have this.
>
> It is addressed to compiler writers, so it is definitely something
> that could benefit you. (As you roll your own architecture, you
> could even implement a "signed division by 2" instruction, or at
> least POWER's ADDZE so you only need two instructions for that).
>
I did look it up, and noted that it is apparently freely available
online (well, as opposed to needing to order a book).
But, yeah, such an instruction could be possible.
As-is, signed division by power-of-2 is 4 instructions:
MOV Rs, Rt | CMPGE 0, Rs
ADD?F (d-1), Rt
SHAD Rt, -k, Rd
Where: k=log2(d), and Rt is a scratch register.
> As far as division by constands is concerned,
>
https://oeis.org/A346496 has a sequence for 32-bit unsigned
> division.
>
OK. I had done more testing and noted that my revised form still failed
for larger values with various divisors (7, 14, 19, 28, 31, ...).
I added logic to detect these cases and disable the optimization if it
would lead to unsafe results (and noted that these cases can be detected
without needing to brute-force the entire range).
Also switched the calculation for the shift to:
k = 31+ceil(log2(d));
The table-driven divider also now has a few extra checks for this.
It is possible I could add runtime calls for a 64*64->128 multiply with
a 128-bit right shift, which could potentially be used as a fallback case.
TBD.
Ironically, I did note recently when attempting to run a profiler on my
emulator (having to resort to WSL + gprof as none of the Windows
profilers seem to want to work right now), that apparently:
MOV (IR / 2R), ADD/SUB, SHAD/SHLD, ..., are actually some of the most
common instructions being executed, but are partly hidden away in my
other stats given that they are also typically executed in parallel with
other instructions, and thus partly avoid being counted in the
per-instruction clock-cycle budget.
So, say:
ADD 4, R5 | MOV.L (SP, 24), R4
Will only count the 'MOV.L' for its clock-cycles, and ignore the 'ADD'
(since its clock-cycles are effectively hidden behind the 'MOV.L').
For compiler stats:
~ 40% of F8 block instructions are in Lane 2/3;
~ 20% of F2 block instructions are in Lane 2/3;
~ 13% of F0 block instructions are in Lane 2/3;
None of the F1 block is in these lanes, partly as there is nothing in
the F1 block that is allowed in these lanes.
Annoyingly, some amount of the recent debugging did adversely effect the
Dhrystone score, but alas...