On 12/4/2022 1:11 PM, John Levine wrote:
> According to BGB <
cr8...@gmail.com>:
>> In a general sense, I suspect floating-point was likely inevitable.
>>
>> As for what parts could have varied:
>> Location of the sign and/or exponent;
>> Sign/Magnitude and/or Twos Complement;
>> Handling of values near zero;
>> Number of exponent bits for larger types;
>
> I think I've seen variations in all of those.
>
>> For example, one could have used Twos Complement allowing a signed
>> integer compare to be used unmodified for floating-point compare.
>
> Yup, the PDP-6/10 did that.
>
Yep.
Have noted that things can get "fun" for microfloat formats, since (for
8 bit formats) one is operating on the bare edge of what is "sufficient".
So, things like whether or not there is a sign bit, exact size of the
exponent and mantissa, bias values, etc, will tend to vary by a
significant amount (even within the same domain, noting for example,
that both Mu-Law and A-Law exist, etc).
>> For many purposes, subnormal numbers do not matter.
>
> I believe that the argument is that if you don't know a lot about
> numerical analysis, your intuitions about when they don't matter
> are likely to be wrong. Denormals make it more likely that
> naive code will get an accurate answer.
>
Possible, though apart from edge cases (such as dividing by very small
numbers, or multiplying by very large numbers), any effect denormals
would have had on the result is likely insignificant.
In the case where one divides by a "would have been" denormal, turning
the result into NaN or Inf (as-if it were 0) almost makes more sense.
For Binary32 or Binary64, the dynamic range is large enough that they
are unlikely to matter.
For Binary16, it does mean that the smallest possible normal-range value
is 0.00006103515625, which "could matter", but people presumably aren't
going to be using this for things where "precision actually matters".
For some of my neural-net training experiments, things did tend to start
running into the dynamic-range limits of Binary16, which I ended hacking
around.
Ironically, adding a case to detect intermediate values going outside of
a reasonable range and then decaying any weights which fed into this
value, seemed to cause it to converge a little more quickly.
>> The relative value of giving special semantics (in hardware) to the
>> Inf/NaN cases could be debated. Though, arguably, they do have meaning
>> as "something has gone wrong in the math" signals, which would not
>> necessarily have been preserved with clamping.
>
> Same point, they tell the naive programmer that the code didn't
> work. As some wag pointed out a long time ago, if you don't care
> if the results are right, I can make the program as fast as you want.
>
Granted.
Keeping Inf and NaN semantics as diagnostic signals does at least make
sense. In my case, have mostly kept Inf and NaN as they seem able to
justify their existence.
There are also some limits though to how much can be gained by cheap
approximations.
For example, one could try to define a Binary32 reciprocal as:
recip=0x7F000000-divisor;
Or, FDIV as:
quot=0x3F800000+(dividend-divisor);
Does not take long to realize that, for most uses, this is not sufficient.
For things like pixel color calculations, this sorta thing may be
"mostly passable" though (or can be "made sufficient" with 1 to 3
Newton-Raphson stages).
For things like approximate bounding-sphere checks, it may be sufficient
to define square root as:
sqrta=0x1FC00000+(val>>1);
As can be noted, some of my neural-net stuff was also essentially using
these sort of definitions for the operators.
For an integer equivalent, had also noted that there were ways to
approximate distances, say:
dx=ax-bx; dy=ay-by;
dist=sqrt(dx*dx+dy*dy);
Being approximated as, say:
dx=abs(ax-bx); dy=abs(ay-by);
if(dx>=dy)
dist=dx+(dy>>1);
else
dist=dy+(dx>>1);
Had encountered this nifty trick in the ROTT engine, but had ended up
borrowing it for my BGBTech3 engine experiment.
Another recent experiment was getting the low-precision FPU in my case
up to (roughly) full Binary32 precision, however it still needs a little
more testing.
Lots of corner cutting in in this case, as the low-precision FPU was
more built to be "cheap" than good (even vs the main FPU, which was
already DAZ+FTZ).
Say, main FPU:
FADD: 64-bit mantissa, 12-bit exponent, ~ 9 sub-ULP bits.
FMUL: 54-bit mantissa, 12-bit exponent;
6x DSP48 "triangle" multiplier,
Plus a few LUT-based mini-multipliers for the low-order bits.
Full precision would need 9 DSP48s (with some mini-multipliers),
or 16 DSP48s (no mini multipliers).
Both operators have a 6-cycle latency;
Main FPU supports traditional rounding modes.
Low Precision FPU (original):
FADD: 16-bit mantissa, 9-bit exponent (truncated by 7 bits).
FMUL: 16-bit mantissa, 9-bit exponent (truncated by 7 bits);
1x DSP48 multiplier.
These operators had a 3-cycle latency.
The 16 bit mantissa was used as it maps exactly to the DSP48 in this
case. But, for Binary32, would truncate the mantissa.
In this version, FADD/FSUB was using One's Complement math for the
mantissa, as for truncated Binary32 this was tending to be "closer to
correct" than the result of using Two's Complement.
Low Precision FPU (with FP32 extension):
FADD: 28-bit mantissa, 9-bit exponent, ~ 2 sub-ULP bits (1).
FMUL: 26-bit mantissa, 9-bit exponent;
1x DSP48 multiplier, plus two 9-bit LUT based multipliers (2).
These operators still have a 3-cycle latency;
Effectively hard-wired as Truncate rounding.
The added precision makes it sufficient to use in a few cases where the
original low-precision FPU was not sufficient. It can now entirely take
over for Binary32 SIMD ops.
It is also possible to route some Binary64 ops through it as well,
albeit with the same effective dynamic range and precision as Binary32
(but, 3 cycle rather than 6 cycle).
1: s.eeeeeeee.fffffffffffffffffffffff,
Maps mantissa as:
001fffffffffffffffffffffff00
With the high-bits allowing for both sign and carry to a larger
exponent. Sub-ULP bits allowing for carry into the ULP.
In this version, FADD/FSUB uses Two's Complement for the mantissa.
2: DSP48 does high order bits, with the two 9 bit multipliers producing
the rest. The results from the two 9-bit multipliers can be added to the
low-order bits from the DSP48.
Say, mantissa maps as:
01fffffffffffffffffffffff0
With the top 18 bits each fed into the DSP48, producing a 36 bit result.
The 9 bit multipliers deal with multiplying the high bits from one input
against the low bits from the other (the low-bits that were out of reach
of the DSP48), with these results being added together and then added to
the appropriate place within the 36-bit initial result.
In this case, the 9b multipliers were built from 3x3->6 multipliers, eg:
0 1 2 3 4 5 6 7
0 00 00 00 00 00 00 00 00
1 00 01 02 03 04 05 06 07
2 00 02 04 06 10 12 14 16
3 00 03 06 11 14 17 22 25
4 00 04 10 14 20 24 30 34
5 00 05 12 17 24 31 36 43
6 00 06 14 22 30 36 44 52
7 00 07 16 25 34 43 52 61
Where values here are in Base-8.
In this case, the 3x3 multipliers used because these can fit more easily
into a LUT6 (so likely more efficient than either working on 2 or 4 bit
values in this case).
Needed to build the multipliers manually as otherwise Vivado seemed to
see the 9-bit multiplies and tried using DSP48s for these as well, but
in this case didn't want to burn 8 additional DSPs on the SIMD unit.
However, as noted since the "low*middle" and "low*low" results are
effectively not calculated, whatever they might have contributed to the
probability of "correctly rounded ULP" is effectively lost.
But, this becomes less a question on numerical precision per-se, but
more of the statistical probability of the results being different, or
of this probability having a visible or meaningful effect on the result.
Apart from a few edge cases, the potential contribution from these
low-order results in the final result effectively approaches zero.
At least, in the absence of hardware support for a Single-Rounded FMAC
operator.
...