On Friday, December 30, 2022 at 10:44:52 PM UTC+2, Terje Mathisen wrote:
> Michael S wrote:
> > On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
> >> Thomas Koenig wrote:
> >>> Michael S <
already...@yahoo.com> schrieb:
> >>>> But the real reason why I don't believe that my code can be integrates into
> >>>> glibc or even into libgcc is different: I don't support rounding modes.
> >>>> That is, I always round to nearest with breaks rounded to even.
> >>>> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> >>>> They think that binary128 rounding mode should be the same as current
> >>>> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> >>>> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
> >>>
> >>> I understand them being sticklers for accuracy (Terje? :-)
> >> Absolutely so!
> >>
> >
> > I am pretty sure that in typical use cases one doesn't want binary128
> > rounding mode to be prescribed by the same control world as binary32/binary64.
> > Generally, binary64 non-default rounding modes are for experimentation.
> > And binary128 is for comparison of results of experimentation with "master"
> > values. And for "master" values you want "best" precision which is achieved in
> > default rounding mode.
> > mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.
> >
> >> Supporting all required rounding modes turns out to be easy however: If
> >> you already support the default round_to_nearest_or_even (RNE), then you
> >> already have the 4 required decision bits:
> >>
> >> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
> >> but it is very easy to grab. :-) )
> >>
> >> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
> >> get out the increment needed to round the intermediate result.
> >>
> >> Supporting multiple rounding modes just means grabbing the correct
> >> 16-bit value, or you can use the sign bit to select between two 64-bit
> >> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
> >> count to end up with the desired rounding bit.
> >>
> >> Terje
> >>
> >
> > I am repeating myself for the 3rd or 4th time - the main cost is *not*
> > implementation of non-default rounding modes. For that task branch
> > prediction will do a near perfect job.
> > The main cost is reading of FP control word that contains the relevant bits
> > and interpreting this bits in GPR domain. It is especially expensive if done
> > in portable way, i.e. via fegetround().
> > It's not a lot of cycles in absolute sense, but in such primitives like fadd
> > and fmul every cycle counts. We want to do each of them in less than
> > 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.
> Ouch!
>
> Sorry, I had absolutely no intention to suggest you should try to adapt
> to whatever the current HW rounding mode is! IMHO the only sane way to
> do it is to assume the user will set the rounding mode with the standard
> function to do so, and at that point you save away a copy, and load the
> corresponding 16-bit rounding lookup value.
>
> If a library user then goes "behind your back" using direct asm
> instructions to set the rounding mode to something else, then just
> disregard that.
>
> OK?
>
> Terje
> --
> - <Terje.Mathisen at
tmsw.no>
> "almost all programming can be viewed as an exercise in caching"
And now it's more than my hand-waving.
Now I have measurements that show the expense of rounding modes
implemented via querying fegetround().
Computers:
EPY3-L - AMD EPYC-7543P, 3500 MHz, Debian Linux under WSL on WS-2019
EPY3-W - AMD EPYC-7543P, 3600 MHz, Windows Server 2019
SKLC-W - Intel Xeon E-2176G, 4250 MHz, Windows Server 2016
HSWL-W - Intel Xeon E3-1271 v3, 4000 MHz, Windows Server 2008 R2
IVYB-W - Intel Core i7-3770, ~3900 Mhz, Windows 7 Professional
Compilers:
Linux: gcc 10.2.1
Windows: gcc 12.2
Benchmark: Matrix multiplication, written in C using scalar add/mul primitives.
See ./tests/matmulq
Windows tests run with default options
Linux tests run with optin cn=17
Description of primitives:
ag - __addtf3 from libgcc
mg - __multf3 from libgcc
af - __addtf3 from ./addsubq.c.
Supports Invalid Operand exception.
Does not support Inexact exception.
Does not support non-default rounding modes.
mf - __multf3 from ./mulq.c.
Supports Invalid Operand exception.
Does not support Inexact exception.
Does not support non-default rounding modes.
ar - __addtf3 from ./addsubq_rm.c.
Supports Invalid Operand exception.
Does not support Inexact exception.
Support non-default rounding modes by mean of reading current rounding mode
with standard 'C' library function fegetround() once per __addtf3() call.
mr - __multf3 from ./mulq_rm.c.
Supports Invalid Operand exception.
Does not support Inexact exception.
Support non-default rounding modes by mean of reading current rounding mode
with standard 'C' library function fegetround() once per __addtf3() call.
Bencmark results (MFLOP/s):
Computer: EPY3-L EPY3-W SKLC-W HSWL-W IVYB-W
ag+mg 35.8 34.6 54.0 51.1 46.3
af+mf 138.5 108.3 113.8 100.3 77.2
ar+mf 80.9 91.3 92.8 79.7 67.3
af+mr 80.0 81.9 90.6 80.1 64.8
ar+mr 55.9 67.3 77.0 66.3 57.4