Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Improved routines for gcc/gfortran quadmath arithmetic

590 views
Skip to first unread message

Michael S

unread,
Dec 28, 2022, 2:45:19 PM12/28/22
to
This thread is for attention of Thomas Koenig.

Few months ago I spend significant time coding replacement routines for
__addtf3 and __multf3.
Back then I thought that they are part of quadmath library.
quadmath library appears to be relatively low profile project with
little burocracy involved so I thout that integration of my replacement
will be easy.
Later on (in June) I learned that these routines are in fact part of glibc.
That reduced my enthusiasm. My impression is that glibc project is more
bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
work integrated) and that glibc developers have strong territorial instincts.
I could be wrong about both points, but that was my impression.
Simultaneously, I had more "real work" to do that left little time. Often I came
home tired. And when I don't do this sort of hobby things continuously
night after night I tend to lose focus ans interest. In short, I almost forgot
about whole thing. Luckily, by then both routines were already in MUCH better shape
than originals. I mean, in terms of speed.
Today and tomorrow I happen to have time, so recollected about it and stored routines
and tests in my public github repository.
https://github.com/already5chosen/extfloat/tree/master/binary128

Thomas, please look at this code and tell me if there is a chance to integrate it.
Preferable, in glibc, but I am not optimistic about it.
If not in glibc then, maybe, in more specialized places, like matrix primitives of
gfortran.
BTW, for later I have potentially more interesting routines - multiplication of
array by scalar and addition of two vectors. Due two reduction in parsing and
call overhead they are tens of percents faster than __multf3/__addtf3 called in loops.
Hopefully, I will add them to repository later tonight or tomorrow.

Anton Ertl

unread,
Dec 29, 2022, 4:08:42 AM12/29/22
to
Michael S <already...@yahoo.com> writes:
>This thread is for attention of Thomas Koenig.
>
>Few months ago I spend significant time coding replacement routines for
>__addtf3 and __multf3.
>Back then I thought that they are part of quadmath library.
>quadmath library appears to be relatively low profile project with
>little burocracy involved so I thout that integration of my replacement
>will be easy.
>Later on (in June) I learned that these routines are in fact part of glibc.

The names look like they belong to libgcc (part of gcc), not glibc,
and indeed, on a Debian 11 system

objdump -d /lib/x86_64-linux-gnu/libgcc_s.so.1|grep addtf3

shows code for __addtf3, while

objdump -d /lib/x86_64-linux-gnu/libc-2.31.so| grep addtf3

comes up blank (apparently not even a call to __addtf3 there).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Michael S

unread,
Dec 29, 2022, 5:22:43 AM12/29/22
to
You are right, I was confused about similar library names.
It does not change my point - it's still much higher profile project than libquadmath.

Thomas Koenig

unread,
Dec 29, 2022, 5:36:48 AM12/29/22
to
Michael S <already...@yahoo.com> schrieb:
> This thread is for attention of Thomas Koenig.
>
> Few months ago I spend significant time coding replacement routines for
> __addtf3 and __multf3.
> Back then I thought that they are part of quadmath library.
> quadmath library appears to be relatively low profile project with
> little burocracy involved so I thout that integration of my replacement
> will be easy.
> Later on (in June) I learned that these routines are in fact part of glibc.
> That reduced my enthusiasm. My impression is that glibc project is more
> bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
> work integrated) and that glibc developers have strong territorial instincts.

I cannot really speak to that, I hardly know the glibc people. (I
probably ran across them at a GNU Cauldron, but I don't remember talking
to them).

The current status seems to be that these functions are part of libgcc,
and that they are copied over from glibc.

> I could be wrong about both points, but that was my impression.
> Simultaneously, I had more "real work" to do that left little time.

I know that only too well (I currently have my own project, outside of
work, which keeps me quite busy).

> Often I came
> home tired. And when I don't do this sort of hobby things continuously
> night after night I tend to lose focus ans interest. In short, I almost forgot
> about whole thing. Luckily, by then both routines were already in MUCH better shape
> than originals. I mean, in terms of speed.
> Today and tomorrow I happen to have time, so recollected about it and stored routines
> and tests in my public github repository.
> https://github.com/already5chosen/extfloat/tree/master/binary128
>
> Thomas, please look at this code and tell me if there is a chance to integrate it.

I will test this for a bit. A next step would be a discussion on
the gcc mailing list. A possible scenario would be to create a
branch, and to merge that into trunk once gcc14 development starts.

The code may require some adjustment for other platforms, or for
special conditions in libgcc. For example, I am not sure if using
x86 intrinsics in libgcc would cause problems, or if using uint64_t
works, or if using __float128 is correct or it should be replaced by
TFMode.

Anyway, I'll drop you an e-mail in the near future.

> Preferable, in glibc, but I am not optimistic about it.
> If not in glibc then, maybe, in more specialized places, like matrix primitives of
> gfortran.
> BTW, for later I have potentially more interesting routines - multiplication of
> array by scalar and addition of two vectors. Due two reduction in parsing and
> call overhead they are tens of percents faster than __multf3/__addtf3 called in loops.
> Hopefully, I will add them to repository later tonight or tomorrow.

That also sounds interesting :-)

Michael S

unread,
Dec 29, 2022, 6:56:28 AM12/29/22
to
All uses of x86 intrinsic functions are guarded with #ifdef __amd64.
They are here in order to fight stupidities of gcc compiler, most commonly
its hyperactive SLP vectorizer.
The code is 100% correct without this parts, but non-trivially slower when
compiled with gcc12.
Making them unnneded in gcc 13 or 14 sounds like a worthy goal for gcc,
but I am not optimistic. Until now the trend was opposite - SLP vectorizer
only gets more aggressive.

> or if using uint64_t works,

My code certainly requires both uint64_t and __int128.
But it appears to be almost o.k. since we don't want to support all gcc platforms.
All we are interested in are platforms that support binary128 FP.
By now those are:
x86-64,
MIPS64,
aarch64,
rv64gc,
rv32gc,
s390x,
power64le,
power64
The first four are o.k.
The last two (or three?) have hardware so don't need sw implementation.
That leave one odd case of rv32gc. I don't know what to do about it.
IMHO, the wisest is to drop support for binary128 on this platform.
I can't imagine that it is actually used by anybody.

> or if using __float128 is correct or it should be replaced by
> TFMode.

I don't know what TFMode means. Ideally, it should be _Float128.
I used __float128 because I wanted to compile with clang that does not
understand __float128. That not a good reason, so probably has to change.


But the real reason why I don't believe that my code can be integrates into
glibc or even into libgcc is different: I don't support rounding modes.
That is, I always round to nearest with breaks rounded to even.
I suppose that it's what is wanted by nearly all users, but glibc has different idea.
They think that binary128 rounding mode should be the same as current
rounding mode for binary64/binary32. They say, it's required by IEEE-754.
They could even be correct about it, but it just shows that IEEE-754 is not perfect.

Support to other rounding modes can be added to my routines, but doing it in
platform-independent manner will slow things down significantly and doing it
in platform-dependent manner will reduce portability (IMHO, *not* for a good reason)
and it still would be somewhat slower than current code.

That's why I am more hopeful for integration into gfortran and esp. into gfortran's
matrix/vector libraries.
My impression is that these places don't care about quadmath non-default rounding modes.

>
> Anyway, I'll drop you an e-mail in the near future.

Normally, I don't check this e-mail address regularly.
I can, of course, but why not keep conversation public? It could have other benefits.

Thomas Koenig

unread,
Dec 29, 2022, 1:07:45 PM12/29/22
to
Michael S <already...@yahoo.com> schrieb:
OK, I plugged your code into libgcc/soft-fp/addtf3.c and
libgcc/soft-fp/multf3.c, and, on the first try, hit an error:

In file included from /home/tkoenig/trunk-bin/gcc/include/immintrin.h:39,
from /home/tkoenig/trunk-bin/gcc/include/x86intrin.h:32,
from ../../../trunk/libgcc/soft-fp/addtf3.c:4:
/home/tkoenig/trunk-bin/gcc/include/smmintrin.h: In function 'f128_to_u128':
/home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
455 | _mm_extract_epi64 (__m128i __X, const int __N)
| ^~~~~~~~~~~~~~~~~
../../../trunk/libgcc/soft-fp/addtf3.c:70:17: note: called from here
70 | uint64_t hi = _mm_extract_epi64(v, 1);
| ^~~~~~~~~~~~~~~~~~~~~~~
/home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
455 | _mm_extract_epi64 (__m128i __X, const int __N)
| ^~~~~~~~~~~~~~~~~
../../../trunk/libgcc/soft-fp/addtf3.c:69:17: note: called from here
69 | uint64_t lo = _mm_extract_epi64(v, 0);
| ^~~~~~~~~~~~~~~~~~~~~~~
make[3]: *** [../../../trunk/libgcc/shared-object.mk:14: addtf3.o] Error 1

(I'm not quite sure what "target specific option mismatch" actually
means in this context). This kind of thing is not unexpected,
because the build environment inside gcc is different from normal
userland. So, some debugging would be needed to bring this into libgcc.
Looking at it through gfortran's glasses, we do not support
REAL(KIND=16) on 32-bit systems, so that would be OK.

>
>> or if using __float128 is correct or it should be replaced by
>> TFMode.
>
> I don't know what TFMode means.

TFMode is the name for 128-bit reals inside gcc (see
https://gcc.gnu.org/onlinedocs/gccint/Machine-Modes.html ). If you
look at libgcc, you will find it is actually implemented as a struct
of two longs. That does not mean it absolutely has to be that way.


>Ideally, it should be _Float128.
> I used __float128 because I wanted to compile with clang that does not
> understand __float128. That not a good reason, so probably has to change.

> But the real reason why I don't believe that my code can be integrates into
> glibc or even into libgcc is different: I don't support rounding modes.
> That is, I always round to nearest with breaks rounded to even.
> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> They think that binary128 rounding mode should be the same as current
> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> They could even be correct about it, but it just shows that IEEE-754 is not perfect.

I understand them being sticklers for accuracy (Terje? :-)

However, this does not mean that your code cannot be included. It would
be possible to have the traditional function as a fallback in the (rare)
case where people want something else - a conditional jump which would
be predicted 99.99.... % accurately in a tight loop.

> Support to other rounding modes can be added to my routines, but doing it in
> platform-independent manner will slow things down significantly and doing it
> in platform-dependent manner will reduce portability (IMHO, *not* for a good reason)
> and it still would be somewhat slower than current code.
>
> That's why I am more hopeful for integration into gfortran and esp. into gfortran's
> matrix/vector libraries.
> My impression is that these places don't care about quadmath non-default rounding modes.

Yes. Fortran, specifically, restores IEEE rounding modes to default
on procedure entry.

I would still prefer to get your code into libgcc, though, because it
would then be useful for every place where somebody uses 128-bit
in a program.

Have you ever compiled gcc? I often use the gcc compile farm at
https://cfarm.tetaneutral.net/ , it has some rather beefy machines
which cuts down on bootstrap time (but it will still take a couple
of hours). You can request at the comile farm.

>>
>> Anyway, I'll drop you an e-mail in the near future.
>
> Normally, I don't check this e-mail address regularly.
> I can, of course, but why not keep conversation public? It could have other benefits.

We will probably have to move over to g...@gcc.gnu.org eventually.

[...]

Michael S

unread,
Dec 29, 2022, 3:02:25 PM12/29/22
to
That's because I never tried to compile for AMD64 target that does not have at least SSE4 :(
Always compiled with -march=native on machines that were not older that 10-11 y.o.
I'd try to fix it.
In the mean time you could add -march=nehalem or -march=x86-64-v2 or -march=bdver1
or -msse4 to your set of compilation flags.
I don't think it is that easy.
The problem is not the branch but reading the rounding mode from where
gcc stores it.

> > Support to other rounding modes can be added to my routines, but doing it in
> > platform-independent manner will slow things down significantly and doing it
> > in platform-dependent manner will reduce portability (IMHO, *not* for a good reason)
> > and it still would be somewhat slower than current code.
> >
> > That's why I am more hopeful for integration into gfortran and esp. into gfortran's
> > matrix/vector libraries.
> > My impression is that these places don't care about quadmath non-default rounding modes.
> Yes. Fortran, specifically, restores IEEE rounding modes to default
> on procedure entry.
>
> I would still prefer to get your code into libgcc, though, because it
> would then be useful for every place where somebody uses 128-bit
> in a program.
>
> Have you ever compiled gcc?

Yes, but only for embedded targets. I.e. last part of the suffix was always -none.
And never with glibc. In fact, I think I never compiled it myself with anything
other than newlib.
But why are you asking?

Terje Mathisen

unread,
Dec 29, 2022, 4:13:20 PM12/29/22
to
Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
>> But the real reason why I don't believe that my code can be integrates into
>> glibc or even into libgcc is different: I don't support rounding modes.
>> That is, I always round to nearest with breaks rounded to even.
>> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
>> They think that binary128 rounding mode should be the same as current
>> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
>> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
>
> I understand them being sticklers for accuracy (Terje? :-)

Absolutely so!

Supporting all required rounding modes turns out to be easy however: If
you already support the default round_to_nearest_or_even (RNE), then you
already have the 4 required decision bits:

Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
but it is very easy to grab. :-) )

Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
get out the increment needed to round the intermediate result.

Supporting multiple rounding modes just means grabbing the correct
16-bit value, or you can use the sign bit to select between two 64-bit
constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
count to end up with the desired rounding bit.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Michael S

unread,
Dec 29, 2022, 4:32:50 PM12/29/22
to
Fixed.

Michael S

unread,
Dec 29, 2022, 4:56:12 PM12/29/22
to
On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
> Thomas Koenig wrote:
> > Michael S <already...@yahoo.com> schrieb:
> >> But the real reason why I don't believe that my code can be integrates into
> >> glibc or even into libgcc is different: I don't support rounding modes.
> >> That is, I always round to nearest with breaks rounded to even.
> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> >> They think that binary128 rounding mode should be the same as current
> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
> >
> > I understand them being sticklers for accuracy (Terje? :-)
> Absolutely so!
>

I am pretty sure that in typical use cases one doesn't want binary128
rounding mode to be prescribed by the same control world as binary32/binary64.
Generally, binary64 non-default rounding modes are for experimentation.
And binary128 is for comparison of results of experimentation with "master"
values. And for "master" values you want "best" precision which is achieved in
default rounding mode.
mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.

> Supporting all required rounding modes turns out to be easy however: If
> you already support the default round_to_nearest_or_even (RNE), then you
> already have the 4 required decision bits:
>
> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
> but it is very easy to grab. :-) )
>
> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
> get out the increment needed to round the intermediate result.
>
> Supporting multiple rounding modes just means grabbing the correct
> 16-bit value, or you can use the sign bit to select between two 64-bit
> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
> count to end up with the desired rounding bit.
>
> Terje
>

I am repeating myself for the 3rd or 4th time - the main cost is *not*
implementation of non-default rounding modes. For that task branch
prediction will do a near perfect job.
The main cost is reading of FP control word that contains the relevant bits
and interpreting this bits in GPR domain. It is especially expensive if done
in portable way, i.e. via fegetround().
It's not a lot of cycles in absolute sense, but in such primitives like fadd
and fmul every cycle counts. We want to do each of them in less than
25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

Thomas Koenig

unread,
Dec 30, 2022, 4:32:20 AM12/30/22
to
Michael S <already...@yahoo.com> schrieb:
> On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
>> Thomas Koenig wrote:
>> > Michael S <already...@yahoo.com> schrieb:
>> >> But the real reason why I don't believe that my code can be integrates into
>> >> glibc or even into libgcc is different: I don't support rounding modes.
>> >> That is, I always round to nearest with breaks rounded to even.
>> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
>> >> They think that binary128 rounding mode should be the same as current
>> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
>> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
>> >
>> > I understand them being sticklers for accuracy (Terje? :-)
>> Absolutely so!
>>
>
> I am pretty sure that in typical use cases one doesn't want binary128
> rounding mode to be prescribed by the same control world as binary32/binary64.
> Generally, binary64 non-default rounding modes are for experimentation.
> And binary128 is for comparison of results of experimentation with "master"
> values. And for "master" values you want "best" precision which is achieved in
> default rounding mode.
> mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.

I think you have to understand where people are coming from.
This is a originally a soft-float routine, meant for implementing
functions in hardware for CPUs that lack the feature. With that
in mind, it is clear why people would want to look at the hardware
settings for its behavior.

We are slightly abusing this at the moment, but I think we can do
much better.

>
>> Supporting all required rounding modes turns out to be easy however: If
>> you already support the default round_to_nearest_or_even (RNE), then you
>> already have the 4 required decision bits:
>>
>> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
>> but it is very easy to grab. :-) )
>>
>> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
>> get out the increment needed to round the intermediate result.
>>
>> Supporting multiple rounding modes just means grabbing the correct
>> 16-bit value, or you can use the sign bit to select between two 64-bit
>> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
>> count to end up with the desired rounding bit.
>>
>> Terje
>>
>
> I am repeating myself for the 3rd or 4th time - the main cost is *not*
> implementation of non-default rounding modes. For that task branch
> prediction will do a near perfect job.
> The main cost is reading of FP control word that contains the relevant bits
> and interpreting this bits in GPR domain. It is especially expensive if done
> in portable way, i.e. via fegetround().

OK.

I think we can use Fortran semantics for an advantageous solution, at
least for gfortran.

Each procedure has a processor-dependent rounding mode on entry,
and these are restored on exit. (This actually makes a lot of sense
somebody might have changed the rounding mode somewhere else,
and the results should not depend on this), so there is no need
to look at global state.

A procedure which does not call IEEE_SET_ROUNDING_MODE does not change
rounding modes, so it can also use the default. This should cover the
vast majority of programs.

If IEEE_SET_ROUNDING_MODE is called, it sets the processor status
register, but it can also do something else, like set a flag local
to the routine.

So, a strategy could be to implement three functions:

The simplest one of them would be called from the compiler if there
is no call too IEEE_SET_ROUNDING_MODE in sight, and it would do
exactly what you have already implemented.

The second one would take an additional argument, and implements
all additional rounding modes. This would be called with the
local flag as additional argument. (If there turns out to be no
speed disadvantage to using this with a constant argument vs.
the first one, the two could also be rolled into one).

The third one, under the original name, actually reads the processor
status register and then tail-calls the second one to do the work.

For gfortran, because we implement our library in C, we would also
need to add a flag which tells C (or the middle end) just to use
the default rounding mode, which we can then use as a flag when
building libgfortran.

Does this sound reasonable?

> It's not a lot of cycles in absolute sense, but in such primitives like fadd
> and fmul every cycle counts. We want to do each of them in less than
> 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

Absolutely. Like the old filk song "Every cycle is sacred"

Anton Ertl

unread,
Dec 30, 2022, 5:16:41 AM12/30/22
to
Michael S <already...@yahoo.com> writes:
>It's not a lot of cycles in absolute sense, but in such primitives like fadd
>and fmul every cycle counts. We want to do each of them in less than
>25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

If the rounding mode is set only through routines like fesetround(),
all such routines could be changed to either

1) Change __addtf3 etc. to the slow, but rounding-mode-conforming
versions, and change fesetround() etc. to a version that just sets the
rounding mode. I.e., after the first fesetround() etc., you always
get the slow __addtf3, but fesetround() is relatively fast.

2) If the target rounding mode is different from the current rounding
mode: if the target is nearest-or-even, set the fast __addtf3 etc,
otherwise the slow __addtf3 etc. Don't change fesetround(). This
results in fast nearest-or-even __addtf3 etc., but relatively slow
fesetround() etc.

Note that these functions are already using some indirection thanks to
dynamic linking, so they are already changeable.

Michael S

unread,
Dec 30, 2022, 5:30:29 AM12/30/22
to
If it was on CPUs that has two models, one with binary128 hardware and other
without such hardware, like presence/absence of FP co-processor on CPUs
from the 80s, then I'll take this explanation.
But it's not the case. These routines are used primarily on x86-64 and ARM64
that never had binary128 and, it seems, are not planning to add it in the
[foreseeable] future.
So if procedure Foo calls IEEE_SET_ROUNDING_MODE and later calls and
then calls procedure Bar the Bar expected to operate with default (==RNE)
rounding mode?
That's sound great for my purposes, but also sounds like violation of
intentions of IEEE-754 and may be even of the letter of the Standard.

> If IEEE_SET_ROUNDING_MODE is called, it sets the processor status
> register, but it can also do something else, like set a flag local
> to the routine.
>
> So, a strategy could be to implement three functions:
>
> The simplest one of them would be called from the compiler if there
> is no call too IEEE_SET_ROUNDING_MODE in sight, and it would do
> exactly what you have already implemented.
>
> The second one would take an additional argument, and implements
> all additional rounding modes. This would be called with the
> local flag as additional argument. (If there turns out to be no
> speed disadvantage to using this with a constant argument vs.
> the first one, the two could also be rolled into one).
>
> The third one, under the original name, actually reads the processor
> status register and then tail-calls the second one to do the work.
>
> For gfortran, because we implement our library in C, we would also
> need to add a flag which tells C (or the middle end) just to use
> the default rounding mode, which we can then use as a flag when
> building libgfortran.
>
> Does this sound reasonable?

Mostly.
The only problem that I see so far is specific to AMD64 Windows.
The 3rd routine will be slow. Well, not absolutely, but slower than
necessary.
That's because by Windows ABI _Float64 is just a regular structure
that is returned from functions like any other structures that are
bigger than 64 bits, i.e. caller passes pointer to location on stack
and callee stores value here. So far so good.
The problem is that gcc's tail call elimination does not work in
this case. We can, of course, hope that they will fix it in v13 or
v14, but considering that they were not able to do it in 30+ years
I am not holding my breath in anticipation.

Michael S

unread,
Dec 30, 2022, 5:41:02 AM12/30/22
to
Things like that should be discussed with gcc and libgcc maintainer.
I have neither ambitions nor desire to become one.

As my personal opinion, I hope that one day in the future statically linked
libraries infrastructure will make a comeback. Not as default, but as an
option people can use without jumping through hoops.

Thomas Koenig

unread,
Dec 30, 2022, 6:15:46 AM12/30/22
to
Well, we are not going be able to change observable behavior of that
function (politically), but I don't think it matters that much.
We shuld just avoid calling this inefficient function if it is
possible to avoid it, and I don't think we need to do so from Fortran
at all. For other programming languages, I am not sure.
Correct. It also makes trying to set the rounding flags in a subroutine
a no-op :-)

> That's sound great for my purposes, but also sounds like violation of
> intentions of IEEE-754 and may be even of the letter of the Standard.

Because IEEE-754 is one of those pesky standards where there is no
publically available final comittee draft, I cannot speak to that,
but I certainly think that the J3 people knew what they were doing
when they specified the Fortran interface to IEEE.

Maybe Terje can comment from the IEEE-754 side?
If it is inefficient in Windows, then one other possibility is to
mark the second function inline, and let the compiler take care
of it, or whatever else turns out to be fastest. What I am thinking
about is mainly to reduce the call overhead to the absolute minimum.

Anton Ertl

unread,
Dec 30, 2022, 7:00:42 AM12/30/22
to
Michael S <already...@yahoo.com> writes:
>On Friday, December 30, 2022 at 12:16:41 PM UTC+2, Anton Ertl wrote:
>> If the rounding mode is set only through routines like fesetround(),
>> all such routines could be changed to either
>>
>> 1) Change __addtf3 etc. to the slow, but rounding-mode-conforming
>> versions, and change fesetround() etc. to a version that just sets the
>> rounding mode. I.e., after the first fesetround() etc., you always
>> get the slow __addtf3, but fesetround() is relatively fast.
>>
>> 2) If the target rounding mode is different from the current rounding
>> mode: if the target is nearest-or-even, set the fast __addtf3 etc,
>> otherwise the slow __addtf3 etc. Don't change fesetround(). This
>> results in fast nearest-or-even __addtf3 etc., but relatively slow
>> fesetround() etc.
>>
>> Note that these functions are already using some indirection thanks to
>> dynamic linking, so they are already changeable.
>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
>
>Things like that should be discussed with gcc and libgcc maintainer.

Worse, it needs both gcc (libgcc) and glibc maintainers, because
__addtf3 is part of libgcc, while fesetround() is part of glibc. And
gcc also needs to work with other libcs, not just glibc, so it may not
be an easy problem. I agree that it's not your problem, though.

>As my personal opinion, I hope that one day in the future statically linked
>libraries infrastructure will make a comeback.

They have been doing for some time. Languages like Go and Rust use
static linking (at least for stuff written in those languages),
because it's apparently too hard to maintain binary compatibility for
libraries written in those languages. I think the growth of main
memory sizes compared to the heydays of dynamic linking in the 1990s
also plays a role.

>Not as default, but as an
>option people can use without jumping through hoops.

Yes, it tends to be hard to statically link a C program these days.

Michael S

unread,
Dec 30, 2022, 8:35:26 AM12/30/22
to
That's a solution, too.
The library will be bigger, but at source code level we don't repeat themselves.
I don't like that sort of bloat, but relatively to thousands of other bloats that
people today consider acceptable this one is quite minor.

The best technical solution of Windows would be a change in API of
compiler's support functions from pass by value to pass by reference.
I.e. instead of __float128 __multf3(__float128 srcx, __float128 srcy) it should be
void __multf3(__float128* dst, __float128* srcx, __float128* srcy).
It helps the problem with TCE, but not only that.
In my measurements such API ends up significantly faster, at least in
matmul benchmark.
But such change means breaking compatibility with previous version of compiler.
Also, more problematically, it means that gcc compiler people have to do
additional works for sake of Windows and Windows alone. My impression is
that they hate to do it.
Of course, the same change can be made on Linux, too, but it is harder sell.
both because on Linux the problem of compatibility with previous versions
is more serious and because on Linux there present other popular compiler
(clang) that is sometimes used together with gcc and that supports
_Float64 and people expect interoperability.
Also on Linux the performaance gain from change of the API is smaller.





Thomas Koenig

unread,
Dec 30, 2022, 9:24:12 AM12/30/22
to
Michael S <already...@yahoo.com> schrieb:

[...]
I am not sure it is only that; I have no idea where this ABI is
specified. Did Microsoft publish anything about this?

However, https://godbolt.org/z/7oqYsxG3P tells me that
icc has a different calling convention from both gcc and clang,
presumably on Linux, so the mess is even bigger...

In general, ABI changes are never untaken lightly, only if there is
a pressing need - if either a mistake in the previous implementation
or a change in a standard requires something new.

> Also, more problematically, it means that gcc compiler people have to do
> additional works for sake of Windows and Windows alone. My impression is
> that they hate to do it.

People are wary to work on systems they don't know well. There are,
however, some people active in gcc development on these platforms.

> Of course, the same change can be made on Linux, too, but it is harder sell.
> both because on Linux the problem of compatibility with previous versions
> is more serious and because on Linux there present other popular compiler
> (clang) that is sometimes used together with gcc and that supports
> _Float64 and people expect interoperability.
> Also on Linux the performaance gain from change of the API is smaller.

For Linux, the ABI is prescribed in the

System V Application Binary Interface
AMD64 Architecture Processor Supplement
(With LP64 and ILP32 Programming Models)
Version 1.0

(to give it its full title, you'll find it as x86-64-psABI-1.0.pdf).
I do not think you will get anybody to change that. Hmm... seems
that Intel is actually in violation of that specification.
Interesting...

Michael S

unread,
Dec 30, 2022, 9:39:27 AM12/30/22
to
On Friday, December 30, 2022 at 4:24:12 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
>
> [...]
> > On Friday, December 30, 2022 at 1:15:46 PM UTC+2, Thomas Koenig wrote:
>
> >> If it is inefficient in Windows, then one other possibility is to
> >> mark the second function inline, and let the compiler take care
> >> of it, or whatever else turns out to be fastest. What I am thinking
> >> about is mainly to reduce the call overhead to the absolute minimum.
> >
> > That's a solution, too.
> > The library will be bigger, but at source code level we don't repeat themselves.
> > I don't like that sort of bloat, but relatively to thousands of other bloats that
> > people today consider acceptable this one is quite minor.
> >
> > The best technical solution of Windows would be a change in API of
> > compiler's support functions from pass by value to pass by reference.
> > I.e. instead of __float128 __multf3(__float128 srcx, __float128 srcy) it should be
> > void __multf3(__float128* dst, __float128* srcx, __float128* srcy).
> > It helps the problem with TCE, but not only that.
> > In my measurements such API ends up significantly faster, at least in
> > matmul benchmark.
> > But such change means breaking compatibility with previous version of compiler.
> I am not sure it is only that; I have no idea where this ABI is
> specified. Did Microsoft publish anything about this?
>

As far as Microsoft is concerned, __float128/_Float128 does not exist.
So, naturally, they are not in the official ABI. As I said above, from Windows
ABI perspective those types, if implemented, are yet another 16-byte structures.
You mean, official ABI prescribes the names and prototypes of
compiler support routines __multf3, __addtf3 and __subtf3 ?


Thomas Koenig

unread,
Dec 30, 2022, 10:06:34 AM12/30/22
to
Michael S <already...@yahoo.com> schrieb:
OK.
No, it just prescribes the argument passing conventions if
you pass 128-bit floats by value (in two halves, in SSE
registers).

However, there is no ABI-based reason why the auxiliary
multiplication funcdtions for use by Fortran (or other value)
should use pass by value. It is also possible to use

void foo (__float128 *const restrict a, __float128 *const restrict b,
__float128 *restrict c)

or some variant if that turns out to generate better code.

Thomas Koenig

unread,
Dec 30, 2022, 3:16:18 PM12/30/22
to
I plugged in the current version into current trunk, and it works. Very good.
A regression test shows the following failures introduced by the patch
(at all optimization levels):

gcc.dg/torture/float128-exact-underflow.c
gcc.dg/torture/float128-ieee-nan.c
gcc.dg/torture/float128-nan.c
gcc.dg/torture/float128-nan-floath.c
gfortran.dg/ieee/large_3.F9

These tests check various IEEE compliance aspects, including IEEE
flags for the Fortran test. I'd have to look in detail what
the causes are, but I suspect that, for C, we will have to
use something with the original functionality.

Terje Mathisen

unread,
Dec 30, 2022, 3:44:52 PM12/30/22
to
Ouch!

Sorry, I had absolutely no intention to suggest you should try to adapt
to whatever the current HW rounding mode is! IMHO the only sane way to
do it is to assume the user will set the rounding mode with the standard
function to do so, and at that point you save away a copy, and load the
corresponding 16-bit rounding lookup value.

If a library user then goes "behind your back" using direct asm
instructions to set the rounding mode to something else, then just
disregard that.

OK?

Terje

Terje Mathisen

unread,
Dec 30, 2022, 3:57:15 PM12/30/22
to
Microsoft do support 16-byte return values all over the place in thje
several 100's of AVX SIMD intrinsics which use __m256i as both argument
and return types. Could you (ab)use those via some typedefs?

Terje

MitchAlsup

unread,
Dec 30, 2022, 5:12:52 PM12/30/22
to
On Friday, December 30, 2022 at 2:44:52 PM UTC-6, Terje Mathisen wrote:
> Michael S wrote:

> > I am repeating myself for the 3rd or 4th time - the main cost is *not*
> > implementation of non-default rounding modes. For that task branch
> > prediction will do a near perfect job.
> > The main cost is reading of FP control word that contains the relevant bits
> > and interpreting this bits in GPR domain. It is especially expensive if done
> > in portable way, i.e. via fegetround().
> > It's not a lot of cycles in absolute sense, but in such primitives like fadd
> > and fmul every cycle counts. We want to do each of them in less than
> > 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.
> Ouch!
>
> Sorry, I had absolutely no intention to suggest you should try to adapt
> to whatever the current HW rounding mode is! IMHO the only sane way to
> do it is to assume the user will set the rounding mode with the standard
> function to do so, and at that point you save away a copy, and load the
> corresponding 16-bit rounding lookup value.
>
> If a library user then goes "behind your back" using direct asm
> instructions to set the rounding mode to something else, then just
> disregard that.
<
If you have an environment where you have a standard way to get/set
the rounding mode, and some programmer goes around that interface
with ASM to get/set the rounding mode, that programmer (ASM) should
be summarily dismissed.

MitchAlsup

unread,
Dec 30, 2022, 5:13:59 PM12/30/22
to
On Friday, December 30, 2022 at 2:57:15 PM UTC-6, Terje Mathisen wrote:
> Michael S wrote:

> > As far as Microsoft is concerned, __float128/_Float128 does not exist.
> > So, naturally, they are not in the official ABI. As I said above, from Windows
> > ABI perspective those types, if implemented, are yet another 16-byte structures.
<
> Microsoft do support 16-byte return values all over the place in thje
> several 100's of AVX SIMD intrinsics which use __m256i as both argument
> and return types. Could you (ab)use those via some typedefs?
<
My 66000 supports 8 doubleword argument and return structures passed
via registers.

Thomas Koenig

unread,
Dec 31, 2022, 4:15:41 AM12/31/22
to
MitchAlsup <Mitch...@aol.com> schrieb:
Or, in the words of Numerical Recipes:

# Come the (computer) revolution, all persons found guilty of such
# criminal behavior will be summarily executed, and their programs
# won’t be!

Thomas Koenig

unread,
Dec 31, 2022, 4:21:08 AM12/31/22
to
Thomas Koenig <tko...@netcologne.de> schrieb:
Looking at the tests a bit more, this is probably due to not using
fesetexceptflag to signal abnormal conditions. The version for
Fortran code which uses the IEEE module would need to do so
(conditionally, so it will be well predicted to do nothing except
in pathological cases).

Michael S

unread,
Dec 31, 2022, 11:33:44 AM12/31/22
to
Actually, it is much better without restrict.
Calling these routines with the destination equal to one or both sources
is practically useful.

Michael S

unread,
Dec 31, 2022, 11:37:55 AM12/31/22
to
For exact underflow they are correct. I didn't bother to implement inexact flag.
IMHO, it's the most stupid IEEE-754 feature.
For nan, I pretty sure in myself. Probably a bug in test.

Michael S

unread,
Dec 31, 2022, 12:01:40 PM12/31/22
to
Actually, I consider SystemV binding of __float128 function parameters and
of return value to SSE registers as rather big performance mistake.
__float128 has nothing to do with SSE/AVX, 100% of processing is in GPR
domain. Copying things back and forth between XMM and GPR just slow
things down. And absence of non-volatile XMM registers in SystemV ABI
makes the matters worse.

The problem of Microsoft ABI is not non-usage of XMM, but inability to pass
and return 16-byte structures in pair of GPRs. Also, # of non-volatile GPRs is
a too big for these particular routines, but that's relatively less harmful.

Michael S

unread,
Dec 31, 2022, 12:40:54 PM12/31/22
to
Intel intrinsic macro _MM_SET_ROUNDING_MODE() is probably used quite widely.
Also, rounding mode could be changes by _mm_setcsr().
Microsoft has _controlfp().
Linux has something similar, I don't remember the name right now.

So there are many ways to change rounding mode apart from asm and official
fesetround() and fesetenv().

Also, what about threads?
Shouldn't your shadow copy of rounding mode be thread-local?
If it does, that immediately prevents platform-independent C solution.

Terje Mathisen

unread,
Jan 1, 2023, 4:00:45 PM1/1/23
to
How do you differentiate between SNaN and QNaN? Do you set or clear a
bit? If so, which bit?

All alternatives are legal, but only one combination is in use for each
target CPU afair.

Michael S

unread,
Jan 1, 2023, 4:19:07 PM1/1/23
to
I don't. Should I?

> All alternatives are legal, but only one combination is in use for each
> target CPU afair.

I never encountered SNaN in practice. Who uses them and why?

MitchAlsup

unread,
Jan 1, 2023, 5:01:04 PM1/1/23
to
On Sunday, January 1, 2023 at 3:00:45 PM UTC-6, Terje Mathisen wrote:
> Michael S wrote:

> How do you differentiate between SNaN and QNaN? Do you set or clear a
> bit? If so, which bit?
<
IEEE 754-2008 and 2019 both specify bit<52> == 1 -> QNaN; == 0 -> SNaN.
>
> All alternatives are legal, but only one combination is in use for each
> target CPU afair.
<
1985 only specified they had to be distinguished but not how. The later
standards cleaned that up.

MitchAlsup

unread,
Jan 1, 2023, 5:05:50 PM1/1/23
to
to the 99.99% level: nobody uses them.
<
Kahan, Coonen, and 3 others use them to bound array accesses and other
nefarious programming tricks. Attempting to capture "compiler should not
have created code to reach this point."
<
If IEEE 754 mandated that exceptions were always enabled (excepting
INEXACT) and operands were delivered to signal handling routines (in
some manner) and results from signal handlers returned to excepting
code (in some manner) then they might be of some use to another 55
programmers world wide.
<
This still leaves us with 99.99% of programmers never use them.

Michael S

unread,
Jan 1, 2023, 5:07:11 PM1/1/23
to
It seem x86-64, Arm64 and POWER FP HW encode SNaN vs QNaN in the same way:
QNaN has MS bit of significand = 1.
I'd guess, gcc test suite expects the same for soft binary128.
When my addsubq primitive sums Infs with opposite signs it generates number
that by such convention will be considered SNaN which is a in violation of
the standard. Same for multiplication: for 0*Inf my routine generates something that
test suit interprets as SNaN.
Quite possible that's the reason for failing tests.

I can change it easily with no adverse effect on performance.
But may be there are more mines on this minefield that I am not aware about?



Michael S

unread,
Jan 1, 2023, 6:12:31 PM1/1/23
to
Done.

Terje Mathisen

unread,
Jan 2, 2023, 1:31:13 AM1/2/23
to
Sounds like #ifdef to hide the differences, but stay with the library calls.

>
> Also, what about threads?
> Shouldn't your shadow copy of rounding mode be thread-local?
> If it does, that immediately prevents platform-independent C solution.

Some of the intrinsics are also platform-dependent, right?

I.e. when I wrote the world's fastest (at least at the time) ogg vorbis
decoder, I hid all the compiler instrinsics for SSE(n) and Altivec
behind an #ifdef layer which defined my own names for the common subset
that I needed. This worked so well that my code could use a single
library, allowing static linking and no runtime selection of the optimal
version. (The latter is what the previous fastest (closed source)
commercial offering was doing.)

Terje Mathisen

unread,
Jan 2, 2023, 2:06:34 AM1/2/23
to
Thnk of them as trapping vs non-trapping NaNs, except that you can
suppress all the traps if you so want, and most programming environments do.

Terje Mathisen

unread,
Jan 2, 2023, 2:25:00 AM1/2/23
to
MitchAlsup wrote:
> On Sunday, January 1, 2023 at 3:00:45 PM UTC-6, Terje Mathisen wrote:
>> Michael S wrote:
>
>> How do you differentiate between SNaN and QNaN? Do you set or clear a
>> bit? If so, which bit?
> <
> IEEE 754-2008 and 2019 both specify bit<52> == 1 -> QNaN; == 0 -> SNaN.

So you quiet a SNaN by masking away that top mantissa bit. This also
means that when using binary sorting you get
zero/subnormal/normal/inf/qnan/snan which is fine except that you really
should have included the top-but-one mantissa bit as well!

If you generate an SNaN with just the top mantissa bit set, and later
one it the 'S' is masked away, you now get Inf instead of QNaN.

In my favorite version the top two mantissa bits are all defined like this:

00 : Inf
01 : QNaN
10 : None (un-initialized or missing variables)
11 : SNaN

>>
>> All alternatives are legal, but only one combination is in use for each
>> target CPU afair.
> <
> 1985 only specified they had to be distinguished but not how. The later
> standards cleaned that up.

Are you sure?

Afair, the DFP people got it right, so when that was added (in 2008 or
one iteration earlier?) they included exact/required decimal NaN handling.

All existing binary FP implementations are still legal, but anyone sane
designing a new cpu would use the same setup as for DFP.

Thomas Koenig

unread,
Jan 2, 2023, 3:56:22 AM1/2/23
to
Michael S <already...@yahoo.com> schrieb:

> I never encountered SNaN in practice. Who uses them and why?

Initializing all floating point variables to a signalling NaN is
an excellent way of catching undefined variable errors, if the
system supports them.

Michael S

unread,
Jan 2, 2023, 7:37:58 AM1/2/23
to
In routines in question I allowed to myself to be heavily gcc-dependent.
However I tried to avoid all other platform/CPU dependencies, except
for requirements to have __int128 and to be consistently endian, i.e.
either little or big endian and to have the same endiannes for integers
and for floats.

There are few SSE intrinsic in my code, under #ifdef, but they are not
required for correctness and hopefully in the gcc13, 14 or 15 would not
be needed for performance, too.
No OS dependencies, as far as can tell, except that x86 performance quirk
has dependency on the ABI. With better compiler, this dependency also could
be eliminated in the future.

Michael S

unread,
Jan 2, 2023, 7:46:58 AM1/2/23
to
That works in signal-friendly environment, like single-threaded command
line program that is not a part of more complex programming complex.
I.e., in great scheme of things, approximately never.

BTW, did you try to torture updated files?

Thomas Koenig

unread,
Jan 3, 2023, 2:34:08 AM1/3/23