Improved routines for gcc/gfortran quadmath arithmetic

Michael S

unread,

Dec 28, 2022, 2:45:19 PM12/28/22

to

This thread is for attention of Thomas Koenig.

Few months ago I spend significant time coding replacement routines for
__addtf3 and __multf3.
Back then I thought that they are part of quadmath library.
quadmath library appears to be relatively low profile project with
little burocracy involved so I thout that integration of my replacement
will be easy.
Later on (in June) I learned that these routines are in fact part of glibc.
That reduced my enthusiasm. My impression is that glibc project is more
bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
work integrated) and that glibc developers have strong territorial instincts.
I could be wrong about both points, but that was my impression.
Simultaneously, I had more "real work" to do that left little time. Often I came
home tired. And when I don't do this sort of hobby things continuously
night after night I tend to lose focus ans interest. In short, I almost forgot
about whole thing. Luckily, by then both routines were already in MUCH better shape
than originals. I mean, in terms of speed.
Today and tomorrow I happen to have time, so recollected about it and stored routines
and tests in my public github repository.
https://github.com/already5chosen/extfloat/tree/master/binary128

Thomas, please look at this code and tell me if there is a chance to integrate it.
Preferable, in glibc, but I am not optimistic about it.
If not in glibc then, maybe, in more specialized places, like matrix primitives of
gfortran.
BTW, for later I have potentially more interesting routines - multiplication of
array by scalar and addition of two vectors. Due two reduction in parsing and
call overhead they are tens of percents faster than __multf3/__addtf3 called in loops.
Hopefully, I will add them to repository later tonight or tomorrow.

Anton Ertl

unread,

Dec 29, 2022, 4:08:42 AM12/29/22

to

Michael S <already...@yahoo.com> writes:
>This thread is for attention of Thomas Koenig.
>
>Few months ago I spend significant time coding replacement routines for
>__addtf3 and __multf3.
>Back then I thought that they are part of quadmath library.
>quadmath library appears to be relatively low profile project with
>little burocracy involved so I thout that integration of my replacement
>will be easy.
>Later on (in June) I learned that these routines are in fact part of glibc.

The names look like they belong to libgcc (part of gcc), not glibc,
and indeed, on a Debian 11 system

objdump -d /lib/x86_64-linux-gnu/libgcc_s.so.1|grep addtf3

shows code for __addtf3, while

objdump -d /lib/x86_64-linux-gnu/libc-2.31.so| grep addtf3

comes up blank (apparently not even a call to __addtf3 there).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Michael S

unread,

Dec 29, 2022, 5:22:43 AM12/29/22

to

You are right, I was confused about similar library names.
It does not change my point - it's still much higher profile project than libquadmath.

Thomas Koenig

unread,

Dec 29, 2022, 5:36:48 AM12/29/22

to

Michael S <already...@yahoo.com> schrieb:

> This thread is for attention of Thomas Koenig.
>
> Few months ago I spend significant time coding replacement routines for
> __addtf3 and __multf3.
> Back then I thought that they are part of quadmath library.
> quadmath library appears to be relatively low profile project with
> little burocracy involved so I thout that integration of my replacement
> will be easy.
> Later on (in June) I learned that these routines are in fact part of glibc.
> That reduced my enthusiasm. My impression is that glibc project is more
> bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
> work integrated) and that glibc developers have strong territorial instincts.

I cannot really speak to that, I hardly know the glibc people. (I
probably ran across them at a GNU Cauldron, but I don't remember talking
to them).

The current status seems to be that these functions are part of libgcc,
and that they are copied over from glibc.

> I could be wrong about both points, but that was my impression.
> Simultaneously, I had more "real work" to do that left little time.

I know that only too well (I currently have my own project, outside of
work, which keeps me quite busy).

> Often I came
> home tired. And when I don't do this sort of hobby things continuously
> night after night I tend to lose focus ans interest. In short, I almost forgot
> about whole thing. Luckily, by then both routines were already in MUCH better shape
> than originals. I mean, in terms of speed.
> Today and tomorrow I happen to have time, so recollected about it and stored routines
> and tests in my public github repository.
> https://github.com/already5chosen/extfloat/tree/master/binary128
>
> Thomas, please look at this code and tell me if there is a chance to integrate it.

I will test this for a bit. A next step would be a discussion on
the gcc mailing list. A possible scenario would be to create a
branch, and to merge that into trunk once gcc14 development starts.

The code may require some adjustment for other platforms, or for
special conditions in libgcc. For example, I am not sure if using
x86 intrinsics in libgcc would cause problems, or if using uint64_t
works, or if using __float128 is correct or it should be replaced by
TFMode.

Anyway, I'll drop you an e-mail in the near future.

> Preferable, in glibc, but I am not optimistic about it.
> If not in glibc then, maybe, in more specialized places, like matrix primitives of
> gfortran.
> BTW, for later I have potentially more interesting routines - multiplication of
> array by scalar and addition of two vectors. Due two reduction in parsing and
> call overhead they are tens of percents faster than __multf3/__addtf3 called in loops.
> Hopefully, I will add them to repository later tonight or tomorrow.

That also sounds interesting :-)

Michael S

unread,

Dec 29, 2022, 6:56:28 AM12/29/22

to

All uses of x86 intrinsic functions are guarded with #ifdef __amd64.
They are here in order to fight stupidities of gcc compiler, most commonly
its hyperactive SLP vectorizer.
The code is 100% correct without this parts, but non-trivially slower when
compiled with gcc12.
Making them unnneded in gcc 13 or 14 sounds like a worthy goal for gcc,
but I am not optimistic. Until now the trend was opposite - SLP vectorizer
only gets more aggressive.

> or if using uint64_t works,

My code certainly requires both uint64_t and __int128.
But it appears to be almost o.k. since we don't want to support all gcc platforms.
All we are interested in are platforms that support binary128 FP.
By now those are:
x86-64,
MIPS64,
aarch64,
rv64gc,
rv32gc,
s390x,
power64le,
power64
The first four are o.k.
The last two (or three?) have hardware so don't need sw implementation.
That leave one odd case of rv32gc. I don't know what to do about it.
IMHO, the wisest is to drop support for binary128 on this platform.
I can't imagine that it is actually used by anybody.

> or if using __float128 is correct or it should be replaced by
> TFMode.

I don't know what TFMode means. Ideally, it should be _Float128.
I used __float128 because I wanted to compile with clang that does not
understand __float128. That not a good reason, so probably has to change.

But the real reason why I don't believe that my code can be integrates into
glibc or even into libgcc is different: I don't support rounding modes.
That is, I always round to nearest with breaks rounded to even.
I suppose that it's what is wanted by nearly all users, but glibc has different idea.
They think that binary128 rounding mode should be the same as current
rounding mode for binary64/binary32. They say, it's required by IEEE-754.
They could even be correct about it, but it just shows that IEEE-754 is not perfect.

Support to other rounding modes can be added to my routines, but doing it in
platform-independent manner will slow things down significantly and doing it
in platform-dependent manner will reduce portability (IMHO, *not* for a good reason)
and it still would be somewhat slower than current code.

That's why I am more hopeful for integration into gfortran and esp. into gfortran's
matrix/vector libraries.
My impression is that these places don't care about quadmath non-default rounding modes.

>
> Anyway, I'll drop you an e-mail in the near future.

Normally, I don't check this e-mail address regularly.
I can, of course, but why not keep conversation public? It could have other benefits.

Thomas Koenig

unread,

Dec 29, 2022, 1:07:45 PM12/29/22

to

Michael S <already...@yahoo.com> schrieb:

OK, I plugged your code into libgcc/soft-fp/addtf3.c and
libgcc/soft-fp/multf3.c, and, on the first try, hit an error:

In file included from /home/tkoenig/trunk-bin/gcc/include/immintrin.h:39,
from /home/tkoenig/trunk-bin/gcc/include/x86intrin.h:32,
from ../../../trunk/libgcc/soft-fp/addtf3.c:4:
/home/tkoenig/trunk-bin/gcc/include/smmintrin.h: In function 'f128_to_u128':
/home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
455 | _mm_extract_epi64 (__m128i __X, const int __N)
| ^~~~~~~~~~~~~~~~~
../../../trunk/libgcc/soft-fp/addtf3.c:70:17: note: called from here
70 | uint64_t hi = _mm_extract_epi64(v, 1);
| ^~~~~~~~~~~~~~~~~~~~~~~
/home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
455 | _mm_extract_epi64 (__m128i __X, const int __N)
| ^~~~~~~~~~~~~~~~~
../../../trunk/libgcc/soft-fp/addtf3.c:69:17: note: called from here
69 | uint64_t lo = _mm_extract_epi64(v, 0);
| ^~~~~~~~~~~~~~~~~~~~~~~
make[3]: *** [../../../trunk/libgcc/shared-object.mk:14: addtf3.o] Error 1

(I'm not quite sure what "target specific option mismatch" actually
means in this context). This kind of thing is not unexpected,
because the build environment inside gcc is different from normal
userland. So, some debugging would be needed to bring this into libgcc.

Looking at it through gfortran's glasses, we do not support
REAL(KIND=16) on 32-bit systems, so that would be OK.

>
>> or if using __float128 is correct or it should be replaced by
>> TFMode.
>
> I don't know what TFMode means.

TFMode is the name for 128-bit reals inside gcc (see
https://gcc.gnu.org/onlinedocs/gccint/Machine-Modes.html ). If you
look at libgcc, you will find it is actually implemented as a struct
of two longs. That does not mean it absolutely has to be that way.

>Ideally, it should be _Float128.
> I used __float128 because I wanted to compile with clang that does not
> understand __float128. That not a good reason, so probably has to change.

> But the real reason why I don't believe that my code can be integrates into
> glibc or even into libgcc is different: I don't support rounding modes.
> That is, I always round to nearest with breaks rounded to even.
> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> They think that binary128 rounding mode should be the same as current
> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> They could even be correct about it, but it just shows that IEEE-754 is not perfect.

I understand them being sticklers for accuracy (Terje? :-)

However, this does not mean that your code cannot be included. It would
be possible to have the traditional function as a fallback in the (rare)
case where people want something else - a conditional jump which would
be predicted 99.99.... % accurately in a tight loop.

> Support to other rounding modes can be added to my routines, but doing it in
> platform-independent manner will slow things down significantly and doing it
> in platform-dependent manner will reduce portability (IMHO, *not* for a good reason)
> and it still would be somewhat slower than current code.
>
> That's why I am more hopeful for integration into gfortran and esp. into gfortran's
> matrix/vector libraries.
> My impression is that these places don't care about quadmath non-default rounding modes.

Yes. Fortran, specifically, restores IEEE rounding modes to default
on procedure entry.

I would still prefer to get your code into libgcc, though, because it
would then be useful for every place where somebody uses 128-bit
in a program.

Have you ever compiled gcc? I often use the gcc compile farm at
https://cfarm.tetaneutral.net/ , it has some rather beefy machines
which cuts down on bootstrap time (but it will still take a couple
of hours). You can request at the comile farm.

>>
>> Anyway, I'll drop you an e-mail in the near future.
>
> Normally, I don't check this e-mail address regularly.
> I can, of course, but why not keep conversation public? It could have other benefits.

We will probably have to move over to g...@gcc.gnu.org eventually.

[...]

Michael S

unread,

Dec 29, 2022, 3:02:25 PM12/29/22

to

That's because I never tried to compile for AMD64 target that does not have at least SSE4 :(
Always compiled with -march=native on machines that were not older that 10-11 y.o.
I'd try to fix it.
In the mean time you could add -march=nehalem or -march=x86-64-v2 or -march=bdver1
or -msse4 to your set of compilation flags.

I don't think it is that easy.
The problem is not the branch but reading the rounding mode from where
gcc stores it.

> > Support to other rounding modes can be added to my routines, but doing it in
> > platform-independent manner will slow things down significantly and doing it
> > in platform-dependent manner will reduce portability (IMHO, *not* for a good reason)
> > and it still would be somewhat slower than current code.
> >
> > That's why I am more hopeful for integration into gfortran and esp. into gfortran's
> > matrix/vector libraries.
> > My impression is that these places don't care about quadmath non-default rounding modes.
> Yes. Fortran, specifically, restores IEEE rounding modes to default
> on procedure entry.
>
> I would still prefer to get your code into libgcc, though, because it
> would then be useful for every place where somebody uses 128-bit
> in a program.
>
> Have you ever compiled gcc?

Yes, but only for embedded targets. I.e. last part of the suffix was always -none.
And never with glibc. In fact, I think I never compiled it myself with anything
other than newlib.
But why are you asking?

Terje Mathisen

unread,

Dec 29, 2022, 4:13:20 PM12/29/22

to

Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:

>> But the real reason why I don't believe that my code can be integrates into
>> glibc or even into libgcc is different: I don't support rounding modes.
>> That is, I always round to nearest with breaks rounded to even.
>> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
>> They think that binary128 rounding mode should be the same as current
>> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
>> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
>
> I understand them being sticklers for accuracy (Terje? :-)

Absolutely so!

Supporting all required rounding modes turns out to be easy however: If
you already support the default round_to_nearest_or_even (RNE), then you
already have the 4 required decision bits:

Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
but it is very easy to grab. :-) )

Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
get out the increment needed to round the intermediate result.

Supporting multiple rounding modes just means grabbing the correct
16-bit value, or you can use the sign bit to select between two 64-bit
constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
count to end up with the desired rounding bit.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Michael S

unread,

Dec 29, 2022, 4:32:50 PM12/29/22

to

Fixed.

Michael S

unread,

Dec 29, 2022, 4:56:12 PM12/29/22

to

On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
> Thomas Koenig wrote:
> > Michael S <already...@yahoo.com> schrieb:
> >> But the real reason why I don't believe that my code can be integrates into
> >> glibc or even into libgcc is different: I don't support rounding modes.
> >> That is, I always round to nearest with breaks rounded to even.
> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> >> They think that binary128 rounding mode should be the same as current
> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
> >
> > I understand them being sticklers for accuracy (Terje? :-)
> Absolutely so!
>

I am pretty sure that in typical use cases one doesn't want binary128
rounding mode to be prescribed by the same control world as binary32/binary64.
Generally, binary64 non-default rounding modes are for experimentation.
And binary128 is for comparison of results of experimentation with "master"
values. And for "master" values you want "best" precision which is achieved in
default rounding mode.
mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.

> Supporting all required rounding modes turns out to be easy however: If
> you already support the default round_to_nearest_or_even (RNE), then you
> already have the 4 required decision bits:
>
> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
> but it is very easy to grab. :-) )
>
> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
> get out the increment needed to round the intermediate result.
>
> Supporting multiple rounding modes just means grabbing the correct
> 16-bit value, or you can use the sign bit to select between two 64-bit
> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
> count to end up with the desired rounding bit.
>
> Terje
>

I am repeating myself for the 3rd or 4th time - the main cost is *not*
implementation of non-default rounding modes. For that task branch
prediction will do a near perfect job.
The main cost is reading of FP control word that contains the relevant bits
and interpreting this bits in GPR domain. It is especially expensive if done
in portable way, i.e. via fegetround().
It's not a lot of cycles in absolute sense, but in such primitives like fadd
and fmul every cycle counts. We want to do each of them in less than
25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

Thomas Koenig

unread,

Dec 30, 2022, 4:32:20 AM12/30/22

to

Michael S <already...@yahoo.com> schrieb:

> On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
>> Thomas Koenig wrote:
>> > Michael S <already...@yahoo.com> schrieb:
>> >> But the real reason why I don't believe that my code can be integrates into
>> >> glibc or even into libgcc is different: I don't support rounding modes.
>> >> That is, I always round to nearest with breaks rounded to even.
>> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
>> >> They think that binary128 rounding mode should be the same as current
>> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
>> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
>> >
>> > I understand them being sticklers for accuracy (Terje? :-)
>> Absolutely so!
>>
>
> I am pretty sure that in typical use cases one doesn't want binary128
> rounding mode to be prescribed by the same control world as binary32/binary64.
> Generally, binary64 non-default rounding modes are for experimentation.
> And binary128 is for comparison of results of experimentation with "master"
> values. And for "master" values you want "best" precision which is achieved in
> default rounding mode.
> mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.

I think you have to understand where people are coming from.
This is a originally a soft-float routine, meant for implementing
functions in hardware for CPUs that lack the feature. With that
in mind, it is clear why people would want to look at the hardware
settings for its behavior.

We are slightly abusing this at the moment, but I think we can do
much better.

>
>> Supporting all required rounding modes turns out to be easy however: If
>> you already support the default round_to_nearest_or_even (RNE), then you
>> already have the 4 required decision bits:
>>
>> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
>> but it is very easy to grab. :-) )
>>
>> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
>> get out the increment needed to round the intermediate result.
>>
>> Supporting multiple rounding modes just means grabbing the correct
>> 16-bit value, or you can use the sign bit to select between two 64-bit
>> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
>> count to end up with the desired rounding bit.
>>
>> Terje
>>
>
> I am repeating myself for the 3rd or 4th time - the main cost is *not*
> implementation of non-default rounding modes. For that task branch
> prediction will do a near perfect job.
> The main cost is reading of FP control word that contains the relevant bits
> and interpreting this bits in GPR domain. It is especially expensive if done
> in portable way, i.e. via fegetround().

OK.

I think we can use Fortran semantics for an advantageous solution, at
least for gfortran.

Each procedure has a processor-dependent rounding mode on entry,
and these are restored on exit. (This actually makes a lot of sense
somebody might have changed the rounding mode somewhere else,
and the results should not depend on this), so there is no need
to look at global state.

A procedure which does not call IEEE_SET_ROUNDING_MODE does not change
rounding modes, so it can also use the default. This should cover the
vast majority of programs.

If IEEE_SET_ROUNDING_MODE is called, it sets the processor status
register, but it can also do something else, like set a flag local
to the routine.

So, a strategy could be to implement three functions:

The simplest one of them would be called from the compiler if there
is no call too IEEE_SET_ROUNDING_MODE in sight, and it would do
exactly what you have already implemented.

The second one would take an additional argument, and implements
all additional rounding modes. This would be called with the
local flag as additional argument. (If there turns out to be no
speed disadvantage to using this with a constant argument vs.
the first one, the two could also be rolled into one).

The third one, under the original name, actually reads the processor
status register and then tail-calls the second one to do the work.

For gfortran, because we implement our library in C, we would also
need to add a flag which tells C (or the middle end) just to use
the default rounding mode, which we can then use as a flag when
building libgfortran.

Does this sound reasonable?

> It's not a lot of cycles in absolute sense, but in such primitives like fadd
> and fmul every cycle counts. We want to do each of them in less than
> 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

Absolutely. Like the old filk song "Every cycle is sacred"

Anton Ertl

unread,

Dec 30, 2022, 5:16:41 AM12/30/22

to

Michael S <already...@yahoo.com> writes:
>It's not a lot of cycles in absolute sense, but in such primitives like fadd
>and fmul every cycle counts. We want to do each of them in less than
>25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

If the rounding mode is set only through routines like fesetround(),
all such routines could be changed to either

1) Change __addtf3 etc. to the slow, but rounding-mode-conforming
versions, and change fesetround() etc. to a version that just sets the
rounding mode. I.e., after the first fesetround() etc., you always
get the slow __addtf3, but fesetround() is relatively fast.

2) If the target rounding mode is different from the current rounding
mode: if the target is nearest-or-even, set the fast __addtf3 etc,
otherwise the slow __addtf3 etc. Don't change fesetround(). This
results in fast nearest-or-even __addtf3 etc., but relatively slow
fesetround() etc.

Note that these functions are already using some indirection thanks to
dynamic linking, so they are already changeable.

Michael S

unread,

Dec 30, 2022, 5:30:29 AM12/30/22

to

If it was on CPUs that has two models, one with binary128 hardware and other
without such hardware, like presence/absence of FP co-processor on CPUs
from the 80s, then I'll take this explanation.
But it's not the case. These routines are used primarily on x86-64 and ARM64
that never had binary128 and, it seems, are not planning to add it in the
[foreseeable] future.

So if procedure Foo calls IEEE_SET_ROUNDING_MODE and later calls and
then calls procedure Bar the Bar expected to operate with default (==RNE)
rounding mode?
That's sound great for my purposes, but also sounds like violation of
intentions of IEEE-754 and may be even of the letter of the Standard.

> If IEEE_SET_ROUNDING_MODE is called, it sets the processor status
> register, but it can also do something else, like set a flag local
> to the routine.
>
> So, a strategy could be to implement three functions:
>
> The simplest one of them would be called from the compiler if there
> is no call too IEEE_SET_ROUNDING_MODE in sight, and it would do
> exactly what you have already implemented.
>
> The second one would take an additional argument, and implements
> all additional rounding modes. This would be called with the
> local flag as additional argument. (If there turns out to be no
> speed disadvantage to using this with a constant argument vs.
> the first one, the two could also be rolled into one).
>
> The third one, under the original name, actually reads the processor
> status register and then tail-calls the second one to do the work.
>
> For gfortran, because we implement our library in C, we would also
> need to add a flag which tells C (or the middle end) just to use
> the default rounding mode, which we can then use as a flag when
> building libgfortran.
>
> Does this sound reasonable?

Mostly.
The only problem that I see so far is specific to AMD64 Windows.
The 3rd routine will be slow. Well, not absolutely, but slower than
necessary.
That's because by Windows ABI _Float64 is just a regular structure
that is returned from functions like any other structures that are
bigger than 64 bits, i.e. caller passes pointer to location on stack
and callee stores value here. So far so good.
The problem is that gcc's tail call elimination does not work in
this case. We can, of course, hope that they will fix it in v13 or
v14, but considering that they were not able to do it in 30+ years
I am not holding my breath in anticipation.

Michael S

unread,

Dec 30, 2022, 5:41:02 AM12/30/22

to

Things like that should be discussed with gcc and libgcc maintainer.
I have neither ambitions nor desire to become one.

As my personal opinion, I hope that one day in the future statically linked
libraries infrastructure will make a comeback. Not as default, but as an
option people can use without jumping through hoops.

Thomas Koenig

unread,

Dec 30, 2022, 6:15:46 AM12/30/22

to

Well, we are not going be able to change observable behavior of that
function (politically), but I don't think it matters that much.
We shuld just avoid calling this inefficient function if it is
possible to avoid it, and I don't think we need to do so from Fortran
at all. For other programming languages, I am not sure.

Correct. It also makes trying to set the rounding flags in a subroutine
a no-op :-)

> That's sound great for my purposes, but also sounds like violation of
> intentions of IEEE-754 and may be even of the letter of the Standard.

Because IEEE-754 is one of those pesky standards where there is no
publically available final comittee draft, I cannot speak to that,
but I certainly think that the J3 people knew what they were doing
when they specified the Fortran interface to IEEE.

Maybe Terje can comment from the IEEE-754 side?

If it is inefficient in Windows, then one other possibility is to
mark the second function inline, and let the compiler take care
of it, or whatever else turns out to be fastest. What I am thinking
about is mainly to reduce the call overhead to the absolute minimum.

Anton Ertl

unread,

Dec 30, 2022, 7:00:42 AM12/30/22

to

Michael S <already...@yahoo.com> writes:
>On Friday, December 30, 2022 at 12:16:41 PM UTC+2, Anton Ertl wrote:
>> If the rounding mode is set only through routines like fesetround(),
>> all such routines could be changed to either
>>
>> 1) Change __addtf3 etc. to the slow, but rounding-mode-conforming
>> versions, and change fesetround() etc. to a version that just sets the
>> rounding mode. I.e., after the first fesetround() etc., you always
>> get the slow __addtf3, but fesetround() is relatively fast.
>>
>> 2) If the target rounding mode is different from the current rounding
>> mode: if the target is nearest-or-even, set the fast __addtf3 etc,
>> otherwise the slow __addtf3 etc. Don't change fesetround(). This
>> results in fast nearest-or-even __addtf3 etc., but relatively slow
>> fesetround() etc.
>>
>> Note that these functions are already using some indirection thanks to
>> dynamic linking, so they are already changeable.
>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
>
>Things like that should be discussed with gcc and libgcc maintainer.

Worse, it needs both gcc (libgcc) and glibc maintainers, because
__addtf3 is part of libgcc, while fesetround() is part of glibc. And
gcc also needs to work with other libcs, not just glibc, so it may not
be an easy problem. I agree that it's not your problem, though.

>As my personal opinion, I hope that one day in the future statically linked
>libraries infrastructure will make a comeback.

They have been doing for some time. Languages like Go and Rust use
static linking (at least for stuff written in those languages),
because it's apparently too hard to maintain binary compatibility for
libraries written in those languages. I think the growth of main
memory sizes compared to the heydays of dynamic linking in the 1990s
also plays a role.

>Not as default, but as an
>option people can use without jumping through hoops.

Yes, it tends to be hard to statically link a C program these days.

Michael S

unread,

Dec 30, 2022, 8:35:26 AM12/30/22

to

That's a solution, too.
The library will be bigger, but at source code level we don't repeat themselves.
I don't like that sort of bloat, but relatively to thousands of other bloats that
people today consider acceptable this one is quite minor.

The best technical solution of Windows would be a change in API of
compiler's support functions from pass by value to pass by reference.
I.e. instead of __float128 __multf3(__float128 srcx, __float128 srcy) it should be
void __multf3(__float128* dst, __float128* srcx, __float128* srcy).
It helps the problem with TCE, but not only that.
In my measurements such API ends up significantly faster, at least in
matmul benchmark.
But such change means breaking compatibility with previous version of compiler.
Also, more problematically, it means that gcc compiler people have to do
additional works for sake of Windows and Windows alone. My impression is
that they hate to do it.
Of course, the same change can be made on Linux, too, but it is harder sell.
both because on Linux the problem of compatibility with previous versions
is more serious and because on Linux there present other popular compiler
(clang) that is sometimes used together with gcc and that supports
_Float64 and people expect interoperability.
Also on Linux the performaance gain from change of the API is smaller.

Thomas Koenig

unread,

Dec 30, 2022, 9:24:12 AM12/30/22

to

Michael S <already...@yahoo.com> schrieb:

[...]

I am not sure it is only that; I have no idea where this ABI is
specified. Did Microsoft publish anything about this?

However, https://godbolt.org/z/7oqYsxG3P tells me that
icc has a different calling convention from both gcc and clang,
presumably on Linux, so the mess is even bigger...

In general, ABI changes are never untaken lightly, only if there is
a pressing need - if either a mistake in the previous implementation
or a change in a standard requires something new.

> Also, more problematically, it means that gcc compiler people have to do
> additional works for sake of Windows and Windows alone. My impression is
> that they hate to do it.

People are wary to work on systems they don't know well. There are,
however, some people active in gcc development on these platforms.

> Of course, the same change can be made on Linux, too, but it is harder sell.
> both because on Linux the problem of compatibility with previous versions
> is more serious and because on Linux there present other popular compiler
> (clang) that is sometimes used together with gcc and that supports
> _Float64 and people expect interoperability.
> Also on Linux the performaance gain from change of the API is smaller.

For Linux, the ABI is prescribed in the

System V Application Binary Interface
AMD64 Architecture Processor Supplement
(With LP64 and ILP32 Programming Models)
Version 1.0

(to give it its full title, you'll find it as x86-64-psABI-1.0.pdf).
I do not think you will get anybody to change that. Hmm... seems
that Intel is actually in violation of that specification.
Interesting...

Michael S

unread,

Dec 30, 2022, 9:39:27 AM12/30/22

to

On Friday, December 30, 2022 at 4:24:12 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
>
> [...]
> > On Friday, December 30, 2022 at 1:15:46 PM UTC+2, Thomas Koenig wrote:
>
> >> If it is inefficient in Windows, then one other possibility is to
> >> mark the second function inline, and let the compiler take care
> >> of it, or whatever else turns out to be fastest. What I am thinking
> >> about is mainly to reduce the call overhead to the absolute minimum.
> >
> > That's a solution, too.
> > The library will be bigger, but at source code level we don't repeat themselves.
> > I don't like that sort of bloat, but relatively to thousands of other bloats that
> > people today consider acceptable this one is quite minor.
> >
> > The best technical solution of Windows would be a change in API of
> > compiler's support functions from pass by value to pass by reference.
> > I.e. instead of __float128 __multf3(__float128 srcx, __float128 srcy) it should be
> > void __multf3(__float128* dst, __float128* srcx, __float128* srcy).
> > It helps the problem with TCE, but not only that.
> > In my measurements such API ends up significantly faster, at least in
> > matmul benchmark.
> > But such change means breaking compatibility with previous version of compiler.
> I am not sure it is only that; I have no idea where this ABI is
> specified. Did Microsoft publish anything about this?
>

As far as Microsoft is concerned, __float128/_Float128 does not exist.
So, naturally, they are not in the official ABI. As I said above, from Windows
ABI perspective those types, if implemented, are yet another 16-byte structures.

You mean, official ABI prescribes the names and prototypes of
compiler support routines __multf3, __addtf3 and __subtf3 ?

Thomas Koenig

unread,

Dec 30, 2022, 10:06:34 AM12/30/22

to

Michael S <already...@yahoo.com> schrieb:

OK.

No, it just prescribes the argument passing conventions if
you pass 128-bit floats by value (in two halves, in SSE
registers).

However, there is no ABI-based reason why the auxiliary
multiplication funcdtions for use by Fortran (or other value)
should use pass by value. It is also possible to use

void foo (__float128 *const restrict a, __float128 *const restrict b,
__float128 *restrict c)

or some variant if that turns out to generate better code.

Thomas Koenig

unread,

Dec 30, 2022, 3:16:18 PM12/30/22

to

I plugged in the current version into current trunk, and it works. Very good.
A regression test shows the following failures introduced by the patch
(at all optimization levels):

gcc.dg/torture/float128-exact-underflow.c
gcc.dg/torture/float128-ieee-nan.c
gcc.dg/torture/float128-nan.c
gcc.dg/torture/float128-nan-floath.c
gfortran.dg/ieee/large_3.F9

These tests check various IEEE compliance aspects, including IEEE
flags for the Fortran test. I'd have to look in detail what
the causes are, but I suspect that, for C, we will have to
use something with the original functionality.

Terje Mathisen

unread,

Dec 30, 2022, 3:44:52 PM12/30/22

to

Ouch!

Sorry, I had absolutely no intention to suggest you should try to adapt
to whatever the current HW rounding mode is! IMHO the only sane way to
do it is to assume the user will set the rounding mode with the standard
function to do so, and at that point you save away a copy, and load the
corresponding 16-bit rounding lookup value.

If a library user then goes "behind your back" using direct asm
instructions to set the rounding mode to something else, then just
disregard that.

OK?

Terje

Terje Mathisen

unread,

Dec 30, 2022, 3:57:15 PM12/30/22

to

Microsoft do support 16-byte return values all over the place in thje
several 100's of AVX SIMD intrinsics which use __m256i as both argument
and return types. Could you (ab)use those via some typedefs?

Terje

MitchAlsup

unread,

Dec 30, 2022, 5:12:52 PM12/30/22

to

On Friday, December 30, 2022 at 2:44:52 PM UTC-6, Terje Mathisen wrote:

> Michael S wrote:

> > I am repeating myself for the 3rd or 4th time - the main cost is *not*
> > implementation of non-default rounding modes. For that task branch
> > prediction will do a near perfect job.
> > The main cost is reading of FP control word that contains the relevant bits
> > and interpreting this bits in GPR domain. It is especially expensive if done
> > in portable way, i.e. via fegetround().
> > It's not a lot of cycles in absolute sense, but in such primitives like fadd
> > and fmul every cycle counts. We want to do each of them in less than
> > 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.
> Ouch!
>
> Sorry, I had absolutely no intention to suggest you should try to adapt
> to whatever the current HW rounding mode is! IMHO the only sane way to
> do it is to assume the user will set the rounding mode with the standard
> function to do so, and at that point you save away a copy, and load the
> corresponding 16-bit rounding lookup value.
>
> If a library user then goes "behind your back" using direct asm
> instructions to set the rounding mode to something else, then just
> disregard that.
<

If you have an environment where you have a standard way to get/set
the rounding mode, and some programmer goes around that interface
with ASM to get/set the rounding mode, that programmer (ASM) should
be summarily dismissed.

MitchAlsup

unread,

Dec 30, 2022, 5:13:59 PM12/30/22

to

On Friday, December 30, 2022 at 2:57:15 PM UTC-6, Terje Mathisen wrote:
> Michael S wrote:

> > As far as Microsoft is concerned, __float128/_Float128 does not exist.
> > So, naturally, they are not in the official ABI. As I said above, from Windows
> > ABI perspective those types, if implemented, are yet another 16-byte structures.
<
> Microsoft do support 16-byte return values all over the place in thje
> several 100's of AVX SIMD intrinsics which use __m256i as both argument
> and return types. Could you (ab)use those via some typedefs?
<

My 66000 supports 8 doubleword argument and return structures passed
via registers.

Thomas Koenig

unread,

Dec 31, 2022, 4:15:41 AM12/31/22

to

MitchAlsup <Mitch...@aol.com> schrieb:

Or, in the words of Numerical Recipes:

# Come the (computer) revolution, all persons found guilty of such
# criminal behavior will be summarily executed, and their programs
# won’t be!

Thomas Koenig

unread,

Dec 31, 2022, 4:21:08 AM12/31/22

to

Thomas Koenig <tko...@netcologne.de> schrieb:

Looking at the tests a bit more, this is probably due to not using
fesetexceptflag to signal abnormal conditions. The version for
Fortran code which uses the IEEE module would need to do so
(conditionally, so it will be well predicted to do nothing except
in pathological cases).

Michael S

unread,

Dec 31, 2022, 11:33:44 AM12/31/22

to

Actually, it is much better without restrict.
Calling these routines with the destination equal to one or both sources
is practically useful.

Michael S

unread,

Dec 31, 2022, 11:37:55 AM12/31/22

to

For exact underflow they are correct. I didn't bother to implement inexact flag.
IMHO, it's the most stupid IEEE-754 feature.
For nan, I pretty sure in myself. Probably a bug in test.

Michael S

unread,

Dec 31, 2022, 12:01:40 PM12/31/22

to

Actually, I consider SystemV binding of __float128 function parameters and
of return value to SSE registers as rather big performance mistake.
__float128 has nothing to do with SSE/AVX, 100% of processing is in GPR
domain. Copying things back and forth between XMM and GPR just slow
things down. And absence of non-volatile XMM registers in SystemV ABI
makes the matters worse.

The problem of Microsoft ABI is not non-usage of XMM, but inability to pass
and return 16-byte structures in pair of GPRs. Also, # of non-volatile GPRs is
a too big for these particular routines, but that's relatively less harmful.

Michael S

unread,

Dec 31, 2022, 12:40:54 PM12/31/22

to

Intel intrinsic macro _MM_SET_ROUNDING_MODE() is probably used quite widely.
Also, rounding mode could be changes by _mm_setcsr().
Microsoft has _controlfp().
Linux has something similar, I don't remember the name right now.

So there are many ways to change rounding mode apart from asm and official
fesetround() and fesetenv().

Also, what about threads?
Shouldn't your shadow copy of rounding mode be thread-local?
If it does, that immediately prevents platform-independent C solution.

Terje Mathisen

unread,

Jan 1, 2023, 4:00:45 PM1/1/23

to

How do you differentiate between SNaN and QNaN? Do you set or clear a
bit? If so, which bit?

All alternatives are legal, but only one combination is in use for each
target CPU afair.

Michael S

unread,

Jan 1, 2023, 4:19:07 PM1/1/23

to

I don't. Should I?

> All alternatives are legal, but only one combination is in use for each
> target CPU afair.

I never encountered SNaN in practice. Who uses them and why?

MitchAlsup

unread,

Jan 1, 2023, 5:01:04 PM1/1/23

to

On Sunday, January 1, 2023 at 3:00:45 PM UTC-6, Terje Mathisen wrote:
> Michael S wrote:

> How do you differentiate between SNaN and QNaN? Do you set or clear a
> bit? If so, which bit?
<

IEEE 754-2008 and 2019 both specify bit<52> == 1 -> QNaN; == 0 -> SNaN.

>
> All alternatives are legal, but only one combination is in use for each
> target CPU afair.
<

1985 only specified they had to be distinguished but not how. The later
standards cleaned that up.

MitchAlsup

unread,

Jan 1, 2023, 5:05:50 PM1/1/23

to

to the 99.99% level: nobody uses them.
<
Kahan, Coonen, and 3 others use them to bound array accesses and other
nefarious programming tricks. Attempting to capture "compiler should not
have created code to reach this point."
<
If IEEE 754 mandated that exceptions were always enabled (excepting
INEXACT) and operands were delivered to signal handling routines (in
some manner) and results from signal handlers returned to excepting
code (in some manner) then they might be of some use to another 55
programmers world wide.
<
This still leaves us with 99.99% of programmers never use them.

Michael S

unread,

Jan 1, 2023, 5:07:11 PM1/1/23

to

It seem x86-64, Arm64 and POWER FP HW encode SNaN vs QNaN in the same way:
QNaN has MS bit of significand = 1.
I'd guess, gcc test suite expects the same for soft binary128.
When my addsubq primitive sums Infs with opposite signs it generates number
that by such convention will be considered SNaN which is a in violation of
the standard. Same for multiplication: for 0*Inf my routine generates something that
test suit interprets as SNaN.
Quite possible that's the reason for failing tests.

I can change it easily with no adverse effect on performance.
But may be there are more mines on this minefield that I am not aware about?

Michael S

unread,

Jan 1, 2023, 6:12:31 PM1/1/23

to

Done.

Terje Mathisen

unread,

Jan 2, 2023, 1:31:13 AM1/2/23

to

Sounds like #ifdef to hide the differences, but stay with the library calls.

>
> Also, what about threads?
> Shouldn't your shadow copy of rounding mode be thread-local?
> If it does, that immediately prevents platform-independent C solution.

Some of the intrinsics are also platform-dependent, right?

I.e. when I wrote the world's fastest (at least at the time) ogg vorbis
decoder, I hid all the compiler instrinsics for SSE(n) and Altivec
behind an #ifdef layer which defined my own names for the common subset
that I needed. This worked so well that my code could use a single
library, allowing static linking and no runtime selection of the optimal
version. (The latter is what the previous fastest (closed source)
commercial offering was doing.)

Terje Mathisen

unread,

Jan 2, 2023, 2:06:34 AM1/2/23

to

Thnk of them as trapping vs non-trapping NaNs, except that you can
suppress all the traps if you so want, and most programming environments do.

Terje Mathisen

unread,

Jan 2, 2023, 2:25:00 AM1/2/23

to

MitchAlsup wrote:
> On Sunday, January 1, 2023 at 3:00:45 PM UTC-6, Terje Mathisen wrote:
>> Michael S wrote:
>
>> How do you differentiate between SNaN and QNaN? Do you set or clear a
>> bit? If so, which bit?
> <
> IEEE 754-2008 and 2019 both specify bit<52> == 1 -> QNaN; == 0 -> SNaN.

So you quiet a SNaN by masking away that top mantissa bit. This also
means that when using binary sorting you get
zero/subnormal/normal/inf/qnan/snan which is fine except that you really
should have included the top-but-one mantissa bit as well!

If you generate an SNaN with just the top mantissa bit set, and later
one it the 'S' is masked away, you now get Inf instead of QNaN.

In my favorite version the top two mantissa bits are all defined like this:

00 : Inf
01 : QNaN
10 : None (un-initialized or missing variables)
11 : SNaN

>>
>> All alternatives are legal, but only one combination is in use for each
>> target CPU afair.
> <
> 1985 only specified they had to be distinguished but not how. The later
> standards cleaned that up.

Are you sure?

Afair, the DFP people got it right, so when that was added (in 2008 or
one iteration earlier?) they included exact/required decimal NaN handling.

All existing binary FP implementations are still legal, but anyone sane
designing a new cpu would use the same setup as for DFP.

Thomas Koenig

unread,

Jan 2, 2023, 3:56:22 AM1/2/23

to

Michael S <already...@yahoo.com> schrieb:

> I never encountered SNaN in practice. Who uses them and why?

Initializing all floating point variables to a signalling NaN is
an excellent way of catching undefined variable errors, if the
system supports them.

Michael S

unread,

Jan 2, 2023, 7:37:58 AM1/2/23

to

In routines in question I allowed to myself to be heavily gcc-dependent.
However I tried to avoid all other platform/CPU dependencies, except
for requirements to have __int128 and to be consistently endian, i.e.
either little or big endian and to have the same endiannes for integers
and for floats.

There are few SSE intrinsic in my code, under #ifdef, but they are not
required for correctness and hopefully in the gcc13, 14 or 15 would not
be needed for performance, too.
No OS dependencies, as far as can tell, except that x86 performance quirk
has dependency on the ABI. With better compiler, this dependency also could
be eliminated in the future.

Michael S

unread,

Jan 2, 2023, 7:46:58 AM1/2/23

to

That works in signal-friendly environment, like single-threaded command
line program that is not a part of more complex programming complex.
I.e., in great scheme of things, approximately never.

BTW, did you try to torture updated files?

Thomas Koenig

unread,

Jan 3, 2023, 2:34:08 AM1/3/23

to

Michael S <already...@yahoo.com> schrieb:

Running a regression test on them right now.

Result is unchanged, the files

gcc.dg/torture/float128-exact-underflow.c
gcc.dg/torture/float128-ieee-nan.c
gcc.dg/torture/float128-nan-floath.c
gcc.dg/torture/float128-nan.c
gfortran.dg/ieee/large_3.F90

fail their execution tests at all optimization levels.

Michael S

unread,

Jan 3, 2023, 4:23:29 AM1/3/23

to

Can the test give more precise information about what exactly it does not like
about NaN?

BTW, so far I didn't find the sources of the tests.
I am looking here:
https://github.com/gcc-mirror/gcc/tree/master/gcc/testsuite/gdc.dg/torture
Most likely a wrong place.

Michael S

unread,

Jan 3, 2023, 4:28:22 AM1/3/23

to

On related note, did you try to plug my routines into Fortran's matmul?
I'd like to do it myself, but don't know where to start.
It's not as trivial as pluging into C/C++.

Thomas Koenig

unread,

Jan 3, 2023, 9:12:31 AM1/3/23

to

Michael S <already...@yahoo.com> schrieb:
> On Tuesday, January 3, 2023 at 9:34:08 AM UTC+2, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>> > On Monday, January 2, 2023 at 10:56:22 AM UTC+2, Thomas Koenig wrote:
>> >> Michael S <already...@yahoo.com> schrieb:
>> >> > I never encountered SNaN in practice. Who uses them and why?
>> >> Initializing all floating point variables to a signalling NaN is
>> >> an excellent way of catching undefined variable errors, if the
>> >> system supports them.
>> >
>> > That works in signal-friendly environment, like single-threaded command
>> > line program that is not a part of more complex programming complex.
>> > I.e., in great scheme of things, approximately never.
>> >
>> > BTW, did you try to torture updated files?
>> Running a regression test on them right now.
>>
>> Result is unchanged, the files
>>
>> gcc.dg/torture/float128-exact-underflow.c
>> gcc.dg/torture/float128-ieee-nan.c
>> gcc.dg/torture/float128-nan-floath.c
>> gcc.dg/torture/float128-nan.c
>> gfortran.dg/ieee/large_3.F90
>>
>> fail their execution tests at all optimization levels.
>
> Can the test give more precise information about what exactly it does not like
> about NaN?

I'd have to debug this in detail; you are probably in a much better
position than I am to understand this.

> BTW, so far I didn't find the sources of the tests.
> I am looking here:
> https://github.com/gcc-mirror/gcc/tree/master/gcc/testsuite/gdc.dg/torture
> Most likely a wrong place.

Probably better to look at the official gcc repository.
The files can be found at

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/gcc.dg/torture/float128-exact-underflow.c

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/gcc.dg/torture/float128-ieee-nan.c
which includes
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/gcc.dg/torture/floatn-nan.h

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/gcc.dg/torture/float128-nan-floath.c
which includes
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/gcc.dg/torture/floatn-nan-floath.h

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/gcc.dg/torture/float128-nan.c

https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/testsuite/gcc.dg/torture/float128-nan.c

You can also do is to clone the gcc git repository with

git clone git://gcc.gnu.org/git/gcc.git trunk

(see https://gcc.gnu.org/git.html )

when you fill find these files under trunk/gcc/testsuite/gcc.dg/torture

Do you have a reasonably beefy UNIXy machine to compile gcc on?
Compiling gcc on Windows is a pain due to fork/exec semantics,
which takes foreeeeeeeeeeever. If you want access to one, the
gcc compile farm wold be an option.

Thomas Koenig

unread,

Jan 3, 2023, 11:24:53 AM1/3/23

to

If you put them into libgcc as __addtf3 etc. they will be called
from the Fortran library automatically.

I have done so, on a rather old CPU, a Xeon 5670 (gcc188 on the
gcc compile farm), and the following program (where "original"
was compiled with the original sources, and "modified" with the
replacement of the libgcc routines by the ones from your repository)

module tick
interface
function rdtsc() bind(C,name="rdtsc")
use iso_c_binding
integer(kind=c_long) :: rdtsc
end function rdtsc
end interface
end module tick

program main
use tick
use iso_c_binding
implicit none
integer, parameter :: wp = selected_real_kind(30)
! integer, parameter :: n=5000, p=4000, m=3666
integer, parameter :: n = 1000, p = 1000, m = 1000
real (kind=wp) :: c(n,p), a(n,m), b(m, p)
character(len=80) :: line
integer(c_long) :: t1, t2, t3
real (kind=wp) :: fl = 2.d0*n*m*p
integer :: i,j

print *,wp

line = '10 10'
call random_number(a)
call random_number(b)
t1 = rdtsc()
t2 = rdtsc()
t3 = t2-t1
print *,t3
t1 = rdtsc()
c = matmul(a,b)
t2 = rdtsc()
print *,1/(fl/(t2-t1-t3)),"Cycles per operation"
read (unit=line,fmt=*) i,j
write (unit=line,fmt=*) c(i,j)
end program main

showed

tkoenig@gcc188:~> ./original
16
32
^C
tkoenig@gcc188:~> time ./original
16
32
90.5696151959999999999999999999999997 Cycles per operation

real 1m2,148s
user 1m2,123s
sys 0m0,008s
tkoenig@gcc188:~> time ./modified
16
32
52.8148391719999999999999999999999957 Cycles per operation

real 0m36,296s
user 0m36,278s
sys 0m0,008s

so a gain of almost a factor of two. Quite impressive.

Michael S

unread,

Jan 3, 2023, 12:26:55 PM1/3/23

to

Your test is ~10% too optimistic for case of original add/mul primitives.
For more realistic results you should add, for example, following lines
before matmul:
a = a * 2.0 - 1
b = b * 2.0 - 1

These lines will randomize the sign of input matrices.
Original primitives are slower with random sign of input.

Michael S

unread,

Jan 3, 2023, 12:35:57 PM1/3/23

to

On Tuesday, January 3, 2023 at 6:24:53 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Tuesday, January 3, 2023 at 9:34:08 AM UTC+2, Thomas Koenig wrote:
> >> Michael S <already...@yahoo.com> schrieb:
>>

> > On related note, did you try to plug my routines into Fortran's matmul?
> > I'd like to do it myself, but don't know where to start.
> > It's not as trivial as pluging into C/C++.
> If you put them into libgcc as __addtf3 etc. they will be called
> from the Fortran library automatically.
>

The trouble is I don't know how to put them into libgcc.
In my own C and C++ tests I simply link the program with object files
containing replacement routines and .o files take precedence over
library versions.
It somehow didn't work for Fortran's matmul. At least, on Windows.
I didn't try it on Linux.

Michael S

unread,

Jan 3, 2023, 12:41:15 PM1/3/23

to

On Tuesday, January 3, 2023 at 6:24:53 PM UTC+2, Thomas Koenig wrote:

Another question: what is happening with sqrtq() ?
Will gcc13 ship with "soft" variant of the routine?
Or still with terminally buggy "semi-hard" crap?

Michael S

unread,

Jan 3, 2023, 12:53:07 PM1/3/23

to

I'd look into it.

>
> Do you have a reasonably beefy UNIXy machine to compile gcc on?
> Compiling gcc on Windows is a pain due to fork/exec semantics,
> which takes foreeeeeeeeeeever. If you want access to one, the
> gcc compile farm wold be an option.

I have Debian instance under WSL on rather solid HW:
EPYC3 with 32 cores/64 threads and good NVMe SSD.
Probably, faster than busy compile farm.
But I don't have skills and not a lot of desire to
acquire this type of skills.
As far as I am concerned, compiling gcc is less fun
than coding fast soft FP routines.

Scott Lurndal

unread,

Jan 3, 2023, 2:36:40 PM1/3/23

to

I would expect the gcc mailing list to be a far better
resource for such questions than posting to comp.arch.

Michael S

unread,

Jan 3, 2023, 3:22:01 PM1/3/23

to

Thomas Koenig submitted bug report on this issue (Bug 105101).
I would expect that he is regularly updated about the progress of resolution.

Thomas Koenig

unread,

Jan 3, 2023, 4:04:28 PM1/3/23

to

Michael S <already...@yahoo.com> schrieb:

As are you, you are on the CC list of that bug (if you read that
e-mail, that is :-)

gcc has just entered the bug-fixing-only phase for gcc13, so I
expect this to be fixed then. I might just have a go at it myself,
although I usually do not stray from the Fortran front end.

Thomas Koenig

unread,

Jan 3, 2023, 4:18:00 PM1/3/23

to

Scott Lurndal <sc...@slp53.sl.home> schrieb:

This is actually straying a bit far from the topic of this group.
I have created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
to have this tracked on the gcc side (people usually do not
expect to find discussions of gcc patches on comp.arch), and I
have Michael plus two people from the gcc side to the CC of the PR.

@Michael: If it could be possible for you to add versions of your
routines that do rounding according to the different modes, and
optionally detect exceptions, I think I could help you beat this
into something that could work for gcc14, probably by setting
up a branch.

It would definitely help if you could compile it yourself, because
any problem I find would have to be translated to something
that can be reproduced outside of GCC, you would fix it, and
then it would have to be translated back into GCC, which will
make development more cumbersome and more error-prone.

Compiling gcc is actually fairly simple (if time-consuming because
a full boostrap will take a couple of hours).

Michael S

unread,

Jan 3, 2023, 6:46:19 PM1/3/23

to

On Tuesday, January 3, 2023 at 11:18:00 PM UTC+2, Thomas Koenig wrote:
> Scott Lurndal <sc...@slp53.sl.home> schrieb:
> > Michael S <already...@yahoo.com> writes:
> >>On Tuesday, January 3, 2023 at 6:24:53 PM UTC+2, Thomas Koenig wrote:
> >>
> >>Another question: what is happening with sqrtq() ?
> >>Will gcc13 ship with "soft" variant of the routine?
> >>Or still with terminally buggy "semi-hard" crap?
> >
> > I would expect the gcc mailing list to be a far better
> > resource for such questions than posting to comp.arch.
> This is actually straying a bit far from the topic of this group.
> I have created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
> to have this tracked on the gcc side (people usually do not
> expect to find discussions of gcc patches on comp.arch), and I
> have Michael plus two people from the gcc side to the CC of the PR.
>
> @Michael: If it could be possible for you to add versions of your
> routines that do rounding according to the different modes, and
> optionally detect exceptions, I think I could help you beat this
> into something that could work for gcc14, probably by setting
> up a branch.
>

I can do it, but it will take time.
With unit testing etc. it's going to take 10-20 hours of work.
I can't currently estimate how many calendar days, but most likely
it wouldn't be ready on Monday, Jan 16.
But I am not sure that it would really help. For integration into libgcc
they would want more of useless IEEE features, like inexact exceptions.

Also, probably they will want invalid operand exceptions. This feature
is only mostly useless rather than totally useless, but it is also hard to
implement in platform-independent C.

So, at the end we'll have a complicated mess full of ifdefs and with rather
narrow portability. In contrast, today we have nice fast code that can be
compiled on any 64-bit platform supported by gcc.

That's why I said at the beginning of the thread that integration into
gfortran matrix libraries is a more realistic goal.

> It would definitely help if you could compile it yourself, because
> any problem I find would have to be translated to something
> that can be reproduced outside of GCC, you would fix it, and
> then it would have to be translated back into GCC, which will
> make development more cumbersome and more error-prone.
>
> Compiling gcc is actually fairly simple (if time-consuming because
> a full boostrap will take a couple of hours).

I still don't understand why compilation of couple of library routine and
of accompanying test suit requires compilation of the whole compiler.

Thomas Koenig

unread,

Jan 4, 2023, 1:46:33 AM1/4/23

to

Michael S <already...@yahoo.com> schrieb:
> On Tuesday, January 3, 2023 at 6:24:53 PM UTC+2, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>> > On Tuesday, January 3, 2023 at 9:34:08 AM UTC+2, Thomas Koenig wrote:
>> >> Michael S <already...@yahoo.com> schrieb:
>>>
>> > On related note, did you try to plug my routines into Fortran's matmul?
>> > I'd like to do it myself, but don't know where to start.
>> > It's not as trivial as pluging into C/C++.
>> If you put them into libgcc as __addtf3 etc. they will be called
>> from the Fortran library automatically.
>>
>
> The trouble is I don't know how to put them into libgcc.

I replaced them in the libgcc/soft-float directory (see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 , which has a
patch attached, and, I, well, recompiled).

> In my own C and C++ tests I simply link the program with object files
> containing replacement routines and .o files take precedence over
> library versions.
> It somehow didn't work for Fortran's matmul. At least, on Windows.
> I didn't try it on Linux.

Not sure what happened there.

Thomas Koenig

unread,

Jan 4, 2023, 1:59:52 AM1/4/23

to

Michael S <already...@yahoo.com> schrieb:

General remark: Unfortunately, compiler writers don't get to pick
and choose among features that they implement. If a relevant standard
prescribes it, it should be implemented. If it worked in a previous
release, it must work in future releases - regressions are the highest
priority bugs, and nothing that is known to introduce a regression
will be accepted. This is why it is compulsory to run a regression
test before submitting a patch to gcc.

What compiler writers _can_ get away with is specifying options
which let the user opt out of certain features. In particularly
egregious cases of bad specs, compiler writers can even default
to violating standards if they give users a special option to
regain standard behavior (see Fortran's rules about aligning
COMMON blocks).

Sometimes compiler writers go over the top with options like
-ffast-math, which is basically a way for the users to say "I don't
care, just go wild with floating point arithmetic".

So, if we set our sights a little lower, we can leave the
current soft-float routines as they are, and provide the
faster routines behind appropriate options and for special
cases.

> Also, probably they will want invalid operand exceptions. This feature
> is only mostly useless rather than totally useless, but it is also hard to
> implement in platform-independent C.
>
> So, at the end we'll have a complicated mess full of ifdefs and with rather
> narrow portability. In contrast, today we have nice fast code that can be
> compiled on any 64-bit platform supported by gcc.
>
> That's why I said at the beginning of the thread that integration into
> gfortran matrix libraries is a more realistic goal.

See above; including a faster version with -ffast-math and the
appropriate suboptions would be a halfway step.

>> It would definitely help if you could compile it yourself, because
>> any problem I find would have to be translated to something
>> that can be reproduced outside of GCC, you would fix it, and
>> then it would have to be translated back into GCC, which will
>> make development more cumbersome and more error-prone.
>>
>> Compiling gcc is actually fairly simple (if time-consuming because
>> a full boostrap will take a couple of hours).
>
> I still don't understand why compilation of couple of library routine and
> of accompanying test suit requires compilation of the whole compiler.

Regression tests.

Thomas Koenig

unread,

Jan 4, 2023, 2:43:12 AM1/4/23

to

One more remark.

Instead of using x86-specific SIMD, it might be a good idea to look
into Gcc's vector extensions - those will then also be useful on
other architectures, and will save #ifdefs. See

https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

Thomas Koenig

unread,

Jan 4, 2023, 10:03:07 AM1/4/23

to

Michael S <already...@yahoo.com> schrieb:

My experience differs.

I've used that option to great effect when debugging complicated
code which has a lot of variables - not core routines, but the code
around it. Think enbineering calculations like solving equations
of state for multi-component mixtures, or process simulations
involving hundreds of streams, or doing calculations for complex
geometries, or solving ODEs for time-evolution of systems
with complicated governing equations, or...

The problem is even worse for old maintaining old Fortran
codes. Because many old compilers initialized to all variables to
zero, which modern compilers oriented towards stack-based systems
do not do. Initializing everything to a signalling NaN,and getting
a trap when accessing uninitialized variables is a great help when
debugging this kind of problem.

At least one Fortran compiler offers an option to trap on
uninitialized variables. This is even better, because it also
works for integers or characters, but has a rather high overhead
and some severe restrictions (for example, no C interoperability).

Michael S

unread,

Jan 4, 2023, 5:29:28 PM1/4/23

to

You appear to fit under my definition above - command line,
signal-friendly.

But I agree that Invalid Operand exception is not nearly as useless as
Inexact exception .

I'd try to implement it and to measure performance impact.
At the beginning I'd try a portable way, i.e. feraiseexcept().
If it ends up too slow then we could consult gcc maintainer
about less standard but faster ways.

Hopefully, tomorrow night.

Michael S

unread,

Jan 4, 2023, 5:47:05 PM1/4/23

to

The purpose of my use of SIMD is not to do real SIMD, but to minimize
impact of unfortunate choices of ABI that are specific to x86-64.
It's unlikely that it would be useful on other platforms.
It's certainly not useful on ARM64 - the only "other" platform people
could realistically want to use this library on.

As I said several times here, I hope that in gcc13 or gcc14 intrinsic
workarounds will no longer be needed on x86-64 as well.

Michael S

unread,

Jan 4, 2023, 6:35:33 PM1/4/23

to

On Wednesday, January 4, 2023 at 8:46:33 AM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Tuesday, January 3, 2023 at 6:24:53 PM UTC+2, Thomas Koenig wrote:
> >> Michael S <already...@yahoo.com> schrieb:
> >> > On Tuesday, January 3, 2023 at 9:34:08 AM UTC+2, Thomas Koenig wrote:
> >> >> Michael S <already...@yahoo.com> schrieb:
> >>>
> >> > On related note, did you try to plug my routines into Fortran's matmul?
> >> > I'd like to do it myself, but don't know where to start.
> >> > It's not as trivial as pluging into C/C++.
> >> If you put them into libgcc as __addtf3 etc. they will be called
> >> from the Fortran library automatically.
> >>
> >
> > The trouble is I don't know how to put them into libgcc.
> I replaced them in the libgcc/soft-float directory (see
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 , which has a
> patch attached, and, I, well, recompiled).
> > In my own C and C++ tests I simply link the program with object files
> > containing replacement routines and .o files take precedence over
> > library versions.
> > It somehow didn't work for Fortran's matmul. At least, on Windows.
> > I didn't try it on Linux.
> Not sure what happened there.

Couple of hours ago I tried it again, this time on Linux.
It sort of worked, but I am not sure that it replaced all calls of original
routines.

I used Fortran code that you posted here ~6 months ago.
On Zen3-based EPYC at 3.5 GHz original code reported 30 MFLOPs.
The same code linked with my __multf3/__addtf3/__addtf3 routines
reported 60 MFLOPs. Quite slow still.
For comparison, my own matrix multiplication in C does 140 MFLOPs
on the same machine/OS with the same primitives.

I don't know what is the reason for such huge difference. It could be
incomplete replacement of arithmetic primitives, but it also could be
something else.

MitchAlsup

unread,

Jan 4, 2023, 7:03:03 PM1/4/23

to

Should be easy to determine if you read the assembly code.

Michael S

unread,

Jan 4, 2023, 7:08:14 PM1/4/23

to

It should be easy, but it isn't. The wonder of shared libraries.

Thomas Koenig

unread,

Jan 5, 2023, 2:11:59 AM1/5/23

to

Michael S <already...@yahoo.com> schrieb:

You can use -static-libgfortran to link in
libgfortran statically. The source code can be found at
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgfortran/generated/matmul_r16.c

It may not be very friendly for soft float - it is an adaption of
netlib code originally written in Fortran for PowerPC, which was
turned into C. It turns out to give quite decent performance on
SIMD for hardware.

It might be possible to put in another implementation for soft-float,
but we would have to be careful not to degrade performance for systems
where 128-bit float is actually done in hardware.

Thomas Koenig

unread,

Jan 5, 2023, 2:21:58 AM1/5/23

to

Michael S <already...@yahoo.com> schrieb:

> On Wednesday, January 4, 2023 at 9:43:12 AM UTC+2, Thomas Koenig wrote:
>> One more remark.
>>
>> Instead of using x86-specific SIMD, it might be a good idea to look
>> into Gcc's vector extensions - those will then also be useful on
>> other architectures, and will save #ifdefs. See
>>
>> https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
>
> The purpose of my use of SIMD is not to do real SIMD, but to minimize
> impact of unfortunate choices of ABI that are specific to x86-64.
> It's unlikely that it would be useful on other platforms.
> It's certainly not useful on ARM64 - the only "other" platform people
> could realistically want to use this library on.

You don't know the RISC-V enthusiasts, or the Loongarch people :-)

Michael S

unread,

Jan 5, 2023, 3:42:57 AM1/5/23

to

When I said "realistically", I meant "in order to calculate useful results
reasonably quickly". Doing something in order to brag that "My favorite
platform can crawl through it, too" is not considered realistic in my book.

In context of this library "reasonable speed" is as a key point as correctness.
The whole justification for its existence is based on promise of being
significantly faster than MPFR. Otherwise, why bother?

Michael S

unread,

Jan 5, 2023, 4:17:52 AM1/5/23

to

BTW, from technical point of view RV64 and Longarch are least problematic.
Unlike x86-64 and ARM64, they defined their _Float64 ABI right.
Everything is passed in GPRs, so the problem I am trying to sidestep
with SSE intrinsics simply does not exist.
https://godbolt.org/z/M1jE5bfG8

I wonder why almost identical MIPS64 got it wrong.

Michael S

unread,

Jan 5, 2023, 5:07:29 AM1/5/23

to

On Thursday, January 5, 2023 at 9:11:59 AM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Thursday, January 5, 2023 at 2:03:03 AM UTC+2, MitchAlsup wrote:
> >> On Wednesday, January 4, 2023 at 5:35:33 PM UTC-6, Michael S wrote:
> >> > Couple of hours ago I tried it again, this time on Linux.
> >> > It sort of worked, but I am not sure that it replaced all calls of original
> >> > routines.
> >> >
> >> > I used Fortran code that you posted here ~6 months ago.
> >> > On Zen3-based EPYC at 3.5 GHz original code reported 30 MFLOPs.
> >> > The same code linked with my __multf3/__addtf3/__addtf3 routines
> >> > reported 60 MFLOPs. Quite slow still.
> >> > For comparison, my own matrix multiplication in C does 140 MFLOPs
> >> > on the same machine/OS with the same primitives.
> >> >
> >> > I don't know what is the reason for such huge difference. It could be
> >> > incomplete replacement of arithmetic primitives, but it also could be
> >> > something else.
> >> <
> >> Should be easy to determine if you read the assembly code.
> >
> > It should be easy, but it isn't. The wonder of shared libraries.
> You can use -static-libgfortran to link in
> libgfortran statically. The source code can be found at
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgfortran/generated/matmul_r16.c
>

Please, don't tell Mitch and Steven that Fortran's matmul is implemented in C. :(

> It may not be very friendly for soft float - it is an adaption of
> netlib code originally written in Fortran for PowerPC, which was
> turned into C. It turns out to give quite decent performance on
> SIMD for hardware.
>
> It might be possible to put in another implementation for soft-float,
> but we would have to be careful not to degrade performance for systems
> where 128-bit float is actually done in hardware.

This code is quite long :(
By observation I can't even figure out which part is going to be executed on my
x86-64 Debian Linux.
May be, I'd try to compile it one day (night), but not tonight.
For tonight Invalid Operand exception is of higher priority.

Thomas Koenig

unread,

Jan 5, 2023, 1:36:34 PM1/5/23

to

Michael S <already...@yahoo.com> schrieb:
> On Thursday, January 5, 2023 at 9:11:59 AM UTC+2, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>> > On Thursday, January 5, 2023 at 2:03:03 AM UTC+2, MitchAlsup wrote:
>> >> On Wednesday, January 4, 2023 at 5:35:33 PM UTC-6, Michael S wrote:
>> >> > Couple of hours ago I tried it again, this time on Linux.
>> >> > It sort of worked, but I am not sure that it replaced all calls of original
>> >> > routines.
>> >> >
>> >> > I used Fortran code that you posted here ~6 months ago.
>> >> > On Zen3-based EPYC at 3.5 GHz original code reported 30 MFLOPs.
>> >> > The same code linked with my __multf3/__addtf3/__addtf3 routines
>> >> > reported 60 MFLOPs. Quite slow still.
>> >> > For comparison, my own matrix multiplication in C does 140 MFLOPs
>> >> > on the same machine/OS with the same primitives.
>> >> >
>> >> > I don't know what is the reason for such huge difference. It could be
>> >> > incomplete replacement of arithmetic primitives, but it also could be
>> >> > something else.
>> >> <
>> >> Should be easy to determine if you read the assembly code.
>> >
>> > It should be easy, but it isn't. The wonder of shared libraries.
>> You can use -static-libgfortran to link in
>> libgfortran statically. The source code can be found at
>> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgfortran/generated/matmul_r16.c
>>
>
> Please, don't tell Mitch and Steven that Fortran's matmul is implemented in C. :(

:-)

There is a rather simple reason for that: Modern Fortran has array
features, and the array descriptors (a.k.a dope vectors) cannot
really be accessed by standard Fortran features, so C it is. And C
is not too bad as a low-level language if const and restrict are
sprinkeled liberally over the argument lists.

>> It may not be very friendly for soft float - it is an adaption of
>> netlib code originally written in Fortran for PowerPC, which was
>> turned into C. It turns out to give quite decent performance on
>> SIMD for hardware.
>>
>> It might be possible to put in another implementation for soft-float,
>> but we would have to be careful not to degrade performance for systems
>> where 128-bit float is actually done in hardware.
>
>
> This code is quite long :(

It's rather involved, yes.

Different functions are called depending on which SSE features
are available, not that it matters too much for soft float.

> By observation I can't even figure out which part is going to be executed on my
> x86-64 Debian Linux.

Depends.

If you just call matmul with standard arrays without playing
any fancy games with strides etc, it is the last time, the
blocked algorithm, that gets executed is the one starting
with

if (rxstride == 1 && axstride == 1 && bxstride == 1
&& GFC_DESCRIPTOR_RANK (b) != 1)
{
/* This block of code implements a tuned matmul, derived from
Superscalar GEMM-based level 3 BLAS, Beta version 0.1

Bo Kagstrom and Per Ling
Department of Computing Science
Umea University
S-901 87 Umea, Sweden

from netlib.org, translated to C, and modified for matmul.m4. */

You will see the translation from Fortran in lines like

f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
* b[l + j * b_dim1];

where the compiler is good enough to chose the right induction
variables for the loop.

> May be, I'd try to compile it one day (night), but not tonight.

If you want a preprocessed source, I can send it to you
(but not via comp.arch :)

> For tonight Invalid Operand exception is of higher priority.

:-)

Michael S

unread,

Jan 5, 2023, 6:30:27 PM1/5/23

to

Done.
Performance impact varies depending on CPU/OS from negligible
on Windows+Skylake and Linux+Zen3 to significant (3% on Windows+Ivy Bridge,
5% on Windows+Zen3).
All impacts measured with my matmulq() routine when input contains no NaNs
at all, either signaling or quiet. With NaNs present in the input sets the impact will
be bigger.

Michael S

unread,

Jan 6, 2023, 8:49:22 AM1/6/23

to

On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
> Thomas Koenig wrote:
> > Michael S <already...@yahoo.com> schrieb:

> >> But the real reason why I don't believe that my code can be integrates into
> >> glibc or even into libgcc is different: I don't support rounding modes.
> >> That is, I always round to nearest with breaks rounded to even.
> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> >> They think that binary128 rounding mode should be the same as current
> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
> >
> > I understand them being sticklers for accuracy (Terje? :-)
> Absolutely so!
>

I just want to show you something:

#include <fenv.h>
void foo(double res[4], double a, double b)
{
static const int rm[4] = { FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD };
for (int i = 0; i < 4; ++i) {
fesetround(rm[i]);
res[i] = a + b;
}
fesetround(FE_TONEAREST); // restore default
}

Both gcc and MSVC at any optimization level except "no optimization" turn it into

#include <fenv.h>
void rmadd64_core(double res[4], double a, double b)
{
static const int rm[4] = { FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD };
double tmp = a + b;
for (int i = 0; i < 4; ++i) {
fesetround(rm[i]);
res[i] = tmp;
}
fesetround(FE_TONEAREST); // restore default
}

And that's 'C, the language that is supposed to follow much stricter rules
w.r.t. floating point than likes of Fortran.
I wonder what professor William Kahan would have to say about it.

Terje Mathisen

unread,

Jan 6, 2023, 9:57:00 AM1/6/23

to

That just proves that as seen by the compiler, there is no aliasing
between fesetround() and any actual FP operation. As you show here, this
is horribly broken, you need to wrap that fesetround() function in
something which tells the compiler that both (a) and (b) should be
considered volatile. Hmmm, I don't know any non-asm way of doing this. :-(

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,

Jan 6, 2023, 9:57:51 AM1/6/23

to

Michael S <already...@yahoo.com> schrieb:

This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108318 .

Tim Rentsch

unread,

Jan 6, 2023, 11:08:15 AM1/6/23

to

Terje Mathisen <terje.m...@tmsw.no> writes:

[edited for line length]

> Michael S wrote:

Did anyone try this:

#include <fenv.h>
void foo(double res[4], double a, double b)
{

#pragma STDC FENV_ACCESS ON

static const int rm[4] = {
FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD
};
for (int i = 0; i < 4; ++i) {
fesetround(rm[i]);
res[i] = a + b;
}
fesetround(FE_TONEAREST); // restore default
}

or this:

#include <fenv.h>

#pragma STDC FENV_ACCESS ON

void foo(double res[4], double a, double b)
{
static const int rm[4] = {
FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD
};
for (int i = 0; i < 4; ++i) {
fesetround(rm[i]);
res[i] = a + b;
}
fesetround(FE_TONEAREST); // restore default
}

The STDC FENV_ACCESS is a required #pragma, and is described in
section 7.6.1 of the C standard (C99 and C11).

Scott Lurndal

unread,

Jan 6, 2023, 12:16:15 PM1/6/23

to

An earlier report:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678

Thomas Koenig

unread,

Jan 7, 2023, 10:49:40 AM1/7/23

to

Scott Lurndal <sc...@slp53.sl.home> schrieb:

Argh. #pragma STDC FENV_ACCESS not implemented. Incomplete C99
support, indeed...

For Fortran, I have submitted a patch.

Michael S

unread,

Jan 7, 2023, 10:53:55 AM1/7/23

to

IMHO, it proves that IEEE-754 idea of control flags is hopelessly 1960s.
For the standard that first came to life in 1980s they could do better.

Tim Rentsch

unread,

Jan 7, 2023, 11:14:14 AM1/7/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:

> Thomas Koenig <tko...@netcologne.de> writes:
>
>>> [.. description of a bug involving optimization that ignores
>>> the presence of calls to fesetround() ..]

>> This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108318 .
>
> An earlier report:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678

The responses in those bug reports show that using

#pragma STDC FENV_ACCESS ON

is the right way to address the situation, and that gcc doesn't
support this #pragma even though the C standard says it must (and
thus gcc is not conforming). Sadly that shortcoming has been
known since 2008, and it hasn't been fixed. Very disappointing.

As best I can tell clang doesn't support this #pragma either,
which is even more disappointing.

Thomas Koenig

unread,

Jan 7, 2023, 12:30:27 PM1/7/23

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> That just proves that as seen by the compiler, there is no aliasing
> between fesetround() and any actual FP operation. As you show here, this
> is horribly broken, you need to wrap that fesetround() function in
> something which tells the compiler that both (a) and (b) should be
> considered volatile. Hmmm, I don't know any non-asm way of doing this. :-(

I concur about this being broken.

What could be done (at some cost to performance) is to add a
memory-clobbering asm statement

__asm__ __volatile__("":::"memory");

after the fesetround().

This inhibits more optimizations than needed, because it forces
a reload of all operands from memory. Then again, chainging
floating-point state is not something that I would expect to be
performant and not to be done in a tight loop, so the effect is
probably not all that bad in practice.

It would be better to have some sort of floating-point barrier
for this, though.

Is the code above the kind of thing you had in mind?

Brian G. Lucas

unread,

Jan 7, 2023, 1:26:33 PM1/7/23

to

I think that clang does support the pragma for some targets, e.g. x86-64
and systemz.
Here is the systemz code for your C routine with the pragma (the floating
point add is not hoisted out of the loop):
.text
.file "fprounding.c"
.globl foo # -- Begin function foo
.p2align 4
.type foo,@function
foo: # @foo
# %bb.0:
stmg %r10, %r15, 80(%r15)
aghi %r15, -176
lgr %r11, %r15
std %f8, 168(%r11) # 8-byte Folded Spill
std %f9, 160(%r11) # 8-byte Folded Spill
ldr %f8, %f2
ldr %f9, %f0
lgr %r13, %r2
lghi %r12, -16
larl %r10, foo.rm
.LBB0_1: # =>This Inner Loop Header: Depth=1
lgf %r2, 16(%r10,%r12)
brasl %r14, fesetround@PLT
ldr %f0, %f9
adbr %f0, %f8
std %f0, 0(%r13)
la %r12, 4(%r12)
la %r13, 8(%r13)
cgijlh %r12, 0, .LBB0_1
# %bb.2:
lghi %r2, 0
ld %f8, 168(%r11) # 8-byte Folded Reload
ld %f9, 160(%r11) # 8-byte Folded Reload
lmg %r10, %r15, 256(%r11)
jg fesetround@PLT
[remainder of code snipped]

Tim Rentsch

unread,

Jan 7, 2023, 9:08:08 PM1/7/23

to

"Brian G. Lucas" <bag...@gmail.com> writes:

> On 1/7/23 10:14, Tim Rentsch wrote:
>
>> sc...@slp53.sl.home (Scott Lurndal) writes:

[...]

>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
>>
>> The responses in those bug reports show that using
>>
>> #pragma STDC FENV_ACCESS ON
>>
>> is the right way to address the situation, and that gcc doesn't
>> support this #pragma even though the C standard says it must (and
>> thus gcc is not conforming). Sadly that shortcoming has been
>> known since 2008, and it hasn't been fixed. Very disappointing.
>>
>> As best I can tell clang doesn't support this #pragma either,
>> which is even more disappointing.
>
> I think that clang does support the pragma for some targets,
> e.g. x86-64 and systemz.
> Here is the systemz code for your C routine with the pragma (the

> floating point add is not hoisted out of the loop): [..code..]

Interesting. Were there any warnings about the #pragma? Also,
did you try compiling the same code only without the #pragma (or
with #pragma STDC FENV_ACCESS OFF)? I'm curious to know if the
resultant code changed in those cases.

Michael S

unread,

Jan 8, 2023, 4:22:50 AM1/8/23

to

On Thursday, January 5, 2023 at 8:36:34 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:

> > May be, I'd try to compile it one day (night), but not tonight.
>
> If you want a preprocessed source, I can send it to you
> (but not via comp.arch :)
>

Invalid operand finished 58 hours ago.
Now I am ready to take look at preprocessed source.
Preferably one generated on machine that has AVX2 but does not have
AVX512, like Zen3 or Skylake client. Old Westmere Xeon is less desirable.

Do you want to put it as attachment at 108279?
As far as I am concerned it would be the best, because I prefer to keep
all conversation public. But if it violates some policies of gcc bugzilla
then you can send it to me via e-mail.
There is another possibility that I like more than e-mail: store it in your
public github repository and post the link either here or in 108279.

Michael S

unread,

Jan 8, 2023, 5:41:56 AM1/8/23

to

Yes, clang gives descriptive warning when pragma is not supported
and descriptive error when pragma is supported, but disabled because
of compilation flags, e.g. -Ofast.
On x86-64 clang supports the pragma since v.12.0.
On ARM64 clang supports the pragma at trunk, but not in any shipping version.

Terje Mathisen

unread,

Jan 8, 2023, 9:56:07 AM1/8/23

to

A global (per thread?) rounding mode hardware setting is there for
backwards compatibility, but we would much prefer for all operations to
carry their desired rounding mode along with the opcode.

I.e. you support 5 or 6 rounding modes, one of which is "use the current
default mode", the rest are directly specified. Routing wise there
should be zero problems related to looking up the current default long
before the mode is required during the last parts of the final clock cycle.

Terje Mathisen

unread,

Jan 8, 2023, 10:00:04 AM1/8/23

to

Yeah, but I was afraid that since both (a) and (b) have been passed in
registers and never should touch memory, the compiler might just figure
out that the barrier doesn't matter. That is unless it understands that
fesetround() might have touched memory in a way that would in fact
impact the upcoming fp op.

Tim Rentsch

unread,

Jan 9, 2023, 7:37:51 PM1/9/23

to

Good to know. Thank you for the info.

> On ARM64 clang supports the pragma at trunk, but not in any shipping
> version.

Huh. What's up with that?

Tim Rentsch

unread,

Jan 9, 2023, 7:57:41 PM1/9/23

to

Terje Mathisen <terje.m...@tmsw.no> writes:

[my apologies for the indirect reply]

> Michael S wrote:
>
>> On Friday, January 6, 2023 at 4:57:00 PM UTC+2, Terje Mathisen wrote:
>>
>>> Michael S wrote:

[context - this:

#include <fenv.h>
void
foo( double res[4], double a, double b ){

static const int rm[4] = {
FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD
};
for( int i = 0; i < 4; ++i ){

fesetround( rm[i] );

res[i] = a + b;
}
fesetround( FE_TONEAREST ); // restore default
}

gets compiled (by gcc and MSVC at any non-zero optimization
level) as this:

#include <fenv.h>
void
rmadd64_core( double res[4], double a, double b ){

static const int rm[4] = {
FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD
};
double tmp = a + b;

for( int i = 0; i < 4; ++i ){

fesetround( rm[i] );

res[i] = tmp;
}
fesetround( FE_TONEAREST ); // restore default
}

(end context)]

>>> That just proves that as seen by the compiler, there is no
>>> aliasing between fesetround() and any actual FP operation. As
>>> you show here, this is horribly broken, you need to wrap that
>>> fesetround() function in something which tells the compiler that
>>> both (a) and (b) should be considered volatile. Hmmm, I don't
>>> know any non-asm way of doing this. :-(

Neither 'a' nor 'b' needs to be made volatile. We can simply make
one access to one of them ('a' in the code below) be a volatile
access, like so:

#include <fenv.h>
void
foo( double res[4], double a, double b ){

static const int rm[4] = {
FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD
};
for( int i = 0; i < 4; ++i ){

fesetround( rm[i] );
res[i] = *(volatile double *)&a + b;
}
fesetround( FE_TONEAREST ); // restore default
}

My tests show both gcc and clang do the right thing. Does anyone
see bad results from other compilers or other versions of gcc/clang?

Michael S

unread,

Jan 10, 2023, 8:13:23 AM1/10/23

to

On Sunday, January 8, 2023 at 4:56:07 PM UTC+2, Terje Mathisen wrote:
>
> A global (per thread?) rounding mode hardware setting is there for
> backwards compatibility, but we would much prefer for all operations to
> carry their desired rounding mode along with the opcode.
>
> I.e. you support 5 or 6 rounding modes, one of which is "use the current
> default mode", the rest are directly specified. Routing wise there
> should be zero problems related to looking up the current default long
> before the mode is required during the last parts of the final clock cycle.
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

I am afraid that it is not going to help, because 99.9% of the time
people will use 'current default mode' which still implies
indirection through control register.
Which, may be, cheap in hardware, [after somebody worked very hard
to make it cheap], but expensive in soft FP.

Where it does help is implementation of transcendental functions
in math library and such. But even here there are still obstacles:
1. Some vocal people think that such functions should respect
current rounding mode. Unfortunately, did not IEEE-754 does not
appear to have a clear position about it.
IMHO, the next standard should say that there is defined set of primitives
(arithmetics, sqrt, float<->int conversions, may be text-to-float,
for [bad] legacy reasons) that shall respect default rounding mode.
For the rest of functions defined IEEE-754 standard, respecting
rounding mode is not just unnecessary, but strongly undesirable.
May be, the standard could even threat that in the next edition
using RNE rounding in transcendentals is going to become mandatory.
2. No support for specification of rounding together with operation
in HLLs. And let's face it, today nearly all libraries are written
in HLLs rather than in asm.

Michael S

unread,

Jan 10, 2023, 8:20:22 AM1/10/23

to

I suppose, it's just too new. Likely, will be shipped in clang16.

Now, why a support for seemingly front-end feature is so target-dependent
is another question.

Michael S

unread,

Jan 10, 2023, 8:46:56 AM1/10/23

to

On Friday, December 30, 2022 at 10:44:52 PM UTC+2, Terje Mathisen wrote:
> Michael S wrote:
> > On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
> >> Thomas Koenig wrote:
> >>> Michael S <already...@yahoo.com> schrieb:
> >>>> But the real reason why I don't believe that my code can be integrates into
> >>>> glibc or even into libgcc is different: I don't support rounding modes.
> >>>> That is, I always round to nearest with breaks rounded to even.
> >>>> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> >>>> They think that binary128 rounding mode should be the same as current
> >>>> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> >>>> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
> >>>
> >>> I understand them being sticklers for accuracy (Terje? :-)
> >> Absolutely so!
> >>
> >

> > I am pretty sure that in typical use cases one doesn't want binary128
> > rounding mode to be prescribed by the same control world as binary32/binary64.
> > Generally, binary64 non-default rounding modes are for experimentation.
> > And binary128 is for comparison of results of experimentation with "master"
> > values. And for "master" values you want "best" precision which is achieved in
> > default rounding mode.
> > mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.
> >
> >> Supporting all required rounding modes turns out to be easy however: If
> >> you already support the default round_to_nearest_or_even (RNE), then you
> >> already have the 4 required decision bits:
> >>
> >> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
> >> but it is very easy to grab. :-) )
> >>
> >> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
> >> get out the increment needed to round the intermediate result.
> >>
> >> Supporting multiple rounding modes just means grabbing the correct
> >> 16-bit value, or you can use the sign bit to select between two 64-bit
> >> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
> >> count to end up with the desired rounding bit.
> >>
> >> Terje
> >>
> >
> > I am repeating myself for the 3rd or 4th time - the main cost is *not*
> > implementation of non-default rounding modes. For that task branch
> > prediction will do a near perfect job.
> > The main cost is reading of FP control word that contains the relevant bits
> > and interpreting this bits in GPR domain. It is especially expensive if done
> > in portable way, i.e. via fegetround().
> > It's not a lot of cycles in absolute sense, but in such primitives like fadd
> > and fmul every cycle counts. We want to do each of them in less than
> > 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.
> Ouch!
>
> Sorry, I had absolutely no intention to suggest you should try to adapt
> to whatever the current HW rounding mode is! IMHO the only sane way to
> do it is to assume the user will set the rounding mode with the standard
> function to do so, and at that point you save away a copy, and load the
> corresponding 16-bit rounding lookup value.
>
> If a library user then goes "behind your back" using direct asm
> instructions to set the rounding mode to something else, then just
> disregard that.
>
> OK?

>
> Terje
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

And now it's more than my hand-waving.
Now I have measurements that show the expense of rounding modes
implemented via querying fegetround().

Computers:
EPY3-L - AMD EPYC-7543P, 3500 MHz, Debian Linux under WSL on WS-2019
EPY3-W - AMD EPYC-7543P, 3600 MHz, Windows Server 2019
SKLC-W - Intel Xeon E-2176G, 4250 MHz, Windows Server 2016
HSWL-W - Intel Xeon E3-1271 v3, 4000 MHz, Windows Server 2008 R2
IVYB-W - Intel Core i7-3770, ~3900 Mhz, Windows 7 Professional

Compilers:
Linux: gcc 10.2.1
Windows: gcc 12.2

Benchmark: Matrix multiplication, written in C using scalar add/mul primitives.
See ./tests/matmulq
Windows tests run with default options
Linux tests run with optin cn=17

Description of primitives:
ag - __addtf3 from libgcc
mg - __multf3 from libgcc
af - __addtf3 from ./addsubq.c.
Supports Invalid Operand exception.
Does not support Inexact exception.
Does not support non-default rounding modes.
mf - __multf3 from ./mulq.c.
Supports Invalid Operand exception.
Does not support Inexact exception.
Does not support non-default rounding modes.
ar - __addtf3 from ./addsubq_rm.c.
Supports Invalid Operand exception.
Does not support Inexact exception.
Support non-default rounding modes by mean of reading current rounding mode
with standard 'C' library function fegetround() once per __addtf3() call.
mr - __multf3 from ./mulq_rm.c.
Supports Invalid Operand exception.
Does not support Inexact exception.
Support non-default rounding modes by mean of reading current rounding mode
with standard 'C' library function fegetround() once per __addtf3() call.

Bencmark results (MFLOP/s):
Computer: EPY3-L EPY3-W SKLC-W HSWL-W IVYB-W
ag+mg 35.8 34.6 54.0 51.1 46.3
af+mf 138.5 108.3 113.8 100.3 77.2
ar+mf 80.9 91.3 92.8 79.7 67.3
af+mr 80.0 81.9 90.6 80.1 64.8
ar+mr 55.9 67.3 77.0 66.3 57.4

MitchAlsup

unread,

Jan 10, 2023, 10:42:16 AM1/10/23

to

On Tuesday, January 10, 2023 at 7:13:23 AM UTC-6, Michael S wrote:
> On Sunday, January 8, 2023 at 4:56:07 PM UTC+2, Terje Mathisen wrote:
> >
> > A global (per thread?) rounding mode hardware setting is there for
> > backwards compatibility, but we would much prefer for all operations to
> > carry their desired rounding mode along with the opcode.
> >
> > I.e. you support 5 or 6 rounding modes, one of which is "use the current
> > default mode", the rest are directly specified. Routing wise there
> > should be zero problems related to looking up the current default long
> > before the mode is required during the last parts of the final clock cycle.
> > Terje
> >
> > --
> > - <Terje.Mathisen at tmsw.no>
> > "almost all programming can be viewed as an exercise in caching"
> I am afraid that it is not going to help, because 99.9% of the time
> people will use 'current default mode' which still implies
> indirection through control register.
<

You spend 99.9% of your wake time out of surgery. Just because that
number is so high, do you really think you want something other than
perfection when in surgery ??

<
> Which, may be, cheap in hardware, [after somebody worked very hard
> to make it cheap], but expensive in soft FP.
<

I don't see why it is more expensive in soft FP as opposed to hard FP !?!

>
> Where it does help is implementation of transcendental functions
> in math library and such. But even here there are still obstacles:
> 1. Some vocal people think that such functions should respect
> current rounding mode. Unfortunately, did not IEEE-754 does not
> appear to have a clear position about it.
<

IEEE took the position that this is a quality of implementation issue.

<
> IMHO, the next standard should say that there is defined set of primitives
> (arithmetics, sqrt, float<->int conversions, may be text-to-float,
> for [bad] legacy reasons) that shall respect default rounding mode.
> For the rest of functions defined IEEE-754 standard, respecting
> rounding mode is not just unnecessary, but strongly undesirable.
<

IEEE754 specifies that SIN(x)^2+COS(x)^2 == 1.0 for all x.
You cannot do that with RNE; some values of x will produce 1+e
or 1-e when the proper answer is 1.0. {You also require more than 53
bits of faction to get a proper reduced argument--which none of the
"bit accurate" libraries utilize.}

<
> May be, the standard could even threat that in the next edition
> using RNE rounding in transcendentals is going to become mandatory.
<

Why ? I can see uses of transcendentals using RMI and RPI (ln, exp)
for specific purposes (interval analysis)

<
> 2. No support for specification of rounding together with operation
> in HLLs. And let's face it, today nearly all libraries are written
> in HLLs rather than in asm.
<

Just what do you think fsetround() is supposed to do ???

Michael S

unread,

Jan 10, 2023, 11:20:28 AM1/10/23

to

fesetround() means something else. Read the docs.

In order to request fixed rounding mode PARTICULAR_RM for a specific
arithmetic operation c=OP(a,b) programmer has to write ugly code.
E.g., like that:

int current_rm = fesetround();
fesetround(PARTICULAR_RM);
c=OP(a,b);
fesetround(current_rm);

Then programmer has to raise a prayer to CS goods asking for understanding
of his intentions by compiler.
A programmer in question has to have a lot of faith in compiler technology,
plenty of nativity and a very thick skin.
The later is necessary to withstand the comments he will inevitably hear
from the colleagues that happened to read the above code.

MitchAlsup

unread,

Jan 10, 2023, 12:34:29 PM1/10/23

to

On Tuesday, January 10, 2023 at 10:20:28 AM UTC-6, Michael S wrote:
> On Tuesday, January 10, 2023 at 5:42:16 PM UTC+2, MitchAlsup wrote:
> > On Tuesday, January 10, 2023 at 7:13:23 AM UTC-6, Michael S wrote:
> > <
> > > 2. No support for specification of rounding together with operation
> > > in HLLs. And let's face it, today nearly all libraries are written
> > > in HLLs rather than in asm.
> > <
> > Just what do you think fsetround() is supposed to do ???
> fesetround() means something else. Read the docs.
>
> In order to request fixed rounding mode PARTICULAR_RM for a specific
> arithmetic operation c=OP(a,b) programmer has to write ugly code.
> E.g., like that:
>
> int current_rm = fesetround();
> fesetround(PARTICULAR_RM);
> c=OP(a,b);
> fesetround(current_rm);
<

While this can appear to be onerous, properly implemented (i.e., not x86-like),
It should only be a maximum 2 cycles above that of OP();
AND this could be written::
<
int current_rm = fsetround( PARTICULAR_RM );
c = OP(a,b)
fsetround( current_rm );
<
In fact, the second fsetround() can run concurrently with OP(a,b); properly
architected and designed. Of which x86/87 is not. In My 66000 ISA this
can be coded as::
<
HX.rm Rt,#PARTICULAR_RM
OP Rd,Ra,Rb
HW.rm --,Rt

>
> Then programmer has to raise a prayer to CS goods asking for understanding
> of his intentions by compiler.
<

This is more of the problem, as the compiler does not see the "aliasing"
between fsetround() and OP(a,b). Since the RM field should be individually
accessible from flags,..., writing of RM is independent of writing of flags,...
and thus OP does not cast a shadow across RM even if it casts a shadow
across flags,... so OP(a,b) and HW.rm -- Rt are independent instructions
that can run concurrently.
<
Just because many got this wrong, does not mean everybody has to "get it
wrong; continuously and forever."

<
> A programmer in question has to have a lot of faith in compiler technology,
> plenty of nativity and a very thick skin.
<

A programmer should be able to read the assembly code of his input to
any compiled architecture; and if something funny is going on, then that
programmer should be able to walk through the code to see what actually
did happen.

<
> The later is necessary to withstand the comments he will inevitably hear
> from the colleagues that happened to read the above code.
<

/*
the 2 fsetround() function 'calls' are to manipulate the rounding mode
so that OP(a,b) produces the correct result under all circumstances
*/

MitchAlsup

unread,

Jan 10, 2023, 12:41:29 PM1/10/23

to

On Tuesday, January 10, 2023 at 11:34:29 AM UTC-6, MitchAlsup wrote:
> On Tuesday, January 10, 2023 at 10:20:28 AM UTC-6, Michael S wrote:
> > On Tuesday, January 10, 2023 at 5:42:16 PM UTC+2, MitchAlsup wrote:
> > > On Tuesday, January 10, 2023 at 7:13:23 AM UTC-6, Michael S wrote:
> > > <
> > > > 2. No support for specification of rounding together with operation
> > > > in HLLs. And let's face it, today nearly all libraries are written
> > > > in HLLs rather than in asm.
> > > <
> > > Just what do you think fsetround() is supposed to do ???
> > fesetround() means something else. Read the docs.
<

fesetround() Return value
On success, the fesetround() function returns 0.
On failure, it returns nonzero.
<
What a stupid way to implement a rounding mode function. It should have
returned the old value (or index or ENUM) instead of TRUE, FALSE.
<
Perhaps it is time to invent:: fxchground()
<
int fxchground( int round )
<
This functions obtains the current rounding mode and and sets the current
rounding mode to round. as::
<
int fxchground( int round )
{ int rm = fgetround();
fsetround(round );
return rm;
}
<
which happens to be a single instruction in My 66000--and which does not
have any of the "wait for all FP calculations to finish first" semantics.

Thomas Koenig

unread,

Jan 10, 2023, 1:30:47 PM1/10/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Tuesday, January 10, 2023 at 10:20:28 AM UTC-6, Michael S wrote:

>> In order to request fixed rounding mode PARTICULAR_RM for a specific
>> arithmetic operation c=OP(a,b) programmer has to write ugly code.
>> E.g., like that:
>>
>> int current_rm = fesetround();
>> fesetround(PARTICULAR_RM);
>> c=OP(a,b);
>> fesetround(current_rm);
><
> While this can appear to be onerous, properly implemented (i.e., not x86-like),
> It should only be a maximum 2 cycles above that of OP();
> AND this could be written::
><
> int current_rm = fsetround( PARTICULAR_RM );
> c = OP(a,b)
> fsetround( current_rm );
><
> In fact, the second fsetround() can run concurrently with OP(a,b); properly
> architected and designed. Of which x86/87 is not. In My 66000 ISA this
> can be coded as::
><
> HX.rm Rt,#PARTICULAR_RM
> OP Rd,Ra,Rb
> HW.rm --,Rt

I like Terje's suggestion to encode the rounding mode into the
instruction itself. Compilers could certainly figure out constant
propagation of the rounding mode, so this would then be

OP Rd,Ra,Rb,#PARTICULAR_RM

Using your method of specifying either a small constant or a
register number in the instruction, the rounding mode could then
also be put into a register (where fesetround would just set a
local variable, which would then be kept in a register).

This would eat bits of encoding, but... truth be told, I'm still
searching for some justification for my pet 40-bit bundles, and
this might just be part of it.

>>
>> Then programmer has to raise a prayer to CS goods asking for understanding
>> of his intentions by compiler.
><
> This is more of the problem, as the compiler does not see the "aliasing"
> between fsetround() and OP(a,b).

... which is a rather hard problem. Gcc hasn't solved it yet.
My current idea there would be to have a "no CSE of floating
point variables across this point" - kind of builtin function,
but I'm not sure just how hard this would be. There are probably
a lot of passes which do something like that, both in the
middle and back ends.

Michael S

unread,

Jan 10, 2023, 2:05:10 PM1/10/23

to

On Tuesday, January 10, 2023 at 7:34:29 PM UTC+2, MitchAlsup wrote:
> On Tuesday, January 10, 2023 at 10:20:28 AM UTC-6, Michael S wrote:
> > On Tuesday, January 10, 2023 at 5:42:16 PM UTC+2, MitchAlsup wrote:
> > > On Tuesday, January 10, 2023 at 7:13:23 AM UTC-6, Michael S wrote:
> > > <
> > > > 2. No support for specification of rounding together with operation
> > > > in HLLs. And let's face it, today nearly all libraries are written
> > > > in HLLs rather than in asm.
> > > <
> > > Just what do you think fsetround() is supposed to do ???
> > fesetround() means something else. Read the docs.
> >
> > In order to request fixed rounding mode PARTICULAR_RM for a specific
> > arithmetic operation c=OP(a,b) programmer has to write ugly code.
> > E.g., like that:
> >
> > int current_rm = fesetround();
> > fesetround(PARTICULAR_RM);
> > c=OP(a,b);
> > fesetround(current_rm);
> <
> While this can appear to be onerous, properly implemented (i.e., not x86-like),
> It should only be a maximum 2 cycles above that of OP();
> AND this could be written::
> <
> int current_rm = fsetround( PARTICULAR_RM );
> c = OP(a,b)
> fsetround( current_rm );

There was a typo in the first line of my code.
Should be:
int current_rm = fegetround();
According to 'C' standard fesetround() does not return previous value.

MitchAlsup

unread,

Jan 10, 2023, 2:41:00 PM1/10/23

to

Yes, exactly--where do the bits come from ? Bits which are used
99 times per week on a 5 GHz CPU.

> >>
> >> Then programmer has to raise a prayer to CS goods asking for understanding
> >> of his intentions by compiler.
> ><
> > This is more of the problem, as the compiler does not see the "aliasing"
> > between fsetround() and OP(a,b).
<
> ... which is a rather hard problem. Gcc hasn't solved it yet.
<

A FP intrinsic should be considered a FP instruction with dangerous
consequences wrt optimization. So, don't move any FP instruction
around this FP instruction. Perhaps the keyword volatile attached
to fsetround().