Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What's the correct fmod()-algorithm

754 views
Skip to first unread message

Bonita Montero

unread,
Apr 4, 2021, 10:32:39 AM4/4/21
to
If I implement fmod either with ...
x / y - trunc( x / y )
... or with ..
x - trunc( x / y ) * y
... this doesn't always give correct values, f.e. if x is very large
without a fraction and y is very samll. So what's the correct alogithm
for fmod ?

Real Troll

unread,
Apr 4, 2021, 10:55:15 AM4/4/21
to
No problems here using this example:

#include <stdio.h>
#include <math.h>

int main()
{
float a, b;
int c;

a = 150.25;
b = 27.67;
c = 23;

printf("Remainder of %f / %d is %lf\n", a, c, fmod(a,c));
printf("Remainder of %f / %f is %lf\n", a, b, fmod(a,b));

return(0);

}

The result should be:

Remainder of 150.250000 / 23 is 12.250000
Remainder of 150.250000 / 27.670000 is 11.900000

Bonita Montero

unread,
Apr 4, 2021, 11:12:18 AM4/4/21
to
> If I implement fmod either with ...
>     x / y - trunc( x / y )

This won't work for:
x = 5290912749423342.0 and y = 10.0
The result is ...
0.18750000000000000
... although it should be integral.

> ... or with ..
>     x - trunc( x / y ) * y

This won't work for:
x = 52909127494233416 and y = 10.0
The result is ...
-8.0

So ne naive solution doesn't work for such number-combinations.


Bonita Montero

unread,
Apr 4, 2021, 11:15:01 AM4/4/21
to
> No problems here using this example:
>
> #include <stdio.h>
> #include <math.h>
>
> int main()
> {
> float a, b;
> int c;
>
> a = 150.25;
> b = 27.67;
> c = 23;
>
> printf("Remainder of %f / %d is %lf\n", a, c, fmod(a,c));
> printf("Remainder of %f / %f is %lf\n", a, b, fmod(a,b));
>
> return(0);
>
> }
>
> The result should be:
>
> Remainder of 150.250000 / 23 is 12.250000
> Remainder of 150.250000 / 27.670000 is 11.900000

Sorry, you're naive; look at my latest post. I'm looking for a fmod()
-implementation and not for code which shows how to use fmod(), and it
must be really tricky.

Bart

unread,
Apr 4, 2021, 12:24:12 PM4/4/21
to
On 04/04/2021 16:12, Bonita Montero wrote:
>> If I implement fmod either with ...
>>      x / y - trunc( x / y )
>
> This won't work for:
>     x = 5290912749423342.0 and y = 10.0
> The result is ...
>     0.18750000000000000
> ... although it should be integral.
>
>> ... or with ..
>>      x - trunc( x / y ) * y

This is the correct formula (but watch out for negative values).


> This won't work for:
>     x = 52909127494233416 and y = 10.0
> The result is ...
>     -8.0
>
> So ne naive solution doesn't work for such number-combinations.

I don't have trunc() on Windows. If I emulate it with a simple version
the converts to int64 then I get the same result as you.

How is trunc() implemented? If it actually returns an int, then it can't
represent large intger values anyway (eg. 1e100). And if it simply
converts double to long long, then it will not work for values needing
more than 52 bits of precision.

What do you get by printing the value of trunc(52909127494233416)?

If that is wrong, then the fmod result will be wrong.

Bonita Montero

unread,
Apr 4, 2021, 12:46:19 PM4/4/21
to
> What do you get by printing the value of trunc(52909127494233416)?
> If that is wrong, then the fmod result will be wrong.

No, that's the magic of floating point arithemtics.

Bart

unread,
Apr 4, 2021, 1:26:02 PM4/4/21
to
The reason it goes wrong is this:

x =52909127494233416.00000000000000000000
x/y =5290912749423342.00000000000000000000
trunc(x/y) =5290912749423342.00000000000000000000
(x/y)*y =52909127494233424.00000000000000000000

x/y (y is 10.0) should be 5290912749423341.6, but it gives ...3342.0.
Probably because it's reaching the limits of precision. The difference
between first and last figures is 8, rather than 6 (and with the wrong
sign).

I don't know how the internal fmod works; it might be at the bit level.
But it will also start giving problems at a certain point.

Bonita Montero

unread,
Apr 4, 2021, 1:29:07 PM4/4/21
to
> x/y (y is 10.0) should be 5290912749423341.6, but it gives ...3342.0.
> Probably because it's reaching the limits of precision. ...

Right, the difference between x / y and floor( x / y ) becomes larger
than the modulo should be or even negative. I'm currently checking the
floor()-implementation of glibc, but that's very complex.

Paavo Helde

unread,
Apr 4, 2021, 1:39:29 PM4/4/21
to
04.04.2021 19:23 Bart kirjutas:
> On 04/04/2021 16:12, Bonita Montero wrote:
>>> If I implement fmod either with ...
>>>      x / y - trunc( x / y )
>>
>> This won't work for:
>>      x = 5290912749423342.0 and y = 10.0
>> The result is ...
>>      0.18750000000000000
>> ... although it should be integral.
>>
>>> ... or with ..
>>>      x - trunc( x / y ) * y
>
> This is the correct formula (but watch out for negative values).

Arguably the correct definition for intdiv is floor(x/y), not trunc(x/y)
(this avoids disruptions at zero), with the matching definition for
modulus. Python has got this right, whereas C's fmod() is arguably wrong
with negative numbers.

>
>
>> This won't work for:
>>      x = 52909127494233416 and y = 10.0
>> The result is ...
>>      -8.0

I bet fmod() is carefully coded to return most accurate results for all
operand values. If it were trivial to implement it via a simple formula,
there would be no need to include it in the standard library.

>>
>> So ne naive solution doesn't work for such number-combinations.
>
> I don't have trunc() on Windows. If I emulate it with a simple version
> the converts to int64 then I get the same result as you.

trunc() is present in MSVC, at least since VS2013. It returns
floating-point:

https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/trunc-truncf-truncl?view=msvc-160


Alf P. Steinbach

unread,
Apr 4, 2021, 6:38:11 PM4/4/21
to
cppreference says

<<
The double version of fmod behaves as if implemented as follows

double fmod(double x, double y)
{
#pragma STDC FENV_ACCESS ON
double result = std::remainder(std::fabs(x), (y = std::fabs(y)));
if (std::signbit(result)) result += y;
return std::copysign(result, x);
}
>>

<url: https://en.cppreference.com/w/cpp/numeric/math/fmod>

- Alf

Ian Collins

unread,
Apr 4, 2021, 6:38:50 PM4/4/21
to
On 05/04/2021 04:23, Bart wrote:
>
> I don't have trunc() on Windows. If I emulate it with a simple version
> the converts to int64 then I get the same result as you.


https://code.woboq.org/userspace/glibc/sysdeps/ieee754/dbl-64/s_trunc.c.html

It's usually implement with a builtin (__builtin_trunc) in gcc.

--
Ian.

Keith Thompson

unread,
Apr 4, 2021, 7:24:09 PM4/4/21
to
Bart <b...@freeuk.com> writes:
[...]
> I don't have trunc() on Windows. If I emulate it with a simple version
> the converts to int64 then I get the same result as you.
[...]

How do you not have trunc()? It's been standard C since C99, and
standard C++ since, what, C++11?

#include <math.h>
double trunc(double x);
float truncf(float x);
long double truncl(long double x);

Description
The trunc functions round their argument to the integer value, in
floating format, nearest to but no larger in magnitude than the
argument.

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */

Richard Damon

unread,
Apr 4, 2021, 7:32:20 PM4/4/21
to
The basic issue is that you need to realize that computing the value of
x / y when done in floating point is going to be, in general inexact,
and that to use this sort of formula, you are going to need to take this
into account.

In particular, if the rounding of x / y cause it to round up to an
integer value, the value of trunc(x/y) will be off, so you need to guard
for that.

One option would be to see if trunc(x / y)*y is bigger than x, and if so
then reduce this value by y to effecitvely decrease the trunc by one (or
even save the trunc value and decrease it by 1). (This assumes that x
and y are positive, if they might be negative you need to handle those
signs properly).

You also need to handle the case where the value of x/y doesn't even
have a 1's digit because the value is too big, in this case you have
really had a total loss of precision, I would suspect the defined answer
in this case is 0, and accept that in this case it just won't match at
all what you would get with x - trunc(x/y)*y since that might require
the value to be bigger than y.

Bart

unread,
Apr 4, 2021, 8:00:52 PM4/4/21
to
On 05/04/2021 00:23, Keith Thompson wrote:
> Bart <b...@freeuk.com> writes:
> [...]
>> I don't have trunc() on Windows. If I emulate it with a simple version
>> the converts to int64 then I get the same result as you.
> [...]
>
> How do you not have trunc()? It's been standard C since C99, and
> standard C++ since, what, C++11?
>
> #include <math.h>
> double trunc(double x);
> float truncf(float x);
> long double truncl(long double x);
>
> Description
> The trunc functions round their argument to the integer value, in
> floating format, nearest to but no larger in magnitude than the
> argument.

I forget to check other compilers which do have it.

But it's not part of msvcrt.dll which my bcc relies on.

It doesn't seem to be defined on top of anything else either. But there
/is/ 'floor' which I'd forgotten about.


Keith Thompson

unread,
Apr 4, 2021, 10:04:55 PM4/4/21
to
Bart <b...@freeuk.com> writes:
> On 05/04/2021 00:23, Keith Thompson wrote:
>> Bart <b...@freeuk.com> writes:
>> [...]
>>> I don't have trunc() on Windows. If I emulate it with a simple version
>>> the converts to int64 then I get the same result as you.
>> [...]
>>
>> How do you not have trunc()? It's been standard C since C99, and
>> standard C++ since, what, C++11?
>>
>> #include <math.h>
>> double trunc(double x);
>> float truncf(float x);
>> long double truncl(long double x);
>>
>> Description
>> The trunc functions round their argument to the integer value, in
>> floating format, nearest to but no larger in magnitude than the
>> argument.
>
> I forget to check other compilers which do have it.
>
> But it's not part of msvcrt.dll which my bcc relies on.

Wikipedia says that's the C standard library for the Visual C++ (MSVC)
compiler from version 4.2 to 6.0. I have that file on my Windows 10
laptop, and it appears that it doesn't provide the trunc() function --
which means that the compiler has to find it somewhere else.

Are you using an old version of Microsoft's C compiler? Or some other
compiler that doesn't know where to look for library functions that
aren't in msvcrt.dll? I'm able to use trunc() with Visual Studio 2019.

> It doesn't seem to be defined on top of anything else either. But
> there /is/ 'floor' which I'd forgotten about.

Kaz Kylheku

unread,
Apr 4, 2021, 11:07:00 PM4/4/21
to
["Followup-To:" header set to comp.lang.c.]
While mathematically, fmod(x, y) is always defined for real x, y, y /=
0, it simply might not make sense in floating point for
fabs(x) >> fabs(y).

Think about it.

Suppose that x is so larger that the distance between x and x + epsilon
(x and the next consecutive floating-point value) is larger than y.

The intuitition behind fmod(x, y) is that if we repeat y enough times
(start with 0 and add y), we will get to a value that is just below x.
The remaining gap to x is then fmod(x, y). But that point "just below x"
is not necessarily representable in that range where consecutive
floats are farther apart than y!

Nevertheless, the mathematical value fabs(y) exists, and is some real
number in the [0, y) range, which may have an excellent floating-point
approximation, regardless of the above.

Take the next part with a grain of salt; I am no numerical analyst.

I'm thinking that it could be fruitful to scale the exponent (by powers
of two) of y, to get it into the "ballpark" as x.

So that is, suppose we multiply y by pow(2, k), just so that
y * pow(2, k) <= x, but y * pow(2, k + 1) > x. I.e. k is the highest
exponent such that y * pow(2, k) does not exceed x.

Multiplications by powers of two preserve the mantissa exactly: they are
precise.

Having procured this k we, could subtract these two terms to obtain z:

z = x - y*pow(2, k)

Note that all we have done is subtract a multiple of y from x. The
result has the same fmod against y, namely:

fmod(z, y) = fmod(x, y)

we can calculate fmod(z, y) instead. Even though x >> y, it is not
the case that z >> y.

If we can work with floats at the bit level, the calculation for determining k
should be fast. Basically we just take the exponent field from x, and plant it
into y, tweaking it by at most one to produce a value that does not exceed x,
by comparing mantissas.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Chris M. Thomasson

unread,
Apr 5, 2021, 12:06:23 AM4/5/21
to
I do not think that standard C/C++ gives any hardcore guarantees on the
level of precision its going to give a fmod. You should reap existing
code in compilers.

anti...@math.uni.wroc.pl

unread,
Apr 5, 2021, 12:46:23 AM4/5/21
to
In comp.lang.c Bart <b...@freeuk.com> wrote:
> On 04/04/2021 17:46, Bonita Montero wrote:
> >> What do you get by printing the value of trunc(52909127494233416)?
> >> If that is wrong, then the fmod result will be wrong.
> >
> > No, that's the magic of floating point arithemtics.
>
> The reason it goes wrong is this:
>
> x =52909127494233416.00000000000000000000
> x/y =5290912749423342.00000000000000000000
> trunc(x/y) =5290912749423342.00000000000000000000
> (x/y)*y =52909127494233424.00000000000000000000
Actually (x/y)*y does not equal 52909127494233424, and that is
core problem in this example. Namely, decimal 52909127494233424
is not representable exactly in floating point.


> x/y (y is 10.0) should be 5290912749423341.6, but it gives ...3342.0.

This does not matter much in this example: 5290912749423342 is the
closest integer to quitient. Look:

52909127494233416 - 5290912749423342*10

(33) - 4

The result is -4 and it is enough to add 10 to get exact result.

--
Waldek Hebisch

Chris M. Thomasson

unread,
Apr 5, 2021, 2:40:02 AM4/5/21
to
On 4/4/2021 7:32 AM, Bonita Montero wrote:
The fight between a programmer and a floating point number:

https://youtu.be/TJRaLnLDMWg

Bonita Montero

unread,
Apr 5, 2021, 2:45:23 AM4/5/21
to
> I do not think that standard C/C++ gives any hardcore guarantees on the
> level of precision its going to give a fmod. You should reap existing
> code in compilers.

fmod works correctly in current implementations
and gives results >= 0 and < y;

Bonita Montero

unread,
Apr 5, 2021, 2:47:07 AM4/5/21
to
> <<
> The double version of fmod behaves as if implemented as follows
>
> double fmod(double x, double y)
> {
> #pragma STDC FENV_ACCESS ON
>    double result = std::remainder(std::fabs(x), (y = std::fabs(y)));
>    if (std::signbit(result)) result += y;
>    return std::copysign(result, x);
> }
Sorry, that doesn't represent the difficulty in implenting fmod().

Bonita Montero

unread,
Apr 5, 2021, 2:50:00 AM4/5/21
to
That's a correct trunc():

double trunc( double d )
{
static_assert(sizeof(double) == sizeof(uint64_t), "sizeof(double)
not equal to sizeof(uint64_t)");
static_assert(numeric_limits<double>::is_iec559, "double must be
IEEE-754");
// assume size_t is our CPU's native register-width
static_assert(sizeof(size_t) == sizeof(uint64_t) || sizeof(size_t)
== sizeof(uint32_t), "register-width must be 32 or 64 bit");
if constexpr( sizeof(size_t) == sizeof(uint64_t) )
// we have 64 bit registers
{
unsigned const MANTISSA_BITS = 52,
EXP_BIAS = 0x3FF,
INF_NAN_BASE = 0x7FF;
uint64_t const EXP_MASK = (uint64_t)0x7FF
<< MANTISSA_BITS,
SIGN_MASK = (uint64_t)0x800
<< MANTISSA_BITS ,
MANTISSA_MASK = 0x000FFFFFFFFFFFFFu,
NAN_MASK = 0x0008000000000000u,
MIN_INTEGRAL_DIGITS_EXP = (uint64_t) EXP_BIAS
<< MANTISSA_BITS,
MIN_INTEGRAL_ONLY_EXP = (uint64_t)(EXP_BIAS +
MANTISSA_BITS) << MANTISSA_BITS,
INF_NAN_EXP = (uint64_t)INF_NAN_BASE
<< MANTISSA_BITS;
int64_t const MANTISSA_SHIFT_MASK = 0xFFF0000000000000u;
union
{
double du;
uint64_t dx;
};
du = d;
uint64_t exp = dx & EXP_MASK;
if( exp >= MIN_INTEGRAL_DIGITS_EXP )
// value has integral digits
if( exp < MIN_INTEGRAL_ONLY_EXP )
{
// there are fraction-digits to mask out, mask them
unsigned shift = (unsigned)(exp >> MANTISSA_BITS) -
EXP_BIAS;
dx &= MANTISSA_SHIFT_MASK >> shift;
return du;
}
else
if( exp < INF_NAN_EXP )
// value is integral
return du;
else
{
uint64_t mantissa = dx & MANTISSA_MASK;
// infinite, NaN: return value
if( !mantissa || mantissa & NAN_MASK )
return du;
// SNaN: raise exception on SNaN if necessary
feraiseexcept( FE_INVALID );
return du;
}
else
{
// below +/-1.0
// return +/-0.0
dx &= SIGN_MASK;
return du;
}
}
else if constexpr( sizeof(size_t) == sizeof(uint32_t) )
// we have 32 bit registers
{
unsigned const MANTISSA_BITS = 52,
HI_MANTISSA_BITS = 20,
EXP_BIAS = 0x3FF,
INF_NAN_BASE = 0x7FF;
uint32_t const EXP_MASK = (uint32_t)0x7FFu
<< HI_MANTISSA_BITS,
SIGN_MASK = (uint32_t)0x800u
<< HI_MANTISSA_BITS,
HI_MANTISSA_MASK = 0x000FFFFFu,
NAN_MASK = 0x00080000u,
MIN_INTEGRAL_DIGITS_EXP = (uint32_t) EXP_BIAS
<< HI_MANTISSA_BITS,
MAX_INTEGRAL32_EXP = (uint32_t)(EXP_BIAS
+ HI_MANTISSA_BITS) << HI_MANTISSA_BITS,
MIN_INTEGRAL_ONLY_EXP = (uint32_t)(EXP_BIAS
+ MANTISSA_BITS) << HI_MANTISSA_BITS,
INF_NAN_EXP =
(uint32_t)INF_NAN_BASE << HI_MANTISSA_BITS,
NEG_LO_MANTISSA_SHIFT_MASK = 0xFFFFFFFFu;
int32_t const HI_MANTISSA_SHIFT_MASK = 0xFFF00000;
union
{
double du;
struct
{
uint32_t dxLo;
uint32_t dxHi;
};
};
du = d;
uint32_t exp = dxHi & EXP_MASK;
if( exp >= MIN_INTEGRAL_DIGITS_EXP )
// value has integral digits
if( exp < MIN_INTEGRAL_ONLY_EXP )
// there are fraction-digits to mask out
if( exp <= MAX_INTEGRAL32_EXP )
{
// the fraction digits are in the upper dword, mask
them and zero the lower dword
unsigned shift = (unsigned)(exp >>
HI_MANTISSA_BITS) - EXP_BIAS;
dxHi &= HI_MANTISSA_SHIFT_MASK >> shift;
dxLo = 0;
return du;
}
else
{
// the fraction digits are in the lower dword, mask
them
unsigned shift = (unsigned)(exp >>
HI_MANTISSA_BITS) - EXP_BIAS - HI_MANTISSA_BITS;
dxLo &= ~(NEG_LO_MANTISSA_SHIFT_MASK >>
shift);
return du;
}
else
if( exp < INF_NAN_EXP )
// value is integral
return du;
else
{
uint32_t hiMantissa = dxHi & HI_MANTISSA_MASK;
// infinite, NaN: return value
if( !(hiMantissa | dxLo) || hiMantissa & NAN_MASK )
return du;
// SNaN: raise exception on SNaN if necessary
feraiseexcept( FE_INVALID );
return du;
}
else
{
// below +/-1.0
// return +/-0.0
dxHi &= SIGN_MASK;
dxLo = 0;
return du;
}
}
}

Paavo Helde

unread,
Apr 5, 2021, 3:17:45 AM4/5/21
to
05.04.2021 03:00 Bart kirjutas:
> On 05/04/2021 00:23, Keith Thompson wrote:
>> Bart <b...@freeuk.com> writes:
>> [...]
>>> I don't have trunc() on Windows. If I emulate it with a simple version
>>> the converts to int64 then I get the same result as you.
>> [...]
>>
>> How do you not have trunc()?  It's been standard C since C99, and
>> standard C++ since, what, C++11?
>>
>>      #include <math.h>
>>      double trunc(double x);
>>      float truncf(float x);
>>      long double truncl(long double x);
>>
>>      Description
>>      The trunc functions round their argument to the integer value, in
>>      floating format, nearest to but no larger in magnitude than the
>>      argument.
>
> I forget to check other compilers which do have it.
>
> But it's not part of msvcrt.dll which my bcc relies on.

In my setup, trunc() resides in ucrtbase.dll
(C:\Windows\System32\ucrtbase.dll). UCRT stands for "universal C
run-time" which is an optional component when installing Visual Studio
(I gather "universal" means it does not depend on the MSVC version).

Bonita Montero

unread,
Apr 5, 2021, 3:49:36 AM4/5/21
to
That's the code from glibc:

/* @(#)e_fmod.c 5.1 93/09/24 */
/*
* ====================================================
* Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
*
* Developed at SunPro, a Sun Microsystems, Inc. business.
* Permission to use, copy, modify, and distribute this
* software is freely granted, provided that this notice
* is preserved.
* ====================================================
*/

#if defined(LIBM_SCCS) && !defined(lint)
static char rcsid[] = "$NetBSD: e_fmod.c,v 1.8 1995/05/10 20:45:07 jtc
Exp $";
#endif

/*
* __ieee754_fmod(x,y)
* Return x mod y in exact arithmetic
* Method: shift and subtract
*/

#include "math.h"
#include "math_private.h"

#ifdef __STDC__
static const double one = 1.0, Zero[] = { 0.0, -0.0, };
#else
static double one = 1.0, Zero[] = { 0.0, -0.0, };
#endif

#ifdef __STDC__
double __ieee754_fmod( double x, double y )
#else
double __ieee754_fmod( x, y )
double x, y;
#endif
{
int32_t n, hx, hy, hz, ix, iy, sx, i;
u_int32_t lx, ly, lz;

EXTRACT_WORDS( hx, lx, x );
EXTRACT_WORDS( hy, ly, y );
sx = hx & 0x80000000; /* sign of x */
hx ^= sx; /* |x| */
hy &= 0x7fffffff; /* |y| */

/* purge off exception values */
if( (hy | ly) == 0 || (hx >= 0x7ff00000) || /* y=0,or x not
finite */
((hy | ((ly | -ly) >> 31)) > 0x7ff00000) ) /* or y is NaN */
return (x * y) / (x * y);
if( hx <= hy )
{
if( (hx < hy) || (lx < ly) ) return x; /* |x|<|y| return x */
if( lx == ly )
return Zero[(u_int32_t)sx >> 31]; /* |x|=|y| return x*0*/
}

/* determine ix = ilogb(x) */
if( hx < 0x00100000 )
{ /* subnormal x */
if( hx == 0 )
{
for( ix = -1043, i = lx; i > 0; i <<= 1 ) ix -= 1;
}
else
{
for( ix = -1022, i = (hx << 11); i > 0; i <<= 1 ) ix -= 1;
}
}
else ix = (hx >> 20) - 1023;

/* determine iy = ilogb(y) */
if( hy < 0x00100000 )
{ /* subnormal y */
if( hy == 0 )
{
for( iy = -1043, i = ly; i > 0; i <<= 1 ) iy -= 1;
}
else
{
for( iy = -1022, i = (hy << 11); i > 0; i <<= 1 ) iy -= 1;
}
}
else iy = (hy >> 20) - 1023;

/* set up {hx,lx}, {hy,ly} and align y to x */
if( ix >= -1022 )
hx = 0x00100000 | (0x000fffff & hx);
else
{ /* subnormal x, shift x to normal */
n = -1022 - ix;
if( n <= 31 )
{
hx = (hx << n) | (lx >> (32 - n));
lx <<= n;
}
else
{
hx = lx << (n - 32);
lx = 0;
}
}
if( iy >= -1022 )
hy = 0x00100000 | (0x000fffff & hy);
else
{ /* subnormal y, shift y to normal */
n = -1022 - iy;
if( n <= 31 )
{
hy = (hy << n) | (ly >> (32 - n));
ly <<= n;
}
else
{
hy = ly << (n - 32);
ly = 0;
}
}

/* fix point fmod */
n = ix - iy;
while( n-- )
{
hz = hx - hy; lz = lx - ly; if( lx < ly ) hz -= 1;
if( hz < 0 )
{
hx = hx + hx + (lx >> 31); lx = lx + lx;
}
else
{
if( (hz | lz) == 0 ) /* return sign(x)*0 */
return Zero[(u_int32_t)sx >> 31];
hx = hz + hz + (lz >> 31); lx = lz + lz;
}
}
hz = hx - hy; lz = lx - ly; if( lx < ly ) hz -= 1;
if( hz >= 0 )
{
hx = hz; lx = lz;
}

/* convert back to floating value and restore the sign */
if( (hx | lx) == 0 ) /* return sign(x)*0 */
return Zero[(u_int32_t)sx >> 31];
while( hx < 0x00100000 )
{ /* normalize x */
hx = hx + hx + (lx >> 31); lx = lx + lx;
iy -= 1;
}
if( iy >= -1022 )
{ /* normalize output */
hx = ((hx - 0x00100000) | ((iy + 1023) << 20));
INSERT_WORDS( x, hx | sx, lx );
}
else
{ /* subnormal output */
n = -1022 - iy;
if( n <= 20 )
{
lx = (lx >> n) | ((u_int32_t)hx << (32 - n));
hx >>= n;
}
else if( n <= 31 )
{
lx = (hx << (32 - n)) | (lx >> n); hx = sx;
}
else
{
lx = hx >> (n - 32); hx = sx;
}
INSERT_WORDS( x, hx | sx, lx );
x *= one; /* create necessary signal */
}
return x; /* exact output */
}

Whatever it does ...

Kenny McCormack

unread,
Apr 5, 2021, 4:54:48 AM4/5/21
to
In article <874kgl5...@nosuchdomain.example.com>,
Keith Thompson <Keith.S.T...@gmail.com> wrote:
...
>> But it's not part of msvcrt.dll which my bcc relies on.
>
>Wikipedia says that's the C standard library for the Visual C++ (MSVC)
>compiler from version 4.2 to 6.0. I have that file on my Windows 10
>laptop, and it appears that it doesn't provide the trunc() function --
>which means that the compiler has to find it somewhere else.

Missing the point as usual.

The point is that you can use msvcrt.dll without using Microsoft's C
compiler at all. And, in fact, many programs and systems do exactly that.

Among them are:
1) MinGW (at least the original versions of it; it seems to have
transmogrified itself into something else lately)
2) Bart's language(s)
3) My own programming using a certain Windows scripting language; when
I need to do "Unix-y" things from the scripting language, I make
calls to msvcrt.dll.

None of these programs/systems have anything to do with having or using
MS's compiler.

--
No puppet.
No puppet.
You're a puppet.

Bart

unread,
Apr 5, 2021, 5:35:26 AM4/5/21
to
Thanks, if I link with that, then it works. (Also mscr120.dll.)

ucrtbase.dll has double the number of functions of msvcrt, and is newer,
so may be a better default library. I don't recall it, but if I look
inside my compiler, it had:

searchlibs[1]:="ucrtbase"
searchlibs[1]:="msvcrt"

So it has been tried in the past, but here the names are being stored in
the same place!

However, if I try it as the sole library, then it doesn't appear to have
printf. So perhaps the intention was to have both.

Paavo Helde

unread,
Apr 5, 2021, 7:02:48 AM4/5/21
to
05.04.2021 12:35 Bart kirjutas:
> ucrtbase.dll has double the number of functions of msvcrt, and is newer,
> so may be a better default library. I don't recall it, but if I look
> inside my compiler, it had:
>
>    searchlibs[1]:="ucrtbase"
>    searchlibs[1]:="msvcrt"
>
> So it has been tried in the past, but here the names are being stored in
> the same place!
>
> However, if I try it as the sole library, then it doesn't appear to have
> printf. So perhaps the intention was to have both.

With my minimal VS2019 app calling printf(), fmod() and trunc() it loads
the following DLL-s:

advapi32.dll
kernel32.dll
KernelBase.dll
msvcrt.dll
ntdll.dll
rpcrt4.dll
sechost.dll
ucrtbase.dll
vcruntime140.dll
version.dll

This indeed includes both msvcrt and ucrtbase.

Alf P. Steinbach

unread,
Apr 5, 2021, 7:34:58 AM4/5/21
to
Maybe not, but as I understand it that's the spec.

It should not be difficult to make it more complex. ;-)


- Alf

Manfred

unread,
Apr 5, 2021, 12:38:21 PM4/5/21
to
On 4/5/2021 9:49 AM, Bonita Montero wrote:
> That's the code from glibc:
>
> /* @(#)e_fmod.c 5.1 93/09/24 */
> /*
> * ====================================================
> * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
> *
> * Developed at SunPro, a Sun Microsystems, Inc. business.
> * Permission to use, copy, modify, and distribute this
> * software is freely granted, provided that this notice
> * is preserved.
> * ====================================================
> */
>
> #if defined(LIBM_SCCS) && !defined(lint)
> static char rcsid[] = "$NetBSD: e_fmod.c,v 1.8 1995/05/10 20:45:07 jtc
> Exp $";
> #endif
>
> /*
> * __ieee754_fmod(x,y)
> * Return x mod y in exact arithmetic
> * Method: shift and subtract
> */
[...]
>
> Whatever it does ...

It does what it says it does: "shift and subtract".

In general, fmod is pretty close to a division operation (as it happens,
shift and subtract is a common method for binary division)

More to the point of your question, the answer depends on the context.
For one thing, as you can see floating point arithmetic is heavily
dependent on its representation (obviously ieee754 in this case).

In other words, one might ask what is the purpose of your request, so
that its context might shed some light on the path to the solution.

It is no surprise that a naive approach gives incorrect results when
pushing close to the accuracy limits of the representation.

BTW, thanks for posting this.

James Kuyper

unread,
Apr 5, 2021, 10:44:13 PM4/5/21
to
On 4/5/21 12:06 AM, Chris M. Thomasson wrote:
...
> I do not think that standard C/C++ gives any hardcore guarantees on the
> level of precision its going to give a fmod. You should reap existing
> code in compilers.

C:
"The accuracy of the floating-point operations (+, -, *, /) and of the
library functions in <math.h> and <complex.h> that return floating-point
results is implementation-defined, as is the accuracy of the conversion
between floating-point internal representations and string
representations performed by the library functions in <stdio.h>,
<stdlib.h>, and <wchar.h>. The implementation may state that the
accuracy is unknown." (5.2.4.2.2p6)

This means that, should an implementation's documentation of the
accuracy of those operations so specify, the subtractions in the
expression LDBL_MAX - LDBL_MIN < LDBL_MIN - LDBL_MAX could be evaluated
with sufficient inaccuracy to give a result of "true". Despite that
fact, C is not in fact useless for portable floating point math.

"The following macro names are conditionally defined by the implementation:
...
_ _STDC_IEC_559_ _ The integer constant 1, intended to indicate
conformance to the specifications in annex F (IEC 60559 floating-point
arithmetic)." (6.10.8,3p1)

While annex F never directly addresses the accuracy requirements, it
does say:

"An implementation that defines _ _STDC_IEC_559_ _ shall conform to the
specifications in this annex. 356) Where a binding between the C
language and IEC 60559 is indicated, the IEC 60559-specified behavior is
adopted by reference, unless stated otherwise." (F.1p[)

IEC 60559 (== IEC/IEEE 754) specifies the behavior of those operations
far more tightly than the C standard itself does. In fact, it's accuracy
requirements are pretty much the tightest requirements that's it's
feasible to mandate. If you're pushing the limits of floating point
accuracy, make sure to check whether __STDC_IEC_559__ is defined.

C++:
"Note: This document imposes no requirements on the accuracy of
floating-point operations; see also 17.3. — end note" (6.8.1p12).

Re: std::numeric_limits
"static constexpr bool is_iec559;
true if and only if the type adheres to ISO/IEC/IEEE 60559." (17.3.5.1p57)

So the C++ standard says pretty much the same thing as the C standard,
though more succinctly and in less detail, and it doesn't require
documentation of the accuracy. If you're pushing the limits of floating
point accuracy in C++, you should always check the value of is_iec559.

James Kuyper

unread,
Apr 5, 2021, 10:44:37 PM4/5/21
to
On 4/5/21 7:34 AM, Alf P. Steinbach wrote:
> On 05.04.2021 08:46, Bonita Montero wrote:
>>> <<
>>> The double version of fmod behaves as if implemented as follows
>>>
>>> double fmod(double x, double y)
>>> {
>>> #pragma STDC FENV_ACCESS ON
>>>     double result = std::remainder(std::fabs(x), (y = std::fabs(y)));
>>>     if (std::signbit(result)) result += y;
>>>     return std::copysign(result, x);
>>> }

That is essentially the same as the code from the C standard,
F.10.7.1p3, except for the addition of "std::". As such, it only applies
when __STDC_IEC_559__ is pre#defined by the implementation.

>> Sorry, that doesn't represent the difficulty in implenting fmod().
>
> Maybe not, but as I understand it that's the spec.

Note that (as indicated by my response on another thread), unless
__STDC_IEC_559__ is pre#defined by the implementation, the accuracy of
fmod() can be arbitrarily bad, so long as the implementation documents
that it is that bad. However, since remainder()'s behavior is defined by
a cross-reference to IEC 60559, it's accuracy is required to match the
much tighter specifications of that standard, whether or not
__STDC_IEC_559__ is pre#defined.

The C++ says nothing of it's own about this issue, merely
cross-referencing the C standard.

Juha Nieminen

unread,
Apr 6, 2021, 1:22:09 AM4/6/21
to
In comp.lang.c++ Bonita Montero <Bonita....@gmail.com> wrote:
> Sorry, you're naive;

I don't understand why you need the feel to do this.
Are you really such a petty person that you can't just respond normally
to people, and instead you just have to belittle people?

Juha Nieminen

unread,
Apr 6, 2021, 1:30:29 AM4/6/21
to
In comp.lang.c++ Bonita Montero <Bonita....@gmail.com> wrote:
> This won't work for:
> x = 5290912749423342.0 and y = 10.0

A double-precision floating point has a 53-bit mantissa, which is enough
to represent accurately 15 significant decimal digits. If you try a value
with 16 significant decimal digits, the last digit will not be accurate
(because there aren't enough bits to represent completely).

This means that a double cannot store 5290912749423342.0 accurately.
The last digit will be wrong.

Bonita Montero

unread,
Apr 6, 2021, 2:50:51 AM4/6/21
to
> A double-precision floating point has a 53-bit mantissa, which is enough
> to represent accurately 15 significant decimal digits. If you try a value
> with 16 significant decimal digits, the last digit will not be accurate
> (because there aren't enough bits to represent completely).

That was clear from the start and there is no need
to emphasize it again. I just asked for a solution.

MrSpook_...@u3_eza.net

unread,
Apr 6, 2021, 4:05:59 AM4/6/21
to
If you don't like fmod() then implement it yourself using some big number
library. The maths is trivial:

ie: printf("%f = %f\n",fmod(f1,f2),f1 - f2 * (int)(f1 / f2));

Bonita Montero

unread,
Apr 6, 2021, 4:22:10 AM4/6/21
to
> If you don't like fmod() then implement it yourself using some big number
> library. ...

That would be too slow.
The solution given from the glibc-source isn't fast as well,
but it's for sure a magnitude faster than using a bignum-lib.

MrSpook...@59hx8krs2bo3s8a.com

unread,
Apr 6, 2021, 4:59:43 AM4/6/21
to
You can have speed or you can have unlimited floating point accuracy, pick one.
Fast floating point is limited by the abilities of the processors FPU.


Ian Collins

unread,
Apr 6, 2021, 5:15:01 AM4/6/21
to
On 05/04/2021 10:38, Ian Collins wrote:
> On 05/04/2021 04:23, Bart wrote:
>>
>> I don't have trunc() on Windows. If I emulate it with a simple version
>> the converts to int64 then I get the same result as you.
>
>
> https://code.woboq.org/userspace/glibc/sysdeps/ieee754/dbl-64/s_trunc.c.html
>
> It's usually implement with a builtin (__builtin_trunc) in gcc.
>

For

double f( double d )
{
return trunc(d);
}

gcc generates

f:
.LFB0:
.cfi_startproc
endbr64
movsd .LC1(%rip), %xmm2
movsd .LC0(%rip), %xmm4
movapd %xmm0, %xmm3
movapd %xmm0, %xmm1
andpd %xmm2, %xmm3
ucomisd %xmm3, %xmm4
jbe .L2
cvttsd2siq %xmm0, %rax
pxor %xmm0, %xmm0
andnpd %xmm1, %xmm2
cvtsi2sdq %rax, %xmm0
orpd %xmm2, %xmm0
.L2:
ret

--
Ian.

Bonita Montero

unread,
Apr 6, 2021, 6:37:49 AM4/6/21
to
>> That would be too slow.
>> The solution given from the glibc-source isn't fast as well,
>> but it's for sure a magnitude faster than using a bignum-lib.

> You can have speed or you can have unlimited floating point accuracy, pick one.

No one needs unlimtited floating-point accuracy.
Bignum-libaries are only suitable for cryptographic purposes,
and these don't need floating-point- but simple integer-arithmetics.

Bonita Montero

unread,
Apr 6, 2021, 6:42:42 AM4/6/21
to
> For
>
> double f( double d )
> {
>  return trunc(d);
> }
>
> gcc generates
>
> f:
> .LFB0:
>        .cfi_startproc
>        endbr64
>        movsd   .LC1(%rip), %xmm2
>        movsd   .LC0(%rip), %xmm4
>        movapd  %xmm0, %xmm3
>        movapd  %xmm0, %xmm1
>        andpd   %xmm2, %xmm3
>        ucomisd %xmm3, %xmm4
>        jbe     .L2
>        cvttsd2siq      %xmm0, %rax
>        pxor    %xmm0, %xmm0
>        andnpd  %xmm1, %xmm2
>        cvtsi2sdq       %rax, %xmm0
>        orpd    %xmm2, %xmm0
> .L2:
>        ret

If I may translate this: the code stores the value as a 64 bit integer
and checks the flags if the value was fitting in 64 bit. If not, the
value is larger than a 64 bit integer and those values have a integral
-only mantissa. If not and the value fitted in a 64 bit integer it
loads the chopped converted value back as a floating-point value.

MrSpook...@j1w2mu8nr6agvg.biz

unread,
Apr 6, 2021, 6:45:36 AM4/6/21
to
On Tue, 6 Apr 2021 12:37:34 +0200
Bonita Montero <Bonita....@gmail.com> wrote:
>>> That would be too slow.
>>> The solution given from the glibc-source isn't fast as well,
>>> but it's for sure a magnitude faster than using a bignum-lib.
>
>> You can have speed or you can have unlimited floating point accuracy, pick
>one.
>
>No one needs unlimtited floating-point accuracy.

I meant in the sense you can dictate how much accuracy you need.

>Bignum-libaries are only suitable for cryptographic purposes,
>and these don't need floating-point- but simple integer-arithmetics.

Thats all well and good, but if the accuracy you want exceeds the maximum
mantissa length the CPU can provide then you're going to need some sort of
software algorithm to do your calculations even if its invisibly built in to
one of the standard libraries.

Juha Nieminen

unread,
Apr 6, 2021, 6:57:34 AM4/6/21
to
In comp.lang.c++ Bonita Montero <Bonita....@gmail.com> wrote:
If the last digit if your number isn't being stored accurately in the
variable, then there is no solution. The original digit has been lost.
You can't perform an operation on something that doesn't exist.

If you want to use digits of that magnitude, try long double instead.

Bonita Montero

unread,
Apr 6, 2021, 7:10:29 AM4/6/21
to
> Thats all well and good, but if the accuracy you want ...

Where did I wrote that I have this precision concerns.
I simply talked about a solution for valid values.

Bonita Montero

unread,
Apr 6, 2021, 7:12:34 AM4/6/21
to
>> That was clear from the start and there is no need
>> to emphasize it again. I just asked for a solution.

> If the last digit if your number isn't being stored accurately in the
> variable, then there is no solution. The original digit has been lost.
> You can't perform an operation on something that doesn't exist.

Look at the glibc-code I already postet. It is slow but it gives
sufficiently accurate results with double precision.

> If you want to use digits of that magnitude, try long double instead.

long double will have the same issue for values where x and y have
a larger distance in their exponents.


Richard Damon

unread,
Apr 6, 2021, 7:46:10 AM4/6/21
to
Actually, if I am doing my math right, 53 bits can represent integers up
to 9,007.199.254.740.991 (2**53 -1) exactly with the units bit
expressed, so that number is expressible exactly.

50 bits is about 15 digits, so the 3 more bits and the rounding of 10
bits = 3 digits gives us the most of an extra digit.

The issue is that after the divide, the results can't be expressed
precisely enough and the round off errors cause the issues.

If you think of Floating point numbers as representing ranges then
5290912749423342.0 represents the interval 5290912749423341.5 to
5290912749423342.5 and thus the answer to fmod should be the interval
1.5 to 2.5 (but float can't express that level of imprecision itself)

If you think of them as precise numbers that are an approximation to
reality, then fmod would be precisely 2.000, but you might get wrong
values from the math due to round off.

In this case the formula x/y - trunc(x/y) gave 0.1875 instead of 0.2
because is 3/16 and x/y will be expressed to the nearest 1/16th due to
precision round off.

Bonita Montero

unread,
Apr 6, 2021, 8:29:42 AM4/6/21
to
> In this case the formula x/y - trunc(x/y) gave 0.1875 instead of 0.2
> because is 3/16 and x/y will be expressed to the nearest 1/16th due to
> precision round off.

I'm not concerned about imprecise results, but of invalid results
with the naive implementations
x - trunc( x / y ) * y
and
x / y * trunc( x / y )
If x is very high and y is very small, you might get results wich
are outside the modulo-range.

Bonita Montero

unread,
Apr 6, 2021, 8:50:03 AM4/6/21
to
Because this algoritm is so similar to the usual FP-Divide-Algorithms
in FPUs, I'm asking myself why the latter don't integrate a path for
fmod()-operations. That wouldn't be a big deal to integrate an option
for fmod() into these circuits for a partiular FP-Instruction.

Richard Damon

unread,
Apr 6, 2021, 9:02:01 AM4/6/21
to
Floating Point should ALWAYS be treated is 'imprecise', and sometimes
imprecise can create errors big enough to cause other issues.


You need to remember that when you compute x / y, that value will have a
precision, taking the trunc implies an assumption that the precision is
well better than countinng units, which it might not be (and isn't in
your cases).

The answer to this is to add a bit of complexity to the method to make
sure that you adjust the numbers to bring things into a better range
accurately.

In general, this is not simple, epsecially if both x and y have lots of
significant bits and are of very different magnitudes.

Note, that if the LSB of x is bigger than the MSB of y, then the
question actually doesn't really make sense, and you are getting close
to that point, and that is why the system is very sensitive to errors.

Bonita Montero

unread,
Apr 6, 2021, 9:14:25 AM4/6/21
to
> Floating Point should ALWAYS be treated is 'imprecise', ...

You're telling things which are common knowledge.

Richard Damon

unread,
Apr 6, 2021, 10:07:04 AM4/6/21
to
On 4/6/21 9:14 AM, Bonita Montero wrote:
>> Floating Point should ALWAYS be treated is 'imprecise', ...
>
> You're telling things which are common knowledge.

The problem is that the cases your complaining about are the cases where
that impression becomes large compared to the numbers you are dealing with.

trunc(z) when z is the order of 2**53 has no precision left to express.

x/y will have a rounding error in it. If it rounds up and that round up
causes a 'bump' in trunc, then trunc might have amplified the error to a
full integer count. Take a number just a count or two below 1000, and
divide it by 10, you will get 100, but if done to infinite precision, it
would be 99.99999.. and trunc would round down. Thus the trunc(/y)
always has a chance that if the division is close enough to an integer
value, to round up and cause that range error.

All the problems you are complaining about are ultimately rooted in the
fact that floating point math is imprecise, and it is hard, and slow, to
do the math so you get the absolute best answer you can.

James Kuyper

unread,
Apr 6, 2021, 10:11:54 AM4/6/21
to
You don't need bignums to do that. There may be more clever ways to do
it, but as proof of principle, an algorithm has already been sketched
out in this thread which depends upon the fact that multiplication by
powers of FLT_RADIX can be done exactly, as is also true of finding the
difference between two numbers of the same sign and the same exponent,
and fmod(difference, f2) is equal to either fmod(f1,f2), or
f2-fmod(f1,f2), depending upon what you did to give them the same sign.
fmod() is required to give the exact result #ifdef __STDC_IEC_559__

Bonita Montero

unread,
Apr 6, 2021, 10:19:18 AM4/6/21
to
> The problem is that the cases your complaining about are the cases where
> that impression becomes large compared to the numbers you are dealing with.

I'm not complaining about that, I'm just telling that the problems
resulting from inacurrarcy aren't as obvious as in many other cases.

Rest of your text not read.

Bart

unread,
Apr 6, 2021, 10:34:43 AM4/6/21
to
Your OP gave the impression that you were looking for the correct
formula for fmod, and gave two variations.

The correct one was the second (x-y*trunc(x/y)) (the first gives the
wrong results).

You suggested that neither were correct because the it gave the wrong
results on certain inputs.

That is due to the limitations of 64-bit floats.

The actual implementation of fmod may use a /better/ method which
preserves more, which is not say that the trunc approach is wrong.

fmod will start giving invalid values too if you add a few more
significant digits.

I've used the x-y*trunc(x/y) on a bigfloat library and it will always
give the correct results (subject to whatever precision cap you apply),
so it is not wrong. But just not tuned to ieee754 64-bit format.

Bonita Montero

unread,
Apr 6, 2021, 10:42:41 AM4/6/21
to
> fmod will start giving invalid values too if you add a few more
> significant digits.

That's wrong. The results become more inaccurate as higher as the
exponent-differences are, but the modulo-results are always in the
correct range.

Manfred

unread,
Apr 6, 2021, 10:51:19 AM4/6/21
to
On 4/6/2021 4:06 PM, Richard Damon wrote:
> On 4/6/21 9:14 AM, Bonita Montero wrote:
>>> Floating Point should ALWAYS be treated is 'imprecise', ...
>>
>> You're telling things which are common knowledge.
>
> The problem is that the cases your complaining about are the cases where
> that impression becomes large compared to the numbers you are dealing with.
>
> trunc(z) when z is the order of 2**53 has no precision left to express.
>
> x/y will have a rounding error in it. If it rounds up and that round up
> causes a 'bump' in trunc, then trunc might have amplified the error to a
> full integer count. Take a number just a count or two below 1000, and
> divide it by 10, you will get 100, but if done to infinite precision, it
> would be 99.99999.. and trunc would round down. Thus the trunc(/y)
> always has a chance that if the division is close enough to an integer
> value, to round up and cause that range error.

To be more precise, in common implementations the largest loss in
precision is with addiction and subtraction. Trunc() by itself is
trivially exact, since the number of significant digits is well within
the exponent range. Multiplication, division and trunc() do have full
precision, meaning that the result has the same number of significant
digits as the operands.

This is not true for addition and, specifically for this case,
subtraction; the problem in x/y - trunc(x/y) is the "-", not the trunc
per se.
In extreme simplification: if a and b have N decimal significant digits,
and (a-b) is one order of magnitude (1/10th) smaller than a and b, then
(a-b) has N-1 decimal significant digits. If the difference is two
orders of magnitude smaller, the result has N-2 significant digits and
so on.

The key feature of the "shift and subtract" algorithm is that it takes
care that the subtractions performed involve quantities that yield exact
results in the target representation.

Richard Damon

unread,
Apr 6, 2021, 10:59:47 AM4/6/21
to
Nope, once you get an error in the value of trunc due to round offs,
then you get an number out of range. This can occur even for cases with
numbers with very little exponent difference.

Bonita Montero

unread,
Apr 6, 2021, 11:52:03 AM4/6/21
to
> Nope, once you get an error in the value of trunc due to round offs,
> ...

Wrong, the results are just inaccurate, but always in the correct range,
i.e. they're never equal or over the expected mangitude of a modulo-ope-
ration or below.

> ... then you get an number out of range.

Not from fmod().

Bonita Montero

unread,
Apr 6, 2021, 11:53:21 AM4/6/21
to
> The key feature of the "shift and subtract" algorithm is that it takes
> care that the subtractions performed involve quantities that yield exact
> results in the target representation.

Not always exact, but the results always fit in the expected range.

Richard Damon

unread,
Apr 6, 2021, 12:07:28 PM4/6/21
to
Except as you pointed out they could be negative, and if the division of
x/y isn't accurate to the 1s place the results can easily be larger than
y as the round off in the division becomes an error in the answer scaled
by y.

Bonita Montero

unread,
Apr 6, 2021, 12:11:17 PM4/6/21
to
> Except as you pointed out they could be negative, and if the division of
> x/y isn't accurate to the 1s place the results can easily be larger than
> y as the round off in the division becomes an error in the answer scaled
> by y.

That's not how fmod is specified and actually implemented.

Manfred

unread,
Apr 6, 2021, 12:48:35 PM4/6/21
to
I'd have to check the details, but I'm pretty confident the result is
exact with respect to the operands /after/ conversion in the target
representation.

Bonita Montero

unread,
Apr 6, 2021, 12:50:04 PM4/6/21
to
> I'd have to check the details, but I'm pretty confident the result is
> exact with respect to the operands /after/ conversion in the target
> representation.

The algorithm in the code I gave couldn't give an out of range result.

Chris M. Thomasson

unread,
Apr 6, 2021, 4:47:22 PM4/6/21
to
On 4/6/2021 3:37 AM, Bonita Montero wrote:
>>> That would be too slow.
>>> The solution given from the glibc-source isn't fast as well,
>>> but it's for sure a magnitude faster than using a bignum-lib.
>
>> You can have speed or you can have unlimited floating point accuracy,
>> pick one.
>
> No one needs unlimtited floating-point accuracy.
> Bignum-libaries are only suitable for cryptographic purposes,

And deep fractal zooms!

Juha Nieminen

unread,
Apr 8, 2021, 2:09:47 AM4/8/21
to
In comp.lang.c++ Richard Damon <Ric...@damon-family.org> wrote:
> Floating Point should ALWAYS be treated is 'imprecise', and sometimes
> imprecise can create errors big enough to cause other issues.

That's actually one of the most common misconceptions about floating point
arithmetic.

I know that you didn't mean it like this, but nevertheless, quite often
people talk about floating point values as if they were some kind of
nebulous quantum mechanics Heisenberg uncertainty entities that have no
precise well-defined values, as if they were hovering around the
"correct" value, only randomly being exact if you are lucky, but most
of the time being off, and you can never be certain that it will have
a particular value, as if they were affected by quantum fluctuations
and random transistor noise.

Or that's the picture one easily gets when people talk about floating
point values.

In actuality, floating point values have very precise determined
values, specified by an IEEE standard. They, naturally, suffer from
*rounding errors* when the value to be represented can't fit in the
mantissa, but that's it. That's not uncertainty, that's just well-defined
rounding (which can be predicted and calculated, if you really wanted).

Most certainly if two floating point variables have been assigned the
same exact value, you can certainly assume them to be bit-by-bit
identical.

double a = 1.25;
double b = 1.25;
assert(a == b); // Will never, ever, ever fail

If two floating point values have been calculated using the exact
same operations, the results will also be bit-by-bit identical.
There is no randomness in floating point arithmetic. There are no
quantum fluctuations affecting them.

double a = std::sqrt(1.5) / 2.5;
double b = std::sqrt(1.5) / 2.5;
assert(a == b); // Will never, ever, ever fail

Also, there are many values that can be represented 100% accurately
with floating point. Granted that them being in base-2 makes it
sometimes slightly unintuitive what these accurately-representable
values are, because we think in base-10, but there are many.

double a = 123456.0; // Is stored with 100% accuracy
double b = 0.25; // Likewise.

In general, all integers up to about 2^52 (positive or negative)
can be represented with 100% accuracy in a double-precision floating
point variable. Decimal numbers have to be thought in base-2 when
considering whether they can be represented accurately. For example
0.5 can be represented accurately, 0.1 cannot (because 0.1 has an
infinite representation in base-2, and thus will be rounded to
the nearest value that can be represented with a double.)

Fred. Zwarts

unread,
Apr 8, 2021, 3:30:57 AM4/8/21
to
Op 08.apr..2021 om 08:09 schreef Juha Nieminen:
Indeed. One often claims that floating point is less precise than
integer, but the rounding errors of integers are in many cases much
larger. For example (1/2)*2 gives for floating point the exact value 1,
whereas integer result in 0 instead of 1 (due to rounding errors in the
division), which is 100% off. 😉

--
Paradoxes in the relation between Creator and creature.
<http://www.wirholt.nl/English>.

David Brown

unread,
Apr 8, 2021, 3:47:27 AM4/8/21
to
/Should/ never, ever, ever fail.

Not all floating point implementations follow IEEE standards exactly.
In particular, there are implementations which use intermediary formats
with greater precision than the formats used for memory storage. The
most common examples would be floating point coprocessors for the 68K
and x86 worlds, where 80-bit internal formats had greater precision than
64-bit doubles usually used by language implementations. A variable
that happens to be stored on the FPU's stack might then fail to compare
equal to the value of a variable stored in memory, despite being
"obviously" equal. (It's not going to happen with these particular
examples, being values with precise binary representations - I am
referring to general principles.)

If you need precisely controlled and replicable floating point
behaviour, make sure you use a compiler and flags that enforce IEEE
standards strictly. Many compilers do that by default, but can generate
significantly faster code if they are more relaxed (like gcc's
"-ffast-math" flag). Such relaxed behaviour is fine for a lot of code
(I believe, without reference or statistics, the great majority of
code). But you should then treat your floating point as being a little
imprecise, and be wary of using equality tests. (There is a reason gcc
has a "-Wfloat-equal" warning available.)


Juha Nieminen

unread,
Apr 8, 2021, 6:31:54 AM4/8/21
to
In comp.lang.c++ David Brown <david...@hesbynett.no> wrote:
>> double a = 1.25;
>> double b = 1.25;
>> assert(a == b); // Will never, ever, ever fail
>>
> /Should/ never, ever, ever fail.
>
> Not all floating point implementations follow IEEE standards exactly.

I can't think of any rational implementation, even if it used some custom
floating point representation and calculations, that could give false to
that paticular comparison.

The thing that one should be aware of is that if two values have been
calculated differently, they may give values that differ in one or
more of the least-significant digits of the mantissa, even when the
two expressions are mathematically equivalent.

For example,

a = x / y;
b = x * (1.0/y);
assert(a == b); // will probably fail for many values of x and y

Heck, even this:

a = (x + y) + z;
b = x + (y + z);
assert(a == b); // may fail

Bart

unread,
Apr 8, 2021, 7:05:49 AM4/8/21
to
On 08/04/2021 11:31, Juha Nieminen wrote:
> In comp.lang.c++ David Brown <david...@hesbynett.no> wrote:
>>> double a = 1.25;
>>> double b = 1.25;
>>> assert(a == b); // Will never, ever, ever fail
>>>
>> /Should/ never, ever, ever fail.
>>
>> Not all floating point implementations follow IEEE standards exactly.
>
> I can't think of any rational implementation, even if it used some custom
> floating point representation and calculations, that could give false to
> that paticular comparison.
>
> The thing that one should be aware of is that if two values have been
> calculated differently, they may give values that differ in one or
> more of the least-significant digits of the mantissa, even when the
> two expressions are mathematically equivalent.

In your example, 'a' can be an external variable defined in a different
module and using a different compiler.

1.25 is unlikely to be compiled differently, but other constants could
well be. When there is no exact binary that matches the decimal, then
it's not so easy to get the most accurate (and therefore consistent) binary.

I had a lot of trouble with this until I used strtod() in my compiler to
convert text to float. If I disable that, then these two programs:

===============================
c.c
-------------------------------
#include <stdio.h>
#include "c.h"

extern double a;
double b=X;

int main(void) {
if (a==b)
printf("%20.20e == %20.20e\n",a,b);
else
printf("%20.20e != %20.20e\n",a,b);
}

-------------------------------

===============================
d.c
-------------------------------
#include "c.h"
double a=X;
-------------------------------

Where c.h contains:
#define X 0.00000000000000000001

Produces this output when c.c is compiled with bcc and d.c with gcc:

9.99999999999999950000e-021 != 1.00000000000000010000e-020

The 9.999... value is gcc's, and presumably is closer to X.

David Brown

unread,
Apr 8, 2021, 7:38:10 AM4/8/21
to
On 08/04/2021 12:31, Juha Nieminen wrote:
> In comp.lang.c++ David Brown <david...@hesbynett.no> wrote:
>>> double a = 1.25;
>>> double b = 1.25;
>>> assert(a == b); // Will never, ever, ever fail
>>>
>> /Should/ never, ever, ever fail.
>>
>> Not all floating point implementations follow IEEE standards exactly.
>
> I can't think of any rational implementation, even if it used some custom
> floating point representation and calculations, that could give false to
> that paticular comparison.

As I said, it is hard to imagine those particular values leading to
failure, as they have exact representations in any sane floating point
representation.

It's a lot easier to see how you might have failures for other cases
where rounding to different precisions can affect the values.

>
> The thing that one should be aware of is that if two values have been
> calculated differently, they may give values that differ in one or
> more of the least-significant digits of the mantissa, even when the
> two expressions are mathematically equivalent.

Yes, indeed.

>
> For example,
>
> a = x / y;
> b = x * (1.0/y);
> assert(a == b); // will probably fail for many values of x and y
>
> Heck, even this:
>
> a = (x + y) + z;
> b = x + (y + z);
> assert(a == b); // may fail
>

Correct. IEEE-defined floating point operations are definitely not
associative - I don't think they are even always commutative, but I
could be wrong there (I am no expert in this area). One of the points
of the gcc "-ffast-math" flags is to tell the compiler it can assume
floating point arithmetic /is/ associative - either the expressions for
a and b give the same answer, or you don't care which it takes (in both
these examples). That can lead to far more efficient code.

Richard Damon

unread,
Apr 8, 2021, 8:01:25 AM4/8/21
to
On 4/8/21 2:09 AM, Juha Nieminen wrote:
> In comp.lang.c++ Richard Damon <Ric...@damon-family.org> wrote:
>> Floating Point should ALWAYS be treated is 'imprecise', and sometimes
>> imprecise can create errors big enough to cause other issues.
>
> That's actually one of the most common misconceptions about floating point
> arithmetic.

And I would say that THAT is one of the most common misconceptions about
floating point.

Yes, there are LIMITED situations where floating point will be exact, or
at least consistent, but unless you absolutely KNOW that you are in one
of those cases, thinking you are is the source of most problems.

And, if you are in one of those cases, very often floating point in not
the best answer by many measures (it may be the simplest, but that
doesn't always mean best)

Double Precision floating point numbers represent only slightly less
than 2**64 discrete values. Of the values they might represent in that
range, the odds of one chosen truely at random is minuscule.

They do tend to be 'close enough' as few things are actually needed to a
better precision than they represent. So that is where acknowledging
that they ARE imprecise, but precise enough for most cases, is the way
to go, so you are alert for the cases when that imprecision gets
important, like the case talked about previously.

>
> I know that you didn't mean it like this, but nevertheless, quite often
> people talk about floating point values as if they were some kind of
> nebulous quantum mechanics Heisenberg uncertainty entities that have no
> precise well-defined values, as if they were hovering around the
> "correct" value, only randomly being exact if you are lucky, but most
> of the time being off, and you can never be certain that it will have
> a particular value, as if they were affected by quantum fluctuations
> and random transistor noise.
>

Yes, Imprecise does not mean 'random', but that shows not understanding
the meaning of terms.

> Or that's the picture one easily gets when people talk about floating
> point values.
>
> In actuality, floating point values have very precise determined
> values, specified by an IEEE standard. They, naturally, suffer from
> *rounding errors* when the value to be represented can't fit in the
> mantissa, but that's it. That's not uncertainty, that's just well-defined
> rounding (which can be predicted and calculated, if you really wanted).

Yes, a particular bit combination has an exact value that it represents,
and the operatons on those bits have fairly precise rules for how they
are to be performed.


>
> Most certainly if two floating point variables have been assigned the
> same exact value, you can certainly assume them to be bit-by-bit
> identical.
>
> double a = 1.25;
> double b = 1.25;
> assert(a == b); // Will never, ever, ever fail

That just shows repeatability, not accuracy.

It also works for the values of 0.1 and 0.2 even though they are NOT
exactly represented. As can be seen be:
double a = 0.1;
double b = 0.2;
double c = 0.3;
assert(a + b == c); /* Will fail on most binary floating point
machines */

>
> If two floating point values have been calculated using the exact
> same operations, the results will also be bit-by-bit identical.
> There is no randomness in floating point arithmetic. There are no
> quantum fluctuations affecting them.
>
> double a = std::sqrt(1.5) / 2.5;
> double b = std::sqrt(1.5) / 2.5;
> assert(a == b); // Will never, ever, ever fail

Again, shows consistency not accuracy.

For almost all values of a we will find that
std::sqrt(a) * std::sqrt(a) != a

(Yes, this will succeed for some values of a, likely gotten by taking a
value with less than 26 bits of precision and squaring it.
>
> Also, there are many values that can be represented 100% accurately
> with floating point. Granted that them being in base-2 makes it
> sometimes slightly unintuitive what these accurately-representable
> values are, because we think in base-10, but there are many.
>
> double a = 123456.0; // Is stored with 100% accuracy
> double b = 0.25; // Likewise.
>
> In general, all integers up to about 2^52 (positive or negative)
> can be represented with 100% accuracy in a double-precision floating
> point variable. Decimal numbers have to be thought in base-2 when
> considering whether they can be represented accurately. For example
> 0.5 can be represented accurately, 0.1 cannot (because 0.1 has an
> infinite representation in base-2, and thus will be rounded to
> the nearest value that can be represented with a double.)
>
Yes, and if you really are dealing with just integers, keep them as
integers. If you have to stop to decide if your number happens to be one
that is exact, you probably are doing something wrong and making
problems for yourself, and will likely make a mistake at some point.
Much better to either reformulate so they ALWAYS are exact or you treat
them like they are always possibly inexact.

Juha Nieminen

unread,
Apr 8, 2021, 1:35:46 PM4/8/21
to
In comp.lang.c++ David Brown <david...@hesbynett.no> wrote:
> Correct. IEEE-defined floating point operations are definitely not
> associative - I don't think they are even always commutative, but I
> could be wrong there (I am no expert in this area).

IIRC strict IEEE compliance requires for summation and multiplication
to be commutative (ie. always give bit-by-bit the exact same result),
but don't quote me on that.

F Russell

unread,
Apr 8, 2021, 4:52:25 PM4/8/21
to
On Thu, 08 Apr 2021 08:01:06 -0400, Richard Damon wrote:

>
> Double Precision floating point numbers represent only slightly less
> than 2**64 discrete values.
>

That's not true.

An IEEE standard double has a 54 bit mantissa (53 + 1 normalized bit)
and therefore there are 2^54-1 numbers per exponent value.

The "per exponent value" is important.

As the exponent value increases the density of doubles per binary
"decade" decreases.

IOW, for larger and larger values, the density of exactly representable
numbers decreases. There are more and more "gaps" per binary "decade."

But we can always go to higher precision FP in software.

Why was the 64-bit hardware FP chosen?

William Kahan, the father of the IEEE-854 standard, said that 64-bits
were chosen so that the average programmer would not have to hire
an expensive error analyst to ensure that his code was correct.

But error analysis is mandatory for all critical FP computation.

Bart

unread,
Apr 8, 2021, 5:07:43 PM4/8/21
to
On 08/04/2021 21:51, F Russell wrote:
> On Thu, 08 Apr 2021 08:01:06 -0400, Richard Damon wrote:
>
>>
>> Double Precision floating point numbers represent only slightly less
>> than 2**64 discrete values.
>>
>
> That's not true.
>
> An IEEE standard double has a 54 bit mantissa (53 + 1 normalized bit)
> and therefore there are 2^54-1 numbers per exponent value.
>

The mantissa is 53 bits including an implicit bit.

2**54 numbers times approx 2**11 exponent values, and adding in the sign
bit, would come to 2**66 different values representable in 64 bits?

64 bits clearly can represent 2**64 different values, but many will be
invalid floating point numbers (loads of NaNs for example).

Keith Thompson

unread,
Apr 8, 2021, 5:56:15 PM4/8/21
to
F Russell <f...@random.info> writes:
> On Thu, 08 Apr 2021 08:01:06 -0400, Richard Damon wrote:
>> Double Precision floating point numbers represent only slightly less
>> than 2**64 discrete values.
>
> That's not true.

How exactly is it not true?

> An IEEE standard double has a 54 bit mantissa (53 + 1 normalized bit)
> and therefore there are 2^54-1 numbers per exponent value.
>
> The "per exponent value" is important.
>
> As the exponent value increases the density of doubles per binary
> "decade" decreases.
>
> IOW, for larger and larger values, the density of exactly representable
> numbers decreases. There are more and more "gaps" per binary "decade."

All that is correct -- but how does it refute the original statement?

There are clearly 2**64 representations for 64-bit double-precision
floating-point. My understanding is that most of those represent unique
real values, which implies that

Double Precision floating point numbers represent only slightly less
than 2**64 discrete values.

So where is the incorrect statement? (If you're quibbling about the
meaning of "discrete", perhaps "distinct" would have been clearer.)

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */

Richard Damon

unread,
Apr 8, 2021, 9:29:34 PM4/8/21
to
My memory is that IEEE 64 bits has:
1 Sign Bit
11 Exponent Bits
52 Mantissa Bits

For exponents other than 0, the 52 bits have an assumed leading 1, but
there are still only 2**52 values per exponent value (not -1 as all
2**52 are valid values)

The 11 Exponent bits give 2048 possible values, The highest is for NaNs
and Inf, but the rest are values (0 are denormals, but there are still
2**52 of those), and we have 2 values for each of those (except for 0,
sort-of) so we actually have:

2 * (2**11-1) * 2**52 -1 distinct (finite) values that can be
represented. This is just slightly less that 2**64 distinct values.

The exact value is 2**64 - 2**53 - 1

64 bits is big enough, that as long as you avoid the case of subtracting
two almost equal numbers, most calculations will just be good enough.
The trick is to watch your math to make sure that you never get that
form of cancellation occuring.

Juha Nieminen

unread,
Apr 10, 2021, 3:22:04 AM4/10/21
to
In comp.lang.c++ F Russell <f...@random.info> wrote:
> Why was the 64-bit hardware FP chosen?
>
> William Kahan, the father of the IEEE-854 standard, said that 64-bits
> were chosen so that the average programmer would not have to hire
> an expensive error analyst to ensure that his code was correct.

That explanation sounds completely nonsensical, at least without
further explanation.

(What does the size of the floating point value have anything to do
with "ensuring that the average programmer's code is correct"?
Implying that any other number of bits would have made it impossible,
and exactly 64 bits somehow magically ensures program correctness?
What?)

Richard Damon

unread,
Apr 10, 2021, 9:03:30 AM4/10/21
to
15-16 digits of precision are enough that for most practical
calculations using actual measurements or implemented with actual
measurements will have enough precision that the amount lost in typical
calculations will still be better than you could expect from the
accuracy of the measurements.

More, would just be a waste of computation resources (which were still
expensive when the standards were forming). The next logical step down
was 32 bits, and that often shows the effects of numerical accuracy with
real measurements.

48 bits might have worked, but can be close enough to the computation
limit that you would need to be carefull.

At 64 bits, the only real problem you need to watch out for is precision
cancellation, where you subtract two nearly equal numbers or add two
number that are almost equal, just with different signs. You can
normally ignore things like adding the small magnitude numbers first and
then getting to the larger, because you are unlikely to be adding enough
smaller numbers that order makes a significant difference.

Perhaps the biggest problem with that idea is that since there still are
a few spots where it makes a difference, the fact that you don't need to
normally worry about the other cases lulls you into a false sense of
security.

James Kuyper

unread,
Apr 10, 2021, 2:25:28 PM4/10/21
to
On 4/10/21 3:21 AM, Juha Nieminen wrote:
He's not implying that 64 bits is somehow magical. Nor that 64 bits is
optimal. Nor that it ensures program correctness.
What he's suggesting is that 32 bit floating point is often sufficiently
imprecise to be a serious problem, and that average programmers don't
have the skills needed to figure out how to restructure their algorithms
to produce acceptable accuracy when carried out using 32-bit floating point.
However, for most purposes, 64 bit floating point is enough to avoid
serious problems, so ordinary programmers can use it with a fair amount
of confidence that it will produce acceptable accuracy.
That doesn't mean that program correctness is assured when using 64
bits, only that code using 64 bits is substantially less likely to
result in accuracy problems than the same code using 32 bits. On the
other hand, further increases in the size produce only marginal
reductions in the accuracy problems. 64 bit's isn't exactly determined
optimal length. Most real floating point formats occupy a number of
bytes that is a power of 2, so the reasonable options tend to be 32
bits, 64 bits, and 128 bits. Basically, if you need 128-bit floating
point, you probably also need a numerical analyst anyway.

And, of course, even 128 bit floating point with the help of numerical
analyst is not sufficient to ensure program correctness, and he said
nothing to suggest that it would. It's not even sufficient to ensure
that the floating point math is correct, and it has no bearing at all on
those aspects of code correctness that don't involve floating point math.
0 new messages