Re: The required minimum omega for double

Ben Bacarisse

unread,

Feb 9, 2017, 8:16:32 PM2/9/17

to

r...@zedat.fu-berlin.de (Stefan Ram) writes:

> Newsgroups: comp.lang.c,comp.lang.c++
>
> #include <float.h>
>
> Some implementations have this property:
>
> A double has at least 14 significant digits (about 14-16) and
>
> 9007199254740991.

I.e. 2^53 - 1 or, in most C implementations these days

(1llu << DBL_MANT_DIG) - 1

This works when FLT_RADIX is 2, but it need not be.

> is the largest double value that can be added to 1.0
> giving the "next" double value
>
> 9007199254740992.
>
> (when one adds 1.0 again, the value will not change
> anymore).
>
> I don't know whether there is a common name for such a
> value. I will call it (just for this post) "omega".
> (In the above example, the omega is 9007199254740991.)
>
> Now, the standard does not require 14 significant
> digits, but 10 (DBL_DIG).
>
> I wonder whether one can somehow find the smallest
> omega for the type double that is required by the
> standard.

I don't think you can get more than an approximation to it and probably
quite a crude one at that. The standard does not give minimum values
for DBL_MANT_DIG as you have no doubt found out already.

<snip>
--
Ben.

Robert Wessel

unread,

Feb 10, 2017, 12:58:04 AM2/10/17

to

On Fri, 10 Feb 2017 01:15:50 +0000, Ben Bacarisse
<ben.u...@bsb.me.uk> wrote:

>r...@zedat.fu-berlin.de (Stefan Ram) writes:
>
>> Newsgroups: comp.lang.c,comp.lang.c++
>>
>> #include <float.h>
>>
>> Some implementations have this property:
>>
>> A double has at least 14 significant digits (about 14-16) and
>>
>> 9007199254740991.
>
>I.e. 2^53 - 1 or, in most C implementations these days
>
> (1llu << DBL_MANT_DIG) - 1
>
>This works when FLT_RADIX is 2, but it need not be.

(pow(FLT_RADIX, DBL_MANT_DIG) - 1)

Should get close. That does have the disadvantage that it can't be
computed at compile time. And resulting in a FP number.

>> is the largest double value that can be added to 1.0
>> giving the "next" double value
>>
>> 9007199254740992.
>>
>> (when one adds 1.0 again, the value will not change
>> anymore).
>>
>> I don't know whether there is a common name for such a
>> value. I will call it (just for this post) "omega".
>> (In the above example, the omega is 9007199254740991.)
>>
>> Now, the standard does not require 14 significant
>> digits, but 10 (DBL_DIG).
>>
>> I wonder whether one can somehow find the smallest
>> omega for the type double that is required by the
>> standard.
>
>I don't think you can get more than an approximation to it and probably
>quite a crude one at that. The standard does not give minimum values
>for DBL_MANT_DIG as you have no doubt found out already.

I'm not sure what the OP is asking, but if it's about the minimum
possible "omega" on a conforming implementation (rather than the
actual omega of a particular implementation), then the answer would
appear to be approximately:

(pow(10,DBL_DIG) - 1)

Scott Lurndal

unread,

Feb 10, 2017, 9:58:02 AM2/10/17

to

Robert Wessel <robert...@yahoo.com> writes:
>On Fri, 10 Feb 2017 01:15:50 +0000, Ben Bacarisse

>
>

> (pow(FLT_RADIX, DBL_MANT_DIG) - 1)
>
>Should get close. That does have the disadvantage that it can't be
>computed at compile time. And resulting in a FP number.

With GCC you can do this as a compile time constexpr.

e.g., something similar to:

constexpr
unsigned int log2(uint64_t n) { return (n<=1) ? 0: (64-__builtin_clzll(n-1)); };

James Kuyper

unread,

Feb 10, 2017, 1:06:23 PM2/10/17

to

On 02/09/2017 07:52 PM, Stefan Ram wrote:
> Newsgroups: comp.lang.c,comp.lang.c++
>
> #include <float.h>
>
> Some implementations have this property:
>
> A double has at least 14 significant digits (about 14-16) and
>
> 9007199254740991.
>

> is the largest double value that can be added to 1.0
> giving the "next" double value

Do you really mean "next double value" or "next integer"? The next
integer is 1 higher than Omega. The "next double value" will, in
general, be Omega+FLT_RADIX. I'm going to explain below why "next double
value" is problematic. Then I'll go ahead on the assumption that you
meant "next integer".

> 9007199254740992.
>
> (when one adds 1.0 again, the value will not change
> anymore).
>
> I don't know whether there is a common name for such a
> value. I will call it (just for this post) "omega".
> (In the above example, the omega is 9007199254740991.)
>
> Now, the standard does not require 14 significant
> digits, but 10 (DBL_DIG).
>
> I wonder whether one can somehow find the smallest
> omega for the type double that is required by the
> standard.
>

> Clearly, omega >= 0 and omega <= DBL_MAX.

You've defined Omega in terms of what happens when you add 1.0 to the
number. However, that will depend upon the rounding mode.

Consider the smallest integer for which the next integer has no
representation as a double, lets call it Omega1, to keep it distinct
from your Omega.

Omega1 and Omega1+FLT_RADIX will both be representable. If FLT_RADIX==2,
those values will both be equally far from the mathematical value of
Omega1 + 1. The C standard imposes NO requirements of it's own on the
accuracy of such floating point expressions (5.2.4.2.2p6). None -
whatsoever. In particular, this means that a fully conforming
implementation of C is allowed to implement floating point math so
inaccurately that DBL_MAX-DBL_MIN < DBL_MIN - DBL_MAX.
However, if __STDC_IEC_559__ is pre#defined by the implementation, then
the requirements of annex F apply (6.10.8.3p1), which are essentially
equivalent to IEC 60559:1989, which is equivalent ANSI/IEEE 754-1985
(F1p1). In that case, if FLT_RADIX == 2, and you add 1.0 to Omega1, the
result will be rounded to a value of either Omega1 or Omega1+2,
depending upon the current rounding mode. If the rounding mode is
FE_TONEAREST, FE_DOWNWARD, or FE_TOWARDZERO, then it will round to
Omega1, and that will be true of every value up to 2*Omega1, so 2*Omega1
will be the quantity you define as Omega. On the other hand, if the
rounding mode is FE_UPWARD, then Omega is the same as Omega1.

If you meant "next integer", that ambiguity disappears. Regardless of
rounding mode, Omega1 is the the smallest integer to which you can add
1.0 without getting the value of the next integer - because, by
definition, that next integer cannot be represented.

Section 5.2.4.2.2 specifies a model for how floating point types are
represented. That model need not actually be followed, but the required
behavior of floating point operations, and the ranges of representable
values are specified in terms of that model; a floating point
representation for which that model is sufficiently bad might not have
any way of correctly implementing the standard's requirements. So I'll
restrict my discussion to implementations for which that model is
perfectly accurate.
The standard uses subscripts, superscripts, and greek letters in it's
description of the model, which don't necessarily display meaningfully
in a usenet message. I'll indicate subscripts and superscripts by
preceeding them with _ and ^, respectively, and I'll replace the
summation with Sum(variable, lower limit, upper limit, expression).

> The following parameters are used to
> define the model for each floating-point type:
> s sign (+/-1)
> b base or radix of exponent representation (an integer > 1)
> e exponent (an integer between a minimum e_min
> and a maximum e_max )
> p precision (the number of base-b digits in the significand)
> f_k nonnegative integers less than b (the significand digits)
>
> A floating-point number (x) is defined by the following model:
> x = s b^e Sum(k, 1, p, f_k b^-k) e_min <= e <= e_max

The term in that sum with the smallest value is the one for k==p. f_p
gets multiplied by both b^e and b^-p, and therefore by b^(e-p). The
lowest integer that cannot be represented exactly is one which would
require a precision p+1 to represented, because b^(e-p) is greater than
1. Therefore, the Omega1 must have e == p+1, f_1 == 1, and f_k == 0 for
all other values of k. Therefore,

Omega1 = (+1) * b^(p+1) * 1*b^(-1) == b^p

== b/b^(1-p) == FLT_RADIX/DBL_EPSILON

Note: even on implementations that pre#define __STDC_IEC_559__, a
certain amount of inaccuracy is still permitted in both the
interpretation of floating point constants and in floating point
expressions. Jut barely enough, in fact, to allow FLT_RADIX/DBL_EPSILON
to come out as Omega1-1. However, that shouldn't happen if those macros
are defined as hexadecimal floating point constants, which must be
converted exactly if they can be represented exactly.

The smallest permitted value for FLT_RADIX is 2, and the maximum
permitted value for DBL_EPSILON is 1e-9, so omega1-1 cannot be less than
2/1e-9. A value for DBL_EPSILON of exactly 1e-9 is not possible if
FLT_RADIX is 2; the closest you can get is with p=29, so DBL_EPSILON =
2^(-30) == 1.0/(1,073,741,824), so the smallest permitted value for
omega1-1 is 2/(2^-30) == 2,147,483,648.

Robert Wessel

unread,

Feb 12, 2017, 2:52:31 AM2/12/17

to

On Fri, 10 Feb 2017 14:57:49 GMT, sc...@slp53.sl.home (Scott Lurndal)
wrote:

That's not the same thing - the above uses pow(), because FLT_RADIX
does not have to be 2. And it's its an exponential function, not a
logarithm.

If you're implying that you could use a logarithm as a part of that
computation, assuming, something like:

a**b = 2 ** (a * lg(b))

and generating the logarithm with a clz, and then the exponentiation
with a left shit.

That doesn't work because no integer approximation of the lg will be
close enough to create anywhere near the correct value for the
exponentiation. An exception being where (a) is a power of two.

Now while I suppose someone could define a pow() that is, itself, a
constexpr, I've never seen it done.

Juha Nieminen

unread,

Feb 13, 2017, 2:33:51 AM2/13/17

to

In comp.lang.c++ Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
> This works when FLT_RADIX is 2, but it need not be.

In which practical system it isn't?

Of course for optimization you could use #if to use the integer
calculation, else the floating point one.

Robert Wessel

unread,

Feb 13, 2017, 5:06:46 AM2/13/17

to

On IBM mainframes, you can use either traditional hex FP or IEEE
binary FP for floats, doubles and long doubles (based on a compiler
option). FLT_RADIX changes appropriately.

The system also supports IEEE decimal floating point, but I'm not
aware of any way you can make ordinary floats use decimal FP (you
actually do define such items with "_Decimal64", etc., and additional
constants are defined in decimal.h).

Scott Lurndal

unread,

Feb 14, 2017, 8:39:11 AM2/14/17

to

Robert Wessel <robert...@yahoo.com> writes:
>On Fri, 10 Feb 2017 14:57:49 GMT, sc...@slp53.sl.home (Scott Lurndal)
>wrote:
>
>>Robert Wessel <robert...@yahoo.com> writes:
>>>On Fri, 10 Feb 2017 01:15:50 +0000, Ben Bacarisse
>>
>>>
>>>
>>> (pow(FLT_RADIX, DBL_MANT_DIG) - 1)
>>>
>>>Should get close. That does have the disadvantage that it can't be
>>>computed at compile time. And resulting in a FP number.
>>
>>With GCC you can do this as a compile time constexpr.
>>
>>e.g., something similar to:
>>
>>constexpr
>>unsigned int log2(uint64_t n) { return (n<=1) ? 0: (64-__builtin_clzll(n-1)); };
>
>
>That's not the same thing - the above uses pow(), because FLT_RADIX
>does not have to be 2. And it's its an exponential function, not a
>logarithm.

"something like" in this case means call 'pow' directly in
a constexpr function. It will be evaluated at compile time.

Robert Wessel

unread,

Feb 14, 2017, 9:01:20 AM2/14/17

to

On Tue, 14 Feb 2017 13:39:03 GMT, sc...@slp53.sl.home (Scott Lurndal)

It would have to be an custom version of pow(), one presumably limited
to integer arguments, that would be implemented as a constexpr
function. pow() itself, is not, AFAIK, valid in a constexpr function.

Scott Lurndal

unread,

Feb 14, 2017, 12:30:23 PM2/14/17

to

$ cat /tmp/a.cpp
#include <float.h>
#include <math.h>

constexpr double omega(void) { return pow(FLT_RADIX, DBL_MANT_DIG) - 1; }

int
main(int argc, const char **argv, const char **envp)
{
double o = omega();

return (int)o;
}

$ g++ -std=c++11 -o /tmp/a /tmp/a.cpp
$

Robert Wessel

unread,

Feb 14, 2017, 6:39:00 PM2/14/17

to

On Tue, 14 Feb 2017 17:30:12 GMT, sc...@slp53.sl.home (Scott Lurndal)

While GCC (6.3) does compile that (and with -O2, actually reduces it
to a constant), ICC (17) does not, and Clang (3.9.1) accepts it, but
appears to still generate a call to pow() (more specifically it
converts the tail call in main to a jump to pow).

My question is is this actually required by the standard? GCC seems
to take a tour through some of the "promoted" stuff, which I really
don't understand. Unless I've missed it, I really haven't seen
anything like a list of math.h functions that are required to be
constexpr (assuming appropriate inputs). Obviously a compiler can
provide an extension providing an constexpr pow().

Tim Rentsch

unread,

Feb 17, 2017, 10:30:27 AM2/17/17

to

Robert Wessel <robert...@yahoo.com> writes:

> On Fri, 10 Feb 2017 14:57:49 GMT, sc...@slp53.sl.home (Scott Lurndal)
> wrote:
>
>> Robert Wessel <robert...@yahoo.com> writes:
>>> On Fri, 10 Feb 2017 01:15:50 +0000, Ben Bacarisse
>>>
>>> (pow(FLT_RADIX, DBL_MANT_DIG) - 1)
>>>
>>> Should get close. That does have the disadvantage that it can't be
>>> computed at compile time. And resulting in a FP number.
>>
>> With GCC you can do this as a compile time constexpr.

>[...]

>
> Now while I suppose someone could define a pow() that is, itself, a
> constexpr, I've never seen it done.

That's an interesting exercise: define a function

constexpr long double cepow( long double x, long double y );

that calculates x**y (please excuse my fortran). Probably easier
in C++14 but still doable in C++11 (and the accuracy looks fairly
good in the tests I ran).

Robert Wessel

unread,

Feb 18, 2017, 1:19:38 AM2/18/17

to

On Fri, 17 Feb 2017 07:30:18 -0800, Tim Rentsch
<t...@alumni.caltech.edu> wrote:

>Robert Wessel <robert...@yahoo.com> writes:
>
>> On Fri, 10 Feb 2017 14:57:49 GMT, sc...@slp53.sl.home (Scott Lurndal)
>> wrote:
>>
>>> Robert Wessel <robert...@yahoo.com> writes:
>>>> On Fri, 10 Feb 2017 01:15:50 +0000, Ben Bacarisse
>>>>
>>>> (pow(FLT_RADIX, DBL_MANT_DIG) - 1)
>>>>
>>>> Should get close. That does have the disadvantage that it can't be
>>>> computed at compile time. And resulting in a FP number.
>>>
>>> With GCC you can do this as a compile time constexpr.
>>[...]
>>
>> Now while I suppose someone could define a pow() that is, itself, a
>> constexpr, I've never seen it done.
>
>That's an interesting exercise: define a function
>
> constexpr long double cepow( long double x, long double y );
>
>that calculates x**y (please excuse my fortran). Probably easier
>in C++14 but still doable in C++11 (and the accuracy looks fairly
>good in the tests I ran).

I never meant to express doubt that you *could* implement a pow()
function as a constexpr. Doing it well (errors bounded to something
somewhere near an ULP - and that flatly rules out any simple
implementation that does a simple reduction of pow(x,y) to
exp(y*log(x)) *), and handling all the special cases**, would
doubtless make for some pretty ugly code given the restrictions in
constexpr function (although a convention implementation is pretty
ugly as well, and even so, several hundred lines of code, when you
count the required support functions).

OTOH, it's certainly possible for a compiler to recognize many special
cases (notably, in this case, where the two parameters are constants),
and inline that. In appears that GCC, for example, does actually do
that in some cases, and then allows the result in a constexpr (that
was in Scott's example that I commented on). But it's not at all
clear to me that this is standard behavior, or something that can be
counted on.

*The problem is that log(x) basically shoves most of the exponent bits
into the mantissa, and shoves the same number of bits off the low end
of the mantissa - IOW, log(x) on a double pretty much starts with
throwing away as many as the low 10 bits of the input. You can, and
often do, use exp(y*log(x)) as part of the calculation, but you can
only do that after scaling both operands to be fairly near the base of
convenient log and exp functions*** (which will make the result near
1.0). You factor the exponents back in after.

**Most of the various combinations of signs, infinities, NaNs and
zeros for the two operands end up needing specific handling.

***There are various tradeoffs as to the exact range, and it's often
easiest to split the exponent and mantissa directly from the FP
format, thus leading to base 2 log/exp.

Tim Rentsch

unread,

Feb 18, 2017, 12:46:57 PM2/18/17

to

Robert Wessel <robert...@yahoo.com> writes:

> On Fri, 17 Feb 2017 07:30:18 -0800, Tim Rentsch
> <t...@alumni.caltech.edu> wrote:
>
>> Robert Wessel <robert...@yahoo.com> writes:
>>
>>> On Fri, 10 Feb 2017 14:57:49 GMT, sc...@slp53.sl.home (Scott Lurndal)
>>> wrote:
>>>
>>>> Robert Wessel <robert...@yahoo.com> writes:
>>>>> On Fri, 10 Feb 2017 01:15:50 +0000, Ben Bacarisse
>>>>>
>>>>> (pow(FLT_RADIX, DBL_MANT_DIG) - 1)
>>>>>
>>>>> Should get close. That does have the disadvantage that it can't be
>>>>> computed at compile time. And resulting in a FP number.
>>>>
>>>> With GCC you can do this as a compile time constexpr.
>>>
>>> [...]
>>>
>>> Now while I suppose someone could define a pow() that is, itself, a
>>> constexpr, I've never seen it done.
>>
>> That's an interesting exercise: define a function
>>
>> constexpr long double cepow( long double x, long double y );
>>
>> that calculates x**y (please excuse my fortran). Probably easier
>> in C++14 but still doable in C++11 (and the accuracy looks fairly
>> good in the tests I ran).
>
> I never meant to express doubt that you *could* implement a pow()
> function as a constexpr.

Right. I never thought otherwise. If anything your earlier
comment suggests you think it could be done, and that is what
prompted my investigation.

> Doing it well (errors bounded to something
> somewhere near an ULP - and that flatly rules out any simple
> implementation that does a simple reduction of pow(x,y) to
> exp(y*log(x)) *),

Yes, I tried that simple idea later, and it definitely does worse
in the accuracy department, even for the "softball" test cases
I tried.

> and handling all the special cases**,

I made no attempt to handle those, just "normal values".

> would
> doubtless make for some pretty ugly code given the restrictions in
> constexpr function

It's not clear to me it would be all that bad, not counting
things like NaNs and maybe values near infinity. Of course
anything involving setting errno is right out.

> (although a convention implementation is pretty
> ugly as well, and even so, several hundred lines of code, when you
> count the required support functions).

Oh yes. My aims were more modest, to produce fairly high
accuracy for non-pathological argument values. Obviously
you have more experience with these sorts of functions
than I do.

> OTOH, it's certainly possible for a compiler to recognize many special
> cases (notably, in this case, where the two parameters are constants),
> and inline that. In appears that GCC, for example, does actually do
> that in some cases, and then allows the result in a constexpr (that
> was in Scott's example that I commented on). But it's not at all
> clear to me that this is standard behavior, or something that can be
> counted on.

My reading of the C++ standard is that what gcc does is not
required, and so of course cannot be counted on. In fact it
looks like trying to use pow() to initialize a constexpr requires
a diagnostic. The C standard has an escape hatch for different
forms of constant expressions that allow implementations some
freedom to accept extensions to the required set of expressions.
AFAIK C++ does not have a similar rule, but of course I may
have missed something.

> *The problem is that log(x) basically shoves most of the exponent bits
> into the mantissa, and shoves the same number of bits off the low end
> of the mantissa - IOW, log(x) on a double pretty much starts with
> throwing away as many as the low 10 bits of the input. You can, and
> often do, use exp(y*log(x)) as part of the calculation, but you can
> only do that after scaling both operands to be fairly near the base of
> convenient log and exp functions*** (which will make the result near
> 1.0). You factor the exponents back in after.

Yes, that is more or less what I did.

> **Most of the various combinations of signs, infinities, NaNs and
> zeros for the two operands end up needing specific handling.

Operands being zero is easy to handle - just a couple more cases
in the ?: cascade.

For signs, what I did was toss out x**y when x is negative (ie
give a return value of 0), and return 1/x**-y when y is negative.

I completely ignored infinities and NaNs. I think infinities
would not be too bad, although cases near the edge might give
bad results. I haven't thought about how to deal with NaNs,
although maybe that isn't too bad either, assuming there is
a constexpr-acceptable way of detecting them.

> ***There are various tradeoffs as to the exact range, and it's often
> easiest to split the exponent and mantissa directly from the FP
> format, thus leading to base 2 log/exp.

Sadly there is (AFAICT) no way to do that in a way that is
allowed in a constexpr context, at least not directly.

Also, a potentially more significant problem - for floating
point, the range and precision at compile time may be different
than what they would be at run time. In practice that might not
be a big deal, but it is something that should be checked.