On 02/09/2017 07:52 PM, Stefan Ram wrote:
> Newsgroups: comp.lang.c,comp.lang.c++
>
> #include <float.h>
>
> Some implementations have this property:
>
> A double has at least 14 significant digits (about 14-16) and
>
> 9007199254740991.
>
> is the largest double value that can be added to 1.0
> giving the "next" double value
Do you really mean "next double value" or "next integer"? The next
integer is 1 higher than Omega. The "next double value" will, in
general, be Omega+FLT_RADIX. I'm going to explain below why "next double
value" is problematic. Then I'll go ahead on the assumption that you
meant "next integer".
> 9007199254740992.
>
> (when one adds 1.0 again, the value will not change
> anymore).
>
> I don't know whether there is a common name for such a
> value. I will call it (just for this post) "omega".
> (In the above example, the omega is 9007199254740991.)
>
> Now, the standard does not require 14 significant
> digits, but 10 (DBL_DIG).
>
> I wonder whether one can somehow find the smallest
> omega for the type double that is required by the
> standard.
>
> Clearly, omega >= 0 and omega <= DBL_MAX.
You've defined Omega in terms of what happens when you add 1.0 to the
number. However, that will depend upon the rounding mode.
Consider the smallest integer for which the next integer has no
representation as a double, lets call it Omega1, to keep it distinct
from your Omega.
Omega1 and Omega1+FLT_RADIX will both be representable. If FLT_RADIX==2,
those values will both be equally far from the mathematical value of
Omega1 + 1. The C standard imposes NO requirements of it's own on the
accuracy of such floating point expressions (5.2.4.2.2p6). None -
whatsoever. In particular, this means that a fully conforming
implementation of C is allowed to implement floating point math so
inaccurately that DBL_MAX-DBL_MIN < DBL_MIN - DBL_MAX.
However, if __STDC_IEC_559__ is pre#defined by the implementation, then
the requirements of annex F apply (6.10.8.3p1), which are essentially
equivalent to IEC 60559:1989, which is equivalent ANSI/IEEE 754-1985
(F1p1). In that case, if FLT_RADIX == 2, and you add 1.0 to Omega1, the
result will be rounded to a value of either Omega1 or Omega1+2,
depending upon the current rounding mode. If the rounding mode is
FE_TONEAREST, FE_DOWNWARD, or FE_TOWARDZERO, then it will round to
Omega1, and that will be true of every value up to 2*Omega1, so 2*Omega1
will be the quantity you define as Omega. On the other hand, if the
rounding mode is FE_UPWARD, then Omega is the same as Omega1.
If you meant "next integer", that ambiguity disappears. Regardless of
rounding mode, Omega1 is the the smallest integer to which you can add
1.0 without getting the value of the next integer - because, by
definition, that next integer cannot be represented.
Section 5.2.4.2.2 specifies a model for how floating point types are
represented. That model need not actually be followed, but the required
behavior of floating point operations, and the ranges of representable
values are specified in terms of that model; a floating point
representation for which that model is sufficiently bad might not have
any way of correctly implementing the standard's requirements. So I'll
restrict my discussion to implementations for which that model is
perfectly accurate.
The standard uses subscripts, superscripts, and greek letters in it's
description of the model, which don't necessarily display meaningfully
in a usenet message. I'll indicate subscripts and superscripts by
preceeding them with _ and ^, respectively, and I'll replace the
summation with Sum(variable, lower limit, upper limit, expression).
> The following parameters are used to
> define the model for each floating-point type:
> s sign (+/-1)
> b base or radix of exponent representation (an integer > 1)
> e exponent (an integer between a minimum e_min
> and a maximum e_max )
> p precision (the number of base-b digits in the significand)
> f_k nonnegative integers less than b (the significand digits)
>
> A floating-point number (x) is defined by the following model:
> x = s b^e Sum(k, 1, p, f_k b^-k) e_min <= e <= e_max
The term in that sum with the smallest value is the one for k==p. f_p
gets multiplied by both b^e and b^-p, and therefore by b^(e-p). The
lowest integer that cannot be represented exactly is one which would
require a precision p+1 to represented, because b^(e-p) is greater than
1. Therefore, the Omega1 must have e == p+1, f_1 == 1, and f_k == 0 for
all other values of k. Therefore,
Omega1 = (+1) * b^(p+1) * 1*b^(-1) == b^p
== b/b^(1-p) == FLT_RADIX/DBL_EPSILON
Note: even on implementations that pre#define __STDC_IEC_559__, a
certain amount of inaccuracy is still permitted in both the
interpretation of floating point constants and in floating point
expressions. Jut barely enough, in fact, to allow FLT_RADIX/DBL_EPSILON
to come out as Omega1-1. However, that shouldn't happen if those macros
are defined as hexadecimal floating point constants, which must be
converted exactly if they can be represented exactly.
The smallest permitted value for FLT_RADIX is 2, and the maximum
permitted value for DBL_EPSILON is 1e-9, so omega1-1 cannot be less than
2/1e-9. A value for DBL_EPSILON of exactly 1e-9 is not possible if
FLT_RADIX is 2; the closest you can get is with p=29, so DBL_EPSILON =
2^(-30) == 1.0/(1,073,741,824), so the smallest permitted value for
omega1-1 is 2/(2^-30) == 2,147,483,648.