In article <se9ui6$m1q$
1...@dont-email.me>,
James Kuyper <
james...@alumni.caltech.edu> wrote:
> On 8/2/21 3:05 PM, Vincent Lefevre wrote:
> > In article <se79uh$ejo$
1...@dont-email.me>,
> > James Kuyper <
james...@alumni.caltech.edu> wrote:
> >
> >> On 8/1/21 7:08 AM, Vincent Lefevre wrote:
> ...
> >>> Until now, *_MANT_DIG has always meant that all FP numbers from the
> >>> model are representable, AFAIK. ...
> >
> >> When you use floating point format that is not a good fit to the
> >> standard's model, you're inherently going to be breaking some
> >> assumptions that are based upon that model - the only question is, which
> >> ones.
> >
> > This is FUD. Double-double is a good fit to the standard's model.
> > There are additional numbers (as allowed by the standard), so that
> > this won't follow the same behavior as IEEE FP, but so does
> > contraction of FP expressions.
> We're clearly using "good fit" in different senses. I would consider a
> floating point format to be a "perfect fit" to the C standard's model if
> every number representable in that format qualifies as a floating-point
> number, and every number that qualifies as a floating point number is
> representable in that format. "Goodness of fit" would depend upon how
> closely any given format approaches that ideal. There's no single value
> of the model parameters that makes both of those statements even come
> close to being true for double-double format.
The C standard does not have a notion of "goodness of fit", and it is
certainly not in the rationale. Strict IEEE floating point is nice as
it allows one to used specific algorithms based on its semantic. One
goal is to provide more precision and more accuracy; this is exactly
what double-double does, providing at least 106-bit precision, with
accurate algorithms based on the IEEE 754 semantic of double-precision
numbers. But for that, the IEEE semantic must strictly be followed;
anything that helps in improving the precision (though it is a good
thing in some contexts) will make the algorithms fail.
Otherwise, the goal is not to fit the floating-point model, but to
get as much accuracy as possible and bounds with a conventional error
analysis model. For instance, the goal of double-double is to provide
a 106-bit precision at least (107-bit precision almost everywhere),
following the FP model. There are additional representable numbers,
but are not a problem: for the error analysis, they will typically
be ignored (though in some cases, one can take advantage of them),
i.e. they won't have much influence on error bounds (if any); for
the actual result, you get these additional values for free compared
to a strict 106-bit FP arithmetic.
This double-double choice done as the "long double" format on PowerPC
is better than just double precision (allowed by the C standard) in
term of precision. And it is faster than a software implementation of
binary128 (a.k.a. quadruple precision). These were regarded as better
criteria than "goodness to fit".
> >> Error analysis for double-double will necessarily be quite different
> >> from the error analysis that applies to a format that fits the
> >> standard's model, regardless of what value you choose for LDBL_MANT_DIG.
> >
> > Assuming some error bound in ulp (which must be done whatever the
> > format), the error analysis will be the same.
> Given a number in double-double format represented by a+b, where
> fabs(a) > fabs(b) && fabs(b) > 0, with x being the largest power of
> FLT_RADIX that is no larger than b, 1 ulp will be DBL_EPSILON*x.
> That expression will vary over a very large dynamic range while the
> number being represented by a+b changes only negligibly.
[...]
The notion of ulp is defined (and usable in practice) only with a
floating-point format. The best way to apply it to double-double
is to define it to a floating-point format that a subset of this
double-double format. If double has a 53-bit precision, this gives
a 106-bit precision, hence the choice of LDBL_MANT_DIG = 106.
> >> The consequences of this decision are therefore relatively limited
> >> outside of 5.2.4.2.2. Many standard library functions that take floating
> >> point arguments have defined behavior only when passed a floating-point
> >> number.
> >
> > The main ones, like expl, have a defined behavior for other real
> > values: "The exp functions compute the base-e exponential of x.
> > A range error occurs if the magnitude of x is too large."
> > Note that x is *not* required to be a floating-point number.
> The frexp(), ldexp(), and fabs() functions have specified behavior only
> when the double argument qualifies as a floating point number.
Well, in the particular case of double-double, they can easily be
generalized to non-floating-point numbers, with some drawbacks:
frexp and ldexp may introduce rounding. In practice, they can just
be defined by the implementation. This is not worse than the fact
that the standard doesn't define the accuracy of the floating-point
operations and functions.
> The printf() family of functions have specified behavior only for
> floating-point numbers, NaN's and infinities, when using f,F,e,E,g,G,a,
> or A formats.
> The ceil() and floor() functions have a return value which is required
> to be floating point number.
> The scanf() family of functions (when using f,F,e,E,g,G,a, or A formats)
> and strtod(). are required to return floating-point numbers, NaN's or
> infinities,
> and, of course, this also applies to the float, long double, and complex
> versions of all of those functions, and to the wide-character version of
> the text functions.
Ditto, they can be defined by the implementation.
> >> In some cases, the behavior is also defined if they are passed a
> >> NaN or an infinity.
> >
> > Only in Annex F, AFAIK.
> No, the descriptions of printf() and scanf() families of functions also
> allow for NaNs and infinities.
OK, but this is not much useful, as you need Annex F or definitions
by the implementation to know how infinities and NaNs are handled
by most operations, starting with the basic arithmetic operations.
> ...
> >>> But since f_1 is defined by the formula in 5.2.4.2.2p2, this means that
> >>> it is defined only with no more than p digits, where p = LDBL_MANT_DIG
> >>> for long double.
> >
> >> No, f_1 can trivially be identified as the leading base-b digit,
> >> regardless of whether or not there are too many base-b digits for the
> >> representation to qualify as a floating-point number.
> >
> > It can, but this is not what the standard says. This can be solved
> > if p is replaced by the infinity over the sum symbol in the formula.
> The C standard defines what normalized, subnormal, and unnormalized
> floating-point numbers are; it does not define what those terms mean for
> things that could qualify as floating-point numbers, but only if
> LDBL_MANT_DIG were larger. However, the extension of those concepts to
> such representations is trivial. I think that increasing the value of
> LDBL_MANT_DIG to allow them to qualify as floating-point numbers is the
> more appropriate approach.
> If p is replaced by infinity, it no longer defines a floating point
> format. Such formats are defined, in part, by their finite maximum value
> of p.
So, you mean that to follow the definition of "normal" used with
double-double on PowerPC, LDBL_MANT_DIG needs to be increased to
a large value (something like e_max - e_min + 53), even though
not all floating-point values would be representable?
But then, various specifications would be incorrect, such as:
* LDBL_MAX, as (1 - b^(-p)) b^emax would not be representable
(p would be too large).
* LDBL_EPSILON would no longer make any sense and would not be
representable, as b^(1-p) would be too small.
* frexp, because the condition "value equals x times 2^(*exp)" could
not always be satisfied (that's probably why the standard says "If
value is not a floating-point number [...]", which is OK if all
floating-point numbers are assumed to be exactly representable).
In summary, you're just replacing a problem by several problems.