141 views

Skip to first unread message

Oct 1, 2021, 11:37:33 AMOct 1

to

// numeric_limits<long double>::digits=64.

//

typedef long double FType;

FType x=numeric_limits<FType>::max();

int iexp;

int64_t mint;

x=::frexpl(x,&iexp);

x=::ldexpl(x,numeric_limits<FType>::digits);

mint= static_cast<int64_t>(x);

Result (mint) is a negative number, something not right!!!

//

typedef long double FType;

FType x=numeric_limits<FType>::max();

int iexp;

int64_t mint;

x=::frexpl(x,&iexp);

x=::ldexpl(x,numeric_limits<FType>::digits);

mint= static_cast<int64_t>(x);

Result (mint) is a negative number, something not right!!!

Oct 1, 2021, 12:09:29 PMOct 1

to

Oct 1, 2021, 1:19:03 PMOct 1

to

#include <iostream>

#include <cstdint>

#include <limits>

#include <utility>

#include <iomanip>

using namespace std;

using mantissa_pair = pair<bool, uint64_t>;

mantissa_pair getMantissa( double value );

int main()

{

double v = numeric_limits<double>::min();

do

{

mantissa_pair mp = getMantissa( v );

cout << "value: " << v;

if( mp.first )

cout << " mantissa: " << hex << mp.second << endl;

else

cout << " invalid mantissa (Inf, S(NaN))" << endl;

} while( (v *= 2.0) != numeric_limits<double>::infinity() );

}

static_assert(numeric_limits<double>::is_iec559, "must be standard fp");

mantissa_pair getMantissa( double value )

{

union

{

uint64_t binary;

double value;

} u;

u.value = value;

unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;

if( exponent == 0 )

return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |

0x10000000000000u );

if( exponent == 0x7FF )

return pair<bool, uint64_t>( false, 0 );

return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |

0x10000000000000u );

}

Oct 1, 2021, 1:21:12 PMOct 1

to

return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu );

Oct 1, 2021, 1:25:19 PMOct 1

to

>> mantissa_pair getMantissa( double value )

>> {

>> union

>> {

>> uint64_t binary;

>> double value;

>> } u;

>> u.value = value;

>> unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;

>> if( exponent == 0 )

>> return pair<bool, uint64_t>( true, u.binary &

>> 0xFFFFFFFFFFFFFu | 0x10000000000000u );

>

> Should be:

> return pair<bool, uint64_t>( true, u.binary &

> 0xFFFFFFFFFFFFFu );

>

>

>> if( exponent == 0x7FF )

>> return pair<bool, uint64_t>( false, 0 );

>> return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |

>> 0x10000000000000u );

>> }

Now with maximum efficiency:
>> {

>> union

>> {

>> uint64_t binary;

>> double value;

>> } u;

>> u.value = value;

>> unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;

>> if( exponent == 0 )

>> return pair<bool, uint64_t>( true, u.binary &

>> 0xFFFFFFFFFFFFFu | 0x10000000000000u );

>

> Should be:

> return pair<bool, uint64_t>( true, u.binary &

> 0xFFFFFFFFFFFFFu );

>

>

>> if( exponent == 0x7FF )

>> return pair<bool, uint64_t>( false, 0 );

>> return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |

>> 0x10000000000000u );

>> }

mantissa_pair getMantissa( double value )

{

union

{

uint64_t binary;

double value;

} u;

u.value = value;

unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;

if( exponent == 0x7FF )

return pair<bool, uint64_t>( false, 0 );

uint64_t hiBit = (uint64_t)(exponent != 0) << 52;
return pair<bool, uint64_t>( false, 0 );

return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu | hiBit );

}

Oct 1, 2021, 1:53:58 PMOct 1

to

Am 01.10.2021 um 19:25 schrieb Bonita Montero:

> mantissa_pair getMantissa( double value )

> {

> union

> {

> uint64_t binary;

> double value;

> } u;

> u.value = value;

> unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;

> if( exponent == 0x7FF )

> return pair<bool, uint64_t>( false, 0 );

> uint64_t hiBit = (uint64_t)(exponent != 0) << 52;

> return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |

> hiBit );

> }

Oh, I think this would be the best solution for 0x7ff-exponents:
> mantissa_pair getMantissa( double value )

> {

> union

> {

> uint64_t binary;

> double value;

> } u;

> u.value = value;

> unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;

> if( exponent == 0x7FF )

> return pair<bool, uint64_t>( false, 0 );

> uint64_t hiBit = (uint64_t)(exponent != 0) << 52;

> return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |

> hiBit );

> }

mantissa_pair getMantissa( double value )

{

union

{

uint64_t binary;

double value;

} u;

u.value = value;

unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;

if( exponent == 0x7FF )

return pair<bool, uint64_t>( false, mantissa );
uint64_t hiBit = (uint64_t)(exponent != 0) << 52;

return pair<bool, uint64_t>( true, mantissa | hiBit );
}

Return the mantisssa also with false in .first for Inf and (S)NaN.

Oct 1, 2021, 2:05:56 PMOct 1

to

On 10/1/21 12:09 PM, Radica...@theburrow.co.uk wrote:

> On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)

> wij <wyn...@gmail.com> wrote:

>> // numeric_limits<long double>::digits=64.

>> //

>> typedef long double FType;

>> FType x=numeric_limits<FType>::max();

>> int iexp;

>> int64_t mint;

>> x=::frexpl(x,&iexp);

>> x=::ldexpl(x,numeric_limits<FType>::digits);

>> mint= static_cast<int64_t>(x);

>>

>> Result (mint) is a negative number, something not right!!!

I can't duplicate this problem: I get mint:9223372036854775807.
> On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)

> wij <wyn...@gmail.com> wrote:

>> // numeric_limits<long double>::digits=64.

>> //

>> typedef long double FType;

>> FType x=numeric_limits<FType>::max();

>> int iexp;

>> int64_t mint;

>> x=::frexpl(x,&iexp);

>> x=::ldexpl(x,numeric_limits<FType>::digits);

>> mint= static_cast<int64_t>(x);

>>

>> Result (mint) is a negative number, something not right!!!

> It might help if you set iexp to some value.

from the C standard library.

Section 7.12.6.7p1 of the C standard says:

> long double frexpl(long double value, int *p);

The following paragraph says:

> The frexp functions break a floating-point number into a normalized fraction and an integer exponent. They store the integer in the int object pointed to by p .

So iexp should be set. When I ran the code, it got set to a value of

16384. That's not the problem.

Oct 1, 2021, 2:09:54 PMOct 1

to

On 01/10/2021 18:18, Bonita Montero wrote:

> Am 01.10.2021 um 17:37 schrieb wij:

>> // numeric_limits<long double>::digits=64.

>> //

>> typedef long double FType;

>> FType x=numeric_limits<FType>::max();

>> int iexp;

>> int64_t mint;

>> x=::frexpl(x,&iexp);

>> x=::ldexpl(x,numeric_limits<FType>::digits);

>> mint= static_cast<int64_t>(x);

>>

>> Result (mint) is a negative number, something not right!!!

>

> Take this:

>

> #include <iostream>

> #include <cstdint>

> #include <limits>

> #include <utility>

> #include <iomanip>

>

> using namespace std;

>

> using mantissa_pair = pair<bool, uint64_t>;

>

> mantissa_pair getMantissa( double value );

What happened to *long* double?
> Am 01.10.2021 um 17:37 schrieb wij:

>> // numeric_limits<long double>::digits=64.

>> //

>> typedef long double FType;

>> FType x=numeric_limits<FType>::max();

>> int iexp;

>> int64_t mint;

>> x=::frexpl(x,&iexp);

>> x=::ldexpl(x,numeric_limits<FType>::digits);

>> mint= static_cast<int64_t>(x);

>>

>> Result (mint) is a negative number, something not right!!!

>

> Take this:

>

> #include <iostream>

> #include <cstdint>

> #include <limits>

> #include <utility>

> #include <iomanip>

>

> using namespace std;

>

> using mantissa_pair = pair<bool, uint64_t>;

>

> mantissa_pair getMantissa( double value );

Oct 1, 2021, 2:17:04 PMOct 1

to

Oct 1, 2021, 2:53:54 PMOct 1

to

long mantissa = max; // impicit conversion

--

7-77-777

Evil Sinner!

Oct 1, 2021, 8:19:54 PMOct 1

to

On 10/1/21 2:53 PM, Branimir Maksimovic wrote:

...

integer type. The conversion truncates; that is, the fractional part is

discarded. The behavior is undefined if the truncated value cannot be

represented in the destination type." (7.3.10p1).

While it's not required to be the case, on most implementations

std::numeric_limits<long double>::max() is WAY too large to be

represented by a long, so the behavior of such code is undefined. The

minimum value of LDBL_MAX (set by the C standard, inherited by the C++

standard) is 1e37, which would require long to have at least 123 bits in

order for that conversion to have defined behavior.

And even when it has defined behavior, I can't imagine how you would

reach the conclusion that this conversion should be the value of the

mantissa.

...

> long double max = cumeric_limits<long double>::max();

> long mantissa = max; // impicit conversion

"A prvalue of a floating-point type can be converted to a prvalue of an
> long mantissa = max; // impicit conversion

integer type. The conversion truncates; that is, the fractional part is

discarded. The behavior is undefined if the truncated value cannot be

represented in the destination type." (7.3.10p1).

While it's not required to be the case, on most implementations

std::numeric_limits<long double>::max() is WAY too large to be

represented by a long, so the behavior of such code is undefined. The

minimum value of LDBL_MAX (set by the C standard, inherited by the C++

standard) is 1e37, which would require long to have at least 123 bits in

order for that conversion to have defined behavior.

And even when it has defined behavior, I can't imagine how you would

reach the conclusion that this conversion should be the value of the

mantissa.

Oct 1, 2021, 8:23:19 PMOct 1

to

#include <math.h>

#include <limits>

#include <iostream>

using namespace std;

#define ENDL endl

template<typename T>

int64_t get_mant(T x) {

int iexp;

x=frexp(x,&iexp);

x=ldexp(x,numeric_limits<T>::digits);

return static_cast<int64_t>(x);

};

int main()

{

cout << dec << get_mant(numeric_limits<float>::max()) << ", "

<< hex << get_mant(numeric_limits<float>::max()) << ENDL;

cout << dec << get_mant(numeric_limits<double>::max()) << ", "

<< hex << get_mant(numeric_limits<double>::max()) << ENDL;

cout << dec << get_mant(numeric_limits<long double>::max()) << ", "

<< hex << get_mant(numeric_limits<long double>::max()) << ENDL;

return 0;

};

// end file t.cpp -----

$ g++ t.cpp

]$ ./a.out

16777215, ffffff

9007199254740991, 1fffffffffffff

-9223372036854775808, 8000000000000000

Oct 1, 2021, 10:16:30 PMOct 1

to

ok correct is long long :P

--

7-77-777

Evil Sinner!

Oct 2, 2021, 1:29:01 AMOct 2

to

in my quote from the standard, it's whether "the truncated value cannot

be represented in the destination type". std::numeric_limits<long

double>::max() doesn't have a fractional part, so the truncated value is

the same as the actual value. On my system, for instance, that value is

1.18973e+4932.

> ok correct is long long :P

maximum value representable by long long is still 9223372036854775807,

the same as the maximum value for long; it's still far too small to

represent 1.18973e+4932, so the behavior of the conversion is undefined.

The actual behavior on my system appears to be saturating at LLONG_MAX

== 9223372036854775807. If I change the second line to

long long mantissa = 0.75*max;

0.75*max is 8.92299e+4931, which should certainly not have the same

mantissa as max itself, but the value loaded into "mantissa" is still

9223372036854775807.

Oct 2, 2021, 1:56:20 AMOct 2

to

I accidentally sent this message first to Branimir by e-mail, and he

responded in kind.

On 10/2/21 1:31 AM, Branimir Maksimovic wrote:

>

>> On 02.10.2021., at 07:27, James Kuyper <james...@alumni.caltech.edu> wrote:

> or large can be…

> so it can fit or not…

The C++ standard cross-references the C standard for such purposes, and

the C standard imposes strict limits on how small those things can be:

LLONG_MAX is required to be at least 9223372036854775807, and LDBL_MAX

is supposed to be at least 1e37.

You are right, however, about there being no limits on how large they

can be. It is therefore permissible for an implementation to have

LLONG_MAX >= LDBL_MAX, but do you know of any such implementation?

In any event, the relevant issue is not the limits imposed on those

values by the standard, but the actual values of LLONG_MAX and LDBL_MAX

for the particular implementation you're using, and it's perfectly

feasible to determine those values from <climits>, <cfloat>, or

std::numeric_limits<>::max. What are those values on the implementation

you're using?

> but question is how to extract mantissa which was answer :P

No, that is not the answer. If max did have a value small enough to make

the conversion to long long have defined behavior, the result of that

conversion would be the truncated value itself (7.3.10p1), NOT the

mantissa of the truncated value. What makes you think otherwise?

responded in kind.

On 10/2/21 1:31 AM, Branimir Maksimovic wrote:

>

>> On 02.10.2021., at 07:27, James Kuyper <james...@alumni.caltech.edu> wrote:

>>

>> On 10/1/21 10:16 PM, Branimir Maksimovic wrote:

...

>> On 10/1/21 10:16 PM, Branimir Maksimovic wrote:

...

>> ok correct is long long :P

>> On my system, changing it to long long doesn't make any different - the

>> maximum value representable by long long is still 9223372036854775807,

>> the same as the maximum value for long; it's still far too small to

>> represent 1.18973e+4932, so the behavior of the conversion is undefined.

>> The actual behavior on my system appears to be saturating at LLONG_MAX

>> == 9223372036854775807. If I change the second line to

>>

>> long long mantissa = 0.75*max;

>>

>> 0.75*max is 8.92299e+4931, which should certainly not have the same

>> mantissa as max itself, but the value loaded into "mantissa" is still

>> 9223372036854775807.

> Problem is that neither long double nor long is defined how small
>> maximum value representable by long long is still 9223372036854775807,

>> the same as the maximum value for long; it's still far too small to

>> represent 1.18973e+4932, so the behavior of the conversion is undefined.

>> The actual behavior on my system appears to be saturating at LLONG_MAX

>> == 9223372036854775807. If I change the second line to

>>

>> long long mantissa = 0.75*max;

>>

>> 0.75*max is 8.92299e+4931, which should certainly not have the same

>> mantissa as max itself, but the value loaded into "mantissa" is still

>> 9223372036854775807.

> or large can be…

> so it can fit or not…

The C++ standard cross-references the C standard for such purposes, and

the C standard imposes strict limits on how small those things can be:

LLONG_MAX is required to be at least 9223372036854775807, and LDBL_MAX

is supposed to be at least 1e37.

You are right, however, about there being no limits on how large they

can be. It is therefore permissible for an implementation to have

LLONG_MAX >= LDBL_MAX, but do you know of any such implementation?

In any event, the relevant issue is not the limits imposed on those

values by the standard, but the actual values of LLONG_MAX and LDBL_MAX

for the particular implementation you're using, and it's perfectly

feasible to determine those values from <climits>, <cfloat>, or

std::numeric_limits<>::max. What are those values on the implementation

you're using?

> but question is how to extract mantissa which was answer :P

No, that is not the answer. If max did have a value small enough to make

the conversion to long long have defined behavior, the result of that

conversion would be the truncated value itself (7.3.10p1), NOT the

mantissa of the truncated value. What makes you think otherwise?

Oct 2, 2021, 6:09:23 AMOct 2

to

Am 01.10.2021 um 17:37 schrieb wij:

parts and double parts to double conversion:

#pragma once

#include <limits>

#include <cstdint>

#include <cassert>

struct dbl_parts

{

static_assert(std::numeric_limits<double>::is_iec559, "must be standard

fp");

dbl_parts( double d );

dbl_parts &operator =( double d );

dbl_parts() = default;

operator double();

bool getSign();

std::uint16_t getBiasedExponent();

std::int16_t getExponent();

std::uint64_t getMantissa();

void setSign( bool sign );

void setBiasedExponent( uint16_t exp );

void setExponent( int16_t exp );

void setMantissa( uint64_t mantissa );

private:

union

{

double value;

std::uint64_t binary;

};

};

inline

dbl_parts::dbl_parts( double d ) :

value( d )

{

}

inline

dbl_parts &dbl_parts::operator =( double d )

{

value = d;

return *this;

}

inline

dbl_parts::operator double()

{

return value;

}

inline

bool dbl_parts::getSign()

{

return (int64_t)binary < 0;

}

inline

std::uint16_t dbl_parts::getBiasedExponent()

{

return (std::uint16_t)(binary >> 52) & 0x7FF;

}

inline

int16_t dbl_parts::getExponent()

{

return (int16_t)getBiasedExponent() - 0x3FF;

}

inline

std::uint64_t dbl_parts::getMantissa()

{

std::uint16_t bExp = getBiasedExponent();

std::uint64_t hiBit = (uint64_t)(bExp && bExp != 0x7FF) << 52;

return binary & 0xFFFFFFFFFFFFFu | hiBit;

}

inline

void dbl_parts::setSign( bool sign )

{

binary = binary & 0x7FFFFFFFFFFFFFFFu | (std::uint64_t)sign << 63;

}

inline

void dbl_parts::setBiasedExponent( std::uint16_t exp )

{

assert(exp <= 0x7FF);

binary = binary & 0x800FFFFFFFFFFFFFu | (std::uint64_t)exp << 52;

}

inline

void dbl_parts::setExponent( std::int16_t exp )

{

assert(exp >= -0x3FF && exp <= 400);

setBiasedExponent( (uint16_t)(exp - 0x3FF) );

}

inline

void dbl_parts::setMantissa( std::uint64_t mantissa )

{

assert((getBiasedExponent() == 0 || getBiasedExponent() == 0x7FF) &&

!(mantissa & -0x10000000000000));

assert(getBiasedExponent() != 0 && getBiasedExponent() != 0x7FF ||

mantissa <= 0x1FFFFFFFFFFFFFu);

binary = binary & 0xFFF0000000000000u | mantissa & 0xFFFFFFFFFFFFFu;

}

Oct 2, 2021, 6:59:22 AMOct 2

to

On 2021-10-02, James Kuyper <james...@alumni.caltech.edu> wrote:

> I accidentally sent this message first to Branimir by e-mail, and he

>

>

> No, that is not the answer. If max did have a value small enough to make

> the conversion to long long have defined behavior, the result of that

> conversion would be the truncated value itself (7.3.10p1), NOT the

> mantissa of the truncated value. What makes you think otherwise?

Because mantissa is whole part of floating point number?
> the conversion to long long have defined behavior, the result of that

> conversion would be the truncated value itself (7.3.10p1), NOT the

> mantissa of the truncated value. What makes you think otherwise?

--

7-77-777

Evil Sinner!

Oct 2, 2021, 12:59:19 PMOct 2

to

On 10/2/2021 2:23 AM, wij wrote:

> On Saturday, 2 October 2021 at 02:05:56 UTC+8, james...@alumni.caltech.edu wrote:

>> On 10/1/21 12:09 PM, Radica...@theburrow.co.uk wrote:

>>> On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)

>>> wij <wyn...@gmail.com> wrote:

>>>> // numeric_limits<long double>::digits=64.

>>>> //

>>>> typedef long double FType;

>>>> FType x=numeric_limits<FType>::max();

>>>> int iexp;

>>>> int64_t mint;

>>>> x=::frexpl(x,&iexp);

>>>> x=::ldexpl(x,numeric_limits<FType>::digits);

>>>> mint= static_cast<int64_t>(x);

>>>>

>>>> Result (mint) is a negative number, something not right!!!

>> I can't duplicate this problem: I get mint:9223372036854775807.

<snip>
> On Saturday, 2 October 2021 at 02:05:56 UTC+8, james...@alumni.caltech.edu wrote:

>> On 10/1/21 12:09 PM, Radica...@theburrow.co.uk wrote:

>>> On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)

>>> wij <wyn...@gmail.com> wrote:

>>>> // numeric_limits<long double>::digits=64.

>>>> //

>>>> typedef long double FType;

>>>> FType x=numeric_limits<FType>::max();

>>>> int iexp;

>>>> int64_t mint;

>>>> x=::frexpl(x,&iexp);

>>>> x=::ldexpl(x,numeric_limits<FType>::digits);

>>>> mint= static_cast<int64_t>(x);

>>>>

>>>> Result (mint) is a negative number, something not right!!!

>> I can't duplicate this problem: I get mint:9223372036854775807.

1)

The templated function uses 'frexp' and 'ldexp', which take both double

arguments (not *long* double), hence UB occurs at those calls for the

'long double' type whenever this type is actually larger than 'double'.

2)

On my Linux box numeric_limits<*long* double>::digits is 64

(numeric_limits<double>::digits is 53), so the static_cast<int64_t>(x)

yields UB again.

=========

#include <cmath>

#include <limits>

#include <iostream>

using namespace std;

#define ENDL endl

uint64_t get_mantf(float x)
#include <iostream>

using namespace std;

#define ENDL endl

{

int iexp;

x=frexp(x,&iexp);

x=ldexp(x,numeric_limits<float>::digits);

return static_cast<uint64_t>(x);

};

uint64_t get_mant(double x)

{

int iexp;

x=frexp(x,&iexp);

x=ldexp(x,numeric_limits<double>::digits);

return static_cast<uint64_t>(x);

};

uint64_t get_mantl(long double x)

{

int iexp;

x=frexp(x,&iexp);

x=ldexp(x,numeric_limits<long double>::digits);

return static_cast<uint64_t>(x);

};

int main()

{

cout << dec << numeric_limits<float>::digits << ", " <<

get_mantf(numeric_limits<float>::max()) << ", "

<< hex << get_mantf(numeric_limits<float>::max()) << ENDL;

cout << dec << numeric_limits<double>::digits << ", " <<

get_mant(numeric_limits<double>::max()) << ", "

<< hex << get_mant(numeric_limits<double>::max()) << ENDL;

cout << dec << numeric_limits<long double>::digits << ", " <<
<< hex << get_mant(numeric_limits<double>::max()) << ENDL;

get_mantl(numeric_limits<long double>::max()) << ", "

<< hex << get_mantl(numeric_limits<long double>::max()) << ENDL;

return 0;

}

===================

$ c++ -std=c++11 -O2 -Wall mant.cc && ./a.out

24, 16777215, ffffff

53, 9007199254740991, 1fffffffffffff

64, 18446744073709551615, ffffffffffffffff

Oct 2, 2021, 1:21:47 PMOct 2

to

=====

uint64_t get_mantf(float x)

{

int iexp;

x=frexpf(x,&iexp);
{

int iexp;

x=ldexpf(x,numeric_limits<float>::digits);

return static_cast<uint64_t>(x);

};

uint64_t get_mant(double x)

{

int iexp;

x=frexp(x,&iexp);

x=ldexp(x,numeric_limits<double>::digits);

return static_cast<uint64_t>(x);

};

uint64_t get_mantl(long double x)

{

int iexp;

x=frexpl(x,&iexp);
};

uint64_t get_mant(double x)

{

int iexp;

x=frexp(x,&iexp);

x=ldexp(x,numeric_limits<double>::digits);

return static_cast<uint64_t>(x);

};

uint64_t get_mantl(long double x)

{

int iexp;

x=ldexpl(x,numeric_limits<long double>::digits);

return static_cast<uint64_t>(x);

};

=====

However, the three distinct functions with frexp and ldexp in all three

still work on my box (I'm guessing gcc is still able to compile the

correct implementation in in this case), but the template doesn't.

Oct 2, 2021, 2:57:53 PMOct 2

to

an answer in two parts:

1. The mantissa of a floating point number (also called the significand,

which is the term used in the C standard in text that is incorporated by

reference into the C++ standard) is something quite different from the

whole part of that number. See

<https://en.m.wikipedia.org/wiki/Significand> for details.

2. When a floating point number is greater than LLONG_MAX+1.0, the code

you provided has undefined behavior. That makes sense, since a long long

object cannot represent the whole part of such a number. The actual

behavior can vary from one implementation to another, but on my system,

it does NOT contain the mantissa, either.

Consider the following program:

#include <iostream>

int main(void)

{

typedef long double FType;

FType max= std::numeric_limits<FType>::max();

FType large = 0xA.BCDEF0123456789p+16380L;

FType middling = 0xABCDEF01.23456789p0L;

long long maxll = max;

long long largell = large;

long long middlingll = middling;

std::cout << std::fixed << "max: " << max << std::endl;

std::cout << "large: " << large << std::endl;

std::cout << "middling: " << middling << std::endl;

std::cout << "maxll: " << maxll << std::endl;

std::cout << "largell: " << largell << std::endl;

std::cout << "middlingll: " << middlingll << std::endl;

std::cout << std::hexfloat << std::showbase << std::showpoint <<

std::hex << "max: " << max << std::endl;

std::cout << "large: " << large << std::endl;

std::cout << "middling: " << middling << std::endl;

std::cout << "maxll: " << maxll << std::endl;

std::cout << "largell: " << largell << std::endl;

std::cout << "middlingll: " << middlingll << std::endl;

}

The output from that program on my system is:

max:



large:



middling: 2882400001.137778

maxll: 9223372036854775807

largell: 9223372036854775807

middlingll: 2882400001

max: 0xf.fffffffffffffffp+16380

large: 0xa.bcdef0123456789p+16380

middling: 0xa.bcdef0123456789p+28

maxll: 0x7fffffffffffffff

largell: 0x7fffffffffffffff

middlingll: 0xabcdef01

I'll use hexadecimal notation in the following comments:

The whole part of "max" has the same value as "max" itself. The same is

true of "large". The whole part of middling is 0xABCDEF01. The

fractional part of middling is 0x0.23456789.

The mantissas are as follows:

max: 0xffffffffffffffff

large: 0xabcdef0123456789

middling: 0xabcdef0123456789

The values stored in "maxll" and "largell" do not match either the whole

part or the mantissa of the corresponding floating point number. Because

"middling" is the only one of the three long double values that is

smaller than LLONG_MAX, the value stored in middlingll is the whole part

of "middling", as it should be, but is quite different from the mantissa

of "middling".

Oct 2, 2021, 3:00:58 PMOct 2

to

double` number, represented as an `int64_t` value.

A `long double` is in practice either IEEE 754 64-bit, or IEEE 754

80-bit. In Windows that choice depends on the compiler. With Visual C++

(and hence probably also Intel) it's 64-bit, same as type `double`,

while with MinGW g++ (and hence probably also clang) it 80-bit,

originally the x86-family's math coprocessor's extended format. For

80-bit IEEE 754 the mantissa part is 64 bits.

With 64-bits mantissa there is a high chance of setting the sign bit of

an `int64_t` to 1, resulting in a negative value. I believe that that

will only /not/ happen for a denormal value, but, I'm tired and might be

wrong about that. Anyway, instead use unsigned types for bit handling.

For example, in this case, use `uint64_t`.

However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just

use `memcpy`.

Due to the silliness in gcc regarding the standard's "strict aliasing"

rule, I'd not use a reinterpretation pointer cast.

- Alf

Oct 2, 2021, 3:12:21 PMOct 2

to

guarantee that any integer type is large enough to hold the mantissa,

but FType is.

> However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just

> use `memcpy`.

FType) it will give the correct result even if your assumption about

IEEE 754 is false; that's not the case with memcpy().

Oct 2, 2021, 3:23:41 PMOct 2

to

If so, agreed.

It's late in the day for me, sorry.

>> However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just

>> use `memcpy`.

>

> The advantage of the code as written is that (if you change mint to have

> FType) it will give the correct result even if your assumption about

> IEEE 754 is false; that's not the case with memcpy().

(numeric_limits::is_iec559). Dealing with the bits of just about any

representation seems to me a hopelessly daunting task. :-o

- Alf

Oct 2, 2021, 3:27:34 PMOct 2

to

There AFAIK is no suitable type name for the integer type with

sufficient bits to represent the mantissa, or generally >N bits.

Unfortunately the standard library doesn't provide a mapping from number

of bits as a value, to integer type with that many bits. It can be done,

on the assumption that all types in `<stdint.h>` are present. And

perhaps one can then define a name like `FType` in terms of that mapping.

Oct 2, 2021, 3:37:56 PMOct 2

to

On 10/2/21 2:59 PM, Branimir Maksimovic wrote:

>

>

>> On 02.10.2021., at 20:51, James Kuyper

>> <james...@alumni.caltech.edu

>> <mailto:james...@alumni.caltech.edu>> wrote:

>>

>> int main(void)

>> {

>> typedef long double FType;

> large for type 'long double'; maximum is 1.7976931348623157E+308

> [-Wliteral-range]

> Sorry your program is not correct..

The value of "large" was chosen to be almost as large as "max" on my

machine. It's apparently larger than LDBL_MAX on your system. A more

portable initializer would be

FType large = 0x0.ABCDEF0123456789p0L * max;

Depending upon the value of LDBL_EPSILON on the target implementation,

that definition might result in the mantissa of "large" having fewer

significant digits than it has on mine, but that's not particularly

important.

> This is output on my system:

> bmaxa@Branimirs-Air projects % ./a.out

> max:

> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000

> large: inf

> large: inf

> middling: 0x1.579bde02468adp+31

With that change, the value of "large" on my machine changes to

0xa.bcdef0123456788p+16380, with a corresponding (HUGE) change to the

less significant digits of the decimal output, and the mantissa is

therefore 0xabcdef0123456788.

You'll get much different values for "max" and "large" on your system,

since LDBL_MAX is much smaller, but the qualitative comments I made

about those values should still be accurate. There's no guarantees,

since the behavior is undefined, but I would expect that the values of

"maxll" and "largell" will be unchanged.

>

>

>> On 02.10.2021., at 20:51, James Kuyper

>> <james...@alumni.caltech.edu

>> <mailto:james...@alumni.caltech.edu>> wrote:

>>

>> int main(void)

>> {

>> typedef long double FType;

>> FType max= std::numeric_limits<FType>::max();

>> FType large = 0xA.BCDEF0123456789p+16380L;

>> FType middling = 0xABCDEF01.23456789p0L;

>> long long maxll = max;

>> long long largell = large;

>> long long middlingll = middling;

>>

>> std::cout << std::fixed << "max: " << max << std::endl;

>> std::cout << "large: " << large << std::endl;

>> std::cout << "middling: " << middling << std::endl;

>>

>> std::cout << "maxll: " << maxll << std::endl;

>> std::cout << "largell: " << largell << std::endl;

>> std::cout << "middlingll: " << middlingll << std::endl;

>>

>> std::cout << std::hexfloat << std::showbase << std::showpoint <<

>> std::hex << "max: " << max << std::endl;

>> std::cout << "large: " << large << std::endl;

>> std::cout << "middling: " << middling << std::endl;

>>

>> std::cout << "maxll: " << maxll << std::endl;

>> std::cout << "largell: " << largell << std::endl;

>> std::cout << "middlingll: " << middlingll << std::endl;

>> }

> mantissa.cpp:7:18: warning: magnitude of floating-point constant too
>> FType large = 0xA.BCDEF0123456789p+16380L;

>> FType middling = 0xABCDEF01.23456789p0L;

>> long long maxll = max;

>> long long largell = large;

>> long long middlingll = middling;

>>

>> std::cout << std::fixed << "max: " << max << std::endl;

>> std::cout << "large: " << large << std::endl;

>> std::cout << "middling: " << middling << std::endl;

>>

>> std::cout << "maxll: " << maxll << std::endl;

>> std::cout << "largell: " << largell << std::endl;

>> std::cout << "middlingll: " << middlingll << std::endl;

>>

>> std::cout << std::hexfloat << std::showbase << std::showpoint <<

>> std::hex << "max: " << max << std::endl;

>> std::cout << "large: " << large << std::endl;

>> std::cout << "middling: " << middling << std::endl;

>>

>> std::cout << "maxll: " << maxll << std::endl;

>> std::cout << "largell: " << largell << std::endl;

>> std::cout << "middlingll: " << middlingll << std::endl;

>> }

> large for type 'long double'; maximum is 1.7976931348623157E+308

> [-Wliteral-range]

> Sorry your program is not correct..

The value of "large" was chosen to be almost as large as "max" on my

machine. It's apparently larger than LDBL_MAX on your system. A more

portable initializer would be

FType large = 0x0.ABCDEF0123456789p0L * max;

Depending upon the value of LDBL_EPSILON on the target implementation,

that definition might result in the mantissa of "large" having fewer

significant digits than it has on mine, but that's not particularly

important.

> This is output on my system:

> bmaxa@Branimirs-Air projects % ./a.out

> max:

> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000

> large: inf

> middling: 2882400001.137778

> maxll: 9223372036854775807

> largell: 9223372036854775807

> middlingll: 2882400001

> max: 0x1.fffffffffffffp+1023
> maxll: 9223372036854775807

> largell: 9223372036854775807

> middlingll: 2882400001

> large: inf

> middling: 0x1.579bde02468adp+31

> maxll: 0x7fffffffffffffff

> largell: 0x7fffffffffffffff

> middlingll: 0xabcdef01

>

> Greetings, Branimir.
> largell: 0x7fffffffffffffff

> middlingll: 0xabcdef01

>

With that change, the value of "large" on my machine changes to

0xa.bcdef0123456788p+16380, with a corresponding (HUGE) change to the

less significant digits of the decimal output, and the mantissa is

therefore 0xabcdef0123456788.

You'll get much different values for "max" and "large" on your system,

since LDBL_MAX is much smaller, but the qualitative comments I made

about those values should still be accurate. There's no guarantees,

since the behavior is undefined, but I would expect that the values of

"maxll" and "largell" will be unchanged.

Oct 2, 2021, 5:20:07 PMOct 2

to

point type of the number to be analyzed.

Oct 2, 2021, 8:45:33 PMOct 2

to

On 2021-10-02, James Kuyper <james...@alumni.caltech.edu> wrote:

--

7-77-777

Evil Sinner!

to weak you should be meek, and you should brainfuck stronger

https://github.com/rofl0r/chaos-pp

Oct 2, 2021, 8:48:09 PMOct 2

to

On 2021-10-02, James Kuyper <james...@alumni.caltech.edu> wrote:

bmaxa@Branimirs-Air projects % ./a.out

max: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000

large: 120645172288001372126619919515947355653853135095820093742269387851763173210613194516902557021349374744036403756647462602101944733127725809392434319981526284828181273773403605513307431121904981243948320057805774787281290926848565000291282225047507294705767371139696722108291516984908752371556782562287095382016.000000
max: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000

middling: 2882400001.137778

maxll: 9223372036854775807

largell: 9223372036854775807

middlingll: 2882400001

max: 0x1.fffffffffffffp+1023

large: 0x1.579bde02468acp+1023
maxll: 9223372036854775807

largell: 9223372036854775807

middlingll: 2882400001

max: 0x1.fffffffffffffp+1023

middling: 0x1.579bde02468adp+31

maxll: 0x7fffffffffffffff

largell: 0x7fffffffffffffff

middlingll: 0xabcdef01

>

maxll: 0x7fffffffffffffff

largell: 0x7fffffffffffffff

middlingll: 0xabcdef01

>

Oct 3, 2021, 2:10:11 AMOct 3

to

So, this should be the most elegant code:

#pragma once

#include <limits>

#include <cstdint>

#include <cassert>

struct dbl_parts

{

static_assert(std::numeric_limits<double>::is_iec559, "must be standard

fp");

MANTISSA_BITS = 52;

using i64 = std::int64_t;

using ui64 = std::uint64_t;

static ui64 const

SIGN_MASK = (ui64)1 << 63,

EXP_MASK = (ui64)0x7FF << MANTISSA_BITS,

MANTISSA_MASK = ~-((i64)1 << MANTISSA_BITS),

MANTISSA_MAX = ((ui64)1 << MANTISSA_BITS) | MANTISSA_MASK;

using ui16 = std::uint16_t;

using i16 = std::int16_t;

static ui16 const

BEXP_DENORMAL = 0,

BEXP_BASE = 0x3FF,

BEXP_MAX = 0x7FF;

static i16 const

EXP_MIN = 0 - BEXP_BASE,

EXP_MAX = BEXP_MAX - BEXP_BASE;

union

{

double value;

ui64 binary;

}

inline

int16_t dbl_parts::getExponent()

{

return (i16)(getBiasedExponent() - BEXP_BASE);

ui64 hiBit = (ui64)(bExp && bExp != BEXP_MAX) << MANTISSA_BITS;

return binary & MANTISSA_MASK | hiBit;

binary = binary & (SIGN_MASK | MANTISSA_MASK) | (ui64)exp << MANTISSA_BITS;

setBiasedExponent( exp );

ui64 mantissaMax = MANTISSA_MASK | (ui64)(getBiasedExponent() !=

BEXP_DENORMAL && getBiasedExponent() != BEXP_MAX) << MANTISSA_BITS;

assert(mantissa <= mantissaMax);

#endif

binary = binary & (SIGN_MASK | EXP_MASK) | mantissa & MANTISSA_MASK;

}

#pragma once

#include <limits>

#include <cstdint>

#include <cassert>

struct dbl_parts

{

fp");

dbl_parts( double d );

dbl_parts &operator =( double d );

dbl_parts() = default;

operator double();

bool getSign();

std::uint16_t getBiasedExponent();

std::int16_t getExponent();

std::uint64_t getMantissa();

void setSign( bool sign );

void setBiasedExponent( uint16_t exp );

void setExponent( int16_t exp );

void setMantissa( uint64_t mantissa );

private:

static unsigned const
dbl_parts &operator =( double d );

dbl_parts() = default;

operator double();

bool getSign();

std::uint16_t getBiasedExponent();

std::int16_t getExponent();

std::uint64_t getMantissa();

void setSign( bool sign );

void setBiasedExponent( uint16_t exp );

void setExponent( int16_t exp );

void setMantissa( uint64_t mantissa );

private:

MANTISSA_BITS = 52;

using i64 = std::int64_t;

using ui64 = std::uint64_t;

static ui64 const

SIGN_MASK = (ui64)1 << 63,

EXP_MASK = (ui64)0x7FF << MANTISSA_BITS,

MANTISSA_MASK = ~-((i64)1 << MANTISSA_BITS),

MANTISSA_MAX = ((ui64)1 << MANTISSA_BITS) | MANTISSA_MASK;

using ui16 = std::uint16_t;

using i16 = std::int16_t;

static ui16 const

BEXP_DENORMAL = 0,

BEXP_BASE = 0x3FF,

BEXP_MAX = 0x7FF;

static i16 const

EXP_MIN = 0 - BEXP_BASE,

EXP_MAX = BEXP_MAX - BEXP_BASE;

union

{

double value;

ui64 binary;

};

};

inline

dbl_parts::dbl_parts( double d ) :

value( d )

{

}

inline

dbl_parts &dbl_parts::operator =( double d )

{

value = d;

return *this;

}

inline

dbl_parts::operator double()

{

return value;

}

inline

bool dbl_parts::getSign()

{

return (i64)binary < 0;
};

inline

dbl_parts::dbl_parts( double d ) :

value( d )

{

}

inline

dbl_parts &dbl_parts::operator =( double d )

{

value = d;

return *this;

}

inline

dbl_parts::operator double()

{

return value;

}

inline

bool dbl_parts::getSign()

{

}

inline

std::uint16_t dbl_parts::getBiasedExponent()

{

return (ui16)(binary >> MANTISSA_BITS) & BEXP_MAX;
inline

std::uint16_t dbl_parts::getBiasedExponent()

{

}

inline

int16_t dbl_parts::getExponent()

{

return (i16)(getBiasedExponent() - BEXP_BASE);

}

inline

std::uint64_t dbl_parts::getMantissa()

{

ui16 bExp = getBiasedExponent();
inline

std::uint64_t dbl_parts::getMantissa()

{

ui64 hiBit = (ui64)(bExp && bExp != BEXP_MAX) << MANTISSA_BITS;

return binary & MANTISSA_MASK | hiBit;

}

inline

void dbl_parts::setSign( bool sign )

{

binary = binary & ~SIGN_MASK | (ui64)sign << 63;
inline

void dbl_parts::setSign( bool sign )

{

}

inline

void dbl_parts::setBiasedExponent( std::uint16_t exp )

{

assert(exp <= BEXP_MAX);
inline

void dbl_parts::setBiasedExponent( std::uint16_t exp )

{

binary = binary & (SIGN_MASK | MANTISSA_MASK) | (ui64)exp << MANTISSA_BITS;

}

inline

void dbl_parts::setExponent( std::int16_t exp )

{

exp += BEXP_BASE;
inline

void dbl_parts::setExponent( std::int16_t exp )

{

setBiasedExponent( exp );

}

inline

void dbl_parts::setMantissa( std::uint64_t mantissa )

{

#if !defined(NDEBUG)
inline

void dbl_parts::setMantissa( std::uint64_t mantissa )

{

ui64 mantissaMax = MANTISSA_MASK | (ui64)(getBiasedExponent() !=

BEXP_DENORMAL && getBiasedExponent() != BEXP_MAX) << MANTISSA_BITS;

assert(mantissa <= mantissaMax);

#endif

binary = binary & (SIGN_MASK | EXP_MASK) | mantissa & MANTISSA_MASK;

}

Oct 3, 2021, 3:36:07 AMOct 3

to

(in the original post, the include file <math.h> should be <cmath>)

Oct 3, 2021, 6:15:52 AMOct 3

to

If you know you're running on an x86/x64 device (or even an 8088 with

8087 co-processor!), then this inline code expands a 64-bit double 'x64'

to its constituent parts:

fld qword [x64]

fstp tword [a80] ; sometimes, 'tbyte'

And for a long double 'x80' known to use 80-bit format:

fld tword [x80]

fstp tword [a80]

The former works because on the x87, all loads expand to an 80-bit

internal format with no hidden parts.

'a80' needs to be an instance of a type like this (here assumes

little-endian memory format, which is typical for anything with x87):

typedef struct {

uint64_t mantissa;

uint16_t sign_and_exponent; // sign is top bit

} ldformat;

I don't know how to reliably split those last 16 bits into 15- and 1-bit

fields using C's bitfields. (I used a different test language.)

ASM code shown may need adapting to gcc-style assembly.

Oct 3, 2021, 6:19:45 AMOct 3

to

Am 03.10.2021 um 12:15 schrieb Bart:

> Jesus. And I think this still doesn't do an actual long double!

long double isn't supported by many compilers for x86-64.
> Jesus. And I think this still doesn't do an actual long double!

long double should be avoided when possible because loads

and stores are slow with long double.

Oct 3, 2021, 10:02:25 AMOct 3

to

I've not been following the details of the thread but it seems to have

strayed from the original question (as happens of course).

wij <wyn...@gmail.com> writes:

> // numeric_limits<long double>::digits=64.

> //

> typedef long double FType;

> FType x=numeric_limits<FType>::max();

> int iexp;

> int64_t mint;

> x=::frexpl(x,&iexp);

> x=::ldexpl(x,numeric_limits<FType>::digits);

> mint= static_cast<int64_t>(x);

>

> Result (mint) is a negative number, something not right!!!

frexp (et. al.) return a signed result with a magnitude in the interval

[1/2, 1). If you want the 64 most significant digits of the mantissa of

a long double (it may have many more digits than that) then I'd go for

something like this:

uint_least64_t ms64bits(long double d)

{

int exp;

return (uint_least64_t)std::scalbn(std::fabs(std::frexp(d, &exp)), 64);

}

If, as your code sketch suggests, you want all of them, then you are out

of luck as far as portable code goes because there may not be an integer

type wide enough.

But why do you want an integer value? For most mathematical uses the

result of std::frexp is what you really want.

--

Ben.

strayed from the original question (as happens of course).

wij <wyn...@gmail.com> writes:

> // numeric_limits<long double>::digits=64.

> //

> typedef long double FType;

> FType x=numeric_limits<FType>::max();

> int iexp;

> int64_t mint;

> x=::frexpl(x,&iexp);

> x=::ldexpl(x,numeric_limits<FType>::digits);

> mint= static_cast<int64_t>(x);

>

> Result (mint) is a negative number, something not right!!!

[1/2, 1). If you want the 64 most significant digits of the mantissa of

a long double (it may have many more digits than that) then I'd go for

something like this:

uint_least64_t ms64bits(long double d)

{

int exp;

return (uint_least64_t)std::scalbn(std::fabs(std::frexp(d, &exp)), 64);

}

If, as your code sketch suggests, you want all of them, then you are out

of luck as far as portable code goes because there may not be an integer

type wide enough.

But why do you want an integer value? For most mathematical uses the

result of std::frexp is what you really want.

--

Ben.

Oct 3, 2021, 9:04:10 PMOct 3

to

OP wanted to get the mantissa of long double.

Oct 3, 2021, 9:47:02 PMOct 3

to

explicit VLFloat(long double x) try {

typedef long double FType;

typedef uint64_t UIntMant;

WY_ENSURE(Limits<FType>::Bits<=CHAR_BIT*sizeof(UIntMant));

if(Wy::isfinite(x)==false) {

throw Errno(EINVAL);

}

Errno r;

int iexp;

if(x<0) {

x=-x;

m_neg=true;

} else {

m_neg=false;

}

x=Wy::frexp(x,&iexp);

x=Wy::ldexp(x,Limits<FType>::Bits);

m_exp=iexp-Limits<FType>::Bits;

if((r=m_mant.set_num(MantType::itv(static_cast<UIntMant>(x))))!=Ok) {

throw r;

}

if((r=_finalize())!=Ok) {

throw r;

}

}

catch(const Errno& e) {

WY_THROW( Reply(e) );

};

--

Most of the time while asking questions in comp.lang.c/c++, I need to rewrite a bit.

Oct 3, 2021, 11:32:13 PMOct 3

to

characteristics as the type for which it is a synonym. What I said is

true because FType is a synonym for long double, the same type as x

itself. What I was asserting is that for any given floating point

object, the mantissa or significand stored in that object has a value

that is guaranteed to be representable in the same floating point type

as the object itself. There's no guarantee that any integer type is big

enough to represent it.

I've since thought this over, and checked carefully, and that's not

quite true - though it's true enough for most practical purposes. Let me

explain.

The C++ standard defines <cfloat>, and defines it's contents mostly by

cross-referencing the C standard's definition of <float.h>. Section

5.2.4.2.2 the C standard defines a parameterized model for floating

point representations, that is used as a basis for describing the

meaning of the macros defined in <float.h>, so that model is inherited

by C++.

I will need to refer to the following parameters of that model:

> b - base or radix of exponent representation (an integer > 1)

> e - exponent (an integer between a minimum e_min and a maximum e_max )

> p - precision (the number of base-b digits in the significand)

Note that b, e_min, e_max, and p are constants for any specific floating

point type.

In terms of that model, the value of the significand of a floating point

value x, interpreted as an integer, can be represented in the same

floating point type by a number with exactly the same significand, and e

= p.

The key issue is whether such a representation is allowed, and it turns

out that there can be floating point representations which fit the C

standard's model, for which e_max < p, preventing some signficands from

being representable in such a type.

The macro LDBL_MAX (corresponding to std::numeric_limits<long

double>::max()) is defined as expanding to the value of

(1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for

example, e_max == p-1 and b=2, then this means that for such a type, p

must be at least 124.

Every floating point format I'm familiar with has an e_max value much

larger than p, so I think, as a practical matter, that it's safe to

assume that signficands can be represented in the same floating point

type, but strictly speaking, it's not guaranteed.

Oct 3, 2021, 11:42:40 PMOct 3

to

perfectly normal. With that difference in mind, all of those results are

consistent with what I said. None of the long long values are the same

as the mantissa of the corresponding float value, and only middlingll is

the same as the whole part of corresponding float value.

Do you now concede that your approach is not guaranteed to result in a

long long value containing the mantissa?

Oct 3, 2021, 11:43:28 PMOct 3

to

On 10/2/21 3:23 PM, Alf P. Steinbach wrote:

> On 2 Oct 2021 21:12, James Kuyper wrote:

...
> On 2 Oct 2021 21:12, James Kuyper wrote:

>> It would be safer and more portable to use FType; there's no portable

>> guarantee that any integer type is large enough to hold the mantissa,

>> but FType is.

>

> I believe you intended to write `uintptr_t`, not `FType`.

No, uintptr_t is not guaranteed to be able to hold the mantissa; nor is
>> guarantee that any integer type is large enough to hold the mantissa,

>> but FType is.

>

> I believe you intended to write `uintptr_t`, not `FType`.

any other integer type. FType is guaranteed to be able to hold it.

>>> However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just

>>> use `memcpy`.

>>

>> The advantage of the code as written is that (if you change mint to have

>> FType) it will give the correct result even if your assumption about

>> IEEE 754 is false; that's not the case with memcpy().

>

> Uhm, I'd rather assert IEEE 754 representation

> (numeric_limits::is_iec559). Dealing with the bits of just about any

> representation seems to me a hopelessly daunting task. :-o

having to worry about the bits.

Oct 4, 2021, 12:34:39 AMOct 4

to

On 2021-10-04, James Kuyper <james...@alumni.caltech.edu> wrote:

> Note that b, e_min, e_max, and p are constants for any specific floating

> point type.

>

> In terms of that model, the value of the significand of a floating point

> value x, interpreted as an integer, can be represented in the same

> floating point type by a number with exactly the same significand, and e

>= p.

>

> The key issue is whether such a representation is allowed, and it turns

> out that there can be floating point representations which fit the C

> standard's model, for which e_max < p, preventing some signficands from

> being representable in such a type.

>

> The macro LDBL_MAX (corresponding to std::numeric_limits<long

> double>::max()) is defined as expanding to the value of

> (1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for

> example, e_max == p-1 and b=2, then this means that for such a type, p

> must be at least 124.

>

> Every floating point format I'm familiar with has an e_max value much

> larger than p, so I think, as a practical matter, that it's safe to

> assume that signficands can be represented in the same floating point

> type, but strictly speaking, it's not guaranteed.

Great!!!
> Note that b, e_min, e_max, and p are constants for any specific floating

> point type.

>

> In terms of that model, the value of the significand of a floating point

> value x, interpreted as an integer, can be represented in the same

> floating point type by a number with exactly the same significand, and e

>= p.

>

> The key issue is whether such a representation is allowed, and it turns

> out that there can be floating point representations which fit the C

> standard's model, for which e_max < p, preventing some signficands from

> being representable in such a type.

>

> The macro LDBL_MAX (corresponding to std::numeric_limits<long

> double>::max()) is defined as expanding to the value of

> (1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for

> example, e_max == p-1 and b=2, then this means that for such a type, p

> must be at least 124.

>

> Every floating point format I'm familiar with has an e_max value much

> larger than p, so I think, as a practical matter, that it's safe to

> assume that signficands can be represented in the same floating point

> type, but strictly speaking, it's not guaranteed.

So intead of int we use float type and truncate?

--

7-77-777

Evil Sinner!

to weak you should be meek, and you should brainfuck stronger

Oct 4, 2021, 12:35:40 AMOct 4

to

> Do you now concede that your approach is not guaranteed to result in a

> long long value containing the mantissa?

of course, you convinced me.
> long long value containing the mantissa?

Oct 4, 2021, 3:52:20 AMOct 4

to

that tries to come close to any current standards compliance.

What you probably mean is that on some targets, "long double" is the

same size as "double". That's true for most 32-bit (and smaller)

targets. Most 64-bit targets support larger "long double".

As happens so often, there is /one/ major exception to the common

practices used by (AFAICS) every other OS, every other processor, every

other compiler manufacturer - on Windows, and with MSVC, "long double"

is 64-bit.

"long double" should not be used where "double" will do, because it can

be a great deal slower on many platforms (the load and saves are

irrelevant). You also have to question whether "long double" does what

you want, on any given target. On smaller targets (or more limited

compilers, like MSVC), it gives you no benefits in accuracy or range

compared to "double". On some, such as x86-64 with decent tools, it

generally gives you 80-bit types. On others - almost any other 64-bit

system - it gives you 128-bit quad double, but it is likely to be

implemented in software rather than hardware.

However, it /is/ a valid type on all (reasonable) C and C++ systems, and

it is a perfectly reasonable question to ask how to handle it. Some

aspects can be handled in a portable manner, others require

implementation-dependent details.

Oct 4, 2021, 5:59:32 AMOct 4

to

David Brown <david...@hesbynett.no> wrote:

> What you probably mean is that on some targets, "long double" is the

> same size as "double". That's true for most 32-bit (and smaller)

> targets. Most 64-bit targets support larger "long double".

That might perhaps be true for non-x86 32-bit targets. However, in x86
> What you probably mean is that on some targets, "long double" is the

> same size as "double". That's true for most 32-bit (and smaller)

> targets. Most 64-bit targets support larger "long double".

architectures long double is most typically 80-bit, and has been so

for pretty much as long as the x86 (and its x87 math coprocessor)

has existed, ie. all the way from the 8086 (which was a 16-bit

processor), especially if you had the 8087 FPU coprocessor.

The reason why long double is 80-bit in x86 architectures is that

that's the native floating point format used by the x87 FPU.

(While you can load 32-bit floats and 64-bit doubles, internally

the FPU uses 80-bit floating point. And you can load and store

full 80-bit floating point values, of course.)

However, what has happened since then is the introduction of SSE

(and later AVX), which is much simpler to use and more efficient

than the FPU, and is a direct replacement of the FPU (well, with

the exception of trigonometric functions, which SSE/AVX for some

reason do not support natively). However, SSE/AVX only supports

64-bit floating point, not 80-bit, which is why 80-bit floating

point has been "soft-deprecated" for like two decades now.

You can still use 80-bit long doubles with compilers like gcc

and clang, even when targeting the latest x86-64 CPUs. The compilers

will generate FPU opcodes to handle them (and, most often than not,

they will be less efficient. Also, if I understand correctly, using

the FPU and SSE at the same time is not very efficient because they

share logic circuitry and interfere with each other. In other words,

the SSE unit needs to wait for the FPU logic to do its thing, which

wastes clock cycles. I might be wrong in this, though.)

80-bit long doubles are also considered justifiably deprecated

because they don't really all that much precision. Sure, they

have an 11 bits larger mantissa, and a slightly bigger range,

but all in all that's not a huge advantage in most calculations.

If you calculations are hitting the precision limits of double,

chances are that they are going to hit the precision limits of

long double as well. There's a relatively small range where

long double beats double in practical use.

Oct 4, 2021, 7:50:33 AMOct 4

to

Am 04.10.2021 um 09:52 schrieb David Brown:

> "long double" is supported by any C++ or C compiler, for any target,

> that tries to come close to any current standards compliance.

Most compilers map long double to IEEE-754 double precision and not
> "long double" is supported by any C++ or C compiler, for any target,

> that tries to come close to any current standards compliance.

extended precision. Even with Intel C++ you must supply a compiler

-switch that you have double as a extended precision.

> What you probably mean is that on some targets, "long double" is the

> same size as "double". That's true for most 32-bit (and smaller)

> targets. Most 64-bit targets support larger "long double".

Oct 4, 2021, 9:07:02 AMOct 4

to

On 04/10/2021 08:52, David Brown wrote:

> On 03/10/2021 12:19, Bonita Montero wrote:

>> Am 03.10.2021 um 12:15 schrieb Bart:

>>

>>> Jesus. And I think this still doesn't do an actual long double!

>>

>> long double isn't supported by many compilers for x86-64.

>> long double should be avoided when possible because loads

>> and stores are slow with long double.

>>

>

> "long double" is supported by any C++ or C compiler, for any target,

> that tries to come close to any current standards compliance.

>

> What you probably mean is that on some targets, "long double" is the

> same size as "double". That's true for most 32-bit (and smaller)

> targets. Most 64-bit targets support larger "long double".

Any code running on x86 can choose to use x87 instructions for floating
> On 03/10/2021 12:19, Bonita Montero wrote:

>> Am 03.10.2021 um 12:15 schrieb Bart:

>>

>>> Jesus. And I think this still doesn't do an actual long double!

>>

>> long double isn't supported by many compilers for x86-64.

>> long double should be avoided when possible because loads

>> and stores are slow with long double.

>>

>

> "long double" is supported by any C++ or C compiler, for any target,

> that tries to come close to any current standards compliance.

>

> What you probably mean is that on some targets, "long double" is the

> same size as "double". That's true for most 32-bit (and smaller)

> targets. Most 64-bit targets support larger "long double".

point.

Then, even calculations involving 64-bit variables, will use 80-bit

intermediate results.

> As happens so often, there is /one/ major exception to the common

> practices used by (AFAICS) every other OS, every other processor, every

> other compiler manufacturer - on Windows, and with MSVC, "long double"

> is 64-bit.

Compiler sizeof(long double)

MSCV 8 bytes

gcc 16

clang 8

DMC 10

lccwin 16

tcc 8

bcc 8

So, some compilers do manage to use a wider long double type, even on

Windows, showing that the choice is nothing do with the OS.

And at least one 'big' compiler that is not MSVC also uses a 64-bit type

for long double.

You just like having a bash at Windows (it doesn't stop the use of

float80), and at MSVC (Clang doesn't support float80 either).

Oct 4, 2021, 10:13:10 AMOct 4

to

Am 04.10.2021 um 15:06 schrieb Bart:

> DMC 10

> lccwin 16

Irrelevant.

> DMC 10

> lccwin 16

Irrelevant.

Oct 4, 2021, 10:38:03 AMOct 4

to

David Brown <david...@hesbynett.no> writes:

>On 03/10/2021 12:19, Bonita Montero wrote:

>

>"long double" should not be used where "double" will do, because it can

>be a great deal slower on many platforms (the load and saves are

>irrelevant). You also have to question whether "long double" does what

>you want, on any given target. On smaller targets (or more limited

>compilers, like MSVC), it gives you no benefits in accuracy or range

>compared to "double". On some, such as x86-64 with decent tools, it

>generally gives you 80-bit types. On others - almost any other 64-bit

>system - it gives you 128-bit quad double, but it is likely to be

>implemented in software rather than hardware.

ARMv8 has 128-bit floating point, in hardware.
>On 03/10/2021 12:19, Bonita Montero wrote:

>

>"long double" should not be used where "double" will do, because it can

>be a great deal slower on many platforms (the load and saves are

>irrelevant). You also have to question whether "long double" does what

>you want, on any given target. On smaller targets (or more limited

>compilers, like MSVC), it gives you no benefits in accuracy or range

>compared to "double". On some, such as x86-64 with decent tools, it

>generally gives you 80-bit types. On others - almost any other 64-bit

>system - it gives you 128-bit quad double, but it is likely to be

>implemented in software rather than hardware.

FWIW, it also has 512 to 2048 bit floating point, in hardware, when

the SVE extension is implemented.

From the ARMv8 ARM (Architecture Reference Manual):

The architecture also supports the following floating-point data types:

· Half-precision, see Half-precision floating-point formats

on page A1-44 for details.

· Single-precision, see Single-precision floating-point format

on page A1-46 for details.

· Double-precision, see Double-precision floating-point format

on page A1-47 for details.

· BFloat16, see BFloat16 floating-point format on page A1-48

for details.

It also supports:

· Fixed-point interpretation of words and doublewords. See

Fixed-point format on page A1-50.

· Vectors, where a register holds multiple elements, each of

the same data type. See Vector formats on

page A1-41 for details.

The Armv8 architecture provides two register files:

· A general-purpose register file.

· A SIMD&FP register file.

In each of these, the possible register widths depend on the Execution state.

In AArch64 state:

· A general-purpose register file contains thirty-two 64-bit registers:

-- Many instructions can access these registers as 64-bit

registers or as 32-bit registers, using only the

bottom 32 bits.

· A SIMD&FP register file contains thirty-two 128-bit registers:

-- The quadword integer data types only apply to the SIMD&FP

register file.

-- The floating-point data types only apply to the SIMD&FP

register file.

-- While the AArch64 vector registers support 128-bit vectors,

the effective vector length can be 64-bits

or 128-bits depending on the A64 instruction encoding used,

see Instruction Mnemonics on page C1-195.

If the SVE extension is implemented, an additional register file of 32

512-bit to 2048-bit registers (implementation choice) is available).

Oct 4, 2021, 10:53:30 AMOct 4

to

Am 04.10.2021 um 16:37 schrieb Scott Lurndal:

> ARMv8 has 128-bit floating point, in hardware.

> FWIW, it also has 512 to 2048 bit floating point,

> in hardware, when the SVE extension is implemented.

I bet that ther even isn't an implementation for 128 bit fp,
> ARMv8 has 128-bit floating point, in hardware.

> FWIW, it also has 512 to 2048 bit floating point,

> in hardware, when the SVE extension is implemented.

but just a specification.

Oct 4, 2021, 11:31:01 AMOct 4

to

On 10/4/21 12:34 AM, Branimir Maksimovic wrote:

> On 2021-10-04, James Kuyper <james...@alumni.caltech.edu> wrote:

...
> On 2021-10-04, James Kuyper <james...@alumni.caltech.edu> wrote:

>> In terms of that model, the value of the significand of a floating point

>> value x, interpreted as an integer, can be represented in the same

>> floating point type by a number with exactly the same significand, and e

>> = p.

>>

>> The key issue is whether such a representation is allowed, and it turns

>> out that there can be floating point representations which fit the C

>> standard's model, for which e_max < p, preventing some signficands from

>> being representable in such a type.

>>

>> The macro LDBL_MAX (corresponding to std::numeric_limits<long

>> double>::max()) is defined as expanding to the value of

>> (1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for

>> example, e_max == p-1 and b=2, then this means that for such a type, p

>> must be at least 124.

>>

>> Every floating point format I'm familiar with has an e_max value much

>> larger than p, so I think, as a practical matter, that it's safe to

>> assume that signficands can be represented in the same floating point

>> type, but strictly speaking, it's not guaranteed.

> Great!!!

> So intead of int we use float type and truncate?

You're half right. If you want the mantissa (== significand), you should
>> value x, interpreted as an integer, can be represented in the same

>> floating point type by a number with exactly the same significand, and e

>> = p.

>>

>> The key issue is whether such a representation is allowed, and it turns

>> out that there can be floating point representations which fit the C

>> standard's model, for which e_max < p, preventing some signficands from

>> being representable in such a type.

>>

>> The macro LDBL_MAX (corresponding to std::numeric_limits<long

>> double>::max()) is defined as expanding to the value of

>> (1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for

>> example, e_max == p-1 and b=2, then this means that for such a type, p

>> must be at least 124.

>>

>> Every floating point format I'm familiar with has an e_max value much

>> larger than p, so I think, as a practical matter, that it's safe to

>> assume that signficands can be represented in the same floating point

>> type, but strictly speaking, it's not guaranteed.

> Great!!!

> So intead of int we use float type and truncate?

not truncate - that might lose you the parts of the significand that

represent the fractional part of the number. The OP had it right at the

second to last step in his original code:

x = std::ldexp(x, numeric_limits<FType>::digits);

At this point, x already contains the significand; there's no further

need to convert it to an integer type - in fact, in most contexts you'd

want it in floating point format for later steps in the processing.

Oct 4, 2021, 12:15:29 PMOct 4

to

On 2021-10-04, Bart <b...@freeuk.com> wrote:

>

> Any code running on x86 can choose to use x87 instructions for floating

> point.

Only in assembler...
>

> Any code running on x86 can choose to use x87 instructions for floating

> point.

Oct 4, 2021, 12:15:54 PMOct 4

to

Oct 4, 2021, 12:19:40 PMOct 4

to

Greets, branimir

--

7-77-777

Evil Sinner!

to weak you should be meek, and you should brainfuck stronger

Oct 4, 2021, 1:08:26 PMOct 4

to

sc...@slp53.sl.home (Scott Lurndal) writes:

>David Brown <david...@hesbynett.no> writes:

>>On 03/10/2021 12:19, Bonita Montero wrote:

>

>>

>>"long double" should not be used where "double" will do, because it can

>>be a great deal slower on many platforms (the load and saves are

>>irrelevant). You also have to question whether "long double" does what

>>you want, on any given target. On smaller targets (or more limited

>>compilers, like MSVC), it gives you no benefits in accuracy or range

>>compared to "double". On some, such as x86-64 with decent tools, it

>>generally gives you 80-bit types. On others - almost any other 64-bit

>>system - it gives you 128-bit quad double, but it is likely to be

>>implemented in software rather than hardware.

>

>

>ARMv8 has 128-bit floating point, in hardware.

^ registers
>David Brown <david...@hesbynett.no> writes:

>>On 03/10/2021 12:19, Bonita Montero wrote:

>

>>

>>"long double" should not be used where "double" will do, because it can

>>be a great deal slower on many platforms (the load and saves are

>>irrelevant). You also have to question whether "long double" does what

>>you want, on any given target. On smaller targets (or more limited

>>compilers, like MSVC), it gives you no benefits in accuracy or range

>>compared to "double". On some, such as x86-64 with decent tools, it

>>generally gives you 80-bit types. On others - almost any other 64-bit

>>system - it gives you 128-bit quad double, but it is likely to be

>>implemented in software rather than hardware.

>

>

>ARMv8 has 128-bit floating point, in hardware.

It does not support 128-bit float point types.

Oct 4, 2021, 2:04:37 PMOct 4

to

On 04/10/2021 13:50, Bonita Montero wrote:

> Am 04.10.2021 um 09:52 schrieb David Brown:

>

>> "long double" is supported by any C++ or C compiler, for any target,

>> that tries to come close to any current standards compliance.

>

> Most compilers map long double to IEEE-754 double precision and not

> extended precision. Even with Intel C++ you must supply a compiler

> -switch that you have double as a extended precision.

No, you don't.
> Am 04.10.2021 um 09:52 schrieb David Brown:

>

>> "long double" is supported by any C++ or C compiler, for any target,

>> that tries to come close to any current standards compliance.

>

> Most compilers map long double to IEEE-754 double precision and not

> extended precision. Even with Intel C++ you must supply a compiler

> -switch that you have double as a extended precision.

>

>> What you probably mean is that on some targets, "long double" is the

>> same size as "double". That's true for most 32-bit (and smaller)

>> targets. Most 64-bit targets support larger "long double".

>

> No, most don't.

>

Again, you are confusing "Windows" with "everything".

MS, for reasons known only to them, seem to have decided that "long

double" should be 64-bit in the Windows ABI (noting that there never was

a real ABI for 32-bit Windows). So Intel's compiler /on windows/ will

use 64-bit "long double" by default. It uses 80-bit "long double" on

other x86-64 targets (they are actually 16 bytes in size, for alignment

purposes, but 80 bits of data).

MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit

ARM targets have 128-bit (I don't know off-hand if they are IEEE quad

precision). For RISC-V, even 32-bit targets have 128-bit "long double".

Oct 4, 2021, 3:23:20 PMOct 4

to

sounded far-fetched.

But if it's just a matter of registers of 128+ bits (which can store a

vector of smaller float types), then that's old hat with x86/x64.

Oct 5, 2021, 1:13:11 AMOct 5

to

Am 04.10.2021 um 20:04 schrieb David Brown:

> MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit

> ARM targets have 128-bit (I don't know off-hand if they are IEEE quad

> precision). For RISC-V, even 32-bit targets have 128-bit "long double".

There isn't any ARM-implementation with 128 bit FP.
> MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit

> ARM targets have 128-bit (I don't know off-hand if they are IEEE quad

> precision). For RISC-V, even 32-bit targets have 128-bit "long double".

Oct 5, 2021, 1:19:37 AMOct 5

to

Bart <b...@freeuk.com> wrote:

> Here's quick survey of Windows C compilers:

>

> Compiler sizeof(long double)

>

> MSCV 8 bytes

> gcc 16

> clang 8

> DMC 10

> lccwin 16

> tcc 8

> bcc 8

Note that sizeof() doesn't tell how large the floating point number is,
> Here's quick survey of Windows C compilers:

>

> Compiler sizeof(long double)

>

> MSCV 8 bytes

> gcc 16

> clang 8

> DMC 10

> lccwin 16

> tcc 8

> bcc 8

only how much storage space the compiler is reserving for it. Some

compilers may well over-reserve space for 80-bit floating point, for

alignment reasons (the extra bytes will be unused).

sizeof() is telling only if it gives less than 10 for long double.

Oct 5, 2021, 3:48:16 AMOct 5

to

floating point with floating point in general?

There are very few /hardware/ implementations of quad precision floating

point - I think perhaps Power is the only architecture that actually has

it in practice. (Some architectures, such as SPARC and RISC-V, have

defined them in the architecture but have no physical devices supporting

them.)

But /software/ implementations of quad precision floating point are not

hard to find. And a lot of toolchains support them - just as lots of

toolchains have software support for other kinds of floating point or

integer arithmetic that are part of C or C++, but do not have hardware

implementations on a given target.

And yes, we all know that software floating point is usually a lot

slower than hardware. People don't use 128-bit floating point for speed

- they use it because they need the range or precision, and speed is a

secondary concern.

Oct 5, 2021, 3:48:53 AMOct 5

to

On 04/10/2021 18:15, Branimir Maksimovic wrote:

> On 2021-10-04, Bart <b...@freeuk.com> wrote:

>>

>> Any code running on x86 can choose to use x87 instructions for floating

>> point.

>

> Only in assembler...

>

Or with a good compiler.
> On 2021-10-04, Bart <b...@freeuk.com> wrote:

>>

>> Any code running on x86 can choose to use x87 instructions for floating

>> point.

>

> Only in assembler...

>

Oct 5, 2021, 4:01:00 AMOct 5

to

where different sizes are common. "long double" is typically 80-bit

floating point on x86 (except MSVC and weaker tools), but for alignment

purposes it is often stored in 16-byte blocks (on 64-bit) or 12-byte

blocks (on 32-bit). On other targets (or gcc for x86 with particular

flags), "long double" is often quad precision, almost always implemented

in software.

A useful value to look at is "LDBL_DIG" in <float.h>. For 32-bit IEEE

floats, the number of digits is 6. For 64-bit, it is 15. For 80-bit,

it is 18. For 128-bit, it is 33.

Oct 5, 2021, 6:19:22 AMOct 5