How to get mantissa of long double?

wij

unread,

Oct 1, 2021, 11:37:33 AM10/1/21

to

// numeric_limits<long double>::digits=64.
//
typedef long double FType;
FType x=numeric_limits<FType>::max();
int iexp;
int64_t mint;
x=::frexpl(x,&iexp);
x=::ldexpl(x,numeric_limits<FType>::digits);
mint= static_cast<int64_t>(x);

Result (mint) is a negative number, something not right!!!

Radica...@theburrow.co.uk

unread,

Oct 1, 2021, 12:09:29 PM10/1/21

to

It might help if you set iexp to some value.

Bonita Montero

unread,

Oct 1, 2021, 1:19:03 PM10/1/21

to

Take this:

#include <iostream>
#include <cstdint>
#include <limits>
#include <utility>
#include <iomanip>

using namespace std;

using mantissa_pair = pair<bool, uint64_t>;

mantissa_pair getMantissa( double value );

int main()
{
double v = numeric_limits<double>::min();
do
{
mantissa_pair mp = getMantissa( v );
cout << "value: " << v;
if( mp.first )
cout << " mantissa: " << hex << mp.second << endl;
else
cout << " invalid mantissa (Inf, S(NaN))" << endl;
} while( (v *= 2.0) != numeric_limits<double>::infinity() );
}

static_assert(numeric_limits<double>::is_iec559, "must be standard fp");

mantissa_pair getMantissa( double value )
{
union
{
uint64_t binary;
double value;
} u;
u.value = value;
unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;
if( exponent == 0 )
return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |
0x10000000000000u );
if( exponent == 0x7FF )
return pair<bool, uint64_t>( false, 0 );
return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |
0x10000000000000u );
}

Bonita Montero

unread,

Oct 1, 2021, 1:21:12 PM10/1/21

to

Should be:
return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu );

Bonita Montero

unread,

Oct 1, 2021, 1:25:19 PM10/1/21

to

>> mantissa_pair getMantissa( double value )
>> {
>>      union
>>      {
>>          uint64_t binary;
>>          double value;
>>      } u;
>>      u.value = value;
>>      unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;
>>      if( exponent == 0 )
>>          return pair<bool, uint64_t>( true, u.binary &
>> 0xFFFFFFFFFFFFFu | 0x10000000000000u );
>
> Should be:
>           return pair<bool, uint64_t>( true, u.binary &
> 0xFFFFFFFFFFFFFu );
>
>
>>      if( exponent == 0x7FF )
>>          return pair<bool, uint64_t>( false, 0 );
>>      return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |
>> 0x10000000000000u );
>> }

Now with maximum efficiency:

mantissa_pair getMantissa( double value )
{
union
{
uint64_t binary;
double value;
} u;
u.value = value;
unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;

if( exponent == 0x7FF )
return pair<bool, uint64_t>( false, 0 );

uint64_t hiBit = (uint64_t)(exponent != 0) << 52;
return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu | hiBit );
}

Bonita Montero

unread,

Oct 1, 2021, 1:53:58 PM10/1/21

to

Am 01.10.2021 um 19:25 schrieb Bonita Montero:

> mantissa_pair getMantissa( double value )
> {
>     union
>     {
>         uint64_t binary;
>         double value;
>     } u;
>     u.value = value;
>     unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;
>     if( exponent == 0x7FF )
>         return pair<bool, uint64_t>( false, 0 );
>     uint64_t hiBit = (uint64_t)(exponent != 0) << 52;
>     return pair<bool, uint64_t>( true, u.binary & 0xFFFFFFFFFFFFFu |
> hiBit );
> }

Oh, I think this would be the best solution for 0x7ff-exponents:

mantissa_pair getMantissa( double value )
{
union
{
uint64_t binary;
double value;
} u;
u.value = value;
unsigned exponent = (unsigned)(u.binary >> 52) & 0x7FF;

uint64_t mantissa = u.binary & 0xFFFFFFFFFFFFFu;

if( exponent == 0x7FF )

return pair<bool, uint64_t>( false, mantissa );

uint64_t hiBit = (uint64_t)(exponent != 0) << 52;

return pair<bool, uint64_t>( true, mantissa | hiBit );
}

Return the mantisssa also with false in .first for Inf and (S)NaN.

James Kuyper

unread,

Oct 1, 2021, 2:05:56 PM10/1/21

to

On 10/1/21 12:09 PM, Radica...@theburrow.co.uk wrote:
> On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)
> wij <wyn...@gmail.com> wrote:
>> // numeric_limits<long double>::digits=64.
>> //
>> typedef long double FType;
>> FType x=numeric_limits<FType>::max();
>> int iexp;
>> int64_t mint;
>> x=::frexpl(x,&iexp);
>> x=::ldexpl(x,numeric_limits<FType>::digits);
>> mint= static_cast<int64_t>(x);
>>
>> Result (mint) is a negative number, something not right!!!

I can't duplicate this problem: I get mint:9223372036854775807.

> It might help if you set iexp to some value.

frexpl() is from the part of the C++ standard library that is copied
from the C standard library.

Section 7.12.6.7p1 of the C standard says:
> long double frexpl(long double value, int *p);

The following paragraph says:
> The frexp functions break a floating-point number into a normalized fraction and an integer exponent. They store the integer in the int object pointed to by p .

So iexp should be set. When I ran the code, it got set to a value of
16384. That's not the problem.

Bart

unread,

Oct 1, 2021, 2:09:54 PM10/1/21

to

On 01/10/2021 18:18, Bonita Montero wrote:
> Am 01.10.2021 um 17:37 schrieb wij:
>> // numeric_limits<long double>::digits=64.
>> //
>> typedef long double FType;
>> FType x=numeric_limits<FType>::max();
>> int iexp;
>> int64_t mint;
>> x=::frexpl(x,&iexp);
>> x=::ldexpl(x,numeric_limits<FType>::digits);
>> mint= static_cast<int64_t>(x);
>>
>> Result (mint) is a negative number, something not right!!!
>
> Take this:
>
> #include <iostream>
> #include <cstdint>
> #include <limits>
> #include <utility>
> #include <iomanip>
>
> using namespace std;
>
> using mantissa_pair = pair<bool, uint64_t>;
>
> mantissa_pair getMantissa( double value );

What happened to *long* double?

Bonita Montero

unread,

Oct 1, 2021, 2:17:04 PM10/1/21

to

Is idiocracy.

Branimir Maksimovic

unread,

Oct 1, 2021, 2:53:54 PM10/1/21

to

long double max = cumeric_limits<long double>::max();
long mantissa = max; // impicit conversion

--

7-77-777
Evil Sinner!

James Kuyper

unread,

Oct 1, 2021, 8:19:54 PM10/1/21

to

On 10/1/21 2:53 PM, Branimir Maksimovic wrote:
...

> long double max = cumeric_limits<long double>::max();
> long mantissa = max; // impicit conversion

"A prvalue of a floating-point type can be converted to a prvalue of an
integer type. The conversion truncates; that is, the fractional part is
discarded. The behavior is undefined if the truncated value cannot be
represented in the destination type." (7.3.10p1).

While it's not required to be the case, on most implementations
std::numeric_limits<long double>::max() is WAY too large to be
represented by a long, so the behavior of such code is undefined. The
minimum value of LDBL_MAX (set by the C standard, inherited by the C++
standard) is 1e37, which would require long to have at least 123 bits in
order for that conversion to have defined behavior.
And even when it has defined behavior, I can't imagine how you would
reach the conclusion that this conversion should be the value of the
mantissa.

wij

unread,

Oct 1, 2021, 8:23:19 PM10/1/21

to

// ----- file t.cpp -----
#include <math.h>
#include <limits>
#include <iostream>

using namespace std;
#define ENDL endl

template<typename T>
int64_t get_mant(T x) {
int iexp;
x=frexp(x,&iexp);
x=ldexp(x,numeric_limits<T>::digits);
return static_cast<int64_t>(x);
};

int main()
{
cout << dec << get_mant(numeric_limits<float>::max()) << ", "
<< hex << get_mant(numeric_limits<float>::max()) << ENDL;
cout << dec << get_mant(numeric_limits<double>::max()) << ", "
<< hex << get_mant(numeric_limits<double>::max()) << ENDL;
cout << dec << get_mant(numeric_limits<long double>::max()) << ", "
<< hex << get_mant(numeric_limits<long double>::max()) << ENDL;
return 0;
};
// end file t.cpp -----

$ g++ t.cpp
]$ ./a.out
16777215, ffffff
9007199254740991, 1fffffffffffff
-9223372036854775808, 8000000000000000

Branimir Maksimovic

unread,

Oct 1, 2021, 10:16:30 PM10/1/21

to

It is not undefined as long is larger then mantissa part.
ok correct is long long :P

--

7-77-777
Evil Sinner!

James Kuyper

unread,

Oct 2, 2021, 1:29:01 AM10/2/21

to

No, the relevant issue isn't the size of the mantissa. As stated above
in my quote from the standard, it's whether "the truncated value cannot
be represented in the destination type". std::numeric_limits<long
double>::max() doesn't have a fractional part, so the truncated value is
the same as the actual value. On my system, for instance, that value is
1.18973e+4932.

> ok correct is long long :P

On my system, changing it to long long doesn't make any different - the
maximum value representable by long long is still 9223372036854775807,
the same as the maximum value for long; it's still far too small to
represent 1.18973e+4932, so the behavior of the conversion is undefined.
The actual behavior on my system appears to be saturating at LLONG_MAX
== 9223372036854775807. If I change the second line to

long long mantissa = 0.75*max;

0.75*max is 8.92299e+4931, which should certainly not have the same
mantissa as max itself, but the value loaded into "mantissa" is still
9223372036854775807.

James Kuyper

unread,

Oct 2, 2021, 1:56:20 AM10/2/21

to

I accidentally sent this message first to Branimir by e-mail, and he
responded in kind.

On 10/2/21 1:31 AM, Branimir Maksimovic wrote:
>
>> On 02.10.2021., at 07:27, James Kuyper <james...@alumni.caltech.edu> wrote:

>>
>> On 10/1/21 10:16 PM, Branimir Maksimovic wrote:
...

>> ok correct is long long :P

>> On my system, changing it to long long doesn't make any different - the
>> maximum value representable by long long is still 9223372036854775807,
>> the same as the maximum value for long; it's still far too small to
>> represent 1.18973e+4932, so the behavior of the conversion is undefined.
>> The actual behavior on my system appears to be saturating at LLONG_MAX
>> == 9223372036854775807. If I change the second line to
>>
>> long long mantissa = 0.75*max;
>>
>> 0.75*max is 8.92299e+4931, which should certainly not have the same
>> mantissa as max itself, but the value loaded into "mantissa" is still
>> 9223372036854775807.

> Problem is that neither long double nor long is defined how small
> or large can be…
> so it can fit or not…
The C++ standard cross-references the C standard for such purposes, and
the C standard imposes strict limits on how small those things can be:
LLONG_MAX is required to be at least 9223372036854775807, and LDBL_MAX
is supposed to be at least 1e37.
You are right, however, about there being no limits on how large they
can be. It is therefore permissible for an implementation to have
LLONG_MAX >= LDBL_MAX, but do you know of any such implementation?
In any event, the relevant issue is not the limits imposed on those
values by the standard, but the actual values of LLONG_MAX and LDBL_MAX
for the particular implementation you're using, and it's perfectly
feasible to determine those values from <climits>, <cfloat>, or
std::numeric_limits<>::max. What are those values on the implementation
you're using?
> but question is how to extract mantissa which was answer :P

No, that is not the answer. If max did have a value small enough to make
the conversion to long long have defined behavior, the result of that
conversion would be the truncated value itself (7.3.10p1), NOT the
mantissa of the truncated value. What makes you think otherwise?

Bonita Montero

unread,

Oct 2, 2021, 6:09:23 AM10/2/21

to

Am 01.10.2021 um 17:37 schrieb wij:

Here, exactly 100 lines with a very convenient double to double
parts and double parts to double conversion:

#pragma once
#include <limits>
#include <cstdint>
#include <cassert>

struct dbl_parts
{
static_assert(std::numeric_limits<double>::is_iec559, "must be standard
fp");
dbl_parts( double d );
dbl_parts &operator =( double d );
dbl_parts() = default;
operator double();
bool getSign();
std::uint16_t getBiasedExponent();
std::int16_t getExponent();
std::uint64_t getMantissa();
void setSign( bool sign );
void setBiasedExponent( uint16_t exp );
void setExponent( int16_t exp );
void setMantissa( uint64_t mantissa );
private:
union
{
double value;
std::uint64_t binary;
};
};

inline
dbl_parts::dbl_parts( double d ) :
value( d )
{
}

inline
dbl_parts &dbl_parts::operator =( double d )
{
value = d;
return *this;
}

inline
dbl_parts::operator double()
{
return value;
}

inline
bool dbl_parts::getSign()
{
return (int64_t)binary < 0;
}

inline
std::uint16_t dbl_parts::getBiasedExponent()
{
return (std::uint16_t)(binary >> 52) & 0x7FF;
}

inline
int16_t dbl_parts::getExponent()
{
return (int16_t)getBiasedExponent() - 0x3FF;
}

inline
std::uint64_t dbl_parts::getMantissa()
{
std::uint16_t bExp = getBiasedExponent();
std::uint64_t hiBit = (uint64_t)(bExp && bExp != 0x7FF) << 52;
return binary & 0xFFFFFFFFFFFFFu | hiBit;
}

inline
void dbl_parts::setSign( bool sign )
{
binary = binary & 0x7FFFFFFFFFFFFFFFu | (std::uint64_t)sign << 63;
}

inline
void dbl_parts::setBiasedExponent( std::uint16_t exp )
{
assert(exp <= 0x7FF);
binary = binary & 0x800FFFFFFFFFFFFFu | (std::uint64_t)exp << 52;
}

inline
void dbl_parts::setExponent( std::int16_t exp )
{
assert(exp >= -0x3FF && exp <= 400);
setBiasedExponent( (uint16_t)(exp - 0x3FF) );
}

inline
void dbl_parts::setMantissa( std::uint64_t mantissa )
{
assert((getBiasedExponent() == 0 || getBiasedExponent() == 0x7FF) &&
!(mantissa & -0x10000000000000));
assert(getBiasedExponent() != 0 && getBiasedExponent() != 0x7FF ||
mantissa <= 0x1FFFFFFFFFFFFFu);
binary = binary & 0xFFF0000000000000u | mantissa & 0xFFFFFFFFFFFFFu;
}

Branimir Maksimovic

unread,

Oct 2, 2021, 6:59:22 AM10/2/21

to

On 2021-10-02, James Kuyper <james...@alumni.caltech.edu> wrote:

> I accidentally sent this message first to Branimir by e-mail, and he
>

> No, that is not the answer. If max did have a value small enough to make
> the conversion to long long have defined behavior, the result of that
> conversion would be the truncated value itself (7.3.10p1), NOT the
> mantissa of the truncated value. What makes you think otherwise?

Because mantissa is whole part of floating point number?

--

7-77-777
Evil Sinner!

Manfred

unread,

Oct 2, 2021, 12:59:19 PM10/2/21

to

On 10/2/2021 2:23 AM, wij wrote:
> On Saturday, 2 October 2021 at 02:05:56 UTC+8, james...@alumni.caltech.edu wrote:
>> On 10/1/21 12:09 PM, Radica...@theburrow.co.uk wrote:
>>> On Fri, 1 Oct 2021 08:37:21 -0700 (PDT)
>>> wij <wyn...@gmail.com> wrote:
>>>> // numeric_limits<long double>::digits=64.
>>>> //
>>>> typedef long double FType;
>>>> FType x=numeric_limits<FType>::max();
>>>> int iexp;
>>>> int64_t mint;
>>>> x=::frexpl(x,&iexp);
>>>> x=::ldexpl(x,numeric_limits<FType>::digits);
>>>> mint= static_cast<int64_t>(x);
>>>>
>>>> Result (mint) is a negative number, something not right!!!
>> I can't duplicate this problem: I get mint:9223372036854775807.

<snip>

There are two problems:

1)
The templated function uses 'frexp' and 'ldexp', which take both double
arguments (not *long* double), hence UB occurs at those calls for the
'long double' type whenever this type is actually larger than 'double'.

2)
On my Linux box numeric_limits<*long* double>::digits is 64
(numeric_limits<double>::digits is 53), so the static_cast<int64_t>(x)
yields UB again.

=========
#include <cmath>

#include <limits>
#include <iostream>

using namespace std;
#define ENDL endl

uint64_t get_mantf(float x)
{
int iexp;
x=frexp(x,&iexp);
x=ldexp(x,numeric_limits<float>::digits);
return static_cast<uint64_t>(x);
};

uint64_t get_mant(double x)
{
int iexp;
x=frexp(x,&iexp);
x=ldexp(x,numeric_limits<double>::digits);
return static_cast<uint64_t>(x);
};

uint64_t get_mantl(long double x)
{
int iexp;
x=frexp(x,&iexp);
x=ldexp(x,numeric_limits<long double>::digits);
return static_cast<uint64_t>(x);
};

int main()
{
cout << dec << numeric_limits<float>::digits << ", " <<
get_mantf(numeric_limits<float>::max()) << ", "
<< hex << get_mantf(numeric_limits<float>::max()) << ENDL;
cout << dec << numeric_limits<double>::digits << ", " <<

get_mant(numeric_limits<double>::max()) << ", "
<< hex << get_mant(numeric_limits<double>::max()) << ENDL;

cout << dec << numeric_limits<long double>::digits << ", " <<
get_mantl(numeric_limits<long double>::max()) << ", "
<< hex << get_mantl(numeric_limits<long double>::max()) << ENDL;
return 0;
}
===================
$ c++ -std=c++11 -O2 -Wall mant.cc && ./a.out
24, 16777215, ffffff
53, 9007199254740991, 1fffffffffffff
64, 18446744073709551615, ffffffffffffffff

Manfred

unread,

Oct 2, 2021, 1:21:47 PM10/2/21

to

Sorry, that should have been:

=====

uint64_t get_mantf(float x)
{
int iexp;

x=frexpf(x,&iexp);
x=ldexpf(x,numeric_limits<float>::digits);

return static_cast<uint64_t>(x);
};

uint64_t get_mant(double x)
{
int iexp;
x=frexp(x,&iexp);
x=ldexp(x,numeric_limits<double>::digits);
return static_cast<uint64_t>(x);
};

uint64_t get_mantl(long double x)
{
int iexp;

x=frexpl(x,&iexp);
x=ldexpl(x,numeric_limits<long double>::digits);
return static_cast<uint64_t>(x);
};
=====

However, the three distinct functions with frexp and ldexp in all three
still work on my box (I'm guessing gcc is still able to compile the
correct implementation in in this case), but the template doesn't.

James Kuyper

unread,

Oct 2, 2021, 2:57:53 PM10/2/21

to

Well, at least I'm clear now about what your confusion is. I'll give you
an answer in two parts:
1. The mantissa of a floating point number (also called the significand,
which is the term used in the C standard in text that is incorporated by
reference into the C++ standard) is something quite different from the
whole part of that number. See
<https://en.m.wikipedia.org/wiki/Significand> for details.

2. When a floating point number is greater than LLONG_MAX+1.0, the code
you provided has undefined behavior. That makes sense, since a long long
object cannot represent the whole part of such a number. The actual
behavior can vary from one implementation to another, but on my system,
it does NOT contain the mantissa, either.

Consider the following program:

#include <iostream>

int main(void)
{
typedef long double FType;
FType max= std::numeric_limits<FType>::max();
FType large = 0xA.BCDEF0123456789p+16380L;
FType middling = 0xABCDEF01.23456789p0L;
long long maxll = max;
long long largell = large;
long long middlingll = middling;

std::cout << std::fixed << "max: " << max << std::endl;
std::cout << "large: " << large << std::endl;
std::cout << "middling: " << middling << std::endl;

std::cout << "maxll: " << maxll << std::endl;
std::cout << "largell: " << largell << std::endl;
std::cout << "middlingll: " << middlingll << std::endl;

std::cout << std::hexfloat << std::showbase << std::showpoint <<
std::hex << "max: " << max << std::endl;
std::cout << "large: " << large << std::endl;
std::cout << "middling: " << middling << std::endl;

std::cout << "maxll: " << maxll << std::endl;
std::cout << "largell: " << largell << std::endl;
std::cout << "middlingll: " << middlingll << std::endl;
}

The output from that program on my system is:

max:
1189731495357231765021263853030970205169063322294624200440323733891737005522970722616410290336528882853545697807495577314427443153670288434198125573853743678673593200706973263201915918282961524365529510646791086614311790632169778838896134786560600399148753433211454911160088679845154866512852340149773037600009125479393966223151383622417838542743917838138717805889487540575168226347659235576974805113725649020884855222494791399377585026011773549180099796226026859508558883608159846900235645132346594476384939859276456284579661772930407806609229102715046085388087959327781622986827547830768080040150694942303411728957777100335714010559775242124057347007386251660110828379119623008469277200965153500208474470792443848545912886723000619085126472111951361467527633519562927597957250278002980795904193139603021470997035276467445530922022679656280991498232083329641241038509239184734786121921697210543484287048353408113042573002216421348917347174234800714880751002064390517234247656004721768096486107994943415703476320643558624207443504424380566136017608837478165389027809576975977286860071487028287955567141404632615832623602762896316173978484254486860609948270867968048078702511858930838546584223040908805996294594586201903766048446790926002225410530775901065760671347200125846406957030257138960983757998926954553052368560758683179223113639519468850880771872104705203957587480013143131444254943919940175753169339392366881856189129931729104252921236835159922322050998001677102784035360140829296398115122877768135706045789343535451696539561254048846447169786893211671087229088082778350518228857646062218739702851655083720992349483334435228984751232753726636066213902281264706234075352071724058665079518217303463782631353393706774901950197841690441824738063162828586857741432581165364040218402724913393320949219498422442730427019873044536620350262386957804682003601447291997123095530057206141866974852846856186514832715974481203121946751686379343096189615107330065552421485195201762858595091051839472502863871632494167613804996319791441870254302706758495192008837915169401581740046711477877201459644461175204059453504764721807975761111720846273639279600339670470037613374509553184150073796412605047923251661354841291884211340823015473304754067072818763503617332908005951896325207071673904547777129682265206225651439919376804400292380903112437912614776255964694221981375146967079446870358004392507659451618379811859392049544036114915310782251072691486979809240946772142727012404377187409216756613634938900451232351668146089322400697993176017805338191849981933008410985993938760292601390911414526003720284872132411955424282101831204216104467404621635336900583664606591156298764745525068145003932941404131495400677602951005962253022823003631473824681059648442441324864573137437595096416168048024129351876204668135636877532814675538798871771836512893947195335061885003267607354388673368002074387849657014576090349857571243045102038730494854256702479339322809110526041538528994849203991091946129912491633289917998094380337879522093131466946149705939664152375949285890960489916121944989986384837022486672249148924678410206183364627416969576307632480235587975245253737035433882960862753427740016333434055083537048507374544819754722228975281083020898682633020285259923084168054539687911418297629988964576482765287504562854924265165217750799516259669229114977788962356670956627138482018191348321687995863652637620978285070099337294396784639879024914514222742527006363942327998483976739987154418554201562244154926653014515504685489258620276085761837129763358761215382565129633538141663949516556000264159186554850057052611431952919918807954522394649627635630178580896692226406235382898535867595990647008385687123810329591926494846250768992258419305480763620215089022149220528069842018350840586938493815498909445461977893029113576516775406232278298314033473276603952231603422824717528181818844304880921321933550869873395861276073670866652375555675803171490108477320096424318780070008797346032906278943553743564448851907191616455141155761939399690767415156402826543664026760095087523945507341556135867933066031744720924446513532366647649735400851967040771103640538150073486891798364049570606189535005089840913826869535090066783324472578712196604415284924840041850932811908963634175739897166596000759487800619164094854338758520657116541072260996288150123144377944008749301944744330784388995701842710004808305012177123560622895076269042856800047718893158089358515593863176652948089031267747029662545110861548958395087796755464137944895960527975209874813839762578592105756284401759349324162148339565350189196811389091843795734703269406342890087805846940352453479398080674273236297887100867175802531561302356064878709259865288416350972529537091114317204887747405539054009425375424119317944175137064689643861517718849867010341532542385911089624710885385808688837777258648564145934262121086647588489260031762345960769508849149662444156604419552086811989770240.000000
large:
798441950131984170628030460666810333557348275190265576078984094186809314595841573532599033606882733242214503184515646114757854809285221996356228289554376635019344753920776702760669187940817514531317086140383510733112508649664623633139528050777110074618005617509580146833305060924324246754342390819803294975489244294690457181615916380846787220310100995994303945777913450178570581981686198498193907782547173273698279385392870067122516401380661309429194377571288675510592121362933128690216937551164723903033967732288194633304806974443077506079764506140348104093213176088240569763532433031309628120758051299847190906531282735560740981504351645262405988024022915769015155291537373039452757287714224413271925722965109196412960233450108997556205354431107378518341642987291604090752426939027264213019440145124490661477726747318914574881469961518134515108880857245240185572718139826633083153405462806445652888413410638188733734515336089461690188769941296000310237908472871534924022605781553137758150233789260358532907145882036161660126945661711839735421711825178177760957818009605928677177600206353608487447867000007661334384127783042012294562593528684725810884617737609254169149626757295679706675640253674305137724007576430362146351195481618750152400954313053997866464669799426096587456831642162987414724896100262164539763660362382092573961783134501772282908098290593289630152876770300417744973458556470142688766464671575130409003788303712422629946652772747989412570452719563049061781235816001067752251851519827096684941366242307880857459992859988995242358876699073496204946189012116094636726007751024562501070868616158003530658455857341162526439165492140517609950096619412732907944896057390196270057177242728156339638540867610924105857412556444579107776816889674631366413677833047161379884532164198993276772896425495083000505555449504361780772635099950677280195245017300352256591786535445649376601829488196382898175912034787015428923592784580276113369666903167358253633530441611097653596975461248061662339365488473119600829523500459203619456964603129018988931194328288813647101701632839512858929869478394804296698394260954809692417234406205371106893496236967789991565505735586236384677950809842953217931584749260769959701438782478571501334539385640439880396793372493213016627745357811930479029023541095769455142488296966879072936425173785823423491784831896194556031386766634408387982622337724480135007299914446877152530780342928097086987514633283924856437263144768383859171111988014305350935374816502702333251093923757732462792355307789905174719099488759733393438698689634230361403084258533397265120127568961174111327297320782260488730893216700453978837717137622769084406167566152759623181706028976375164646094475146647369281575972522853549552541720923007665948643646011282736873309526428963407898352464481165847664515121092198236329555997024089181776782273133008347683849048553390800664712980340475996260480072720565715887831207483955807832767939883392425546385536716077432630898214899484281107511073792175413829068138175373955787942264464985983774969783604562637885180871016551868231242254625289486930794202391413454134460390851822972288028367994924347840148793791361918282891665866879658491796934476927641493412928958649240130825652229115711491817767450106062789615541320400721098889852085898259254162735307793046490526388026316295280894798927076625172952033631816193510840287691268927294098058192308372212486275935018916579164964951401297805003291270040660302706760922674620757844937737189874735833048611876587062084398065184497992254250923170482005379707778025178444614040206924597320627466243597003790395509022310055565959173599974814957486832121747812898110468472757711344935141581619768179027277300745026178548690842027389022573413486719086424376549995531862953495262810743844967518912745788410885417320073667346399372144729172804316265474393461925206370016332837493346298714231599348699028709728862988664452830290361646545475793248617512310081372368799772054350231317371686452672996009981551057400080425522396446487229356262303917938229664789869208363238136397068893440807223464451417926302339445880535875195702287684615206433710292963594791392187167500246785808792158380039590563187800668321919054700778993223664127005039562879931963775644174414234242371897053417500956186032208506564827824530909432144378661326313617577661894233125412875713957392594237250272900479523399925200381753270457862573831453382693219682510773999798436059405397109340096099851440470543315842651194313580726690246948309173037070102422697928865543593593980166075781220064584748135949528803322043763798528128793152883647487108192854012453988821565817611465474933891481140768062268834838833326673555808237357184051468989117126978502141404852191804301542607646946130977410809967278476698460181222465259056386246982267558528987894457868410385101006422099287064483189598631389250604017773829174392421160930626885766469862665970968575978122938895191153280286720.000000
middling: 2882400001.137778
maxll: 9223372036854775807
largell: 9223372036854775807
middlingll: 2882400001
max: 0xf.fffffffffffffffp+16380
large: 0xa.bcdef0123456789p+16380
middling: 0xa.bcdef0123456789p+28
maxll: 0x7fffffffffffffff
largell: 0x7fffffffffffffff
middlingll: 0xabcdef01

I'll use hexadecimal notation in the following comments:

The whole part of "max" has the same value as "max" itself. The same is
true of "large". The whole part of middling is 0xABCDEF01. The
fractional part of middling is 0x0.23456789.

The mantissas are as follows:
max: 0xffffffffffffffff
large: 0xabcdef0123456789
middling: 0xabcdef0123456789

The values stored in "maxll" and "largell" do not match either the whole
part or the mantissa of the corresponding floating point number. Because
"middling" is the only one of the three long double values that is
smaller than LLONG_MAX, the value stored in middlingll is the whole part
of "middling", as it should be, but is quite different from the mantissa
of "middling".

Alf P. Steinbach

unread,

Oct 2, 2021, 3:00:58 PM10/2/21

to

Apparently you're trying to obtain the bits of the mantissa of a `long
double` number, represented as an `int64_t` value.

A `long double` is in practice either IEEE 754 64-bit, or IEEE 754
80-bit. In Windows that choice depends on the compiler. With Visual C++
(and hence probably also Intel) it's 64-bit, same as type `double`,
while with MinGW g++ (and hence probably also clang) it 80-bit,
originally the x86-family's math coprocessor's extended format. For
80-bit IEEE 754 the mantissa part is 64 bits.

With 64-bits mantissa there is a high chance of setting the sign bit of
an `int64_t` to 1, resulting in a negative value. I believe that that
will only /not/ happen for a denormal value, but, I'm tired and might be
wrong about that. Anyway, instead use unsigned types for bit handling.
For example, in this case, use `uint64_t`.

However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just
use `memcpy`.

Due to the silliness in gcc regarding the standard's "strict aliasing"
rule, I'd not use a reinterpretation pointer cast.

- Alf

James Kuyper

unread,

Oct 2, 2021, 3:12:21 PM10/2/21

to

It would be safer and more portable to use FType; there's no portable
guarantee that any integer type is large enough to hold the mantissa,
but FType is.

> However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just
> use `memcpy`.

The advantage of the code as written is that (if you change mint to have
FType) it will give the correct result even if your assumption about
IEEE 754 is false; that's not the case with memcpy().

Alf P. Steinbach

unread,

Oct 2, 2021, 3:23:41 PM10/2/21

to

I believe you intended to write `uintptr_t`, not `FType`.

If so, agreed.

It's late in the day for me, sorry.

>> However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just
>> use `memcpy`.
>
> The advantage of the code as written is that (if you change mint to have
> FType) it will give the correct result even if your assumption about
> IEEE 754 is false; that's not the case with memcpy().

Uhm, I'd rather assert IEEE 754 representation
(numeric_limits::is_iec559). Dealing with the bits of just about any
representation seems to me a hopelessly daunting task. :-o

- Alf

Alf P. Steinbach

unread,

Oct 2, 2021, 3:27:34 PM10/2/21

to

It's /very/ late.

There AFAIK is no suitable type name for the integer type with
sufficient bits to represent the mantissa, or generally >N bits.

Unfortunately the standard library doesn't provide a mapping from number
of bits as a value, to integer type with that many bits. It can be done,
on the assumption that all types in `<stdint.h>` are present. And
perhaps one can then define a name like `FType` in terms of that mapping.

James Kuyper

unread,

Oct 2, 2021, 3:37:56 PM10/2/21

to

On 10/2/21 2:59 PM, Branimir Maksimovic wrote:
>
>
>> On 02.10.2021., at 20:51, James Kuyper
>> <james...@alumni.caltech.edu
>> <mailto:james...@alumni.caltech.edu>> wrote:
>>
>> int main(void)
>> {
>> typedef long double FType;

>>    FType max= std::numeric_limits<FType>::max();
>>    FType large = 0xA.BCDEF0123456789p+16380L;
>>    FType middling = 0xABCDEF01.23456789p0L;
>>    long long maxll = max;
>>    long long largell = large;
>>    long long middlingll = middling;
>>
>>    std::cout << std::fixed << "max: " << max << std::endl;
>>    std::cout << "large: " << large << std::endl;
>>    std::cout << "middling: " << middling << std::endl;
>>
>>    std::cout << "maxll: " << maxll << std::endl;
>>    std::cout << "largell: " << largell << std::endl;
>>    std::cout << "middlingll: " << middlingll << std::endl;
>>
>>    std::cout << std::hexfloat << std::showbase << std::showpoint <<
>>        std::hex << "max: " << max << std::endl;
>>    std::cout << "large: " << large << std::endl;
>>    std::cout << "middling: " << middling << std::endl;
>>
>>    std::cout << "maxll: " << maxll << std::endl;
>>    std::cout << "largell: " << largell << std::endl;
>>    std::cout << "middlingll: " << middlingll << std::endl;
>> }

> mantissa.cpp:7:18: warning: magnitude of floating-point constant too
> large for type 'long double'; maximum is 1.7976931348623157E+308
> [-Wliteral-range]
> Sorry your program is not correct..

The value of "large" was chosen to be almost as large as "max" on my
machine. It's apparently larger than LDBL_MAX on your system. A more
portable initializer would be

FType large = 0x0.ABCDEF0123456789p0L * max;

Depending upon the value of LDBL_EPSILON on the target implementation,
that definition might result in the mantissa of "large" having fewer
significant digits than it has on mine, but that's not particularly
important.

> This is output on my system:
> bmaxa@Branimirs-Air projects % ./a.out
> max:
> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> large: inf

> middling: 2882400001.137778
> maxll: 9223372036854775807
> largell: 9223372036854775807
> middlingll: 2882400001

> max: 0x1.fffffffffffffp+1023
> large: inf
> middling: 0x1.579bde02468adp+31

> maxll: 0x7fffffffffffffff
> largell: 0x7fffffffffffffff
> middlingll: 0xabcdef01
>

> Greetings, Branimir.

With that change, the value of "large" on my machine changes to
0xa.bcdef0123456788p+16380, with a corresponding (HUGE) change to the
less significant digits of the decimal output, and the mantissa is
therefore 0xabcdef0123456788.
You'll get much different values for "max" and "large" on your system,
since LDBL_MAX is much smaller, but the qualitative comments I made
about those values should still be accurate. There's no guarantees,
since the behavior is undefined, but I would expect that the values of
"maxll" and "largell" will be unchanged.

Manfred

unread,

Oct 2, 2021, 5:20:07 PM10/2/21

to

He meant the FType that is in the original post. I.e. the same floating
point type of the number to be analyzed.

Branimir Maksimovic

unread,

Oct 2, 2021, 8:45:33 PM10/2/21

to

On 2021-10-02, James Kuyper <james...@alumni.caltech.edu> wrote:

How so if FType is just typedef?

--

7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger
https://github.com/rofl0r/chaos-pp

Branimir Maksimovic

unread,

Oct 2, 2021, 8:48:09 PM10/2/21

to

On 2021-10-02, James Kuyper <james...@alumni.caltech.edu> wrote:

Now outputs:

bmaxa@Branimirs-Air projects % ./a.out
max: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000

large: 120645172288001372126619919515947355653853135095820093742269387851763173210613194516902557021349374744036403756647462602101944733127725809392434319981526284828181273773403605513307431121904981243948320057805774787281290926848565000291282225047507294705767371139696722108291516984908752371556782562287095382016.000000

middling: 2882400001.137778
maxll: 9223372036854775807
largell: 9223372036854775807
middlingll: 2882400001
max: 0x1.fffffffffffffp+1023

large: 0x1.579bde02468acp+1023

middling: 0x1.579bde02468adp+31
maxll: 0x7fffffffffffffff
largell: 0x7fffffffffffffff
middlingll: 0xabcdef01

>

Bonita Montero

unread,

Oct 3, 2021, 2:10:11 AM10/3/21

to

So, this should be the most elegant code:

#pragma once
#include <limits>
#include <cstdint>
#include <cassert>

struct dbl_parts
{

static_assert(std::numeric_limits<double>::is_iec559, "must be standard
fp");

dbl_parts( double d );
dbl_parts &operator =( double d );
dbl_parts() = default;
operator double();
bool getSign();
std::uint16_t getBiasedExponent();
std::int16_t getExponent();
std::uint64_t getMantissa();
void setSign( bool sign );
void setBiasedExponent( uint16_t exp );
void setExponent( int16_t exp );
void setMantissa( uint64_t mantissa );
private:

static unsigned const
MANTISSA_BITS = 52;
using i64 = std::int64_t;
using ui64 = std::uint64_t;
static ui64 const
SIGN_MASK = (ui64)1 << 63,
EXP_MASK = (ui64)0x7FF << MANTISSA_BITS,
MANTISSA_MASK = ~-((i64)1 << MANTISSA_BITS),
MANTISSA_MAX = ((ui64)1 << MANTISSA_BITS) | MANTISSA_MASK;
using ui16 = std::uint16_t;
using i16 = std::int16_t;
static ui16 const
BEXP_DENORMAL = 0,
BEXP_BASE = 0x3FF,
BEXP_MAX = 0x7FF;
static i16 const
EXP_MIN = 0 - BEXP_BASE,
EXP_MAX = BEXP_MAX - BEXP_BASE;
union
{
double value;
ui64 binary;

};
};

inline
dbl_parts::dbl_parts( double d ) :
value( d )
{
}

inline
dbl_parts &dbl_parts::operator =( double d )
{
value = d;
return *this;
}

inline
dbl_parts::operator double()
{
return value;
}

inline
bool dbl_parts::getSign()
{

return (i64)binary < 0;

}

inline
std::uint16_t dbl_parts::getBiasedExponent()
{

return (ui16)(binary >> MANTISSA_BITS) & BEXP_MAX;
}

inline
int16_t dbl_parts::getExponent()
{
return (i16)(getBiasedExponent() - BEXP_BASE);

}

inline
std::uint64_t dbl_parts::getMantissa()
{

ui16 bExp = getBiasedExponent();
ui64 hiBit = (ui64)(bExp && bExp != BEXP_MAX) << MANTISSA_BITS;
return binary & MANTISSA_MASK | hiBit;

}

inline
void dbl_parts::setSign( bool sign )
{

binary = binary & ~SIGN_MASK | (ui64)sign << 63;

}

inline
void dbl_parts::setBiasedExponent( std::uint16_t exp )
{

assert(exp <= BEXP_MAX);
binary = binary & (SIGN_MASK | MANTISSA_MASK) | (ui64)exp << MANTISSA_BITS;

}

inline
void dbl_parts::setExponent( std::int16_t exp )
{

exp += BEXP_BASE;
setBiasedExponent( exp );

}

inline
void dbl_parts::setMantissa( std::uint64_t mantissa )
{

#if !defined(NDEBUG)
ui64 mantissaMax = MANTISSA_MASK | (ui64)(getBiasedExponent() !=
BEXP_DENORMAL && getBiasedExponent() != BEXP_MAX) << MANTISSA_BITS;
assert(mantissa <= mantissaMax);
#endif
binary = binary & (SIGN_MASK | EXP_MASK) | mantissa & MANTISSA_MASK;
}

wij

unread,

Oct 3, 2021, 3:36:07 AM10/3/21

to

I changed int64_t to uint64_t, my problem is solved, thanks.
(in the original post, the include file <math.h> should be <cmath>)

Bart

unread,

Oct 3, 2021, 6:15:52 AM10/3/21

to

Jesus. And I think this still doesn't do an actual long double!

If you know you're running on an x86/x64 device (or even an 8088 with
8087 co-processor!), then this inline code expands a 64-bit double 'x64'
to its constituent parts:

fld qword [x64]
fstp tword [a80] ; sometimes, 'tbyte'

And for a long double 'x80' known to use 80-bit format:

fld tword [x80]
fstp tword [a80]

The former works because on the x87, all loads expand to an 80-bit
internal format with no hidden parts.

'a80' needs to be an instance of a type like this (here assumes
little-endian memory format, which is typical for anything with x87):

typedef struct {
uint64_t mantissa;
uint16_t sign_and_exponent; // sign is top bit
} ldformat;

I don't know how to reliably split those last 16 bits into 15- and 1-bit
fields using C's bitfields. (I used a different test language.)

ASM code shown may need adapting to gcc-style assembly.

Bonita Montero

unread,

Oct 3, 2021, 6:19:45 AM10/3/21

to

Am 03.10.2021 um 12:15 schrieb Bart:

> Jesus. And I think this still doesn't do an actual long double!

long double isn't supported by many compilers for x86-64.
long double should be avoided when possible because loads
and stores are slow with long double.

Ben Bacarisse

unread,

Oct 3, 2021, 10:02:25 AM10/3/21

to

I've not been following the details of the thread but it seems to have
strayed from the original question (as happens of course).

wij <wyn...@gmail.com> writes:

> // numeric_limits<long double>::digits=64.
> //
> typedef long double FType;
> FType x=numeric_limits<FType>::max();
> int iexp;
> int64_t mint;
> x=::frexpl(x,&iexp);
> x=::ldexpl(x,numeric_limits<FType>::digits);
> mint= static_cast<int64_t>(x);
>
> Result (mint) is a negative number, something not right!!!

frexp (et. al.) return a signed result with a magnitude in the interval
[1/2, 1). If you want the 64 most significant digits of the mantissa of
a long double (it may have many more digits than that) then I'd go for
something like this:

uint_least64_t ms64bits(long double d)
{
int exp;
return (uint_least64_t)std::scalbn(std::fabs(std::frexp(d, &exp)), 64);
}

If, as your code sketch suggests, you want all of them, then you are out
of luck as far as portable code goes because there may not be an integer
type wide enough.

But why do you want an integer value? For most mathematical uses the
result of std::frexp is what you really want.

--
Ben.

red floyd

unread,

Oct 3, 2021, 9:04:10 PM10/3/21

to

FFS, Bonita, READ THE FRICKIN' SUBJECT LINE!!!

OP wanted to get the mantissa of long double.

wij

unread,

Oct 3, 2021, 9:47:02 PM10/3/21

to

// --- Here is the real codes
explicit VLFloat(long double x) try {
typedef long double FType;
typedef uint64_t UIntMant;
WY_ENSURE(Limits<FType>::Bits<=CHAR_BIT*sizeof(UIntMant));
if(Wy::isfinite(x)==false) {
throw Errno(EINVAL);
}
Errno r;
int iexp;
if(x<0) {
x=-x;
m_neg=true;
} else {
m_neg=false;
}
x=Wy::frexp(x,&iexp);
x=Wy::ldexp(x,Limits<FType>::Bits);
m_exp=iexp-Limits<FType>::Bits;
if((r=m_mant.set_num(MantType::itv(static_cast<UIntMant>(x))))!=Ok) {
throw r;
}
if((r=_finalize())!=Ok) {
throw r;
}
}
catch(const Errno& e) {
WY_THROW( Reply(e) );
};
--
Most of the time while asking questions in comp.lang.c/c++, I need to rewrite a bit.

James Kuyper

unread,

Oct 3, 2021, 11:32:13 PM10/3/21

to

A typedef is just a synonym for a type, it has exactly the same
characteristics as the type for which it is a synonym. What I said is
true because FType is a synonym for long double, the same type as x
itself. What I was asserting is that for any given floating point
object, the mantissa or significand stored in that object has a value
that is guaranteed to be representable in the same floating point type
as the object itself. There's no guarantee that any integer type is big
enough to represent it.

I've since thought this over, and checked carefully, and that's not
quite true - though it's true enough for most practical purposes. Let me
explain.

The C++ standard defines <cfloat>, and defines it's contents mostly by
cross-referencing the C standard's definition of <float.h>. Section
5.2.4.2.2 the C standard defines a parameterized model for floating
point representations, that is used as a basis for describing the
meaning of the macros defined in <float.h>, so that model is inherited
by C++.

I will need to refer to the following parameters of that model:
> b - base or radix of exponent representation (an integer > 1)
> e - exponent (an integer between a minimum e_min and a maximum e_max )
> p - precision (the number of base-b digits in the significand)

Note that b, e_min, e_max, and p are constants for any specific floating
point type.

In terms of that model, the value of the significand of a floating point
value x, interpreted as an integer, can be represented in the same
floating point type by a number with exactly the same significand, and e
= p.

The key issue is whether such a representation is allowed, and it turns
out that there can be floating point representations which fit the C
standard's model, for which e_max < p, preventing some signficands from
being representable in such a type.

The macro LDBL_MAX (corresponding to std::numeric_limits<long
double>::max()) is defined as expanding to the value of
(1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for
example, e_max == p-1 and b=2, then this means that for such a type, p
must be at least 124.

Every floating point format I'm familiar with has an e_max value much
larger than p, so I think, as a practical matter, that it's safe to
assume that signficands can be represented in the same floating point
type, but strictly speaking, it's not guaranteed.

James Kuyper

unread,

Oct 3, 2021, 11:42:40 PM10/3/21

to

Your system has a different value for LDBL_MAX than mine, which is
perfectly normal. With that difference in mind, all of those results are
consistent with what I said. None of the long long values are the same
as the mantissa of the corresponding float value, and only middlingll is
the same as the whole part of corresponding float value.

Do you now concede that your approach is not guaranteed to result in a
long long value containing the mantissa?

James Kuyper

unread,

Oct 3, 2021, 11:43:28 PM10/3/21

to

On 10/2/21 3:23 PM, Alf P. Steinbach wrote:
> On 2 Oct 2021 21:12, James Kuyper wrote:

...

>> It would be safer and more portable to use FType; there's no portable
>> guarantee that any integer type is large enough to hold the mantissa,
>> but FType is.
>
> I believe you intended to write `uintptr_t`, not `FType`.

No, uintptr_t is not guaranteed to be able to hold the mantissa; nor is
any other integer type. FType is guaranteed to be able to hold it.

>>> However, instead of the shenanigans with `frexpl` and `ldexpl` I'd just
>>> use `memcpy`.
>>
>> The advantage of the code as written is that (if you change mint to have
>> FType) it will give the correct result even if your assumption about
>> IEEE 754 is false; that's not the case with memcpy().
>
> Uhm, I'd rather assert IEEE 754 representation
> (numeric_limits::is_iec559). Dealing with the bits of just about any
> representation seems to me a hopelessly daunting task. :-o

Well, that's the purpose of frexpl() and ldexpl() - they save you from
having to worry about the bits.

Branimir Maksimovic

unread,

Oct 4, 2021, 12:34:39 AM10/4/21

to

On 2021-10-04, James Kuyper <james...@alumni.caltech.edu> wrote:
> Note that b, e_min, e_max, and p are constants for any specific floating
> point type.
>
> In terms of that model, the value of the significand of a floating point
> value x, interpreted as an integer, can be represented in the same
> floating point type by a number with exactly the same significand, and e
>= p.
>
> The key issue is whether such a representation is allowed, and it turns
> out that there can be floating point representations which fit the C
> standard's model, for which e_max < p, preventing some signficands from
> being representable in such a type.
>
> The macro LDBL_MAX (corresponding to std::numeric_limits<long
> double>::max()) is defined as expanding to the value of
> (1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for
> example, e_max == p-1 and b=2, then this means that for such a type, p
> must be at least 124.
>
> Every floating point format I'm familiar with has an e_max value much
> larger than p, so I think, as a practical matter, that it's safe to
> assume that signficands can be represented in the same floating point
> type, but strictly speaking, it's not guaranteed.

Great!!!
So intead of int we use float type and truncate?

--

7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger

ettps://github.com/rofl0r/chaos-pp

Branimir Maksimovic

unread,

Oct 4, 2021, 12:35:40 AM10/4/21

to

On 2021-10-04, James Kuyper <james...@alumni.caltech.edu> wrote:
>

> Do you now concede that your approach is not guaranteed to result in a
> long long value containing the mantissa?

of course, you convinced me.

David Brown

unread,

Oct 4, 2021, 3:52:20 AM10/4/21

to

"long double" is supported by any C++ or C compiler, for any target,
that tries to come close to any current standards compliance.

What you probably mean is that on some targets, "long double" is the
same size as "double". That's true for most 32-bit (and smaller)
targets. Most 64-bit targets support larger "long double".

As happens so often, there is /one/ major exception to the common
practices used by (AFAICS) every other OS, every other processor, every
other compiler manufacturer - on Windows, and with MSVC, "long double"
is 64-bit.

"long double" should not be used where "double" will do, because it can
be a great deal slower on many platforms (the load and saves are
irrelevant). You also have to question whether "long double" does what
you want, on any given target. On smaller targets (or more limited
compilers, like MSVC), it gives you no benefits in accuracy or range
compared to "double". On some, such as x86-64 with decent tools, it
generally gives you 80-bit types. On others - almost any other 64-bit
system - it gives you 128-bit quad double, but it is likely to be
implemented in software rather than hardware.

However, it /is/ a valid type on all (reasonable) C and C++ systems, and
it is a perfectly reasonable question to ask how to handle it. Some
aspects can be handled in a portable manner, others require
implementation-dependent details.

Juha Nieminen

unread,

Oct 4, 2021, 5:59:32 AM10/4/21

to

David Brown <david...@hesbynett.no> wrote:
> What you probably mean is that on some targets, "long double" is the
> same size as "double". That's true for most 32-bit (and smaller)
> targets. Most 64-bit targets support larger "long double".

That might perhaps be true for non-x86 32-bit targets. However, in x86
architectures long double is most typically 80-bit, and has been so
for pretty much as long as the x86 (and its x87 math coprocessor)
has existed, ie. all the way from the 8086 (which was a 16-bit
processor), especially if you had the 8087 FPU coprocessor.

The reason why long double is 80-bit in x86 architectures is that
that's the native floating point format used by the x87 FPU.
(While you can load 32-bit floats and 64-bit doubles, internally
the FPU uses 80-bit floating point. And you can load and store
full 80-bit floating point values, of course.)

However, what has happened since then is the introduction of SSE
(and later AVX), which is much simpler to use and more efficient
than the FPU, and is a direct replacement of the FPU (well, with
the exception of trigonometric functions, which SSE/AVX for some
reason do not support natively). However, SSE/AVX only supports
64-bit floating point, not 80-bit, which is why 80-bit floating
point has been "soft-deprecated" for like two decades now.

You can still use 80-bit long doubles with compilers like gcc
and clang, even when targeting the latest x86-64 CPUs. The compilers
will generate FPU opcodes to handle them (and, most often than not,
they will be less efficient. Also, if I understand correctly, using
the FPU and SSE at the same time is not very efficient because they
share logic circuitry and interfere with each other. In other words,
the SSE unit needs to wait for the FPU logic to do its thing, which
wastes clock cycles. I might be wrong in this, though.)

80-bit long doubles are also considered justifiably deprecated
because they don't really all that much precision. Sure, they
have an 11 bits larger mantissa, and a slightly bigger range,
but all in all that's not a huge advantage in most calculations.
If you calculations are hitting the precision limits of double,
chances are that they are going to hit the precision limits of
long double as well. There's a relatively small range where
long double beats double in practical use.

Bonita Montero

unread,

Oct 4, 2021, 7:50:33 AM10/4/21

to

Am 04.10.2021 um 09:52 schrieb David Brown:

> "long double" is supported by any C++ or C compiler, for any target,
> that tries to come close to any current standards compliance.

Most compilers map long double to IEEE-754 double precision and not
extended precision. Even with Intel C++ you must supply a compiler
-switch that you have double as a extended precision.

> What you probably mean is that on some targets, "long double" is the
> same size as "double". That's true for most 32-bit (and smaller)
> targets. Most 64-bit targets support larger "long double".

No, most don't.

Bart

unread,

Oct 4, 2021, 9:07:02 AM10/4/21

to

On 04/10/2021 08:52, David Brown wrote:
> On 03/10/2021 12:19, Bonita Montero wrote:
>> Am 03.10.2021 um 12:15 schrieb Bart:
>>
>>> Jesus. And I think this still doesn't do an actual long double!
>>
>> long double isn't supported by many compilers for x86-64.
>> long double should be avoided when possible because loads
>> and stores are slow with long double.
>>
>
> "long double" is supported by any C++ or C compiler, for any target,
> that tries to come close to any current standards compliance.
>
> What you probably mean is that on some targets, "long double" is the
> same size as "double". That's true for most 32-bit (and smaller)
> targets. Most 64-bit targets support larger "long double".

Any code running on x86 can choose to use x87 instructions for floating
point.

Then, even calculations involving 64-bit variables, will use 80-bit
intermediate results.

> As happens so often, there is /one/ major exception to the common
> practices used by (AFAICS) every other OS, every other processor, every
> other compiler manufacturer - on Windows, and with MSVC, "long double"
> is 64-bit.

Here's quick survey of Windows C compilers:

Compiler sizeof(long double)

MSCV 8 bytes
gcc 16
clang 8
DMC 10
lccwin 16
tcc 8
bcc 8

So, some compilers do manage to use a wider long double type, even on
Windows, showing that the choice is nothing do with the OS.

And at least one 'big' compiler that is not MSVC also uses a 64-bit type
for long double.

You just like having a bash at Windows (it doesn't stop the use of
float80), and at MSVC (Clang doesn't support float80 either).

Bonita Montero

unread,

Oct 4, 2021, 10:13:10 AM10/4/21

to

Am 04.10.2021 um 15:06 schrieb Bart:

> DMC 10
> lccwin 16

Irrelevant.

Scott Lurndal

unread,

Oct 4, 2021, 10:38:03 AM10/4/21

to

David Brown <david...@hesbynett.no> writes:
>On 03/10/2021 12:19, Bonita Montero wrote:

>
>"long double" should not be used where "double" will do, because it can
>be a great deal slower on many platforms (the load and saves are
>irrelevant). You also have to question whether "long double" does what
>you want, on any given target. On smaller targets (or more limited
>compilers, like MSVC), it gives you no benefits in accuracy or range
>compared to "double". On some, such as x86-64 with decent tools, it
>generally gives you 80-bit types. On others - almost any other 64-bit
>system - it gives you 128-bit quad double, but it is likely to be
>implemented in software rather than hardware.

ARMv8 has 128-bit floating point, in hardware.

FWIW, it also has 512 to 2048 bit floating point, in hardware, when
the SVE extension is implemented.

From the ARMv8 ARM (Architecture Reference Manual):

The architecture also supports the following floating-point data types:
· Half-precision, see Half-precision floating-point formats
on page A1-44 for details.
· Single-precision, see Single-precision floating-point format
on page A1-46 for details.
· Double-precision, see Double-precision floating-point format
on page A1-47 for details.
· BFloat16, see BFloat16 floating-point format on page A1-48
for details.

It also supports:
· Fixed-point interpretation of words and doublewords. See
Fixed-point format on page A1-50.
· Vectors, where a register holds multiple elements, each of
the same data type. See Vector formats on
page A1-41 for details.

The Armv8 architecture provides two register files:
· A general-purpose register file.
· A SIMD&FP register file.

In each of these, the possible register widths depend on the Execution state.
In AArch64 state:
· A general-purpose register file contains thirty-two 64-bit registers:
-- Many instructions can access these registers as 64-bit
registers or as 32-bit registers, using only the
bottom 32 bits.
· A SIMD&FP register file contains thirty-two 128-bit registers:
-- The quadword integer data types only apply to the SIMD&FP
register file.
-- The floating-point data types only apply to the SIMD&FP
register file.
-- While the AArch64 vector registers support 128-bit vectors,
the effective vector length can be 64-bits
or 128-bits depending on the A64 instruction encoding used,
see Instruction Mnemonics on page C1-195.

If the SVE extension is implemented, an additional register file of 32
512-bit to 2048-bit registers (implementation choice) is available).

Bonita Montero

unread,

Oct 4, 2021, 10:53:30 AM10/4/21

to

Am 04.10.2021 um 16:37 schrieb Scott Lurndal:

> ARMv8 has 128-bit floating point, in hardware.
> FWIW, it also has 512 to 2048 bit floating point,
> in hardware, when the SVE extension is implemented.

I bet that ther even isn't an implementation for 128 bit fp,
but just a specification.

James Kuyper

unread,

Oct 4, 2021, 11:31:01 AM10/4/21

to

On 10/4/21 12:34 AM, Branimir Maksimovic wrote:
> On 2021-10-04, James Kuyper <james...@alumni.caltech.edu> wrote:

...

>> In terms of that model, the value of the significand of a floating point
>> value x, interpreted as an integer, can be represented in the same
>> floating point type by a number with exactly the same significand, and e
>> = p.
>>
>> The key issue is whether such a representation is allowed, and it turns
>> out that there can be floating point representations which fit the C
>> standard's model, for which e_max < p, preventing some signficands from
>> being representable in such a type.
>>
>> The macro LDBL_MAX (corresponding to std::numeric_limits<long
>> double>::max()) is defined as expanding to the value of
>> (1 - b^(-p))*b^e_max, and is required to be at least 1e37. If, for
>> example, e_max == p-1 and b=2, then this means that for such a type, p
>> must be at least 124.
>>
>> Every floating point format I'm familiar with has an e_max value much
>> larger than p, so I think, as a practical matter, that it's safe to
>> assume that signficands can be represented in the same floating point
>> type, but strictly speaking, it's not guaranteed.
> Great!!!
> So intead of int we use float type and truncate?

You're half right. If you want the mantissa (== significand), you should
not truncate - that might lose you the parts of the significand that
represent the fractional part of the number. The OP had it right at the
second to last step in his original code:

x = std::ldexp(x, numeric_limits<FType>::digits);

At this point, x already contains the significand; there's no further
need to convert it to an integer type - in fact, in most contexts you'd
want it in floating point format for later steps in the processing.

Branimir Maksimovic

unread,

Oct 4, 2021, 12:15:29 PM10/4/21

to

On 2021-10-04, Bart <b...@freeuk.com> wrote:
>
> Any code running on x86 can choose to use x87 instructions for floating
> point.

Only in assembler...

Branimir Maksimovic

unread,

Oct 4, 2021, 12:15:54 PM10/4/21

to

Irrelevant to whom?

Branimir Maksimovic

unread,

Oct 4, 2021, 12:19:40 PM10/4/21

to

Thanks for explanaition.

Greets, branimir

--

7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger

https://github.com/rofl0r/chaos-pp

Scott Lurndal

unread,

Oct 4, 2021, 1:08:26 PM10/4/21

to

sc...@slp53.sl.home (Scott Lurndal) writes:
>David Brown <david...@hesbynett.no> writes:
>>On 03/10/2021 12:19, Bonita Montero wrote:
>
>>
>>"long double" should not be used where "double" will do, because it can
>>be a great deal slower on many platforms (the load and saves are
>>irrelevant). You also have to question whether "long double" does what
>>you want, on any given target. On smaller targets (or more limited
>>compilers, like MSVC), it gives you no benefits in accuracy or range
>>compared to "double". On some, such as x86-64 with decent tools, it
>>generally gives you 80-bit types. On others - almost any other 64-bit
>>system - it gives you 128-bit quad double, but it is likely to be
>>implemented in software rather than hardware.
>
>
>ARMv8 has 128-bit floating point, in hardware.

^ registers

It does not support 128-bit float point types.

David Brown

unread,

Oct 4, 2021, 2:04:37 PM10/4/21

to

On 04/10/2021 13:50, Bonita Montero wrote:
> Am 04.10.2021 um 09:52 schrieb David Brown:
>
>> "long double" is supported by any C++ or C compiler, for any target,
>> that tries to come close to any current standards compliance.
>
> Most compilers map long double to IEEE-754 double precision and not
> extended precision. Even with Intel C++ you must supply a compiler
> -switch that you have double as a extended precision.

No, you don't.

>
>> What you probably mean is that on some targets, "long double" is the
>> same size as "double". That's true for most 32-bit (and smaller)
>> targets. Most 64-bit targets support larger "long double".
>
> No, most don't.
>

Yes, they do.

Again, you are confusing "Windows" with "everything".

MS, for reasons known only to them, seem to have decided that "long
double" should be 64-bit in the Windows ABI (noting that there never was
a real ABI for 32-bit Windows). So Intel's compiler /on windows/ will
use 64-bit "long double" by default. It uses 80-bit "long double" on
other x86-64 targets (they are actually 16 bytes in size, for alignment
purposes, but 80 bits of data).

MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit
ARM targets have 128-bit (I don't know off-hand if they are IEEE quad
precision). For RISC-V, even 32-bit targets have 128-bit "long double".

Bart

unread,

Oct 4, 2021, 3:23:20 PM10/4/21

to

So it doesn't support float512 to float2048 in hardware? I thought that
sounded far-fetched.

But if it's just a matter of registers of 128+ bits (which can store a
vector of smaller float types), then that's old hat with x86/x64.

Bonita Montero

unread,

Oct 5, 2021, 1:13:11 AM10/5/21

to

Am 04.10.2021 um 20:04 schrieb David Brown:

> MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit
> ARM targets have 128-bit (I don't know off-hand if they are IEEE quad
> precision). For RISC-V, even 32-bit targets have 128-bit "long double".

There isn't any ARM-implementation with 128 bit FP.

Juha Nieminen

unread,

Oct 5, 2021, 1:19:37 AM10/5/21

to

Bart <b...@freeuk.com> wrote:
> Here's quick survey of Windows C compilers:
>
> Compiler sizeof(long double)
>
> MSCV 8 bytes
> gcc 16
> clang 8
> DMC 10
> lccwin 16
> tcc 8
> bcc 8

Note that sizeof() doesn't tell how large the floating point number is,
only how much storage space the compiler is reserving for it. Some
compilers may well over-reserve space for 80-bit floating point, for
alignment reasons (the extra bytes will be unused).

sizeof() is telling only if it gives less than 10 for long double.

David Brown

unread,

Oct 5, 2021, 3:48:16 AM10/5/21

to

And you know that because ... what? Because you are confusing hardware
floating point with floating point in general?

There are very few /hardware/ implementations of quad precision floating
point - I think perhaps Power is the only architecture that actually has
it in practice. (Some architectures, such as SPARC and RISC-V, have
defined them in the architecture but have no physical devices supporting
them.)

But /software/ implementations of quad precision floating point are not
hard to find. And a lot of toolchains support them - just as lots of
toolchains have software support for other kinds of floating point or
integer arithmetic that are part of C or C++, but do not have hardware
implementations on a given target.

And yes, we all know that software floating point is usually a lot
slower than hardware. People don't use 128-bit floating point for speed
- they use it because they need the range or precision, and speed is a
secondary concern.

David Brown

unread,

Oct 5, 2021, 3:48:53 AM10/5/21

to

On 04/10/2021 18:15, Branimir Maksimovic wrote:
> On 2021-10-04, Bart <b...@freeuk.com> wrote:
>>
>> Any code running on x86 can choose to use x87 instructions for floating
>> point.
>
> Only in assembler...
>

Or with a good compiler.

David Brown

unread,

Oct 5, 2021, 4:01:00 AM10/5/21

to

Yes, that is an important distinction - especially in the x86 world
where different sizes are common. "long double" is typically 80-bit
floating point on x86 (except MSVC and weaker tools), but for alignment
purposes it is often stored in 16-byte blocks (on 64-bit) or 12-byte
blocks (on 32-bit). On other targets (or gcc for x86 with particular
flags), "long double" is often quad precision, almost always implemented
in software.

A useful value to look at is "LDBL_DIG" in <float.h>. For 32-bit IEEE
floats, the number of digits is 6. For 64-bit, it is 15. For 80-bit,
it is 18. For 128-bit, it is 33.

Branimir Maksimovic

unread,

Oct 5, 2021, 6:19:22 AM10/5/21

to

not on Apple, as well...

Bonita Montero

unread,

Oct 5, 2021, 10:46:19 AM10/5/21

to

Am 05.10.2021 um 09:47 schrieb David Brown:

>> There isn't any ARM-implementation with 128 bit FP.

> And you know that because ... what? Because you are confusing hardware
> floating point with floating point in general?

There isn't even an ISA-spec. Its just an ABI.

> There are very few /hardware/ implementations of quad precision floating
> point - I think perhaps Power is the only architecture that actually has
> it in practice. (Some architectures, such as SPARC and RISC-V, have
> defined them in the architecture but have no physical devices supporting
> them.)

https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format

Bonita Montero

unread,

Oct 5, 2021, 10:47:57 AM10/5/21

to

Am 05.10.2021 um 07:19 schrieb Juha Nieminen:
> Bart <b...@freeuk.com> wrote:
>> Here's quick survey of Windows C compilers:
>>
>> Compiler sizeof(long double)
>>
>> MSCV 8 bytes
>> gcc 16
>> clang 8
>> DMC 10
>> lccwin 16
>> tcc 8
>> bcc 8
>
> Note that sizeof() doesn't tell how large the floating point number is,
> only how much storage space the compiler is reserving for it. Some
> compilers may well over-reserve space for 80-bit floating point, for
> alignment reasons (the extra bytes will be unused).

If you have a) an x86-Platform and b) sizeof(long double)
> 8 then the bits stored inside the long double are 80.

Keith Thompson

unread,

Oct 5, 2021, 12:04:44 PM10/5/21

to

David Brown <david...@hesbynett.no> writes:
> On 05/10/2021 07:12, Bonita Montero wrote:
>> Am 04.10.2021 um 20:04 schrieb David Brown:
>>
>>> MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit
>>> ARM targets have 128-bit (I don't know off-hand if they are IEEE quad
>>> precision). For RISC-V, even 32-bit targets have 128-bit "long double".
>>
>> There isn't any ARM-implementation with 128 bit FP.
>
> And you know that because ... what? Because you are confusing hardware
> floating point with floating point in general?

$ uname -m
aarch64
$ cat c.c
#include <stdio.h>
#include <limits.h>
#include <float.h>

int main(void) {
printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG = %d\n",
sizeof (long double),
CHAR_BIT * sizeof (long double),
LDBL_DIG);
}
$ gcc c.c -o c && ./c
sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
$

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

David Brown

unread,

Oct 5, 2021, 1:05:14 PM10/5/21

to

On 05/10/2021 16:46, Bonita Montero wrote:
> Am 05.10.2021 um 09:47 schrieb David Brown:
>
>>> There isn't any ARM-implementation with 128 bit FP.
>
>> And you know that because ... what? Because you are confusing hardware
>> floating point with floating point in general?
>
> There isn't even an ISA-spec. Its just an ABI.

And ... so what? For all your effort trying to move the goalposts, you
are still wrong - "long double" is supported, and on all larger systems
with decent compilers, it is greater range and precision than "double".
It's that simple.

>
>> There are very few /hardware/ implementations of quad precision floating
>> point - I think perhaps Power is the only architecture that actually has
>> it in practice. (Some architectures, such as SPARC and RISC-V, have
>> defined them in the architecture but have no physical devices supporting
>> them.)
>
> https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format

Yes, I have read that. Have you? Did you notice where it contradicted
your claims?

Bonita Montero

unread,

Oct 5, 2021, 1:13:32 PM10/5/21

to

Am 05.10.2021 um 19:04 schrieb David Brown:
> On 05/10/2021 16:46, Bonita Montero wrote:
>> Am 05.10.2021 um 09:47 schrieb David Brown:
>>
>>>> There isn't any ARM-implementation with 128 bit FP.
>>
>>> And you know that because ... what? Because you are confusing hardware
>>> floating point with floating point in general?
>>
>> There isn't even an ISA-spec. Its just an ABI.
>
> And ... so what? For all your effort trying to move the goalposts,

> you are still wrong - "long double" is supported, ...

But not on the ISA-level. So it's all up to the compiler and runtime.

Bonita Montero

unread,

Oct 5, 2021, 1:14:31 PM10/5/21

to

Am 05.10.2021 um 18:04 schrieb Keith Thompson:
> David Brown <david...@hesbynett.no> writes:
>> On 05/10/2021 07:12, Bonita Montero wrote:
>>> Am 04.10.2021 um 20:04 schrieb David Brown:
>>>
>>>> MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit
>>>> ARM targets have 128-bit (I don't know off-hand if they are IEEE quad
>>>> precision). For RISC-V, even 32-bit targets have 128-bit "long double".
>>>
>>> There isn't any ARM-implementation with 128 bit FP.
>>
>> And you know that because ... what? Because you are confusing hardware
>> floating point with floating point in general?
>
> $ uname -m
> aarch64
> $ cat c.c
> #include <stdio.h>
> #include <limits.h>
> #include <float.h>
>
> int main(void) {
> printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG = %d\n",
> sizeof (long double),
> CHAR_BIT * sizeof (long double),
> LDBL_DIG);
> }
> $ gcc c.c -o c && ./c
> sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
> $

That's pure software - your CPU doesn't natively support 128 bit FP.

Bart

unread,

Oct 5, 2021, 2:21:41 PM10/5/21

to

Does it matter?

Suppose there /is/ a long double type which is wider than double,
however many bits it is, or whether it's implemented in hardware or
software: how do you then extract its mantissa?

I showed a non-portable way of doing when it was Intel's 80-bit type. My
method used x87 instructions, but it's actually even simpler without it:
just take the first 64 bits of the 80-bit little-endian representation.
(That only works with that 80-bit format.)

But for anything beyond 80 bits, or for something that works on any size
of float, you will need a different approach, and possibly a uint128 type.

Branimir Maksimovic

unread,

Oct 5, 2021, 3:34:10 PM10/5/21

to

On 2021-10-05, Keith Thompson <Keith.S.T...@gmail.com> wrote:
> David Brown <david...@hesbynett.no> writes:
>> On 05/10/2021 07:12, Bonita Montero wrote:
>>> Am 04.10.2021 um 20:04 schrieb David Brown:
>>>
>>>> MSVC for ARM has 64-bit "long doubles" even on 64-bit ARM, other 64-bit
>>>> ARM targets have 128-bit (I don't know off-hand if they are IEEE quad
>>>> precision). For RISC-V, even 32-bit targets have 128-bit "long double".
>>>
>>> There isn't any ARM-implementation with 128 bit FP.
>>
>> And you know that because ... what? Because you are confusing hardware
>> floating point with floating point in general?
>
> $ uname -m
> aarch64
> $ cat c.c
> #include <stdio.h>
> #include <limits.h>
> #include <float.h>
>
> int main(void) {
> printf("sizeof (long double) = %zu (%zu bits), LDBL_DIG = %d\n",
> sizeof (long double),
> CHAR_BIT * sizeof (long double),
> LDBL_DIG);
> }
> $ gcc c.c -o c && ./c
> sizeof (long double) = 16 (128 bits), LDBL_DIG = 33
> $
>

Apple M1:
bmaxa@Branimirs-Air projects % gcc -O2 c.c
bmaxa@Branimirs-Air projects % ./a.out
sizeof (long double) = 8 (64 bits), LDBL_DIG = 15
bmaxa@Branimirs-Air projects % uname -m
arm64

Branimir Maksimovic

unread,

Oct 5, 2021, 3:34:58 PM10/5/21

to

On 2021-10-05, Bonita Montero <Bonita....@gmail.com> wrote:

Exactly.

Bonita Montero

unread,

Oct 6, 2021, 1:58:28 AM10/6/21

to

Yes, It's rare that you need 128 bit FP and it's even more rare that
the speed doesn't matter.

David Brown

unread,

Oct 6, 2021, 2:12:05 AM10/6/21

to

Until a couple of days ago, you didn't think 128-bit floating point
existed. Now you think you know how it is used and what is important to
users?

There was a time when /all/ floating point, at least on "normal"
computers, was in software. Yet people used it - including 64-bit
doubles which were much slower than 32-bit float. Then floating point
co-processors became common, and made floating point somewhat faster
(though still slow). And people used them. There are a great many
processors that still don't have hardware floating point, and people use
software floats if they need floating point. You do so because getting
the correct answer is more important than getting a fast answer.

If you need the greater precision or range of quad precision floating
point, then you use that. If it is slow, you accept that - or you pay
the price for faster hardware.

There does not seem to be much call for quad precision - not enough for
many cpu manufacturers to spend the significant die space needed to
implement it in hardware. But there is enough call for it to be
implemented in software on many tools (gcc and clang for x86 have it as
__float128, and I'd be slightly surprised if MS doesn't have some
support through a similar extension), it is specified in a number of cpu
architecture documents for future implementation, and there are many
library implementations available.

Bonita Montero

unread,

Oct 6, 2021, 4:25:30 AM10/6/21

to

Am 06.10.2021 um 08:11 schrieb David Brown:

> Until a couple of days ago, you didn't think 128-bit floating point

> existed. ...

Where did I say that ?
I only said that ARM 128 bit FP is just an ABI-issue.

Bart

unread,

Oct 6, 2021, 7:12:53 AM10/6/21

to

You keep avoiding my point

Suppose you HAVE 80-bit or 128-FP, and need to extract the mantissa, how
would you do it?

It's like someone asking how to work out the 5th root of a number, and
you give a method for the square root. Then when pressed, you say that
is is rare to need a 5th root!

Bonita Montero

unread,

Oct 6, 2021, 8:49:06 AM10/6/21

to

As the 1 high-bit is alwas included in the 64 bit mantissa of a 80 bit
FP-value that's much easier. In contrast to smaller FP-values where the
one bit implicity except from denormal values or values with the maximum
exponent.

union
{
long double value;
struct
{
uint64_t mantissa;
uint16_t exponent : 15,
sign : 1;
};
}

Just store your value into value and extract the mantissa from mantissa.

Scott Lurndal

unread,

Oct 6, 2021, 10:25:33 AM10/6/21

to

David Brown <david...@hesbynett.no> writes:
>On 06/10/2021 07:58, Bonita Montero wrote:
>> Am 05.10.2021 um 20:21 schrieb Bart:
>>> On 05/10/2021 18:14, Bonita Montero wrote:
>
>>>> That's pure software - your CPU doesn't natively support 128 bit FP.
>>>>
>>>
>>> Does it matter?
>>
>> Yes, It's rare that you need 128 bit FP and it's even more rare that
>> the speed doesn't matter.
>>
>
>Until a couple of days ago, you didn't think 128-bit floating point
>existed. Now you think you know how it is used and what is important to
>users?

It seems the most popular new floating point format today is 16-bit
floating point (bfloat16 in ARM) - important in machine learning
applications.

Bonita Montero

unread,

Oct 6, 2021, 10:54:09 AM10/6/21

to

Am 06.10.2021 um 16:25 schrieb Scott Lurndal:

> It seems the most popular new floating point format today is 16-bit
> floating point (bfloat16 in ARM) - important in machine learning
> applications.

I think that's popular only for KI-puposes.
Even for computer graphics that's not sufficient.

red floyd

unread,

Oct 6, 2021, 12:44:48 PM10/6/21

to

On 10/6/2021 5:48 AM, Bonita Montero wrote:

>
> As the 1 high-bit is alwas included in the 64 bit mantissa of a 80 bit
> FP-value that's much easier. In contrast to smaller FP-values where the
> one bit implicity except from denormal values or values with the maximum
> exponent.
>
> union
> {
>     long double value;
>     struct
>     {
>         uint64_t mantissa;
>         uint16_t exponent : 15,
>                  sign : 1;
>     };
> }
>
> Just store your value into value and extract the mantissa from mantissa.

Does anyone know if type punning through a union is still undefined
behavior?

Also, I believe bitfield allocation is implementation defined.

Bonita Montero

unread,

Oct 6, 2021, 12:52:52 PM10/6/21

to

It's a de-facto standard for all compilers because such constructs
are needed very often.
Otherwise it's specified that you can alias anything as a array of
chars and an array of chars as anything - I think that's specified
because this is nice to have for serialization-purposes. So you can
cast a type to a char * and then further to the destination type to
get the same effect. This definitely prevents any optimization tricks
of the compiler which would sabotage this.

> Also, I believe bitfield allocation is implementation defined.

The way than I did is at least portable among x86-compilers.
On platforms where you have a differnt byte-order this won't
work, but these platforms don't have a 80 bit FP type anyway.

Bonita Montero

unread,

Oct 6, 2021, 1:02:45 PM10/6/21

to

>>> union
>>> {
>>>      long double value;
>>>      struct
>>>      {
>>>          uint64_t mantissa;
>>>          uint16_t exponent : 15,
>>>                   sign : 1;
>>>      };
>>> }
>>>

>> Also, I believe bitfield allocation is implementation defined.

> The way than I did is at least portable among x86-compilers.
> On platforms where you have a differnt byte-order this won't
> work, but these platforms don't have a 80 bit FP type anyway.

To be more precise here: there are some differences in how x86
compilers handle bitfields, but there are also commonalities.
And the binary layout of the above structure is definitely the
same for all x86-compilers.

Chris M. Thomasson

unread,

Oct 6, 2021, 4:57:56 PM10/6/21

to

This does seem to be a pattern for her. Humm...

Juha Nieminen

unread,

Oct 7, 2021, 1:35:29 AM10/7/21

to

red floyd <no.spa...@its.invalid> wrote:
>> union
>> {
>>     long double value;
>>     struct
>>     {
>>         uint64_t mantissa;
>>         uint16_t exponent : 15,
>>                  sign : 1;
>>     };
>> }
>>
>> Just store your value into value and extract the mantissa from mantissa.
>
> Does anyone know if type punning through a union is still undefined
> behavior?
>
> Also, I believe bitfield allocation is implementation defined.

Also, the above union assumes that the long double value is stored in the
same byte order as the members of the struct. It also assumes that the long
double occupies exactly 10 bytes and isn't padded in the wrong end.

And, as you say, it's not defined whether that sign bit will be the lowest
or highest bit of that 16-bit bitfield. (I don't think the standard even
guarantees that the 1-bit will be stored in the same bytes as the 15-bits.)

You could make some compile-time checks to see that the union will have
the proper internal structure (and refuse to compile otherwise), although
IIRC there's no way to check endianess at compile time (unless C++20
added some new features to do that).

David Brown

unread,

Oct 7, 2021, 2:17:34 AM10/7/21

to

Yes, it is. It's not uncommon to for compilers to generate code as
though it /were/ defined in C++ (it is defined behaviour in C),
especially for unions of simple types. But compilers don't (IME)
guarantee it.

The standard method for accessing "raw" bit patterns that is always
safe, is memcpy(). In C++20, there is now std::bit_cast<>, which also
lets you do get type-punning effects.

(And if anyone tells you type-punning unions "work" in C++, then ask why
std::bit_cast<> was added to the language.)

David Brown

unread,

Oct 7, 2021, 2:20:31 AM10/7/21

to

C++20 added std::endian for convenience and standardisation of common
compiler features for endianness:

<https://en.cppreference.com/w/cpp/types/endian>

Bonita Montero

unread,

Oct 7, 2021, 2:38:12 AM10/7/21

to

Am 07.10.2021 um 07:35 schrieb Juha Nieminen:
> red floyd <no.spa...@its.invalid> wrote:
>>> union
>>> {
>>>     long double value;
>>>     struct
>>>     {
>>>         uint64_t mantissa;
>>>         uint16_t exponent : 15,
>>>                  sign : 1;
>>>     };
>>> }
>>>
>>> Just store your value into value and extract the mantissa from mantissa.
>>
>> Does anyone know if type punning through a union is still undefined
>> behavior?
>>
>> Also, I believe bitfield allocation is implementation defined.
>
> Also, the above union assumes that the long double value is stored in the
> same byte order as the members of the struct. It also assumes that the long
> double occupies exactly 10 bytes and isn't padded in the wrong end.

It actually fits for all x86-compilers.

Bonita Montero

unread,

Oct 7, 2021, 2:38:46 AM10/7/21

to

Am 07.10.2021 um 08:20 schrieb David Brown:

> C++20 added std::endian for convenience and standardisation of common
> compiler features for endianness:

Why an endianess check here ?
There are no big-endian machines with 80 bit FP.

Bonita Montero

unread,

Oct 7, 2021, 2:42:23 AM10/7/21

to

Oh, I just remembered the 68K FPUs. These have 80 bit FP.
But 68K is dead today.

David Brown

unread,

Oct 7, 2021, 4:23:49 AM10/7/21

to

The 68k ISA lives on in the microcontroller world, as the ColdFire
processors, though I don't believe any new ColdFire microcontrollers
have been released since NXP bought Freescale in 2015. Even at that
stage it was no longer a major development line for Freescale. A decade
or so before that, ColdFire was one of the most popular architectures in
small network devices such as SOHO NAT routers, and was the main target
for the MMU-less Linux version ucLinux. This was all long after the 68k
was gone from desktops, workstations, home computers, etc.

(Because ucLinux does not support fork(), this lead to an increased use
of spawn() instead of fork-exec, which then made porting to Windows
easier. So if you use any *nix-heritage software on your Windows
machine, that may be partly thanks to the 68k architecture.)

Some of the 68k-based microcontrollers are immortal. People still make
devices using the renowned 68332, though it is around 30 years old.

However, few of these embedded uses used floating point, and the the
bigger ColdFire processors with hardware floating point support all used
64-bit floats. 80-bit floats were used in the 68882 FPU co-processor,
and the 68040 and 68060 processors.

The first serious electronics board I designed had a 68332 with a 68882
co-processor, many years ago. (The current version of the same system
is, unsurprisingly, ARM based.)

Juha Nieminen

unread,

Oct 8, 2021, 12:37:51 AM10/8/21

to

ARM is becoming more and more common.

Juha Nieminen

unread,

Oct 8, 2021, 12:39:18 AM10/8/21

to

David Brown <david...@hesbynett.no> wrote:
> (And if anyone tells you type-punning unions "work" in C++, then ask why
> std::bit_cast<> was added to the language.)

Why hasn't type-punning been standardized in C++, given that it's
standardized in C?

red floyd

unread,

Oct 8, 2021, 1:44:16 AM10/8/21

to

Back in the early '80s, there was something called "all-the-world's-
a-vax-running-BSD Syndrome". You appear to have the modern variant,
known as "All-the-world's-an-x86-running-Windows Syndrome."

Bonita Montero

unread,

Oct 8, 2021, 1:54:09 AM10/8/21

to

ARM hasn't 80 bit FP, so there's no need for the above code on ARM.

David Brown

unread,

Oct 8, 2021, 3:25:51 AM10/8/21

to

That's a fair question. First, I think you'd have to ask why was it
standardised in C? AFAIUI (and this is from memory rather than
reference), in C90 and before, it was quite unclear whether the
standards defined type-punning union behaviour. However, code relied on
it. It's not something that turns up very often - you very rarely have
need for this kind of thing. But because it turns up in low-level code
that is in turn used by lots of other code - such as in the heart of a
software floating point library - it becomes important.

A few compilers, such as gcc, actually documented that type-punning
unions worked - most did not, though people expected them to support
them. I've never been clear as to whether the C standards committee
changed their mind to make type-punning fully defined, or if they simply
changed the wording of the standard to make it less ambiguous.

In C++, however, the idea of type-punning did not fit in. The original
aim was that most C++ code would be about objects that get created using
a constructor, maintain an invariant by using methods to access the
private data, and get destroyed cleanly with a destructor. Messing
directly with the internals of an object using different types - that's
the kind of bad behaviour C programmers use, and stops you relying on
your types and their safety and correctness.

C++ certainly could have made it defined behaviour for unions that are
limited to POD types. But they have never done so - maybe there are
complicating factors that I haven't thought of. The current idea in
C++20 is std::bit_cast<>. I don't know how useful or convenient that
will be.

Scott Lurndal

unread,

Oct 8, 2021, 10:34:47 AM10/8/21

to

David Brown <david...@hesbynett.no> writes:
.
>
>A few compilers, such as gcc, actually documented that type-punning
>unions worked - most did not, though people expected them to support
>them. I've never been clear as to whether the C standards committee
>changed their mind to make type-punning fully defined, or if they simply
>changed the wording of the standard to make it less ambiguous.
>

Reading through the Unisys C compiler manual is interesting, due
to the radically different underlying architecture, for example
in the porting section:

The comparison of negative numbers with unsigned types may behave differently on
A Series systems than on other platforms. Platforms that store arithmetic values in
one's or two's complement form store the sign when assigning to an unsigned type.
On A Series systems, the sign is not stored when assigning to an unsigned type. On
A Series systems, comparing a negative value with the value in an unsigned type will
never be true unless the CHAR2 or UNSIGNED subordinate options of the PORT
compiler control option are used to emulate two's complement arithmetic on
unsigned types. Another workaround is to change negative literals to their one's
complement values (for example, -1 becomes 0xFF).

The representation of scalar types differs between A Series systems and other
platforms. On A Series systems, integer types are 6-bytes in length and start on word
boundaries. Any calculations that assume the size of an integer type to be anything
other than 6 bytes will result in problems when portation occurs.

long Ary[ (sizeof(SomeObject)+3) / sizeof(long)];

The addition of the numeric literal 3 to the size of the type of SomeObject assumes that
sizeof(long) returns 4. In A Series C, sizeof(long) returns 6, causing the calculation
to return unexpected results. The following code fragment illustrates how you can
avoid this type of portation problem:

struct{ unsigned short Type;
unsigned char SeqNum;

} Info;
char *InfoPtr = &Info;
seq = InfoPtr[2];

Using InfoPtr[2] to access SeqNum assumes that bytes 0 and 1 contain the structure
member Type and that byte 2 contains the structure member SeqNum. On A Series
systems, Type occupies bytes 0 to 5, and SeqNum occupies byte 6, which means that
InfoPtr[2] accesses the wrong data.

When a pointer is implicitly or explicitly cast from a pointer to an object of a given
alignment to a pointer to an object of a more strict alignment, the resulting pointer
may not be aligned with the original pointer. For example, casting a pointer to char to a
pointer to int causes the pointer index to be divided by 6 to change from character to
word units. The divide by 6 operation causes the pointer to int to point to the
beginning of the word. If the char pointer points to the middle of the word, the two
pointers do not point to the same data. To catch pointer alignment errors of this type
at run time, use the $BOUNDS(ALIGNMENT) compiler control option.

Two's complement arithmetic is used on many platforms. On A Series systems,
arithmetic is performed on data in signed-magnitude form. This can cause
discrepancies in algorithms that depend on the two's complement representation. For
example, C applications that use encryption algorithms to match data, such as
passwords, between client and server must perform the encryption the same way on
the client-end and the server-end. The differences between the two's complement
and signed-magnitude representation may result in different values when fragments
of data are extracted, encrypted, and reinserted into the data.

To obtain matching results, you can define macros that return arithmetic results in
two's complement form. The following example illustrates macros for two's
complement addition and subtraction:

#define tc_add(arg1, arg2) (((arg1) + (arg2)) & 0xFF)
#define tc_sub(arg1, arg2) (((arg1) + (0x100 - (arg2))) & 0xFF)

The macro errno is defined as an array indexed by the unique stack number of the
execution process when $SET CONCURRENTEXECUTION or
$SHARING=SHAREDBYALL. As a result, each client of a C library has a unique errno
location. [ed. Note stack number in this context is equivalent to task/process id]

Although each errno location starts at zero, the location is not reset to zero when a
client delinks from a library or when another client links with the same stack number.
In order to avoid this issue, the C library should explicitly set errno to zero before
calling a function that sets errno.

Note: The macro errno cannot be used as a declarator when
$CONCURRENTEXECUTION or $SHARING=SHAREDBYALL and any of the standard
heads have been included.

Bonita Montero

unread,

Oct 8, 2021, 2:24:04 PM10/8/21

to

Am 07.10.2021 um 10:23 schrieb David Brown:

> The 68k ISA lives on in the microcontroller world, as the ColdFire
> processors, though I don't believe any new ColdFire microcontrollers
> have been released since NXP bought Freescale in 2015. Even at that
> stage it was no longer a major development line for Freescale. A
> decade or so before that, ColdFire was one of the most popular

> architectures in small network devices such as SOHO NAT routers, ...

MIPS was dominant in routers before it was replaced with ARM.
68K had a straighter instruction-set than x86, but the two times
indirect addressing-modes introduced with the 68020 were totally
brain-damaged.

David Brown

unread,

Oct 9, 2021, 4:23:46 AM10/9/21

to

On 08/10/2021 20:23, Bonita Montero wrote:
> Am 07.10.2021 um 10:23 schrieb David Brown:
>
>> The 68k ISA lives on in the microcontroller world, as the ColdFire
>> processors, though I don't believe any new ColdFire microcontrollers
>> have been released since NXP bought Freescale in 2015. Even at that
>> stage it was no longer a major development line for Freescale. A
>> decade or so before that, ColdFire was one of the most popular
>> architectures in small network devices such as SOHO NAT routers, ...
>
> MIPS was dominant in routers before it was replaced with ARM.

MIPS was dominant in high-end routers and fast switches (with PowerPC
being the main competitor). ColdFire had a substantial share of the
small device market for quite a while, as ucLinux became a popular
choice for the system, though MIPS was also used there (mostly with
proprietary OSes). Later, MIPS was also used with Linux in such
routers, and as you say, ARM is now the most common choice (though MIPS
is still used in many low-end and high-end network devices, and PowerPC
is still found in some big systems).

> 68K had a straighter instruction-set than x86, but the two times
> indirect addressing-modes introduced with the 68020 were totally
> brain-damaged.

68k was a better ISA than x86 in almost every way imaginable. But it
did get a few overly complicated addressing modes - some of these were
dropped in later 68k devices. And the /implementation/ in the 68k
family was not as good - Motorola didn't have as many smart people and
as big budgets as Intel or even AMD. On the 68030, IIRC, someone
discovered that a software division routine worked faster than the
hardware division instruction.

The ColdFire was a clean slate re-implementation of a simplified and
modernised 68k ISA. It dropped the more advanced addressing modes and
some of the rarely used complex instructions, but added a few new useful
ones and streamlined the primary ones. It was marketed as a "variable
instruction length RISC architecture".

When you look back at the original 68000 compared to the 8086, it is
clear that technically the 68000 ISA was a modern and forward-looking
architecture while the 8086 was outdated and old-fashioned before the
first samples were made. The story of the IBM PC shows how technical
brilliance is not enough to succeed - the worst cpu architecture around,
combined with the worst OS ever hacked together, ended up dominant.

Bonita Montero

unread,

Oct 9, 2021, 4:50:50 AM10/9/21

to

Am 09.10.2021 um 10:23 schrieb David Brown:

>> MIPS was dominant in routers before it was replaced with ARM.

> MIPS was dominant in high-end routers and fast switches (with PowerPC

> being the main competitor). ...

I don't know about former high-end routers, but MIPS was by far
the most dominant architecture on SOHO-Routers before ARM. 68k
played almost no role then. F.e. in Germany almost anyone uses
the Fritz!Box routers and they all were MIPS-based before AVM
switched to ARM.

Branimir Maksimovic

unread,

Oct 9, 2021, 5:02:47 AM10/9/21

to

Well I have router with 1GB of RAM and 4 cores :P
running as server :P

--

7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...

Bart

unread,

Oct 9, 2021, 7:16:30 AM10/9/21

to

On 09/10/2021 09:23, David Brown wrote:
> On 08/10/2021 20:23, Bonita Montero wrote:

>> 68K had a straighter instruction-set than x86, but the two times
>> indirect addressing-modes introduced with the 68020 were totally
>> brain-damaged.
>
> 68k was a better ISA than x86 in almost every way imaginable. But it
> did get a few overly complicated addressing modes - some of these were
> dropped in later 68k devices. And the /implementation/ in the 68k
> family was not as good - Motorola didn't have as many smart people and
> as big budgets as Intel or even AMD. On the 68030, IIRC, someone
> discovered that a software division routine worked faster than the
> hardware division instruction.
>

> When you look back at the original 68000 compared to the 8086, it is
> clear that technically the 68000 ISA was a modern and forward-looking
> architecture while the 8086 was outdated and old-fashioned before the
> first samples were made.

That was my initial impression when I first looked at 68000 in the 80s.

Until I had a closer look at the instruction set, with a view to
generating code for it from a compiler. Then it had almost as much lack
of orthogonality as the 8086.

The obvious one is in have two lots of integer registers, 8 Data
registers and 8 Address registers, instead of just 16 general registers,
so that you are constantly thinking about which register set your
operands and intermediate results should go in.

(I never got round to using the chip; my company had moved on to doing
stuff for the IBM PC, instead of developing own hardware products.)

> The story of the IBM PC shows how technical
> brilliance is not enough to succeed - the worst cpu architecture around,
> combined with the worst OS ever hacked together, ended up dominant.

I was looking forward to the Zilog Z80000, but unfortunately that never
happened.

David Brown

unread,

Oct 9, 2021, 9:10:46 AM10/9/21

to

On 09/10/2021 13:16, Bart wrote:
> On 09/10/2021 09:23, David Brown wrote:
>> On 08/10/2021 20:23, Bonita Montero wrote:
>
>>> 68K had a straighter instruction-set than x86, but the two times
>>> indirect addressing-modes introduced with the 68020 were totally
>>> brain-damaged.
>>
>> 68k was a better ISA than x86 in almost every way imaginable. But it
>> did get a few overly complicated addressing modes - some of these were
>> dropped in later 68k devices. And the /implementation/ in the 68k
>> family was not as good - Motorola didn't have as many smart people and
>> as big budgets as Intel or even AMD. On the 68030, IIRC, someone
>> discovered that a software division routine worked faster than the
>> hardware division instruction.
>>
>
>> When you look back at the original 68000 compared to the 8086, it is
>> clear that technically the 68000 ISA was a modern and forward-looking
>> architecture while the 8086 was outdated and old-fashioned before the
>> first samples were made.
>
> That was my initial impression when I first looked at 68000 in the 80s.
>
> Until I had a closer look at the instruction set, with a view to
> generating code for it from a compiler. Then it had almost as much lack
> of orthogonality as the 8086.
>

It is not as orthogonal as most RISC architectures, but a world ahead of
x86.

> The obvious one is in have two lots of integer registers, 8 Data
> registers and 8 Address registers, instead of just 16 general registers,
> so that you are constantly thinking about which register set your
> operands and intermediate results should go in.

Sure. But you have just two sorts of registers - A0..7 and D0..7. Some
common instructions can use either set, while ALU and address operations
tend to be limited to one set. On the 8086, every register - A, B, C,
D, SI, DI, BP, SP - has wildly different uses and instructions. (Later
x86 devices got more regular.)

Scott Lurndal

unread,

Oct 9, 2021, 10:57:25 AM10/9/21

to

There are still very large numbers of MIPS-based chips being produced (at
legacy nodes like 65, 45 and 22nm) today. Not new designs, mind,
but rather ten to fifteen year old designs.

This third generation ARM processor for routers (high-end) is now sampling:

https://semiaccurate.com/2021/06/28/marvell-announces-their-5nm-octeon-10-dpu/
https://www.theregister.com/2021/10/06/marvell_ai_chip/