Portable IEEE754 write routine

Malcolm McLean

unread,

Feb 17, 2010, 12:04:16 PM2/17/10

to

Here it is. Many thanks to all who helped, especially Michael on whose
contribution the code is mainly based.

The idea is to write a float in binary IEEE format, reagrdless of the
native representation in the machine you are on.

I've only access to a Windows PC at the moment, so I'd be grateful if
people could test it on other architectures, especially non-IEEE
native floating point units. Particularly I'm not sure about the
denormalised numbers - my printf seems to think they are identical to
zero so I'm not sure if I'm dropping precision.

int fwriteieee754(double x, FILE *fp, int bigendian)
{
int shift;
unsigned long sign, exp, hibits, hilong, lowlong;
double fnorm, significand;
int expbits = 11;
int significandbits = 52;

/* zero (can't handle signed zero) */
if(x == 0)
{
hilong = 0;
lowlong = 0;
goto writedata;
}
/* infinity */
if(x > DBL_MAX)
{
hilong = 1024 + ((1<<(expbits-1)) - 1);
hilong <<= (31 - expbits);
lowlong = 0;
goto writedata;
}
/* -infinity */
if(x < -DBL_MAX)
{
hilong = 1024 + ((1<<(expbits-1)) - 1);
hilong <<= (31-expbits);
hilong |= (1 << 31);
lowlong = 0;
goto writedata;
}
/* NaN - dodgy because many compilers optimise out this test, but
*there is no portable isnan() */
if(x != x)
{
hilong = 1024 + ((1<<(expbits-1)) - 1);
hilong <<= (31 - expbits);
lowlong = 1234;
goto writedata;
}

/* get the sign */
if(x < 0) {sign = 1; fnorm = -x;}
else {sign = 0; fnorm = x;}

/* get the normalized form of f and track the exponent */
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }

/* check for denormalized numbers */
if(shift < -1022)
{
while(shift <-1022) {fnorm /= 2.0; shift++;}
shift = -1023;
}
else
fnorm = fnorm - 1.0; /* take the significant bit off mantissa */

/* calculate the integer form of the significand */
/* hold it in a double for now */

significand = fnorm * ((1LL<<significandbits) + 0.5f);

/* get the biased exponent */
exp = shift + ((1<<(expbits-1)) - 1); /* shift + bias */

/* put the data into tweo longs (for convenience) */
hibits = (long) (significand / 4294967296);
hilong = (sign << 31) | (exp << (31-expbits) ) | hibits;
lowlong = (long) (significand - hibits * 4294967296);

writedata:
/* write the bytes out to the stream */
if(bigendian)
{
fputc( (hilong >> 24) & 0xFF, fp);
fputc( (hilong >> 16) & 0xFF, fp);
fputc( (hilong >> 8) & 0xFF, fp);
fputc( hilong & 0xFF, fp);

fputc( (lowlong >> 24) & 0xFF, fp);
fputc( (lowlong >> 16) & 0xFF, fp);
fputc( (lowlong >> 8) & 0xFF, fp);
fputc( lowlong & 0xFF, fp);
}
else
{
fputc( lowlong & 0xFF, fp);
fputc( (lowlong >> 8) & 0xFF, fp);
fputc( (lowlong >> 16) & 0xFF, fp);
fputc( (lowlong >> 24) & 0xFF, fp);

fputc( hilong & 0xFF, fp);
fputc( (hilong >> 8) & 0xFF, fp);
fputc( (hilong >> 16) & 0xFF, fp);
fputc( (hilong >> 24) & 0xFF, fp);
}
return ferror(fp);
}

Barry Schwarz

unread,

Feb 17, 2010, 5:38:20 PM2/17/10

to

Why did you make the 0.5 a float instead of letting it default to
double. Now the integer expression must be converted to float and
then the sum to double to participate in the multiplication. Without
the f suffix, the integer expression would be converted to double
immediately and no further conversion would be needed.

--
Remove del for email

Alan Curry

unread,

Feb 17, 2010, 6:00:31 PM2/17/10

to

|
|The idea is to write a float in binary IEEE format, reagrdless of the
|native representation in the machine you are on.
|
|I've only access to a Windows PC at the moment, so I'd be grateful if
|people could test it on other architectures, especially non-IEEE

I don't have anything that exotic, but on boring old big endian PowerPC
there seems to be some trouble. I ran this test:

#include <stdio.h>
#include <float.h>

|
|int fwriteieee754(double x, FILE *fp, int bigendian)
|{

[...your code unchanged...]
|}

int main(void)
{
FILE *f;
double x;
unsigned char *p;

x = 2.00000095367431640625; /* 0x1.000008p+1 */

printf("Before:\n");
printf(" %a\n", x);
printf(" %.60f\n", x);
for(p=(unsigned char *)&x;p<(unsigned char *)(&x+1);++p)
printf(" %02X", *p);
puts("");

f=fopen("fnord", "w+");
if(!f)
return 1;
fwriteieee754(x, f, 1);
rewind(f);
if(!fread(&x, sizeof x, 1, f))
return 1;
remove("fnord");

printf("After:\n");
printf(" %a\n", x);
printf(" %.60f\n", x);
for(p=(unsigned char *)&x;p<(unsigned char *)(&x+1);++p)
printf(" %02X", *p);
puts("");

return 0;
}

And the output was:

Before:
0x1.000008p+1
2.000000953674316406250000000000000000000000000000000000000000
40 00 00 00 80 00 00 00
After:
0x1.000007fffffffp+1
2.000000953674315962160790149937383830547332763671875000000000
40 00 00 00 7F FF FF FF

I was thrown off by the "bigendian" parameter. At first I thought it
meant "native byte order is bigendian, so swaps must be performed on the
output to make it acceptable to the target little endian machine", but
it turned out to mean the opposite(?) Be clear on that when you write
the documentation.

--
Alan Curry

Malcolm McLean

unread,

Feb 18, 2010, 2:03:10 AM2/18/10

to

On Feb 18, 1:00 am, pac...@kosh.dhis.org (Alan Curry) wrote:
> |
>
> And the output was:
>
> Before:
> 0x1.000008p+1
> 2.000000953674316406250000000000000000000000000000000000000000
> 40 00 00 00 80 00 00 00
> After:
> 0x1.000007fffffffp+1
> 2.000000953674315962160790149937383830547332763671875000000000
> 40 00 00 00 7F FF FF FF
>

Thanks for running the test.
It's losing a bit. If native format is not IEEE754 (but something very
close) the loss of a bit or two is inevitable. However if the mantissa
is exactly the same size, it would be nice to preserve the exact
pattern.

Alan Curry

unread,

Feb 18, 2010, 7:50:58 AM2/18/10

to

In article <21e31c8c-eb8b-47ea...@j31g2000yqa.googlegroups.com>,

Malcolm McLean <malcolm...@btinternet.com> wrote:
|
|Thanks for running the test.
|It's losing a bit. If native format is not IEEE754 (but something very
|close) the loss of a bit or two is inevitable. However if the mantissa
|is exactly the same size, it would be nice to preserve the exact
|pattern.

PowerPC not IEEE754? News to me...

It's not just that bit though. It seems to be losing the entire lower 32
bits, replacing them with 7fffffff, whenever the high bit of those 32 bits is
1 in the input. Otherwise, the output exactly matches the input.

Here's a more complete list of values

Before After
0x1.00000feedfacep+1 0x1.000007fffffffp+1
0x1.0000f0eedfacep+1 0x1.0000f0eedfacep+1
0x1.0000000000000p+1 0x1.0000000000000p+1
0x1.0000000000001p+1 0x1.0000000000001p+1
0x1.0000000000010p+1 0x1.0000000000010p+1
0x1.0000000000100p+1 0x1.0000000000100p+1
0x1.0000000001000p+1 0x1.0000000001000p+1
0x1.0000000010000p+1 0x1.0000000010000p+1
0x1.0000000100000p+1 0x1.0000000100000p+1
0x1.0000001000000p+1 0x1.0000001000000p+1
0x1.0000002000000p+1 0x1.0000002000000p+1
0x1.0000004000000p+1 0x1.0000004000000p+1
0x1.0000008000000p+1 0x1.0000008000000p+1
0x1.0000010000000p+1 0x1.0000010000000p+1
0x1.000007fffffffp+1 0x1.000007fffffffp+1
0x1.0000080000000p+1 0x1.000007fffffffp+1
0x1.0000080000001p+1 0x1.000007fffffffp+1
0x1.0000080000010p+1 0x1.000007fffffffp+1
0x1.0000080000100p+1 0x1.000007fffffffp+1
0x1.0000080001000p+1 0x1.000007fffffffp+1
0x1.0000080010000p+1 0x1.000007fffffffp+1
0x1.0000080100000p+1 0x1.000007fffffffp+1
0x1.0000081000000p+1 0x1.000007fffffffp+1
0x1.0000082000000p+1 0x1.000007fffffffp+1
0x1.0000084000000p+1 0x1.000007fffffffp+1
0x1.0000088000000p+1 0x1.000007fffffffp+1
0x1.0000090000000p+1 0x1.000007fffffffp+1
0x1.00000ffffffffp+1 0x1.000007fffffffp+1

int main(void)
{
FILE *f;
double x;

double *p;
double testdoubles[] =
{
0x1.00000feedfacep+1, 0x1.0000f0eedfacep+1, 0x1.0000000000000p+1,
0x1.0000000000001p+1, 0x1.0000000000010p+1, 0x1.0000000000100p+1,
0x1.0000000001000p+1, 0x1.0000000010000p+1, 0x1.0000000100000p+1,
0x1.0000001000000p+1, 0x1.0000002000000p+1, 0x1.0000004000000p+1,
0x1.0000008000000p+1, 0x1.0000010000000p+1, 0x1.000007fffffffp+1,
0x1.0000080000000p+1, 0x1.0000080000001p+1, 0x1.0000080000010p+1,
0x1.0000080000100p+1, 0x1.0000080001000p+1, 0x1.0000080010000p+1,
0x1.0000080100000p+1, 0x1.0000081000000p+1, 0x1.0000082000000p+1,
0x1.0000084000000p+1, 0x1.0000088000000p+1, 0x1.0000090000000p+1,
0x1.00000ffffffffp+1, 0
};

printf("%-30s%-30s\n", "Before", "After");
for(p=testdoubles;*p;++p) {
x = *p;

printf("%-30.13a", x);

f=fopen("fnord", "w+");
if(!f)
return 1;
fwriteieee754(x, f, 1);
rewind(f);
if(!fread(&x, sizeof x, 1, f))
return 1;
remove("fnord");

printf("%-30.13a\n", x);
}

return 0;
}
--
Alan Curry

Malcolm McLean

unread,

Feb 18, 2010, 8:59:07 AM2/18/10

to

On Feb 18, 2:50 pm, pac...@kosh.dhis.org (Alan Curry) wrote:
> In article <21e31c8c-eb8b-47ea-8182-89ffb3ded...@j31g2000yqa.googlegroups.com>,

This

lowlong = (long) (significand - hibits * 4294967296);

should be
lowlong = (unsigned long) (significand hibits * 4294967296).

Well tested.

Alan Curry

unread,

Feb 18, 2010, 6:55:15 PM2/18/10

to

In article <6cd8a898-de8b-47e2...@15g2000yqa.googlegroups.com>,

After that fix, I ran a few million random bit patterns through, and the
only ones that didn't make it through identical were the NaNs.

Then I modified your function to replace all the doubles with long doubles,
to see what would happen if it ran on a system where the native double was
more precise than the IEEE double. The worst error was off-by-one in the
low bit.

Even that went away when I made sure that all the arithmetic was being done
with the larger type, like it would be if your original code was compiled
someplace where "double" was the larger type.

--
Alan Curry

Ben Pfaff

unread,

Feb 18, 2010, 7:30:03 PM2/18/10

to

pac...@kosh.dhis.org (Alan Curry) writes:

> In article <6cd8a898-de8b-47e2...@15g2000yqa.googlegroups.com>,
> Malcolm McLean <malcolm...@btinternet.com> wrote:
> |
> |This
> | lowlong = (long) (significand - hibits * 4294967296);
> |should be
> | lowlong = (unsigned long) (significand hibits * 4294967296).
> |
>
> After that fix, I ran a few million random bit patterns through, and the
> only ones that didn't make it through identical were the NaNs.

For what it's worth, it is usually pretty easy, and not too slow,
to test floating-point routines for all 2**32 possible "float"
values, to raise confidence even further.
--
Ben Pfaff
http://benpfaff.org