different rounding behavior with float and double

Raymond Li

unread,

Jan 9, 2014, 4:46:51 AM1/9/14

to

I have encountered a problem related to floating point rounding. I
googled a lot and there are many clear and helpful information. e.g.

http://www.learncpp.com/cpp-tutorial/25-floating-point-numbers/
http://support.microsoft.com/kb/125056/en-hk

Although the urls have explained the cause, I need to find a practical
way to solve a rounding problem. My program has calculated a weighted
accumulation as 3.5. When the figure is rounded to nearest number, it
became 3 (but I want it to round up to 4). I understood it would be due
to approximation value of 3.5 as 3.49999...

I found a simple fix by using float instead of double. I list the
program below and wish someone could explain why using double would
incur the rounding problem while float would not. In the code below,
fun1() use float and the calculation is 'correct'. In fun2(), it uses
double and the figure 3.5 is rounded as 3.

Raymond

//######################

#include <cmath>
#include <iostream>

//using namespace std;

using std::cout;
using std::endl;
int fun1();
int fun2();

int main(int argc, char ** argv)
{
fun1();
fun2();
return 0;
}

int fun1()
{

float weighted=10.0;
float average=100.0;
float z[]=
{
4.0,
4.0,
4.0,
4.0,
4.0,
3.0,
3.0,
3.0,
2.0,
4.0
};

float total=0.0;

int i=0;
for (i=0;i<10;i++)
{
float item=z[i]*weighted/average;
total=total+item;
cout << i << " accumulate is " << total << endl;
// NSLog(@"z[%d] is %f, total is %f", i, z[i], total);
}

float answer=round(total);
// NSLog(@"rounded is %f", answer);
cout << "rounded is " << answer << endl;
return 0;
}

int fun2()
{

double weighted=10.0;
double average=100.0;
double z[]=
{
4.0,
4.0,
4.0,
4.0,
4.0,
3.0,
3.0,
3.0,
2.0,
4.0
};

double total=0.0;

int i=0;
for (i=0;i<10;i++)
{
double item=z[i]*weighted/average;
total=total+item;
cout << i << " accumulate is " << total << endl;
// NSLog(@"z[%d] is %f, total is %f", i, z[i], total);
}

double answer=round(total);
// NSLog(@"rounded is %f", answer);
cout << "rounded is " << answer << endl;
return 0;
}

0 accumulate is 0.4
1 accumulate is 0.8
2 accumulate is 1.2
3 accumulate is 1.6
4 accumulate is 2
5 accumulate is 2.3
6 accumulate is 2.6
7 accumulate is 2.9
8 accumulate is 3.1
9 accumulate is 3.5
rounded is 4
***(above is the version using float, 3.5 is rounded as 4) ***

0 accumulate is 0.4
1 accumulate is 0.8
2 accumulate is 1.2
3 accumulate is 1.6
4 accumulate is 2
5 accumulate is 2.3
6 accumulate is 2.6
7 accumulate is 2.9
8 accumulate is 3.1
9 accumulate is 3.5
rounded is 3

***(this version use double, 3.5 is rounded as 3) ***

--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---

Alf P. Steinbach

unread,

Jan 9, 2014, 6:48:56 AM1/9/14

to

On 09.01.2014 10:46, Raymond Li wrote:
>
> My program has calculated a weighted
> accumulation as 3.5. When the figure is rounded to nearest number, it
> became 3 (but I want it to round up to 4). I understood it would be due
> to approximation value of 3.5 as 3.49999...

round(x) rounds to the /nearest/ integer value of the type.

When x is around 3.5, as in your case, very tiny differences can make 3
or 4 the nearest integer. These tiny differences can be, well,
different, for float and double types (the latter has double the
precision, roughly).

If you want to round to 1 decimal fractional digit, then simply do this:

round( 10*x )/10

Or you can define that as a function:

auto rounded( double x, int n_decimals = 2 )
-> double
{
double const pow_of_10 = pow( 10.0, n_decimals );
return round( pow_of_10*x )/pow_of_ten;
}

Now regarding the choice of data type I (strongly) suggest to use
double, because that's the default and least troublesome in C++.

For example, a literal such as 3.14 is a double, and C style functions
such as printf and scanf work with double values.

Cheers & hth.,

- Alf

Message has been deleted

Richard Damon

unread,

Jan 9, 2014, 8:58:08 AM1/9/14

to

On 1/9/14, 4:46 AM, Raymond Li wrote:
> I have encountered a problem related to floating point rounding. I
> googled a lot and there are many clear and helpful information. e.g.
>
> http://www.learncpp.com/cpp-tutorial/25-floating-point-numbers/
> http://support.microsoft.com/kb/125056/en-hk
>
>
> Although the urls have explained the cause, I need to find a practical
> way to solve a rounding problem. My program has calculated a weighted
> accumulation as 3.5. When the figure is rounded to nearest number, it
> became 3 (but I want it to round up to 4). I understood it would be due
> to approximation value of 3.5 as 3.49999...
>
> I found a simple fix by using float instead of double. I list the
> program below and wish someone could explain why using double would
> incur the rounding problem while float would not. In the code below,
> fun1() use float and the calculation is 'correct'. In fun2(), it uses
> double and the figure 3.5 is rounded as 3.
>
> Raymond
>
>

This program is effectively calculating something like

.4 + .4 + .4 + .4 + .4 + .3 + .3 + .3 + .2 + .4

Note that none of these values can be exactly expresses as a float or
double so we get numerous round off errors.

If instead you computed it as

(4.0 + 4.0 + 4.0 + 4.0 + 4.0 + 3.0 + 3.0 + 3.0 + 2.0 + 4.0)/10.0

Then each number (having the value of a small integer) can be exactly
represented as a double, and the result will be exact (since the result
3.5 = 7/2 is an exact binary fraction, unlike 4/10)

You only notice this round off error because your input numbers and
weighting factors are "nice" numbers so you can mentally to the math
exactly. Since the answer is very near a discontinuity (the .5 break
point of round), the result is very sensitive to small errors in
calculation.

If the numbers were more realistically not so precisely "round" numbers,
you are much less apt to hit this sort of discontinuous behavior in a
noticeable way.

If the input numbers ARE expected to this sort of nice number, then you
need to be very careful how you do your math to avoid creating "not
nice" numbers along the way creating round of error. Not nice numbers
are numbers line .4 which aren't a reasonable integral multiple of a
power of 2 (note that 1 is a power of 2, being 2 to the 0th power, so
integral numbers are nice, decimal fractions are generally not nice,
unless they are things like .5, .25, .375, which are decimal equivalents
of simple binary fractions).

When you have "nice" numbers, you can add them, subtract them, or
multiply them (within reason) and stay "nice", dividing will likely
break nice, unless you are dividing by a power of 2, so algorithms that
maximize accuracy for "nice" numbers need to avoid dividing as much as
possible, and make it as late a possible.

Wouter van Ooijen

unread,

Jan 9, 2014, 9:14:14 AM1/9/14

to

> This program is effectively calculating something like
>
> .4 + .4 + .4 + .4 + .4 + .3 + .3 + .3 + .2 + .4
>
> Note that none of these values can be exactly expresses as a float or
> double so we get numerous round off errors.
>
> If instead you computed it as
>
> (4.0 + 4.0 + 4.0 + 4.0 + 4.0 + 3.0 + 3.0 + 3.0 + 2.0 + 4.0)/10.0
>
> Then each number (having the value of a small integer) can be exactly
> represented as a double, and the result will be exact (since the result
> 3.5 = 7/2 is an exact binary fraction, unlike 4/10)

If this is indeed about such a set of numbers the obvious solution would
be to scale all values up to integers.

Wouter

Stuart

unread,

Jan 9, 2014, 2:11:28 PM1/9/14

to

IMHO, the obvious solution would be to use a programming language that
takes care of such messy conversions, like Ada95. Ada lets you define
fixed point numbers type so that the actual computation will be done
through integer arithmetic. Ada has a real type system even for
primitive types like int and float, not the C typedef crap. I wish that
this would exist for C++, too, but I'm afraid that such a feature will
never make it into the C++ standard (*sigh*).

Regards,
Stuart

Wouter van Ooijen

unread,

Jan 9, 2014, 4:54:45 PM1/9/14

to

> IMHO, the obvious solution would be to use a programming language that
> takes care of such messy conversions, like Ada95. Ada lets you define
> fixed point numbers type so that the actual computation will be done
> through integer arithmetic. Ada has a real type system even for
> primitive types like int and float, not the C typedef crap. I wish that
> this would exist for C++, too, but I'm afraid that such a feature will
> never make it into the C++ standard (*sigh*).

It doesn't need to, C++ allows you to define such types yourself.

Wouter

Raymond Li

unread,

Jan 9, 2014, 10:51:54 PM1/9/14

to

Thanks for your replies. I hope to stick to double too. But the users
have implemented the logic in legacy system and I need to convince them
if I do something different from them. They claimed that the interim
calculations (z[i]*weighted/average) are used too and they would feel
uncomfortable if I make any adjustment. The worst problem I faced is
that they claimed that the legacy system (which is not really legacy, it
is running Oracle pl/sql) does not have the rounding error.

So I investigated and found it weird that the rounding problem could be
avoided by using float. I am uncomfortable to this workaround (using
float), as I am afraid there would be cases that the rounding issue
recur in other scenarios. So I really want someone could explain why the
float datatype would round correctly in the above case, while using
double rounded 'incorrectly'.

If I am free to rewrite the code, after learning from you, I would
rewrite the code as follow. The problem is that I have to convince my
users, and their legacy system was already implemented.

Regards,
Raymond

#include <iostream>
#include <stdio.h>
#include <cmath>
#include <iomanip>

using std::setprecision;

using std::cout;
using std::endl;

int fun3();
int main(int argc, const char * argv[])
{

// insert code here...
// std::cout << "Hello, World!\n";
fun3();
return 0;
}

int fun3()
{
cout << setprecision(17);

double weighted=10.0;
double average=100.0;
double z[]=
{
4.0,
4.0,
4.0,
4.0,
4.0,
3.0,
3.0,
3.0,
2.0,
4.0
};

double total=0.0;

int i=0;
for (i=0;i<10;i++)
{

//double item=z[i]*weighted/average;
double item=z[i]*weighted; // defer the division
total=total+item;
cout << "in loop " << i << ", accumulate is " << total << endl;

// NSLog(@"z[%d] is %f, total is %f", i, z[i], total);
}

total=total/average; // division done at last to avoid truncation error

double answer=round(total);
cout << "rounded is " << answer << " and original is " << total <<
endl;
return 0;
}

output:
in loop 0, accumulate is 40
in loop 1, accumulate is 80
in loop 2, accumulate is 120
in loop 3, accumulate is 160
in loop 4, accumulate is 200
in loop 5, accumulate is 230
in loop 6, accumulate is 260
in loop 7, accumulate is 290
in loop 8, accumulate is 310
in loop 9, accumulate is 350
rounded is 4 and original is 3.5

Richard Damon

unread,

Jan 9, 2014, 11:39:11 PM1/9/14

to

On 1/9/14, 10:51 PM, Raymond Li wrote:

> Thanks for your replies. I hope to stick to double too. But the users
> have implemented the logic in legacy system and I need to convince them
> if I do something different from them. They claimed that the interim
> calculations (z[i]*weighted/average) are used too and they would feel
> uncomfortable if I make any adjustment. The worst problem I faced is
> that they claimed that the legacy system (which is not really legacy, it
> is running Oracle pl/sql) does not have the rounding error.
>
> So I investigated and found it weird that the rounding problem could be
> avoided by using float. I am uncomfortable to this workaround (using
> float), as I am afraid there would be cases that the rounding issue
> recur in other scenarios. So I really want someone could explain why the
> float datatype would round correctly in the above case, while using
> double rounded 'incorrectly'.
>
> If I am free to rewrite the code, after learning from you, I would
> rewrite the code as follow. The problem is that I have to convince my
> users, and their legacy system was already implemented.
>
> Regards,
> Raymond
>

Float eliminates the error IN THIS EXAMPLE, not in general, it probably
is a matter of which round down and which round up after the divide by
100.0. There will be other example number sets where double is right but
float is wrong.

Unless you run a lot of tests with a number of different cases, which
similarly stress the accuracy, it would be impossible to say "does not
have the rounding problem", it just shows that it doesn't have it for
THAT set of numbers. The one case where I would believe it really not
having a rounding problem would be a system that did the math in
decimal, instead of binary, those systems, while the do have round off
errors, have them where decimal thinking people expect them, so they
don't complain.

A question on operation, in "real" use, are the numbers going to be this
nice? If they are, then you may be able to adjust the rounding so that
0.499 rounds up instead of down. If they aren't then you are going run
into the fact that this is a pathological one in a million case of the
sum just hitting the break point of the round function, and the round
off error makes you cross over the line.

Robert Wessel

unread,

Jan 10, 2014, 12:39:05 AM1/10/14

to

On Fri, 10 Jan 2014 11:51:54 +0800, Raymond Li <fai...@gmail.com>
wrote:

You shouldn't depend on that, it's just a coincidence of how the
rounding error happened to accumulate.

I've modified you program to display a bit more precision (attached
below). With the better display of precision, you can see the
roundoff errors accumulating differently:

(float) 0 item is 0.40000000596046448 accumulate is
0.40000000596046448
(float) 1 item is 0.40000000596046448 accumulate is
0.80000001192092896
(float) 2 item is 0.40000000596046448 accumulate is 1.2000000476837158
(float) 3 item is 0.40000000596046448 accumulate is 1.6000000238418579
(float) 4 item is 0.40000000596046448 accumulate is 2
(float) 5 item is 0.30000001192092896 accumulate is 2.2999999523162842
(float) 6 item is 0.30000001192092896 accumulate is 2.5999999046325684
(float) 7 item is 0.30000001192092896 accumulate is 2.8999998569488525
(float) 8 item is 0.20000000298023224 accumulate is 3.0999999046325684
(float) 9 item is 0.40000000596046448 accumulate is 3.5
rounded is 4
(double) 0 item is 0.40000000000000002 accumulate is
0.40000000000000002
(double) 1 item is 0.40000000000000002 accumulate is
0.80000000000000004
(double) 2 item is 0.40000000000000002 accumulate is
1.2000000000000002
(double) 3 item is 0.40000000000000002 accumulate is
1.6000000000000001
(double) 4 item is 0.40000000000000002 accumulate is 2
(double) 5 item is 0.29999999999999999 accumulate is
2.2999999999999998
(double) 6 item is 0.29999999999999999 accumulate is
2.5999999999999996
(double) 7 item is 0.29999999999999999 accumulate is
2.8999999999999995
(double) 8 item is 0.20000000000000001 accumulate is
3.0999999999999996
(double) 9 item is 0.40000000000000002 accumulate is
3.4999999999999996
rounded is 3

But as I said, you can depend on that. For example, changing the
series to:

{
9.0,
9.0,
9.0,
8.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0
};

Will cause the float version to round to 3 as well:

(float) 0 item is 0.89999997615814209 accumulate is
0.89999997615814209
(float) 1 item is 0.89999997615814209 accumulate is 1.7999999523162842
(float) 2 item is 0.89999997615814209 accumulate is 2.6999998092651367
(float) 3 item is 0.80000001192092896 accumulate is 3.4999997615814209
(float) 4 item is 0 accumulate is 3.4999997615814209
(float) 5 item is 0 accumulate is 3.4999997615814209
(float) 6 item is 0 accumulate is 3.4999997615814209
(float) 7 item is 0 accumulate is 3.4999997615814209
(float) 8 item is 0 accumulate is 3.4999997615814209
(float) 9 item is 0 accumulate is 3.4999997615814209
rounded is 3

IOW, this will vary with the exact set of inputs. So don't do that.

Even worse, you can get this to wander around based on whether or not
you tell the compiler to produce strict IEEE compliant math, and the
requested optimization level (on x86 machines you often see
intermediate results with a higher precision than you'd expect if the
code is using the x87 FPU).

Changing the values (as has been suggested) so that all of them have
exact representations can reduce the roundoff error, but cannot
eliminate it (you'll still get roundoff error on that final division,
even if you get none on the individual terms). OTOH, you probably
will get away with this so long as the only case you care about is
"xxx5.0 / 10.0", since that will have an exact result (.5 being
exactly representable in a binary FP number). This is obviously
fragile, and will go to pot the first time someone tosses in a number
with more than a single decimal place.

If this is important, I'd generally advise avoiding floating point
entirely, and use a package that allows you to do this all with scaled
integers or rationals. I'm not sure what the PL/SQL code is doing,
but it may be using a scaled type, or if it's using a floating type,
they might just be hitting one set of rounding errors that happens to
generate the expected result. And that may just be the worst scenario
- trying to duplicate the existing behavior when the existing behavior
is not what anyone is actually expecting.

/* ----- */

#include <cmath>
#include <iostream>
#include <iomanip>

using std::cout;
using std::endl;
int fun1();
int fun2();

inline int round(double x) { return (floor(x + 0.5)); }

int main(int argc, char ** argv)
{
fun1();
fun2();
return 0;
}

int fun1()
{

float weighted=10.0;
float average=100.0;
float z[]=
{
4.0,
4.0,
4.0,
4.0,
4.0,
3.0,
3.0,
3.0,
2.0,
4.0
};

float total=0.0;

int i=0;
for (i=0;i<10;i++)
{
float item=z[i]*weighted/average;
total=total+item;

cout << "(float) " << i << " item is " <<
std::setprecision(20) << item
<< " accumulate is " << std::setprecision(20) << total
<< endl;
}

float answer=round(total);

cout << "rounded is " << answer << endl;
return 0;
}

int fun2()
{

double weighted=10.0;
double average=100.0;
double z[]=
{
4.0,
4.0,
4.0,
4.0,
4.0,
3.0,
3.0,
3.0,
2.0,
4.0
};

double total=0.0;

int i=0;
for (i=0;i<10;i++)
{
double item=z[i]*weighted/average;
total=total+item;

cout << "(double) " << i << " item is " <<
std::setprecision(20) << item
<< " accumulate is " << std::setprecision(20) << total
<< endl;
}

double answer=round(total);
cout << "rounded is " << answer << endl;
return 0;
}

Raymond Li

unread,

Jan 10, 2014, 3:19:50 AM1/10/14

to

Thanks again for your replies! Richard and Robert explained further!
Thanks! Thanks! Thanks!

The numbers are always 'nice' because there are constraints, e.g. the
weights always add up to 100. Besides, the data would only be in value
between 1 to 5. Therefore, indeed using float could be 'safe' in our
cases. But I am just 'unhappy' to use float instead of double.

Unfortunately, I will stick to the legacy system's implementation (where
actually I have no access to the code) and have to do the division in
interim.

A colleague of mine suggested using NSNumberFormatter to do the
rounding. His demonstration seems to have solved our particular case and
the code can still use double as datatype. It would be platform
dependent at least. And I have to spend time to study the details and
check for potential problems. But I could only object to users if I can
find a case that the legacy system calculate wrong.

Raymond

Message has been deleted