Go bugs when treat big float

zhang...@gmail.com

unread,

Jun 24, 2016, 8:57:51 AM6/24/16

to golang-dev

package main

import (
	"math"
	"fmt"
)

func main() {
	var f64 float64
	f64 = float64(math.MaxFloat32) + float64(10.2)
	if f64 > math.MaxFloat32 {
		fmt.Println("yes")
	} else {
		fmt.Println("no")
	}
}

Expect result is : yes

But really get: no

At Go 1.6, same with Go 1.4, Linux

Ian Lance Taylor

unread,

Jun 24, 2016, 10:04:57 AM6/24/16

to zhang...@gmail.com, golang-dev

This is not a bug in Go.

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Ian

Michael Jones

unread,

Jun 24, 2016, 11:32:29 AM6/24/16

to Ian Lance Taylor, zhang...@gmail.com, golang-dev

The Goldberg paper is good, but a little overwhelming for this specific confusion.

Zhangsilly, as part of adding floating point values, the fractional parts must be shifted so that the exponent parts will be aligned, like this (examples are decimal, but the same in binary):

before: 1.00e3 + 1.00e4

align: 0.10e4 + 1.00e4 (shift one digit to make them both e4)

sum: 1.10e4 (add the fractional parts, use common exponent part)

before: 1.00e2 + 1.00e4

align: 0.01e4 + 1.00e4 (shift two digits to make them both e4)

sum: 1.01e4 (add the fractional parts, use common exponent part)

When adding two values that have a greater difference in exponents than the precision of the fractional part, what happens is that the alignment phase shifts the smaller value so much that it becomes zero, like this example:

before: 1.00e1 + 1.00e4
align: 0.00e4 + 1.00e4 (shift three digits to make them both e4)
sum: 1.00e4 (add the fractional parts, use common exponent part)

This last is what happens in your program.

math.MaxFloat32 is a binary floating point number with an exponent of 127 and 24 bits of fractional precision.

When you convert it to a 64-bit double-precision value, you gain extra bits of fractional precision and exponent range. (24 => 53 and 1 => 11) What matters here is the 53 bits of precision that you now enjoy.

The 10.2 in double precision has an exponent of 3. (It is stored as 1.274999976158142 x 2**3 with the IEEE 754 standard binary encoding of 0x41233333)

The addition alignment of the first number’s exponent of 127 and the second number’s exponent of 3 requires shifting the second number’s fraction right by 127-3 = 124. Alas, since the 54 bits of precision is less than the 124 bits of shifting, the alignment results in the second number having a fraction of zero. Naturally, adding zero does not make the sum any greater than the first number.

This is your answer.

Michael Jones
michae...@gmail.com

P.S. Based on this understanding, we can ask another interesting question. How many bits must that second number move such that alignment does not result in a fraction of zero? Think it through before you look at the answer.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Jones

unread,

Jun 24, 2016, 11:32:29 AM6/24/16

to Ian Lance Taylor, zhang...@gmail.com, golang-dev

Edited:

The Goldberg paper is good, but a little overwhelming for this specific confusion.

Zhangsilly, as part of adding floating point values, the fractional parts must be shifted so that the exponent parts will be aligned, like this (examples are decimal, but the same in binary):

before: 1.00e3 + 1.00e4

align: 0.10e4 + 1.00e4 (shift one digit to make them both e4)

sum: 1.10e4 (add the fractional parts, use common exponent part)

before: 1.00e2 + 1.00e4

align: 0.01e4 + 1.00e4 (shift two digits to make them both e4)

sum: 1.01e4 (add the fractional parts, use common exponent part)

When adding two values that have a greater difference in exponents than the precision of the fractional part, what happens is that the alignment phase shifts the smaller value so much that it becomes zero, like this example:

before: 1.00e1 + 1.00e4
align: 0.00e4 + 1.00e4 (shift three digits to make them both e4)
sum: 1.00e4 (add the fractional parts, use common exponent part)

This last is what happens in your program.

math.MaxFloat32 is a binary floating point number with an exponent of 127 and 24 bits of fractional precision.

When you convert it to a 64-bit double-precision value, you gain extra bits of fractional precision and exponent range. (24 => 53 and 8 => 11) What matters here is the 53 bits of precision that you now enjoy.

The 10.2 in double precision has an exponent of 3. (It is stored as 1.275 x 2**3 with the IEEE 754 binary encoding of 0x4024666666666666)

During addition, alignment of the first number’s exponent of 127 and the second number’s exponent of 3 requires shifting the second number’s fraction right by 127-3 = 124. Alas, since the 54 bits of precision is less than the 124 bits of shifting, the alignment results in the second number having a fraction of zero. Naturally, adding zero does not make the sum any greater than the first number.

Reply all

Reply to author

Forward