Performance of float32

1,412 views
Skip to first unread message

David Turnbull

unread,
Mar 2, 2014, 1:39:25 PM3/2/14
to golan...@googlegroups.com
I thought it'd be fun to experiment with writing a DSP application in Go. I was excited to see complex float32 arrays in the core language. Then I started looking at the math libraries and saw they were all locked in to float64. Wondering if the 32-64 conversions were going to hurt in practice, I wrote a little test. I didn't even get to the math libraries before I ran into some puzzling results.

I don't expect SIMD optimizations, but I was expecting float32 would run at the same speed as float64 with basic addition and multiplication. Perhaps someone can find a problem with my test or at least reset my expectations for Go. 

Output:
GOARCH: amd64
float64: 138.586383ms
float32: 232.536522ms

Source:
package main

import (
"fmt"
"runtime"
"time"
)

var its = (1 << 26)

func test32() {
x1 := float32(1.234)
x2 := float32(1.234)
x3 := float32(0)

t0 := time.Now()
for i := 0; i < its; i++ {
x3 = x3 + x1*x2
}
t1 := time.Now()
fmt.Printf("%T: %v\n", x1, t1.Sub(t0))
}

func test64() {
x1 := float64(1.234)
x2 := float64(1.234)
x3 := float64(0)

t0 := time.Now()
for i := 0; i < its; i++ {
x3 = x3 + x1*x2
}
t1 := time.Now()
fmt.Printf("%T: %v\n", x1, t1.Sub(t0))
}

func main() {
fmt.Printf("GOARCH: %s\n", runtime.GOARCH)
test64()
test32()
}

Michael Jones

unread,
Mar 2, 2014, 2:10:31 PM3/2/14
to David Turnbull, golang-nuts
I copied your code here to my laptop (MacBook Pro) and ran it with similar resulting ratios:

mtj-macbookpro:float32 mtj$ go run main.go
GOARCH: amd64
float64: 55.992488ms
float32: 74.576826ms

Compiling with the -S flag shows the two subroutine's implementation. The major difference in the instruction selection is this:

32-bit float code:
0x00a7 00167 (/Users/mtj/gocode/src/float32/main.go:18) MOVSS "".x1+76(SP),X0
0x00ad 00173 (/Users/mtj/gocode/src/float32/main.go:18) MULSS X3,X0
0x00b1 00177 (/Users/mtj/gocode/src/float32/main.go:18) ADDSS X2,X0
0x00b5 00181 (/Users/mtj/gocode/src/float32/main.go:18) MOVSS X0,X2

64-bit float code:
0x00a7 00167 (/Users/mtj/gocode/src/float32/main.go:31) MOVSD "".x1+80(SP),X0
0x00ad 00173 (/Users/mtj/gocode/src/float32/main.go:31) MULSD X3,X0
0x00b1 00177 (/Users/mtj/gocode/src/float32/main.go:31) ADDSD X2,X0
0x00b5 00181 (/Users/mtj/gocode/src/float32/main.go:31) MOVAPD X0,X2

The MOVAPD instruction is a faster-move that not all 32-bit intel processors support (SSE2). Maybe that's it. Maybe we should be more aggressive in exploiting HW capability.




--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1 650-335-5765

simon place

unread,
Mar 2, 2014, 3:52:40 PM3/2/14
to golan...@googlegroups.com, David Turnbull
if you re-write like this (split out the inner maths into separate functions).....

package main

import (
"fmt"
"runtime"
"time"
)

var its = (1 << 26)
var x1 float32 = 1.234
var x2 float32 =1.234
var x3 float32 = 0

func maths32(){
x3 = x3 + x1*x2
}

func test32() {

t0 := time.Now()
for i := 0; i < its; i++ {
maths32()
}
t1 := time.Now()
fmt.Printf("%T: %v\n", x1, t1.Sub(t0))
}

var x4 float64 = 1.234
var x5 float64 =1.234
var x6 float64 = 0


func maths64(){
x6 = x6 + x4*x5
}

func test64() {

t0 := time.Now()
for i := 0; i < its; i++ {
maths64()
}
t1 := time.Now()
fmt.Printf("%T: %v\n", x4, t1.Sub(t0))
}

func main() {
fmt.Printf("GOARCH: %s\n", runtime.GOARCH)
test64()
test32()
}


you get (well me anyway) the exact same timings.(basically half speed for 32bit)

then use -s to look at the compiled code;

--- prog list "maths64" ---
0101 (test.go:33) TEXT    maths64+0(SB),$0-0
0102 (test.go:33) FUNCDATA $0,gcargs·2+0(SB)
0103 (test.go:33) FUNCDATA $1,gclocals·2+0(SB)
0104 (test.go:34) MOVSD   x4+0(SB),X0
0105 (test.go:34) MOVSD   x5+0(SB),X1
0106 (test.go:34) MULSD   X1,X0
0107 (test.go:34) MOVSD   x6+0(SB),X1
0108 (test.go:34) ADDSD   X1,X0
0109 (test.go:34) MOVSD   X0,x6+0(SB)
0110 (test.go:35) RET     ,


--- prog list "maths32" ---
0000 (test.go:14) TEXT    maths32+0(SB),$0-0
0001 (test.go:14) FUNCDATA $0,gcargs·0+0(SB)
0002 (test.go:14) FUNCDATA $1,gclocals·0+0(SB)
0003 (test.go:15) MOVSS   x1+0(SB),X0
0004 (test.go:15) MOVSS   x2+0(SB),X1
0005 (test.go:15) MULSS   X1,X0
0006 (test.go:15) MOVSS   x3+0(SB),X1
0007 (test.go:15) ADDSS   X1,X0
0008 (test.go:15) MOVSS   X0,x3+0(SB)
0009 (test.go:16) RET     ,


so  i guess MOVAPD isn't involved.

simon place

unread,
Mar 2, 2014, 4:16:07 PM3/2/14
to golan...@googlegroups.com, David Turnbull
if you get rid of the mult then;

GOARCH: amd64
float64: 162.888488ms
float32: 268.216444ms


GOARCH: amd64
float64: 155.745693ms
float32: 155.608991ms

so its the MULSS, which makes me wonder if just using half of the MULPS (mmx) might not be faster on 64bit?

David Turnbull

unread,
Mar 2, 2014, 7:16:53 PM3/2/14
to golan...@googlegroups.com, David Turnbull
Thanks for the responses. I've been disappointed by too many other languages that neglect float32. Any hope of performant ARM code is quickly lost if you force developers to use float64 everywhere. Or you have to do everything fixed point which is less fun. I'm glad to see Go is not headed in this direction.

I mocked up a more accurate pattern for a typical FIR filter. The float32 version was just a tad faster than the float64. An amount that is easily attributed to accessing less memory.

GOARCH: amd64
[]float64: 40.972164ms
[]float32: 39.231442ms

That's for 18.8 million MACs on a 3GHz Core 2 Duo. Awesome! It's very strange that the code I originally posted performs as it does.

My next test will be implementing the Parks–McClellan algorithm and comparing it to C and Ruby implementations.

simon place

unread,
Mar 12, 2014, 9:14:05 PM3/12/14
to golan...@googlegroups.com, David Turnbull
i went on to play with this a bit;

seems like the above timings weren't reliable, maybe garbage collection or concurrent printing, when i switched to the benchmark library, the surprising speed difference disappeared.

however; 

i then looked at more expensive, and hardware accelerated maths functions, like math.sqrt(float64), for which the std lib uses SSE (on amd64).

and noticed that if you roll you own assembler, float32 can be much faster, basically because go's lib's don't include access to the 32bit versions of SSE hardware accelerated functions.

also, to be complete, a wrote a 4-way SSE SIMD instruction test, (where float32 is better as float64 can only be 2-way) which was surprisingly easy.

results(timings per core);

math.Sqrt(b+1)                                                              24.4 ns     (uses hardware 64bit SSE sqrt on amd64) 
float32(math.Sqrt(float64(b+1)))                                  24.3 ns     (basically the same)
4 x above on [4]float32 array                                           21.8 ns (per calc)     
as previous but using assembler to access 32 bit SSE      11.1 ns (per calc) (doubles speed)
as previous but using 4xparallel 32 bit SSE                       3.7 ns  (per calc)  (nearly quadruples speed)
as previous but using low accuracy SSE sqrt               1.6 ns  (per calc)  (more than doubles but only good to 4-5 sf)

so using float32, assembler and low accuracy gets you x15!

more practically having a Sqrt(float32) (and others) in the libs would get you at least x2.(on amd64, and ARM's equivalent seems to be usually in-place on higher spected chips.)   

intel 'sandy bridge' CPUs have AVX (Advanced Vector Extensions), or 256 bit registers, which means 8x float32 and 4x float64. (but i guess the assembler doesn't register these at the moment? since the compiler doesn't use them.)

and in the future, probably? AVX-512 has doubled this again. (so potentially x60 and that's without increasing the number of cores.)

Erwin

unread,
Mar 13, 2014, 5:47:28 AM3/13/14
to simon place, golang-nuts, David Turnbull
I would like to learn SSE assembly as well and so I'd be interested in seeing your code, is it available somewhere?

simon place

unread,
Mar 13, 2014, 4:01:06 PM3/13/14
to golan...@googlegroups.com, simon place, David Turnbull
for this bodyless func header;   func f40pc(i *[4]float32)

there is this file; "f40pc_amd64.c" (in this case for the formula; sqrt(x+1))

// func f40pc(i *[4]float32)
TEXT ·f40pc+0(SB),$16-8
MOVQ    i+0(FP),AX   // get 64 bit address from first parameter
MOVAPS (AX),X0       // load 128bit, 4xfloat32,  from memory
MOVSS   $(1.0),X1       // load single precision var, 1.0 , into lower 32 bits
SHUFPS  $0x00,X1,X1  // duplicate it 4 times across the register
ADDPS   X1,X0    // parallel add 
SQRTPS  X0,X0    // parallel sqrt in-place
MOVAPS   X0,(AX)  // put 128bit back to same address 
RET     ,

HEADS-UP:   the assembler used by Go reads left-to-right, so the destination is the last parameter, but most assembler and documentation i've seen, has the parameters right-to-left. 
it took me a while to figure out that the shuffle command here (3 parameters), has the 'selector' parameter first, when all the documentation i found has it last, just an assembler choice, but unless you know.
bit inconsistent really, since inside actual go code this paradyne is right-to-left.
also: the m/c coming from -S needs a middle dot '·' prefixed to the name to make it work, don't know why.

David Turnbull

unread,
Mar 14, 2014, 10:14:14 PM3/14/14
to golan...@googlegroups.com, simon place, David Turnbull
Simon, I just released a float32 library. Perhaps you would be interested in contributing the assembly language functions that you tested.

simon place

unread,
Mar 15, 2014, 12:16:34 PM3/15/14
to golan...@googlegroups.com, simon place, David Turnbull
i was really commenting here on how advantageous and apparently easy it was to assemblify? whole formula to use SSE, (the n-body code in this  thread seems like it could massively benefit from this.), and that float32 has a major advantage over float64 with this.

you could write code to generate the assembly, obviously this then IS a compiler, but since in a given program the formula required probably doesn't change, its just as easy to hand write.  

lib for applying just single SSE operations to arrays, that works on things like *[4]float32, (and *[2]float64) might be nice, even including some of the most common compound formula.

and, as you have done, a float32 version of the standard math lib would be nice to work with that, although converting to float32 and back isn't as expensive as you might think., but i guess what would be REALLY be nice about a math float32 lib is being able to just cut/paste float64 to float32 and have it work. (see comment on that thread
Reply all
Reply to author
Forward
0 new messages