Performance of float32

David Turnbull

unread,

Mar 2, 2014, 1:39:25 PM3/2/14

to golan...@googlegroups.com

I thought it'd be fun to experiment with writing a DSP application in Go. I was excited to see complex float32 arrays in the core language. Then I started looking at the math libraries and saw they were all locked in to float64. Wondering if the 32-64 conversions were going to hurt in practice, I wrote a little test. I didn't even get to the math libraries before I ran into some puzzling results.

I don't expect SIMD optimizations, but I was expecting float32 would run at the same speed as float64 with basic addition and multiplication. Perhaps someone can find a problem with my test or at least reset my expectations for Go.

Output:

GOARCH: amd64
float64: 138.586383ms
float32: 232.536522ms

Source:

package main

import (

"fmt"

"runtime"

"time"

)

var its = (1 << 26)

func test32() {

x1 := float32(1.234)

x2 := float32(1.234)

x3 := float32(0)

t0 := time.Now()

for i := 0; i < its; i++ {

x3 = x3 + x1*x2

}

t1 := time.Now()

fmt.Printf("%T: %v\n", x1, t1.Sub(t0))

}

func test64() {

x1 := float64(1.234)

x2 := float64(1.234)

x3 := float64(0)

t0 := time.Now()

for i := 0; i < its; i++ {

x3 = x3 + x1*x2

}

t1 := time.Now()

fmt.Printf("%T: %v\n", x1, t1.Sub(t0))

}

func main() {

fmt.Printf("GOARCH: %s\n", runtime.GOARCH)

test64()

test32()

}

Michael Jones

unread,

Mar 2, 2014, 2:10:31 PM3/2/14

to David Turnbull, golang-nuts

I copied your code here to my laptop (MacBook Pro) and ran it with similar resulting ratios:

mtj-macbookpro:float32 mtj$ go run main.go

GOARCH: amd64

float64: 55.992488ms

float32: 74.576826ms

Compiling with the -S flag shows the two subroutine's implementation. The major difference in the instruction selection is this:

32-bit float code:

0x00a7 00167 (/Users/mtj/gocode/src/float32/main.go:18) MOVSS "".x1+76(SP),X0

0x00ad 00173 (/Users/mtj/gocode/src/float32/main.go:18) MULSS X3,X0

0x00b1 00177 (/Users/mtj/gocode/src/float32/main.go:18) ADDSS X2,X0

0x00b5 00181 (/Users/mtj/gocode/src/float32/main.go:18) MOVSS X0,X2

64-bit float code:

0x00a7 00167 (/Users/mtj/gocode/src/float32/main.go:31) MOVSD "".x1+80(SP),X0

0x00ad 00173 (/Users/mtj/gocode/src/float32/main.go:31) MULSD X3,X0

0x00b1 00177 (/Users/mtj/gocode/src/float32/main.go:31) ADDSD X2,X0

0x00b5 00181 (/Users/mtj/gocode/src/float32/main.go:31) MOVAPD X0,X2

The MOVAPD instruction is a faster-move that not all 32-bit intel processors support (SSE2). Maybe that's it. Maybe we should be more aggressive in exploiting HW capability.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Michael T. Jones | Chief Technology Advocate | m...@google.com | +1 650-335-5765

simon place

unread,

Mar 2, 2014, 3:52:40 PM3/2/14

to golan...@googlegroups.com, David Turnbull

if you re-write like this (split out the inner maths into separate functions).....

package main

import (

"fmt"

"runtime"

"time"

)

var its = (1 << 26)

var x1 float32 = 1.234

var x2 float32 =1.234

var x3 float32 = 0

func maths32(){

x3 = x3 + x1*x2

}

func test32() {

t0 := time.Now()

for i := 0; i < its; i++ {

maths32()

}

t1 := time.Now()

fmt.Printf("%T: %v\n", x1, t1.Sub(t0))

}

var x4 float64 = 1.234

var x5 float64 =1.234

var x6 float64 = 0

func maths64(){

x6 = x6 + x4*x5

}

func test64() {

t0 := time.Now()

for i := 0; i < its; i++ {

maths64()

}

t1 := time.Now()

fmt.Printf("%T: %v\n", x4, t1.Sub(t0))

}

func main() {

fmt.Printf("GOARCH: %s\n", runtime.GOARCH)

test64()

test32()

}

you get (well me anyway) the exact same timings.(basically half speed for 32bit)

then use -s to look at the compiled code;

--- prog list "maths64" ---

0101 (test.go:33) TEXT maths64+0(SB),$0-0

0102 (test.go:33) FUNCDATA $0,gcargs·2+0(SB)

0103 (test.go:33) FUNCDATA $1,gclocals·2+0(SB)

0104 (test.go:34) MOVSD x4+0(SB),X0

0105 (test.go:34) MOVSD x5+0(SB),X1

0106 (test.go:34) MULSD X1,X0

0107 (test.go:34) MOVSD x6+0(SB),X1

0108 (test.go:34) ADDSD X1,X0

0109 (test.go:34) MOVSD X0,x6+0(SB)

0110 (test.go:35) RET ,

--- prog list "maths32" ---

0000 (test.go:14) TEXT maths32+0(SB),$0-0

0001 (test.go:14) FUNCDATA $0,gcargs·0+0(SB)

0002 (test.go:14) FUNCDATA $1,gclocals·0+0(SB)

0003 (test.go:15) MOVSS x1+0(SB),X0

0004 (test.go:15) MOVSS x2+0(SB),X1

0005 (test.go:15) MULSS X1,X0

0006 (test.go:15) MOVSS x3+0(SB),X1

0007 (test.go:15) ADDSS X1,X0

0008 (test.go:15) MOVSS X0,x3+0(SB)

0009 (test.go:16) RET ,

so i guess MOVAPD isn't involved.

simon place

unread,

Mar 2, 2014, 4:16:07 PM3/2/14

to golan...@googlegroups.com, David Turnbull

if you get rid of the mult then;

GOARCH: amd64

float64: 162.888488ms

float32: 268.216444ms

GOARCH: amd64

float64: 155.745693ms

float32: 155.608991ms

so its the MULSS, which makes me wonder if just using half of the MULPS (mmx) might not be faster on 64bit?

David Turnbull

unread,

Mar 2, 2014, 7:16:53 PM3/2/14

to golan...@googlegroups.com, David Turnbull

Thanks for the responses. I've been disappointed by too many other languages that neglect float32. Any hope of performant ARM code is quickly lost if you force developers to use float64 everywhere. Or you have to do everything fixed point which is less fun. I'm glad to see Go is not headed in this direction.

I mocked up a more accurate pattern for a typical FIR filter. The float32 version was just a tad faster than the float64. An amount that is easily attributed to accessing less memory.

GOARCH: amd64

[]float64: 40.972164ms

[]float32: 39.231442ms

That's for 18.8 million MACs on a 3GHz Core 2 Duo. Awesome! It's very strange that the code I originally posted performs as it does.

My next test will be implementing the Parks–McClellan algorithm and comparing it to C and Ruby implementations.

https://github.com/AE9RB/firpm

simon place

unread,

Mar 12, 2014, 9:14:05 PM3/12/14

to golan...@googlegroups.com, David Turnbull

i went on to play with this a bit;

seems like the above timings weren't reliable, maybe garbage collection or concurrent printing, when i switched to the benchmark library, the surprising speed difference disappeared.

however;

i then looked at more expensive, and hardware accelerated maths functions, like math.sqrt(float64), for which the std lib uses SSE (on amd64).

and noticed that if you roll you own assembler, float32 can be much faster, basically because go's lib's don't include access to the 32bit versions of SSE hardware accelerated functions.

also, to be complete, a wrote a 4-way SSE SIMD instruction test, (where float32 is better as float64 can only be 2-way) which was surprisingly easy.

results(timings per core);

math.Sqrt(b+1) 24.4 ns (uses hardware 64bit SSE sqrt on amd64)

float32(math.Sqrt(float64(b+1))) 24.3 ns (basically the same)

4 x above on [4]float32 array 21.8 ns (per calc)

as previous but using assembler to access 32 bit SSE 11.1 ns (per calc) (doubles speed)

as previous but using 4xparallel 32 bit SSE 3.7 ns (per calc) (nearly quadruples speed)

as previous but using low accuracy SSE sqrt 1.6 ns (per calc) (more than doubles but only good to 4-5 sf)

so using float32, assembler and low accuracy gets you x15!

more practically having a Sqrt(float32) (and others) in the libs would get you at least x2.(on amd64, and ARM's equivalent seems to be usually in-place on higher spected chips.)

intel 'sandy bridge' CPUs have AVX (Advanced Vector Extensions), or 256 bit registers, which means 8x float32 and 4x float64. (but i guess the assembler doesn't register these at the moment? since the compiler doesn't use them.)

and in the future, probably? AVX-512 has doubled this again. (so potentially x60 and that's without increasing the number of cores.)

(some mentions of this stuff here: http://dennisforbes.ca/index.php/2013/07/23/the-most-powerful-feature-of-go-is-the-least-sexy/ )

Erwin

unread,

Mar 13, 2014, 5:47:28 AM3/13/14

to simon place, golang-nuts, David Turnbull

I would like to learn SSE assembly as well and so I'd be interested in seeing your code, is it available somewhere?

simon place

unread,

Mar 13, 2014, 4:01:06 PM3/13/14

to golan...@googlegroups.com, simon place, David Turnbull

for this bodyless func header; func f40pc(i *[4]float32)

there is this file; "f40pc_amd64.c" (in this case for the formula; sqrt(x+1))

// func f40pc(i *[4]float32)

TEXT ·f40pc+0(SB),$16-8

MOVQ i+0(FP),AX // get 64 bit address from first parameter

MOVAPS (AX),X0 // load 128bit, 4xfloat32, from memory

MOVSS $(1.0),X1 // load single precision var, 1.0 , into lower 32 bits

SHUFPS $0x00,X1,X1 // duplicate it 4 times across the register

ADDPS X1,X0 // parallel add

SQRTPS X0,X0 // parallel sqrt in-place

MOVAPS X0,(AX) // put 128bit back to same address

RET ,

HEADS-UP: the assembler used by Go reads left-to-right, so the destination is the last parameter, but most assembler and documentation i've seen, has the parameters right-to-left.

it took me a while to figure out that the shuffle command here (3 parameters), has the 'selector' parameter first, when all the documentation i found has it last, just an assembler choice, but unless you know.

bit inconsistent really, since inside actual go code this paradyne is right-to-left.

also: the m/c coming from -S needs a middle dot '·' prefixed to the name to make it work, don't know why.

David Turnbull

unread,

Mar 14, 2014, 10:14:14 PM3/14/14

to golan...@googlegroups.com, simon place, David Turnbull

Simon, I just released a float32 library. Perhaps you would be interested in contributing the assembly language functions that you tested.

https://github.com/AE9RB/math32

simon place

unread,

Mar 15, 2014, 12:16:34 PM3/15/14

to golan...@googlegroups.com, simon place, David Turnbull

i was really commenting here on how advantageous and apparently easy it was to assemblify? whole formula to use SSE, (the n-body code in this thread seems like it could massively benefit from this.), and that float32 has a major advantage over float64 with this.

you could write code to generate the assembly, obviously this then IS a compiler, but since in a given program the formula required probably doesn't change, its just as easy to hand write.

a lib for applying just single SSE operations to arrays, that works on things like *[4]float32, (and *[2]float64) might be nice, even including some of the most common compound formula.

and, as you have done, a float32 version of the standard math lib would be nice to work with that, although converting to float32 and back isn't as expensive as you might think., but i guess what would be REALLY be nice about a math float32 lib is being able to just cut/paste float64 to float32 and have it work. (see comment on that thread)

Reply all

Reply to author

Forward