Go toolchain floating point performance

Will Newton

unread,

Apr 25, 2015, 11:52:41 AM4/25/15

to golang-nuts

Hi all,

I'm currently using the gc toolchain to build some relatively floating
point intensive code. I was interested to see if the gccgo and llgo
toolchains might provide a performance boost on this type of code. The
example I was looking at was the FFT benchmark code in go-dsp:

https://github.com/mjibson/go-dsp/blob/master/fft/fft_test.go

The results I got running this code with the three toolchains on
x86_64 are as follows:

gc 1.4.2:

BenchmarkFFT 10 174795984 ns/op
testing: BenchmarkFFT left GOMAXPROCS set to 4
ok github.com/mjibson/go-dsp/fft 2.106s

gccgo master (as of yesterday):

BenchmarkFFT 3 396895666 ns/op
testing: BenchmarkFFT left GOMAXPROCS set to 4
ok github.com/mjibson/go-dsp/fft 2.703s

llgo master (as of yesterday):

BenchmarkFFT 5 420146200 ns/op
testing: BenchmarkFFT left GOMAXPROCS set to 4
ok github.com/mjibson/go-dsp/fft 2.821s

This looks bad for gccgo and llgo. There are some goroutines in the
code so I tried to reduce the effect of that by setting
SetWorkerPoolSize(1) and the relative performance was pretty similar.
Is this type of performance delta expected?

I had a look at the profile briefly and it is spending 80+% in Go code
so it doesn't appear to be GC or allocator overhead, and my
understanding is the version of gccgo I had does escape analysis which
has often been cited in the past as the big missing feature in that
toolchain.

Any tips or "you're doing it wrong" gratefully received. ;-)

Thanks,

--
Will Newton
Software - Cocoon (http://cocoon.life)

Ian Lance Taylor

unread,

Apr 25, 2015, 2:11:09 PM4/25/15

to Will Newton, golang-nuts

On Sat, Apr 25, 2015 at 7:55 AM, Will Newton <will....@cocoon.life> wrote:
>
> I had a look at the profile briefly and it is spending 80+% in Go code
> so it doesn't appear to be GC or allocator overhead, and my
> understanding is the version of gccgo I had does escape analysis which
> has often been cited in the past as the big missing feature in that
> toolchain.

Chris has escape analysis to gccgo but it is not yet turned on by
default. There is at least one bug in the code. Also, the escape
analysis is not yet as effective as it should become.

You can turn it on by passing -fgo-optimize-allocs when you compile.

Also of course when using gccgo and llgo make sure you are compiling
with optimization. It's not the default.

I took a look at the benchmark. The dominating element is the worker
nested function in radix2FFT in radix2.go. I see two aspects that are
slowing down the gccgo code. The first is that gc is doing much
better escape analysis on the various enclosing variables referenced
by the nested function. That reduces the number of memory loads.

The second, and larger, aspect is that gc is inlining the complex
multiplication rather than calling out to a supporting function. When
I use -ffast-math with gccgo, I get benchmark results comparable to
gc. Using -ffast-math tells gccgo that it doesn't need to worry about
getting exactly correct results for infinity and NaN values when doing
complex multiplication.

Ian

Will Newton

unread,

Apr 26, 2015, 12:54:07 PM4/26/15

to Ian Lance Taylor, golang-nuts

On 25 April 2015 at 19:10, Ian Lance Taylor <ia...@golang.org> wrote:

Hi Ian,

> On Sat, Apr 25, 2015 at 7:55 AM, Will Newton <will....@cocoon.life> wrote:
>>
>> I had a look at the profile briefly and it is spending 80+% in Go code
>> so it doesn't appear to be GC or allocator overhead, and my
>> understanding is the version of gccgo I had does escape analysis which
>> has often been cited in the past as the big missing feature in that
>> toolchain.

Thanks for the fast and informative reply!

> Chris has escape analysis to gccgo but it is not yet turned on by
> default. There is at least one bug in the code. Also, the escape
> analysis is not yet as effective as it should become.
>
> You can turn it on by passing -fgo-optimize-allocs when you compile.
>
> Also of course when using gccgo and llgo make sure you are compiling
> with optimization. It's not the default.

I wasn't aware of that. I guess it is line with how gcc operates in
general but different from how gc works by default.

> I took a look at the benchmark. The dominating element is the worker
> nested function in radix2FFT in radix2.go. I see two aspects that are
> slowing down the gccgo code. The first is that gc is doing much
> better escape analysis on the various enclosing variables referenced
> by the nested function. That reduces the number of memory loads.
>
> The second, and larger, aspect is that gc is inlining the complex
> multiplication rather than calling out to a supporting function. When
> I use -ffast-math with gccgo, I get benchmark results comparable to
> gc. Using -ffast-math tells gccgo that it doesn't need to worry about
> getting exactly correct results for infinity and NaN values when doing
> complex multiplication.

With C code I have always tended to avoid -ffast-math as it behaves in
various IEEE non-compliant ways. As far as I can tell Go makes no
specific claims about IEEE 754 compliance so I guess it is free to do
this type of optimization by default?

I tried running the code with -O3 -ffast-math -fgo-optimize-allocs for
gccgo and -O3 for llgo:

gccgo:

BenchmarkFFT 3 423200666 ns/op

testing: BenchmarkFFT left GOMAXPROCS set to 4

ok github.com/mjibson/go-dsp/fft 2.758s

llgo:

BenchmarkFFT 10 245246700 ns/op

testing: BenchmarkFFT left GOMAXPROCS set to 4

ok github.com/mjibson/go-dsp/fft 3.008s

gc for comparison:

BenchmarkFFT 3 402457353 ns/op

testing: BenchmarkFFT left GOMAXPROCS set to 4

ok github.com/mjibson/go-dsp/fft 1.682s

So it looks like llgo is either doing really well here (AFAIK it
doesn't do escape analysis?) or I made some kind of mistake...

Ian Lance Taylor

unread,

Apr 26, 2015, 2:02:54 PM4/26/15

to Will Newton, golang-nuts

On Sun, Apr 26, 2015 at 9:53 AM, Will Newton <will....@cocoon.life> wrote:
>
> With C code I have always tended to avoid -ffast-math as it behaves in
> various IEEE non-compliant ways. As far as I can tell Go makes no
> specific claims about IEEE 754 compliance so I guess it is free to do
> this type of optimization by default?

Go does follow IEEE-754 for floating point operations. However, for
complex numbers the issue is more, well, complex. There is IEC 60559
and C11 Annex G and
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1399.txt but I'm not
sure how strictly any compiler follows them.

A complex number with at least one infinite part is considered to be
an infinity, even if the other part is a NaN. The issue with complex
multiplication is that the cross-cutting multiplication can mean that
when you multiply two numbers that are both Infinity+NaN you can get
NaN+NaN which is a NaN although you should really get an Infinity.
I'm not 100% sure that the gc compiler always gets right, though it
may. Gccgo does reliably get it right when not using -ffast-math, at
the cost of calling a function for complex multiplication.

You don't actually need -ffast-math to speed up this specific test
case, you can use -fcx-fortran-rules, which has a much more limited
effect in that it only applies to complex multiplication and division.

> I tried running the code with -O3 -ffast-math -fgo-optimize-allocs for
> gccgo and -O3 for llgo:
>
> gccgo:
>
> BenchmarkFFT 3 423200666 ns/op
> testing: BenchmarkFFT left GOMAXPROCS set to 4
> ok github.com/mjibson/go-dsp/fft 2.758s
>
> llgo:
>
> BenchmarkFFT 10 245246700 ns/op
> testing: BenchmarkFFT left GOMAXPROCS set to 4
> ok github.com/mjibson/go-dsp/fft 3.008s
>
> gc for comparison:
>
> BenchmarkFFT 3 402457353 ns/op
> testing: BenchmarkFFT left GOMAXPROCS set to 4
> ok github.com/mjibson/go-dsp/fft 1.682s
>
> So it looks like llgo is either doing really well here (AFAIK it
> doesn't do escape analysis?) or I made some kind of mistake...

Interesting. I don't have llgo installed. I wonder what it is doing?

Ian

Dan Kortschak

unread,

Apr 26, 2015, 6:14:49 PM4/26/15

to Ian Lance Taylor, Will Newton, golang-nuts

Veering of-ftopic, but what's the rationale for this?

Ian Lance Taylor

unread,

Apr 26, 2015, 9:09:07 PM4/26/15

to Dan Kortschak, Will Newton, golang-nuts

On Sun, Apr 26, 2015 at 3:14 PM, Dan Kortschak
<dan.ko...@adelaide.edu.au> wrote:
> Veering of-ftopic, but what's the rationale for this?

I'll quote the C99 rationale:

Text book formulas for complex arithmetic tend to turn infinite inputs
into NaNs, often losing
useful information unnecessarily. For example,

(1+i0)(∞+i∞) ⇒ (1×∞ – 0×∞) + i(0×∞+1×∞) ⇒ NaN + I*NaN
and

cexp(∞+iNaN) ⇒ exp(∞)×(cis(NaN)) ⇒ NaN + I*NaN

but for applications modeling the Riemann sphere, result values of
infinite magnitude would be
more useful (even though their phase angles may be meaningless). In
order to support the one-
infinity model, C99 regards any complex value with at least one
infinite part as a complex
infinity (even if the other part is a NaN), and guarantees that
operations and functions honor
basic properties of infinities, and provides the cproj function to map
all infinities to a canonical
one. For example, a finite non-zero value times an infinity must be an
infinity, hence
(1+i0)*(∞+i∞) must be an infinity. In the same spirit, cexp(∞+iNaN) is
an infinity and
cexp(-∞+iNaN) is a complex zero, which preserve cabs(cexp(x+iy)) = exp(x).

Ian

Dan Kortschak

unread,

Apr 26, 2015, 9:19:38 PM4/26/15

to Ian Lance Taylor, Will Newton, golang-nuts

Thanks.

Reply all

Reply to author

Forward