Why is go too slow?

nilsocket

unread,

Jan 28, 2020, 10:33:06 AM1/28/20

to golang-nuts

Attached two files main.c and main.go,

both try to compute fib(40), 20 times on `x` no.of threads/goroutines.

fib(40) is recursive.

Prerequisites for C:

libuv (https://github.com/libuv/libuv)

Compile:

C :

Unoptimized:gcc main.c -lpthread -luv

Optimized:gcc-O3 main.c -lpthread -luv

Go :

go build

Run:

C:
time UV_THREADPOOL_SIZE=x ./a.out // substitute `x` with physical core count
GO:
./temp x // substitute `x` with physical core count

Results:

Thread Count	C (Optimized)	C (Unoptimized)	GO
1	4.38s	11.36s	11.7s
4	1.12s	3.1s	2.9s
8	1.1s	2.9s	2.7s

Laptop with 4 physical cores(8 logical cores).

Why can't go provide Optimization flag?

I understand go developers want compiler to be simple, but the difference seems too big to leave it out.

But isn't there any possibility to abstract the complexity and provide optimization flag?

main.c

main.go

Ian Lance Taylor

unread,

Jan 28, 2020, 10:45:58 AM1/28/20

to nilsocket, golang-nuts

Go doesn't provide an optimization flag because it always optimizes.

Your program amounts to a microbenchmark of the tail recursion optimization. The Go compiler doesn't optimize tail recursion, because it confuses stack traces and breaks runtime.Callers. The C compiler has no such constraints. You might be interested in https://golang.org/issue/22624.

There will always be microbenchmarks in which C code executes faster than Go code. Microbenchmarks can be useful if analyzed with care, but the real question for performance is always how a real program behaves.

Ian

burak serdar

unread,

Jan 28, 2020, 10:51:46 AM1/28/20

to nilsocket, golang-nuts

On Tue, Jan 28, 2020 at 8:33 AM nilsocket <nils...@gmail.com> wrote:

Attached two files main.c and main.go,
both try to compute fib(40), 20 times on `x` no.of threads/goroutines.
fib(40) is recursive.

Your benchmark is flawed. Goroutines are not threads. Also, the printing of the results is included in the times you measure. You're benchmarking i/o implementations of two libraries

Measure single-threaded implementations without printf's and see what happens.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/cec68183-27d1-47cc-b106-48766a2242e6%40googlegroups.com.

nilsocket

unread,

Jan 28, 2020, 11:02:13 AM1/28/20

to golang-nuts

Thanks a lot for your answer.

Thank you.

Michael Jones

unread,

Jan 29, 2020, 4:01:42 PM1/29/20

to nilsocket, golang-nuts

Following up on this, consider how long it actually takes to compute F(40) in the normal iterative way, single threaded -- this is the natural amount of work under discussion.

iterative:

BenchmarkFibonacci40-36 51974828 21.9 ns/op 0 B/op 0 allocs/op

22 nanoseconds, about a 52 millionth of a second. Doing 40 evaluations takes 40 times longer (single threaded) but let's just go with the single evaluation for now.

recursive (after changing commented-out call to FibonacciR):

BenchmarkFibonacci40-36 3 353790912 ns/op 0 B/op 0 allocs/op

353,790,912 nanoseconds, about a third of a second. The program is not doing more essential work down the Fibonacci path, but is doing a great deal more redundant work because of the many redundant calls with the same arguments.

f(40) is 102,334,155 and requires 39 additions the first way and 204,668,309 recursive function calls (2 f(n) - 1) and 102,334,154 additions the second way (all but the near-leaf elements of which have an addition) When we divide the 353,790,912 ns/op by the 204,668,309 function calls per op (up from one in the iterative case) we get 1.73 ns per function call, including the 102,334,154 required additions.

Now I realize that the original question is about C performance vs Go performance under this storm of redundant function calls and additions, and that question further complicated by various ways and means of concurrent evaluation. Still, in any benchmark it is important to understand what is actually happening, what is being tested, and especially for this kind of microbenchmark, to understand all of this well enough to decode the meaning of it all as it might apply to a real program.

Here is my opinion of the meaning as one might understand from this single line of testing:

We can do more than 578 million real (not inlined) stack-pushing and popping function calls per second because they take 1.72 ns each even with the if tests and the addition of stack values. This is a lot of function-call intensity and it is hard to imagine anything other than recursive Fibonacci or Ackerman function evaluation that would do so many each doing so little.

If it is true that any other language is really 2x faster here (or 10x or whatever), then when the body of the function is more than an if-test and an add, say an FFT or a sin() calculation, or writing to memory, or heaven forbid an I/O or OS interaction, or anything else "real" that implicitly takes 100 or 100k, or 100m the time of the function call, then the relative speeds will be immeasurably close to 1.

This is the problem with the microbenchmark. It is important, but mostly to the compiler writers and CPU architects. For people who write programs to do things, the real benchmark is timing those things on machine A vs machine B, or language A vs language B, or whatever.

Also, and perhaps not obvious, is that for so-called cloud providers and operators of huge data centers for internal purposes, the real benchmark is even more indirect, generally timing of those things * watts per second or $ per second. (Both across millions of CPUs in some cases of personal knowledge.) It is just not possible to understand such things by extrapolating from microbenchmarks. Not a defense of Go just a word to the wise.

Michael

--

You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/5d21abd1-1910-4bbb-a24e-6080ceaf0c1c%40googlegroups.com.

--

Michael T. Jones
michae...@gmail.com

Jesper Louis Andersen

unread,

Jan 30, 2020, 8:30:30 AM1/30/20

to Michael Jones, nilsocket, golang-nuts

There is a point I want to add to the excellent discussion.

When your benchmark is small, it becomes far more susceptible to which particular compiler optimizations that triggered. This means even small optimizations which manage to reduce instruction count, memory bandwidth usage, and register pressure provides a very large speedup. Likewise for breaking long data dependency chains, thus increasing ILP in the code and making more units in the CPU able to do work.

Small fluctuations in the instruction mix can yield very large speedups all of a sudden.

There is another important point which is that of parallelism vs concurrency. The problem is "embarrasingly parallel" and has no concurrency at all. Yet, almost all the hard things, even in parallel settings, happen when you need to communicate information between threads/goroutines/fibers/processes. This benchmark cannot really measure that. It isn't trivially given you are utilizing libuv at all in the C setting either, due to this.

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CALoEmQw3r9evYQpQH91TAF7R3HBuA7eNaDA7pEqBpyxjGiNmuw%40mail.gmail.com.