Following up on this, consider how long it actually takes to compute F(40) in the normal iterative way, single threaded -- this is the natural amount of work under discussion.
iterative:
BenchmarkFibonacci40-36 51974828 21.9 ns/op 0 B/op 0 allocs/op
22 nanoseconds, about a 52 millionth of a second. Doing 40 evaluations takes 40 times longer (single threaded) but let's just go with the single evaluation for now.
recursive (after changing commented-out call to FibonacciR):
BenchmarkFibonacci40-36 3 353790912 ns/op 0 B/op 0 allocs/op
353,790,912 nanoseconds, about a third of a second. The program is not doing more essential work down the Fibonacci path, but is doing a great deal more redundant work because of the many redundant calls with the same arguments.
f(40) is 102,334,155 and requires 39 additions the first way and 204,668,309 recursive function calls (2 f(n) - 1) and 102,334,154 additions the second way (all but the near-leaf elements of which have an addition) When we divide the 353,790,912 ns/op by the 204,668,309 function calls per op (up from one in the iterative case) we get 1.73 ns per function call, including the 102,334,154 required additions.
Now I realize that the original question is about C performance vs Go performance under this storm of redundant function calls and additions, and that question further complicated by various ways and means of concurrent evaluation. Still, in any benchmark it is important to understand what is actually happening, what is being tested, and especially for this kind of microbenchmark, to understand all of this well enough to decode the meaning of it all as it might apply to a real program.
Here is my opinion of the meaning as one might understand from this single line of testing:
We can do more than 578 million real (not inlined) stack-pushing and popping function calls per second because they take 1.72 ns each even with the if tests and the addition of stack values. This is a lot of function-call intensity and it is hard to imagine anything other than recursive Fibonacci or Ackerman function evaluation that would do so many each doing so little.
If it is true that any other language is really 2x faster here (or 10x or whatever), then when the body of the function is more than an if-test and an add, say an FFT or a sin() calculation, or writing to memory, or heaven forbid an I/O or OS interaction, or anything else "real" that implicitly takes 100 or 100k, or 100m the time of the function call, then the relative speeds will be immeasurably close to 1.
This is the problem with the microbenchmark. It is important, but mostly to the compiler writers and CPU architects. For people who write programs to do things, the real benchmark is timing those things on machine A vs machine B, or language A vs language B, or whatever.
Also, and perhaps not obvious, is that for so-called cloud providers and operators of huge data centers for internal purposes, the real benchmark is even more indirect, generally timing of those things * watts per second or $ per second. (Both across millions of CPUs in some cases of personal knowledge.) It is just not possible to understand such things by extrapolating from microbenchmarks. Not a defense of Go just a word to the wise.
Michael