OMG the speed!

429 views
Skip to first unread message

ikarus

unread,
Sep 10, 2011, 5:19:56 PM9/10/11
to golang-nuts
Needless to say, I don't come from compiled languages. While playing
around with go-sdl I made a simple goto loop to keep the window open
for a bit to see what I had done so far, just a simple smiley picture.
I set it to quit looping when a var reached... say over 1 billion! I
figured it would give me plenty of time to see it all... I thought I
had messed up when the window popped up and disappeared!

I checked my code and realized it's just counting too fast! Oh wow
that's fast. Just to compare I did the same loop in Lua, one of the
fastest high level languages I know, I've written this whole post and
it hasn't finished yet...

Steffen

unread,
Sep 10, 2011, 6:29:11 PM9/10/11
to golan...@googlegroups.com
Are you sure the compiler did not just optimize the loop away?

Ziad Hatahet

unread,
Sep 10, 2011, 8:06:18 PM9/10/11
to ikarus, golang-nuts
My guess is that it's being optimized to a no-op.

--
Ziad

Devon H. O'Dell

unread,
Sep 10, 2011, 8:42:30 PM9/10/11
to Ziad Hatahet, golang-nuts, ikarus

You may want to use time.Sleep instead for these purposes

Kyle Consalus

unread,
Sep 10, 2011, 8:43:50 PM9/10/11
to Ziad Hatahet, ikarus, golang-nuts
Probably not. I'm not aware of any cases in which 6g/8g optimizes a loop to a no-op. On my machine (2.53 GHz),
for i := 0; i < 1e9; i++ {
}

compiles to a tight 5 instruction loop that takes about 2s to complete.
That seems to match his description.

Gustavo Niemeyer

unread,
Sep 10, 2011, 11:45:13 PM9/10/11
to golang-nuts
> Probably not. I'm not aware of any cases in which 6g/8g optimizes a loop to
> a no-op. On my machine (2.53 GHz),
> for i := 0; i < 1e9; i++ {
> }
> compiles to a tight 5 instruction loop that takes about 2s to complete.

If the loop runs 5 instructions 1e9 times in 2 seconds, that's 2.5e9
instructions in one second, or one instruction every 0.4 nanosecond.
The speed of light is about 299.7M meters per second which is vaguely
29.97 cm per nanosecond, so while the CPU executes one instruction
light has traveled about 29.97 * 0.4 = ~12cm. Thanks to the CPU
designers for that speed, and to the Go designers for not screwing up
a simple operation.

--
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/plus
http://niemeyer.net/twitter
http://niemeyer.net/blog

-- I never filed a patent.

Russel Winder

unread,
Sep 11, 2011, 1:27:05 AM9/11/11
to Gustavo Niemeyer, golang-nuts
On Sun, 2011-09-11 at 00:45 -0300, Gustavo Niemeyer wrote:
> > Probably not. I'm not aware of any cases in which 6g/8g optimizes a loop to
> > a no-op. On my machine (2.53 GHz),
> > for i := 0; i < 1e9; i++ {
> > }
> > compiles to a tight 5 instruction loop that takes about 2s to complete.
>
> If the loop runs 5 instructions 1e9 times in 2 seconds, that's 2.5e9
> instructions in one second, or one instruction every 0.4 nanosecond.
> The speed of light is about 299.7M meters per second which is vaguely
> 29.97 cm per nanosecond, so while the CPU executes one instruction
> light has traveled about 29.97 * 0.4 = ~12cm. Thanks to the CPU
> designers for that speed, and to the Go designers for not screwing up
> a simple operation.

On the other hand, we could find, due to various microbenchmarks
focusing on compute bound operations, that Go is currently about 3 times
slower than C and C++. And, indeed Java.

For example, I have a number of programs computing Pi by quadrature (cf.
the Bazaar branch at http://www.russel.org.uk/Bazaar/Pi_Quadrature),
which is an embarrassingly parallel problem due to the commutativity and
associativity of addition.

In the sequential case the 10⁹ loop takes, on my twin 2.33GHz Xeon
machine running Debian Testing, about 8s in C and C++, about 9s in Java
and about 22s in Go. I have not run these to create statistically
significant results, but this could be done — I am confident the
anecdotal numbers presented here would be borne out.

Moreover, C, C++, and Java (using the appropriate parallelism frameworks
for the language) show near to optimal linear scaling with processor
count. Go using Goroutines shows clear speedups, indeed it looks fairly
linear except that it is not the near optimal linearity C, C++ and Java
achieve. There is clearly an overhead in the goroutine system affecting
the measure, i.e. the slope is not 1.

I appreciate that the core Go development team have eschewed worrying
about optimization of code generation and other optimization issues, but
in doing so they are missing appealing to those C and C++ folk who have
compute bound tasks.

--
Russel.
=============================================================================
Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel...@ekiga.net
41 Buckmaster Road m: +44 7770 465 077 xmpp: rus...@russel.org.uk
London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder

signature.asc

Jim Whitehead II

unread,
Sep 11, 2011, 9:01:42 AM9/11/11
to ikarus, golang-nuts

Lua still requires the virtual machine to execute. Try LuaJIT which is
likely to optimize the loop to native code. You're basically comparing
a CPU to a virtual machine, which doesn't make all that much sense =)

DisposaBoy

unread,
Sep 11, 2011, 9:47:00 AM9/11/11
to golan...@googlegroups.com, ikarus
Isn't that the whole point? otherwise there wouldn't be anything to compare.

unread,
Sep 11, 2011, 11:59:57 AM9/11/11
to golang-nuts
On Sep 10, 11:19 pm, ikarus <iris.bl...@gmail.com> wrote:
> I checked my code and realized it's just counting too fast! Oh wow
> that's fast. Just to compare I did the same loop in Lua, one of the
> fastest high level languages I know,

You probably meant "dynamic" instead of "high level". There are many
high-level languages that run faster than the Lua interpreter.

Go lacks some of the dynamic features of Lua. For example, Go does not
allow the programmer to remove or replace methods at run-time. This
lack of dynamism is one of the reasons why Go was faster than Lua in
your case. If you benchmark codes which are utilizing Lua's dynamic
features a lot, then the performance difference between Go and Lua
would be smaller - Lua might even win some benchmarks.

Minor note: The Lua interpreter does not have integer numbers at all,
it only has floating-point numbers.

TimM.

unread,
Sep 13, 2011, 3:25:51 AM9/13/11
to golang-nuts
Just out of interest, I had a look at the assembly generated by 6g and
gcc (unoptimised and -O) for your PI quadrature program. I don't claim
to be an expert, far from it since this is the first time that I've
ever looked at assembly, but below is a quick analysis that I hope
will help more experience people see why there is a difference between
the loop and whether there are any missed opportunities for
optimisation.

The loops produced and a comparison are available at https://gist.github.com/1213317

The unoptimised assembly for the loop produced by gcc (19
instructions) appears quite similar to that produced by 6g (15
instructions), confirmed by a very similar runtime. The optimised gcc
assembly (8 instructions) is much quicker.

The main difference between the gcc optimised loop and 6g is that the
constants used (0.5, 1.0, and delta) are kept in registers between
iterations rather than MOVed every time and this accounts for three of
the additional instructions.

Two of the additional instructions are to copy the accumulator (sum)
from one register (X6->X1) to another before incrementing it, and
copying it back. Following is the assembly. Uppercase is 6g, lowercase
is gcc -O
MOVSD X6,X1
ADDSD X0,X1
MOVSD X1,X6
versus
addsd %xmm6, %xmm1

The final additional instruction appears to be a superfluous MOV after
converting the index from an integer to a float.
CVTSL2SD AX,X1
MOVSD X1,X0
versus
cvtsi2sd %eax, %xmm0

Hope this is useful to someone.

TimM.

On Sep 11, 6:27 am, Russel Winder <rus...@russel.org.uk> wrote:
> On Sun, 2011-09-11 at 00:45 -0300, Gustavo Niemeyer wrote:
> > > Probably not. I'm not aware of any cases in which 6g/8g optimizes a loop to
> > > a no-op. On my machine (2.53 GHz),
> > > for i := 0; i < 1e9; i++ {
> > > }
> > > compiles to a tight 5 instruction loop that takes about 2s to complete.
>
> > If the loop runs 5 instructions 1e9 times in 2 seconds, that's 2.5e9
> > instructions in one second, or one instruction every 0.4 nanosecond.
> > The speed of light is about 299.7M meters per second which is vaguely
> > 29.97 cm per nanosecond, so while the CPU executes one instruction
> > light has traveled about 29.97 * 0.4 = ~12cm. Thanks to the CPU
> > designers for that speed, and to the Go designers for not screwing up
> > a simple operation.
>
> On the other hand, we could find, due to various microbenchmarks
> focusing on compute bound operations, that Go is currently about 3 times
> slower than C and C++.  And, indeed Java.
>
> For example, I have a number of programs computing Pi by quadrature (cf.
> the Bazaar branch athttp://www.russel.org.uk/Bazaar/Pi_Quadrature),
> which is an embarrassingly parallel problem due to the commutativity and
> associativity of addition.
>
> In the sequential case the 10⁹ loop takes, on my twin 2.33GHz Xeon
> machine running Debian Testing, about 8s in C and C++, about 9s in Java
> and about 22s in Go.  I have not run these to create statistically
> significant results, but this could be done — I am confident the
> anecdotal numbers presented here would be borne out.
>
> Moreover, C, C++, and Java (using the appropriate parallelism frameworks
> for the language) show near to optimal linear scaling with processor
> count.  Go using Goroutines shows clear speedups, indeed it looks fairly
> linear except that it is not the near optimal linearity C, C++ and Java
> achieve.  There is clearly an overhead in the goroutine system affecting
> the measure, i.e. the slope is not 1.
>
> I appreciate that the core Go development team have eschewed worrying
> about optimization of code generation and other optimization issues, but
> in doing so they are missing appealing to those C and C++ folk who have
> compute bound tasks.
>
> --
> Russel.
> =========================================================================== ==
> Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.win...@ekiga.net
> 41 Buckmaster Road    m: +44 7770 465 077   xmpp: rus...@russel.org.uk
> London SW11 1EN, UK   w:www.russel.org.uk skype: russel_winder
>
>  signature.asc
> < 1KViewDownload

roger peppe

unread,
Sep 13, 2011, 5:53:30 AM9/13/11
to TimM., golang-nuts
just a quick note (note that it makes any difference this time AFAICS):
it looks as if you're looking at the output of 6g -S. this is not
the final assembly code, as the linker does a reasonable amount
of optimisation itself. the output of 6l -a is what you want.

Jan Mercl

unread,
Sep 13, 2011, 6:17:30 AM9/13/11
to golan...@googlegroups.com
On Tuesday, September 13, 2011 11:53:30 AM UTC+2, rog wrote:
just a quick note (note that it makes any difference this time AFAICS):
it looks as if you're looking at the output of 6g -S. this is not
the final assembly code, as the linker does a reasonable amount
of optimisation itself.  the output of 6l -a is what you want.

That's new to me and interesting a lot! What kind of optimizations is 6l doing above removing unreachable code? 

roger peppe

unread,
Sep 13, 2011, 6:23:02 AM9/13/11
to golan...@googlegroups.com
it does some peephole optimisation and probably a lot more.

Rob 'Commander' Pike

unread,
Sep 13, 2011, 10:27:35 AM9/13/11
to roger peppe, golan...@googlegroups.com
It's also fair to say that 6g is much weaker at floating-point
optimization than at integer optimization.

-rob

Reply all
Reply to author
Forward
0 new messages