"Strange" ·i function on amd64

ro

unread,

Feb 19, 2013, 12:21:59 AM2/19/13

to golan...@googlegroups.com

When I compile this code using 6g

type t uint32

func (i t) f() uint32 {

return 0

}

I get this output

--- prog list "t.f·i" ---

0137 (a.go:1) TEXT t.f·i+0(SB),$16-16

0138 (a.go:1) MOVL .this+0(FP),BX

0139 (a.go:1) MOVL BX,(SP)

0140 (a.go:1) CALL ,t.f+0(SB)

0141 (a.go:1) MOVL 8(SP),AX

0142 (a.go:1) MOVL AX,.noname+8(FP)

0143 (a.go:1) RET ,

--- prog list "t.f" ---

0095 (a.go:24) TEXT t.f+0(SB),$0-16

0096 (a.go:25) MOVL $0,.noname+8(FP)

0097 (a.go:25) RET

I'm wondering what this "·i" function is.

I have a case where I write the method is assembly (because it's time critical), and pprof tells me that "·i" function is eating about 10% CPU time. The data type is a uint32 as above, strangely the uint64 version of the exact same code doesn't do a roundabout. I was wondering if there is a way to get rid of it.

Dave Cheney

unread,

Feb 19, 2013, 12:23:15 AM2/19/13

to ro, golan...@googlegroups.com

Are you profiling on a Mac ?

> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

ro

unread,

Feb 19, 2013, 1:16:31 AM2/19/13

to golan...@googlegroups.com, ro

Linux.

I tried both 1.0.3 and tip.

The function I mentioned is a short assembly function, but gets called many times (depends on the input, a billion times in some cases), and it takes about 80% of the CPU time. This is why in my case, I can't afford to have useless movs and calls.

Dave Cheney

unread,

Feb 19, 2013, 1:19:55 AM2/19/13

to ro, golan...@googlegroups.com, ro

Ok, just checking, pprof on OSX is broken.

Any chance you can post a working code sample so others can investigate?

ro

unread,

Feb 19, 2013, 1:56:57 AM2/19/13

to golan...@googlegroups.com, ro

Let me check if I can post the relevant bits of the code.

I tried to build a simple example that would cause a caller to call the "·i", but I couldn't manage to do it.

Let me re-phrase my question: clearly, compiler generates an extra "·i" function for methods; what is it's purpose? Under what conditions is it called, instead of the real function (the one without "·i")?

Rémy Oudompheng

unread,

Feb 19, 2013, 2:08:05 AM2/19/13

to ro, golang-nuts

using the method from an interface will expect a pointer-sized
receiver: that means a pointer receiver for large values and a .i
method for small ones.

minux

unread,

Feb 19, 2013, 2:12:23 AM2/19/13

to ro, golan...@googlegroups.com

On Tue, Feb 19, 2013 at 2:56 PM, ro <rayneo...@yahoo.com> wrote:

Let me check if I can post the relevant bits of the code.
I tried to build a simple example that would cause a caller to call the "·i", but I couldn't manage to do it.

Let me re-phrase my question: clearly, compiler generates an extra "·i" function for methods; what is it's purpose? Under what conditions is it called, instead of the real function (the one without "·i")?

the ·i wrapper is only used during interface calls.

and Go tip has since removed the ·i wrapper in this case (because calling t.f·i is equivalent to

calling t.f directly).

ro

unread,

Feb 19, 2013, 2:43:19 AM2/19/13

to golan...@googlegroups.com, ro

In my code, I'm looping over an array of interface (a very large array), and after some preprocessing, I'm dispatching the underlying value to interface's worker function, which does the real job.

My understanding is then, interfaces are too expensive for a such tasks due to their base runtime costs. Given the lack of generics, I should either switch to a different language or give up the idea of writing a package/library and copy-paste my (lengthy) code for each different type.

I have a few more questions regarding 6g and assembly code, if I may.

-Is there any documentation (other that rsc's) regarding the internals of interfaces (such as "·i" functions)

-When I ran objdump on my binary, I noticed that 6* injected a test code which may or may not call runtime.morestack00, to the beginning of my assembly function. This is quite bizarre I think, because I think it beats the purpose of hand-writing the code in assembly.

-My other question is, are the calling conventions of gc and gccgo documented anywhere? It seems 6g in gotip requires 8-byte alignment everywhere, which broke the go1.0.3 code "MOVL AX, .noname+24(FP)". I found it out by looking at 6g -S's output.

minux

unread,

Feb 19, 2013, 8:35:24 AM2/19/13

to ro, golan...@googlegroups.com

On Tue, Feb 19, 2013 at 3:43 PM, ro <rayneo...@yahoo.com> wrote:

In my code, I'm looping over an array of interface (a very large array), and after some preprocessing, I'm dispatching the underlying value to interface's worker function, which does the real job.
My understanding is then, interfaces are too expensive for a such tasks due to their base runtime costs. Given the lack of generics, I should either switch to a different language or give up the idea of writing a package/library and copy-paste my (lengthy) code for each different type.

Have you tried your code with latest version of Go? I expect the compiler to perform

better.

I have a few more questions regarding 6g and assembly code, if I may.
-Is there any documentation (other that rsc's) regarding the internals of interfaces (such as "·i" functions)

this is suspecting to change now and then, so i think any such documentation might not be accurate

without constantly updated. follow the development is the best way.

That said, the current function calling mechanism is documented in golang.org/s/go11func.

-When I ran objdump on my binary, I noticed that 6* injected a test code which may or may not call runtime.morestack00, to the beginning of my assembly function. This is quite bizarre I think, because I think it beats the purpose of hand-writing the code in assembly.

that's for the segmented stack, if you function doesn't use the stack for temporaries, or use a small amount

(less than 100 bytes, say), you can annotate your function as non-splitstack, by filling 7 in the second field

of the TEXT statement, like this:

TEXT ·function(SB),7,$16-20

then the linker won't add any calls to runtime.morestack?? in front of your functions.

-My other question is, are the calling conventions of gc and gccgo documented anywhere? It seems 6g in gotip requires 8-byte alignment everywhere, which broke the go1.0.3 code "MOVL AX, .noname+24(FP)". I found it out by looking at 6g -S's output.

gccgo just follows whatever the target ABI specifies.

for 6g, the problem is harder. and indeed, Go 1.1 will introduce incompatible changes to assembler and C ABI.

for the record, it's not the alignment that's changed, it's the size of int that's changed (from 32-bit to 64-bit on amd64).

Nigel Tao

unread,

Feb 19, 2013, 5:02:13 PM2/19/13

to ro, golang-nuts

On Tue, Feb 19, 2013 at 6:43 PM, ro <rayneo...@yahoo.com> wrote:
> -Is there any documentation (other that rsc's) regarding the internals of
> interfaces (such as "·i" functions)

It's not documentation, but $GOROOT/src/pkg/runtime/iface.c is the how
interfaces are dispatched at runtime. Figuring out what cmd/{6g,gc}
does at compile-time, such as generating the ·i wrapper, is harder.

> -When I ran objdump on my binary, I noticed that 6* injected a test code
> which may or may not call runtime.morestack00, to the beginning of my
> assembly function. This is quite bizarre I think, because I think it beats
> the purpose of hand-writing the code in assembly.

Go's stacks are dynamically sized, which is partly why goroutines are
cheaper than system threads.

$GOROOT/src/pkg/runtime/stack.h is worth reading, to learn more.

> -My other question is, are the calling conventions of gc and gccgo
> documented anywhere?

Not really. I wrote
https://groups.google.com/d/msg/golang-nuts/j7GTMNQQxRU/PtrmfYVRFEIJ
a couple of months ago, though.

Nigel Tao

unread,

Feb 19, 2013, 5:11:21 PM2/19/13

to ro, golang-nuts

On Tue, Feb 19, 2013 at 4:21 PM, ro <rayneo...@yahoo.com> wrote:

> I have a case where I write the method is assembly (because it's time
> critical),

I forgot to mention: 6g can inline Go functions, but not asm
functions. Hand-writing your assembly functions can make your program
slower, even if you start with the output of '6g -S' and only make
obvious improvements.

Also, interfaces provide dynamic dispatch, but that indirection isn't
free. It's cheap, but if you're doing it in your inner loop, then it
could become significant, and it may be worth redesigning your code to
work with concrete types. I don't know exactly what problem you're
solving, but in general, doing "x + y" will obviously be much faster
than "x.Add(interface{}(y))".

ro

unread,

Feb 20, 2013, 1:37:54 AM2/20/13

to golan...@googlegroups.com, ro

On Tuesday, February 19, 2013 10:35:24 PM UTC+9, minux wrote:

Have you tried your code with latest version of Go? I expect the compiler to perform
better.

I tried tip as well, but unfortunately it didn't make too much of a difference.

In my case, the Go-related bottlenecks are (in decreasing impact order):

-hash_lookup

-"·i" roundabout for 32-bit structure

-runtime.ifaceeq

this is suspecting to change now and then, so i think any such documentation might not be accurate
without constantly updated. follow the development is the best way.

That said, the current function calling mechanism is documented in golang.org/s/go11func.

Thanks so much! It's been very helpful. But I was actually wondering about what are the scratch registers for each architecture, how are parameters & return values are passed (this one I know for sure from examples & disassembling the code).

that's for the segmented stack, if you function doesn't use the stack for temporaries, or use a small amount
(less than 100 bytes, say), you can annotate your function as non-splitstack, by filling 7 in the second field

of the TEXT statement, like this:
TEXT ·function(SB),7,$16-20

then the linker won't add any calls to runtime.morestack?? in front of your functions.

I didn't use stack at all, so I simply set the final parameter to $0.

I checked out the Plan 9 assembler for this, and it mentioned only first and second bits (suspend profiling & multiple definitions). Thanks!

for 6g, the problem is harder. and indeed, Go 1.1 will introduce incompatible changes to assembler and C ABI.
for the record, it's not the alignment that's changed, it's the size of int that's changed (from 32-bit to 64-bit on amd64).

I see, that must be due to the the length of the slice I was passing in the arguments.

ro

unread,

Feb 20, 2013, 2:14:51 AM2/20/13

to golan...@googlegroups.com, ro

On Wednesday, February 20, 2013 7:02:13 AM UTC+9, Nigel Tao wrote:

It's not documentation, but $GOROOT/src/pkg/runtime/iface.c is the how
interfaces are dispatched at runtime. Figuring out what cmd/{6g,gc}
does at compile-time, such as generating the ·i wrapper, is harder.

Fingers crossed for a future doc :)

Go's stacks are dynamically sized, which is partly why goroutines are
cheaper than system threads.

$GOROOT/src/pkg/runtime/stack.h is worth reading, to learn more.

Indeed in general, but writing in assembly without using the stack, I know for sure that I'm not going to cause any stack overflows.

Will do, thanks!

Not really. I wrote
https://groups.google.com/d/msg/golang-nuts/j7GTMNQQxRU/PtrmfYVRFEIJ
a couple of months ago, though.

Thanks a lot for explaining the cryptic parameters of TEXT!

On Wednesday, February 20, 2013 7:11:21 AM UTC+9, Nigel Tao wrote:

I forgot to mention: 6g can inline Go functions, but not asm
functions. Hand-writing your assembly functions can make your program
slower, even if you start with the output of '6g -S' and only make
obvious improvements.

Thanks. I can only hope for having inline assembly in Go to solve this problem.

This is usually true for C compilers so I always measure changes (gcc's -Ofast -funroll-loops was good enough, but I couldn't use it because cgo call was a LOT more expensive than I thought). But 6g was indeed much slower in my case, partly because overflow flag is only available from within Go, and partly because I can't find a way to make Go emit instructions such as rsqrtss and popcnt. I also couldn't get 6g to unroll my internal loops.

The function calls are indeed making the situation worse that it could however, because it seems there is not way to pass parameters/return values using registers at all!

If possible, I would like to write the loop code (which calls the assembly code) in C or assembly too, but that requires some knowledge about the internals of interfaces and maps.

Which brings me to the question: is there a reliable way of calling hash_lookup and runtime.ifaceeq from assembly/C code?

Also, interfaces provide dynamic dispatch, but that indirection isn't
free. It's cheap, but if you're doing it in your inner loop, then it
could become significant, and it may be worth redesigning your code to
work with concrete types. I don't know exactly what problem you're
solving, but in general, doing "x + y" will obviously be much faster
than "x.Add(interface{}(y))".

I'm using an interface that represents an entry in a database (whose structure and methods depend on the database). So it's the least abstraction I could use. The problem is, course, each item must be processed (looped over an array of the interface) separately and there's just too many entries in the DB. The method that gets called is the one written in assembly, and involves some math depending on the properties of the corresponding database table.

I have two problems due to the overhead of interfaces:

-".i" function redirection of 32-bit data

-runtime.ifaceeq

I don't know a way around the first one.

But if I can write my loop code in C/assembly, I can get around of runtime.ifaceeq.

All being said, to be on the fair side, the larger portion of my code is still in Go. That one looping function however, takes more that 90% of the CPU time (that is with hand-written assembly code), so the benefits overweigh the cons greatly in this singular case.

Elias Naur

unread,

Feb 20, 2013, 6:19:14 AM2/20/13

to golan...@googlegroups.com, ro

Also, interfaces provide dynamic dispatch, but that indirection isn't
free. It's cheap, but if you're doing it in your inner loop, then it
could become significant, and it may be worth redesigning your code to
work with concrete types. I don't know exactly what problem you're
solving, but in general, doing "x + y" will obviously be much faster
than "x.Add(interface{}(y))".

I'm using an interface that represents an entry in a database (whose structure and methods depend on the database). So it's the least abstraction I could use. The problem is, course, each item must be processed (looped over an array of the interface) separately and there's just too many entries in the DB. The method that gets called is the one written in assembly, and involves some math depending on the properties of the corresponding database table.
I have two problems due to the overhead of interfaces:
-".i" function redirection of 32-bit data
-runtime.ifaceeq
I don't know a way around the first one.
But if I can write my loop code in C/assembly, I can get around of runtime.ifaceeq.

All being said, to be on the fair side, the larger portion of my code is still in Go. That one looping function however, takes more that 90% of the CPU time (that is with hand-written assembly code), so the benefits overweigh the cons greatly in this singular case.

If you really need a tight inner loop, why not split your generic DB-entry list into multiple lists, one for each DB type? The interface abstraction would then be on the granularity of a list of entries instead of a single entry. That way you'll avoid the costly interface call, at the cost of writing the loop for every DB type, instead of only once.

- elias

Anthony Martin

unread,

Feb 20, 2013, 7:14:11 AM2/20/13

to ro, golan...@googlegroups.com

ro <rayneo...@yahoo.com> once said:
> Thanks so much! It's been very helpful. But I was actually wondering about
> what are the scratch registers for each architecture, how are parameters &
> return values are passed (this one I know for sure from examples &
> disassembling the code).

Everything is passed on the stack. Registers are caller-saved
so you can use anything except, in some cases, the registers
for the two extern register pointers (M and G).

> I didn't use stack at all, so I simply set the final parameter to $0.
> I checked out the Plan 9 assembler for this, and it mentioned only first
> and second bits (suspend profiling & multiple definitions). Thanks!

The Go assemblers have a few additions that aren't in the Plan 9
assemblers. In this case, you can see what the bits mean by checking
src/cmd/[568]l/[568.out.h.

Anthony

ro

unread,

Feb 20, 2013, 9:36:39 AM2/20/13

to golan...@googlegroups.com, ro

It's already one entry each time. I couldn't see how this would be helpful in my case, which roughly:

func (s *Algorithm) Method(x, m map[uint32][]Interface) {

// some checks

for _, s :=range m[f(x)] {

// some preprocessing

x.CPUEater(s)

// some postprocessing

}

Could you explain it a little bit more?

I could replace Interface with a solid type, but the Algorithm struct has other methods, and they're quite lengthy.
Ideally, I would like to implement this Method in C and force-inline everything, but I need to be able to access maps and do interface calls from C. I have been wondering if there's an accessible API already available.

ro

unread,

Feb 20, 2013, 9:39:41 AM2/20/13

to golan...@googlegroups.com, ro

I see ---explains the insane (in my case) cost of function calls. I wonder whether this problem will be cured in the future.

Thank a lot for the pointer!

ro

unread,

Feb 20, 2013, 10:00:51 AM2/20/13

to golan...@googlegroups.com, ro

To be more clear, the m map[uint32][]Interface is not passed as a parameter, but actually resides in the Algorithm struct. It's the result of a CPU-intensive preprocessing.

And changing it to map[uint32][]Type to avoid interfaces requires me to duplicate a huge amount of code for all types, not just one short loop.

Honestly, I'm getting more and more skeptical whether Go as it is now is a suitable language for type-agnostic/generic intensive computation code: maps are costy (both space and time-wise, comparing to sparseshash), trading generics for inferfaces for the sake of a generic code results in a serious overhead, the optimizer doesn't emit good code (requiring some hand-written assembly code), no inline-assembly and function calls are insanely expensive which takes away a measurable benefit of the hand-written assembly code. I'll try write the Method() in C+inline assembly if there is an API, but otherwise, I have to switch to C++ before I spend too much time on a potentially fruitless adventure.

Nigel Tao

unread,

Feb 20, 2013, 6:34:36 PM2/20/13

to ro, golang-nuts

On Wed, Feb 20, 2013 at 6:14 PM, ro <rayneo...@yahoo.com> wrote:
> I'm using an interface that represents an entry in a database (whose
> structure and methods depend on the database).

If you're reading from a database, I would expect the I/O time to
dominate, not computation. Or, if your CPUEater computation is
complicated enough to be comparable to I/O time, then I would expect
that the interface indirection would be insignificant.

If you can run pprof on your program, the "weblist" command should
indicate where (at the asm/instruction level) you're spending your
time; click on the blue lines of source code to see the asm-level
samples. http://blog.golang.org/2011/06/profiling-go-programs.html is
a 2011-era article on pprof'ing Go code.

If you want to send me your code (off-list), I can take a quick look at it.

ro

unread,

Feb 21, 2013, 1:45:27 AM2/21/13

to golan...@googlegroups.com, ro

I'm not reading from DB. The entries from DB are read from plain files of data array that can directly be mmap'ed.

Anyway, I'm not theoretically speaking, I actually wrote the programs and did the measurements, and already found the bottlenecks I listed above.

I will follow the disassembled code to access maps and rsc's doc on interface calls for interface method calls, and try to rewrite the function in C,

and keep the interface method in assembly.

This should let me eliminate 1) cost of Go function calls 2) runtime.ifaceeq call 3) reduce the impact of".i" indirection. It should also give me better

optimized code (I don't expect much of a difference though).

However, the space inefficiency of Go maps actually limits the scalability of my program on a single machine by an order in the worst case anyway (the overhead sparsehash with sparse_hash_map is about 2 bits per entry), and interface's itable seems to be another space problem for a large array of interface, so most likely I'll have to switch to C++ sooner or later.

Thanks, let me check if I can do it.

Nigel Tao

unread,

Feb 21, 2013, 7:03:13 PM2/21/13

to ro, golang-nuts

On Thu, Feb 21, 2013 at 2:00 AM, ro <rayneo...@yahoo.com> wrote:
> To be more clear, the m map[uint32][]Interface is not passed as a parameter,
> but actually resides in the Algorithm struct. It's the result of a
> CPU-intensive preprocessing.
> And changing it to map[uint32][]Type to avoid interfaces requires me to
> duplicate a huge amount of code for all types, not just one short loop.

If you haven't already considered it, the design space for the map key
type can be more than just []I (generic but slow) versus []T (fast but
all code is duplicated) for some interface type I and some concrete
type T.

For example, the sort package in the standard library is able to apply
the quicksort algorithm to any []T type (footnote 1). The criticial
insight is that you don't pass a slice *of* interface values, you pass
a slice *which itself implements* an interface.

Footnote 1: Or, really, anything that implements sort.Interface. I've
seen someone do the equivalent of "git bisect" using Go's sort.Search.

ro

unread,

Feb 22, 2013, 1:20:13 AM2/22/13

to golan...@googlegroups.com, ro

In my implementation I indeed use two interfaces: Interface and InterfaceSlice. InterfaceSlice has two elements, Len and Elem. Elem returns the ith element, but as an interface.

I'm sorry about my misleading statement in my previous post, it was meant to point out another general space inefficiency. I pass the actual array, because otherwise I need to do an explicit conversion to loop over the original array, and double (or even worse) the used memory for a while.

I'm afraid I'm back on C++ for this project. Thanks so much for everyone's help so far.

I can leave a few comments based on this experience, summarizing this "adventure":

-The discussions on generics mainly focus on elegance and code duplication and interfaces are pointed as a potential drop-in replacement, but there is practical issue at hand: interface function calls cannot be inlined, this is inherent in their structure. A minor side-issue is Interfaces also require some extra magic with itable, function calls become even more expensive, and the space requirements grows with the number of functions defined in the interface.

-Function calls (especially interface methods) are very expensive on Go. The first few arguments aren't passed via registers, return values are passed by memory, all registered are pushed onto the stack for any function call. When this is combined with above fact, and the lack of inline assembly, doing high performance computations in Go can become very expensive. My understanding is the designers consider i/o or network latencies to be the minimum unit of waste, and it's probably true for what the use Go for.

-maps are space-inefficient, hindering scalability.

-The optimizer is not as good and there's no way access instructions like popcnt from Go, requiring some assembly code (=no inlining=function call) (rewriting my abelit-short CPUEater function in assembly gained me at worst 10x speed thanks torsqrtss and popcnt insutrctions, even at the cost of the interface method call BTW).

-And oh the "·i" redirection, which slow things about 20%.

I don't realistically think that Go will have some sort of generics some day (averaged over real programs, I believe cons overweigh pros), but I do hope that 1) the optimizer will get better 2) maps will become more space efficient 3) function calls will become as cheap as C function calls (at the expense of extra code in the compiler).

Rémy Oudompheng

unread,

Feb 22, 2013, 1:44:44 AM2/22/13

to ro, golan...@googlegroups.com

On 2013/2/22 ro <rayneo...@yahoo.com> wrote:
> In my implementation I indeed use two interfaces: Interface and
> InterfaceSlice. InterfaceSlice has two elements, Len and Elem. Elem returns
> the ith element, but as an interface.
> I'm sorry about my misleading statement in my previous post, it was meant to
> point out another general space inefficiency. I pass the actual array,
> because otherwise I need to do an explicit conversion to loop over the
> original array, and double (or even worse) the used memory for a while.
>
> I'm afraid I'm back on C++ for this project. Thanks so much for everyone's
> help so far.

At the moment I don't find the discussion helpful at all. You didn't
show a compilable version of your program, and your observations may
be true, but we cannot interpret them without the corresponding source
code.

> I can leave a few comments based on this experience, summarizing this
> "adventure":
> -The discussions on generics mainly focus on elegance and code duplication
> and interfaces are pointed as a potential drop-in replacement, but there is
> practical issue at hand: interface function calls cannot be inlined, this is
> inherent in their structure.

This is not true, an agressive compiler may inline interface calls in situations
where the concrete type is known. In many situations where interfaces are
used to emulate generics, it is true.

> A minor side-issue is Interfaces also require
> some extra magic with itable, function calls become even more expensive, and
> the space requirements grows with the number of functions defined in the
> interface.

The magic is extremely limited, and I don't understand your space
requirement thing.

> -Function calls (especially interface methods) are very expensive on Go. The
> first few arguments aren't passed via registers, return values are passed by
> memory, all registered are pushed onto the stack for any function call.

Yes, this is true.

> When
> this is combined with above fact, and the lack of inline assembly, doing
> high performance computations in Go can become very expensive. My
> understanding is the designers consider i/o or network latencies to be the
> minimum unit of waste, and it's probably true for what the use Go for.
> -maps are space-inefficient, hindering scalability.

High performance computation make me think of heavy array processing
using float64, but obviously you aren't talking about that.
It is not true either that contributors reason at an I/O lantency granularity.

> -The optimizer is not as good and there's no way access instructions like
> popcnt from Go, requiring some assembly code (=no inlining=function call)
> (rewriting my abelit-short CPUEater function in assembly gained me at worst
> 10x speed thanks torsqrtss and popcnt insutrctions, even at the cost of the
> interface method call BTW).

Please show the original, complete Go version of your CPUEater function.

> -And oh the "·i" redirection, which slow things about 20%.

You could have solved this by using uintptr as base type or using a
pointer receiver.

> I don't realistically think that Go will have some sort of generics some day
> (averaged over real programs, I believe cons overweigh pros), but I do hope
> that 1) the optimizer will get better 2) maps will become more space
> efficient 3) function calls will become as cheap as C function calls (at the
> expense of extra code in the compiler).

Also, remember that there are more than one Go compiler.

Rémy.

Nigel Tao

unread,

Feb 22, 2013, 2:34:17 AM2/22/13

to ro, golang-nuts

On Fri, Feb 22, 2013 at 5:20 PM, ro <rayneo...@yahoo.com> wrote:
> -Function calls (especially interface methods) are very expensive on Go. The
> first few arguments aren't passed via registers, return values are passed by
> memory, all registered are pushed onto the stack for any function call.

This is true of the gc compilers at this point in time, but gccgo
(which I'm not overly familiar with) might use a different calling
convention.

Daniel Morsing

unread,

Feb 22, 2013, 3:30:26 AM2/22/13

to ro, golan...@googlegroups.com

On Fri, Feb 22, 2013 at 7:20 AM, ro <rayneo...@yahoo.com> wrote:
> -Function calls (especially interface methods) are very expensive on Go. The
> first few arguments aren't passed via registers, return values are passed by
> memory, all registered are pushed onto the stack for any function call.

There's a very good reason for this. If the go compiler passed
pointers in a register, and the called function doesn't spill the
register, you can end up with a situation where the garbage collector
cannot see that a piece of memory is referenced.

http://timetobleed.com/the-broken-promises-of-mrireeyarv/ explains the
problem, using ruby as an example.

Ian Lance Taylor

unread,

Feb 22, 2013, 8:13:19 AM2/22/13

to Nigel Tao, ro, golang-nuts

It does. It uses the C calling convention, so on amd64 the first six
pointer-sized arguments are passed in registers, etc. ro's list
combines concerns about the language (lack of generics, difficulty of
inlining method calls) with concerns about a specific implementation
(calling convention, weak optimizer).

Ian

ro

unread,

Feb 23, 2013, 9:26:16 AM2/23/13

to golan...@googlegroups.com, ro

Anyway, I don't have any attachments to Go or C++. A language is another tool for me to get the job done. Investigating and fighting against the bottlenecks of the language (and with language, I mean existing implementations) so far has lost me more time than writing a solid implementation in C++. I didn't want to offend anyone, so just take it easy.

I'm afraid I can't post the actual code. It's of course up to you how you want or don't want to interpret my comments.

Inlining interface method calls may be possible within a package, however I was assuming that packages are compiled separately, and after compilation (that should be, generating the machine code), it's not possible to do such an optimization across packages. According to your reply, this is apparently wrong.

In any case, there is no such optimization today, and I have to write my code today.

"My space requirement thing" is about what Nigel Tao just pointed out: having a slice of interface is not as efficient.

Thank you for clarifying the point about I/O latency granularity. It was a misunderstaning on my side.

I'm sorry, I'm not allowed to post the CPUEater function, or any other portion of the code.

Using uintptr or a pointer receiver defeats the purpose of using uint32 space-wise. Using 64-bit space to store 32-bits of information halves the limit of database size I can handle. x32-ABI limits the memory I can use to an unacceptable level.

I started out with gccgo because it could unroll-loops and I thought all code would be faster with -Ofast. gccgo's calling convention alleviates the problem in many cases, but it's still not free, and it wouldn't inline an assembly code or interface method call in a time-critical loop. I also noticed that standard libraries (in particular, math) are written in pure Go.

My actual reasons for moving on to 6g later are 1) with gccgo, the memory footproint was much higher (about 12x, I didn't investigate the reason but I'm guessing it may be garbage-collection related) 2) 6g benchmark numbers were partially better.

Eventually, I completely ported my code to pure C++ (without any assembly), while trying to keeping the copy-pasted Go code as intact as possible. I replaced Go maps with sparsehash and slices with C arrays or STL vectors. For those who'd be interested, here are the numbers. The function containing the CPUEater is about 2x faster on average due to inlining and using sparsehash (even though g++'s assembly output is worse than the hand-written assembly in Go version).

There is another piece of time-consuming code which I benchmarked separately. I never got to the optimization stage in Go, but the same code, (except I replaced Go slice of a non-interface type with a STL vector, thus appends with push_backs) is running 10x faster. This is the piece of code that takes up most of the total running time, so the switch paid off better that I expected. Memory usage is only 1/5 of the Go program, which I attribute to sparse_hash_map. When I have time, I will come back to this and try to figure out the potential bottlenecks, and try to make minimal examples in Go and C that will help pinpointing the bottlenecks in Go implementations in my case.

Kyle Lemons

unread,

Feb 23, 2013, 1:19:25 PM2/23/13

to ro, golang-nuts

Thank you for the details; of the points you mention, I personally would like to see inlining of assembly functions at some point. (Side note, inlining does work cross-package.) Inlining interface calls, unfortunately, requires a lot more static analysis than exists currently, and would require a lot of heuristics to decide when and how far to dive into the call tree when one knows an interface's concrete type.

--

Andy Balholm

unread,

Feb 23, 2013, 1:55:18 PM2/23/13

to golan...@googlegroups.com, ro

It sounds to me like your program was designed with C++ templates in mind, and then literally translated to Go, using interfaces for genericity. Writing C++ in Go doesn't usually work very well. As you learned, no amount of micro-optimization can overcome the runtime overhead of using interfaces that way.

But there is likely a more idiomatic way that the problem could be solved in Go that would yield much better performance—although it would still probably be somewhat slower than C++. However, it would take a major redesign of the program.

ro

unread,

Feb 26, 2013, 12:05:24 AM2/26/13

to golan...@googlegroups.com, ro

It is probably true that after years of brainwashing by C++, the way I think is shaped in that particular direction.

But the code is supposed to be library, and it is supposed to accept different possible types. Given these, I can't see much of a choice here (other than copying the code for each different instance, which is what templates do for you anyway).

Out of curiosity, I just pulled the tip again to see the impact of the recent commits, and found a serious performance regression: the function containing the loop is 1800x slower with the tip (in comparison with 1.0.3). The method is basically.

func (a *Algorithm) Method(x, maps []map[uint32][]Interface) {

// some checks

for _, m := range maps {

for _, s :=range m[f(x)] {

// some preprocessing

x.CPUEater(s)

// some postprocessing

}

Rémy Oudompheng

unread,

Feb 26, 2013, 12:20:20 AM2/26/13

to ro, golan...@googlegroups.com

On 2013/2/26 ro <rayneo...@yahoo.com> wrote:
> It is probably true that after years of brainwashing by C++, the way I think
> is shaped in that particular direction.
> But the code is supposed to be library, and it is supposed to accept
> different possible types. Given these, I can't see much of a choice here
> (other than copying the code for each different instance, which is what
> templates do for you anyway).
>
> Out of curiosity, I just pulled the tip again to see the impact of the
> recent commits, and found a serious performance regression: the function
> containing the loop is 1800x slower with the tip (in comparison with 1.0.3).
> The method is basically.
>
> func (a *Algorithm) Method(x, maps []map[uint32][]Interface) {
> // some checks
> for _, m := range maps {
> for _, s :=range m[f(x)] {
> // some preprocessing
> x.CPUEater(s)
> // some postprocessing
> }
> }
> }

Again, it is absolutely necessary that you provide us with some
compilable example so that your benchmarks become reproducible by
everybody. A 1800x slowdown is so ridiculous it can only be explained
by obvious reasons.

If for some reason your CPUEater implementation cannot be made public,
please try to provide a simple replacement for this function that
still demonstrates the regression in a standalone program: it will
help fixing it.

Rémy.

ro

unread,

Feb 26, 2013, 4:33:34 AM2/26/13

to golan...@googlegroups.com, ro

It seems this has nothing to do with the CPUEater. I commented out everything inside the loop, and still the code is 160x slow when compared against 1.0.3.

When I introduce a comparison s == x (both are of Interface type) inside the loop, I get 1200x slowness in tip vs 1.0.3.

It appears that interfaces (comparisons in particular) suddenly got a lot more expensive.

I will try to write a simple benchmark code in the weekend.

Jan Mercl

unread,

Feb 26, 2013, 4:41:04 AM2/26/13

to ro, golang-nuts

On Tue, Feb 26, 2013 at 10:33 AM, ro <rayneo...@yahoo.com> wrote:
> It seems this has nothing to do with the CPUEater. I commented out
> everything inside the loop, and still the code is 160x slow when compared
> against 1.0.3.
> When I introduce a comparison s == x (both are of Interface type) inside the
> loop, I get 1200x slowness in tip vs 1.0.3.
>
> It appears that interfaces (comparisons in particular) suddenly got a lot
> more expensive.
> I will try to write a simple benchmark code in the weekend.

Regressions happen, but 1200x is way beyond any reasonably expect-able
number. Please, please, please post the code which says 1200x or so. I
suspect a measurement methodology error is more probable than a 1200x
slowdown of almost anything.

-j

ro

unread,

Feb 26, 2013, 5:15:46 AM2/26/13

to golan...@googlegroups.com, ro

I'm terribly sorry about writing about it without examining thoroughly:

go tip broke the assembly code I wrote (return address, due to int: int32->int64 change), and it somehow ended up making the number of elements in the loop much longer. If anything, tip produces slightly faster.

Please ignore my previous 2 comments.

Reply all

Reply to author

Forward