plan for two-word funcs, take 2

1,640 views
Skip to first unread message

Russ Cox

unread,
Feb 18, 2013, 6:32:18 PM2/18/13
to golang-dev
http://golang.org/s/go11func
Please comment here (in this thread).

Russ

Nigel Tao

unread,
Feb 18, 2013, 7:56:31 PM2/18/13
to Russ Cox, golang-dev
"Method values" has this motivating example:

var r io.Reader
f := r.Read // f has type func([]byte) (int, error)
n, err := f(p)

Are method values restricted to interface types, or can I also do:

var v Value
f := v.M // f has type func(int)
f(1)



IIRC, the previous document considered a dedicated context register.
Can you briefly comment on using a per-call-frame context 'gap'
instead of a dedicated context register?

Russ Cox

unread,
Feb 18, 2013, 8:45:30 PM2/18/13
to Nigel Tao, golang-dev
On Mon, Feb 18, 2013 at 7:56 PM, Nigel Tao <nige...@golang.org> wrote:
"Method values" has this motivating example:

var r io.Reader
f := r.Read  // f has type func([]byte) (int, error)
n, err := f(p)

Are method values restricted to interface types, or can I also do:

var v Value
f := v.M  // f has type func(int)
f(1)

Both are fine.
 
Can you briefly comment on using a per-call-frame context 'gap'
instead of a dedicated context register?

Using a fixed register to hold the argument on entry to the function complicates the toolchain. It means teaching the optimizers about the existence of a live register on entry to the function and it means having to store it somewhere in each function across calls. It also means that C cannot make such a call directly, though of course we could write a helper in assembly. Using a register is a fair amount of work, so there has to be some reward for it.

If using a register meant that we could get complete backwards compatibility of assembly and C code between Go 1.0 and Go 1.1, I might be more inclined to do it. But we've already broken that by changing the size of int, slice, and string (and we're also changing the size of func), so I don't think it's a big deal to break things further. Also, I think that if we hook my asmlint code in somehow, perhaps as part of go vet, it won't be hard to update assembly.

It's definitely not clear cut either way.

Russ

Ian Lance Taylor

unread,
Feb 18, 2013, 8:57:41 PM2/18/13
to Russ Cox, golang-dev
On Mon, Feb 18, 2013 at 3:32 PM, Russ Cox <r...@golang.org> wrote:
> http://golang.org/s/go11func
> Please comment here (in this thread).

"The only difference is that if r is itself an interface value, in
order to call the Read method both halves of the (type, word) pair
must be passed to the adapter, so a copy of the interface value must
be allocated."

type I interface { Get() int }
func F(i I) func () int {
return i.Get
}

If I understand this correctly, you are suggesting that the compiled
form of F will

1) Allocate memory to hold an interface value, i.e., two words of
memory. Call that p.
2) Copy i into the newly allocated memory p.
3) Return a pair (F', p).
4) The compiler will create F', which will look more or less like this:
func F'(p *I) int {
return *p.Get()
}
Here the value passed for p will come from the pair (F', p), so
this way of writing the function is deceptive. From the caller's
perspective, the function will not take any arguments.


What is the performance effect of adding an extra word of stack space
to every function call? For a direct toplevel call or a direct call
of a method that word will not need to be initialized. For an
indirect call it will of course be set to the word from the function
value. This will use extra stack space in programs with a deep call
tree.

Ian

Ian Lance Taylor

unread,
Feb 18, 2013, 9:06:27 PM2/18/13
to Russ Cox, Nigel Tao, golang-dev
On Mon, Feb 18, 2013 at 5:45 PM, Russ Cox <r...@golang.org> wrote:
>
> Using a fixed register to hold the argument on entry to the function
> complicates the toolchain. It means teaching the optimizers about the
> existence of a live register on entry to the function and it means having to
> store it somewhere in each function across calls. It also means that C
> cannot make such a call directly, though of course we could write a helper
> in assembly. Using a register is a fair amount of work, so there has to be
> some reward for it.

It seems that we only have to teach the optimizers about the existence
of a live register when compiling a func literal or a
compiler-generated adapter function. The other cases ignore the extra
parameter, so there is no live register even though an indirect call
will set the register before making the call. It's true that a func
literal or adapter function needs to copy the register into a local
variable on function entry, but that seems straightforward enough.
And it's true that C code can't call a Go function directly, but how
much such code is there in the runtime? And actually 6c/8c used to
support register variables for m and g, perhaps we could resurrect
that support for this.

There is a reward for this approach: we use less stack space. For the
most common case of direct calls to top-level functions or methods,
there is no penalty at all. For indirect calls we assign a value to
the special register rather than pushing it on the stack, which is if
anything faster.

It seems to me that there is a real case for using a special register.

Ian

Nigel Tao

unread,
Feb 18, 2013, 11:25:23 PM2/18/13
to Russ Cox, golang-dev
"This document supercedes the "Alternatives to Dynamic Code Generation
in Go" document I circulated in September 2012."

For the record, the previous discussion is at
https://groups.google.com/forum/?fromgroups=#!topic/golang-dev/G-uGL-jpOFw

Rob Pike

unread,
Feb 19, 2013, 12:53:44 AM2/19/13
to Nigel Tao, Russ Cox, golang-dev
and for the record, it's spelled 'supersede'. i'll read the doc tomorrow.

-rob

Jan Mercl

unread,
Feb 19, 2013, 3:02:18 PM2/19/13
to Russ Cox, golang-dev
On Tue, Feb 19, 2013 at 12:32 AM, Russ Cox <r...@golang.org> wrote:

In the referenced document I don't see one option for the two-word problem. An option which seems to me feasible too, at least in theory (and if I'm not completely mistaken).

---
Function pointers could stay a single word as they are, if that word would become a pointer to a two-word structure like:

    type fp_t struct { code, data unsafe.Pointer }

Then the generated (pseudo) code is something like:

    move reg1 <- fp->code;
    move reg2 <- fp->data;
    call reg1;
or
    move reg3 <- fp->data;
    call fp->code;

1) The fp_t instance can be allocated in text segment whenever all of the values are known statically:

    func foo() {}

    ...
    f := foo

2) The fp_t instance can be allocated in stack whenever the closure doesn't leak:

    func bar() {
        var x int
        f := func() { x++ }
        ...
    }

3) In other cases the fp_t instance would have to be allocated in the heap.

---

I'm really not sure if this idea isn't actually broken (sorry for the noise then). But in the case it isn't, I would like to ask - was this approach considered too?

-j

Russ Cox

unread,
Feb 21, 2013, 10:07:35 AM2/21/13
to Jan Mercl, golang-dev
On Tue, Feb 19, 2013 at 3:02 PM, Jan Mercl <0xj...@gmail.com> wrote:
Function pointers could stay a single word as they are, if that word would become a pointer to a two-word structure like:

Thank you very much for this suggestion.

Early on we used indirection like this for strings and slices, but it meant doing an allocation any time you wanted to make a string or slice value. In loops that reslice on each iteration, the garbage generated is prohibitively expensive. Instead, we made them multiword values, avoiding an indirection and the allocations. Lesson learned: don't indirect. As a result of having learned this lesson, I never considered using an indirect for func values. The indirect hurts most for strings and slices because it is common to derive one string or slice value from another, and that operation needs to be cheap. There is no equivalent operation on funcs: pretty much all you can do is call them. So using an indirection in this case would not necessarily generate garbage like it did for strings and slices. And if we do the indirect then the func representation remains a single pointer, meaning it doesn't change size, so lots of complications disappear. It's an excellent idea.

I've adopted your suggestion in a slightly modified form: instead of a func value being a pointer to a two-word struct, I made it a pointer to a variable-sized struct: the only constraint is that the first word be the code pointer. So I did remove one indirection. Doing so means that even corner cases like reflection can avoid allocation. The only required allocation is in closures, and then it's just the one allocation that was going to happen anyway.

I have also changed the plan to use a register for the context argument, as Nigel and Ian encouraged. This avoids any changes to the calling convention by hand-written C or assembly functions.

I've revised the doc to incorporate these changes. Everything is now much simpler.

Thanks again.
Russ

Brad Fitzpatrick

unread,
Feb 21, 2013, 10:52:24 AM2/21/13
to Russ Cox, Jan Mercl, golang-dev
It's almost simpler than what exists today. Nice.

The reflection section is fun.

--
 
---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ian Lance Taylor

unread,
Feb 21, 2013, 11:18:31 AM2/21/13
to Russ Cox, Jan Mercl, golang-dev
LGTM

Rémy Oudompheng

unread,
Feb 21, 2013, 11:38:10 AM2/21/13
to Russ Cox, Jan Mercl, golang-dev
On 2013/2/21 Russ Cox <r...@golang.org> wrote:
> I've adopted your suggestion in a slightly modified form: instead of a func
> value being a pointer to a two-word struct, I made it a pointer to a
> variable-sized struct: the only constraint is that the first word be the
> code pointer. So I did remove one indirection. Doing so means that even
> corner cases like reflection can avoid allocation. The only required
> allocation is in closures, and then it's just the one allocation that was
> going to happen anyway.
>
> I have also changed the plan to use a register for the context argument, as
> Nigel and Ian encouraged. This avoids any changes to the calling convention
> by hand-written C or assembly functions.
>
> I've revised the doc to incorporate these changes. Everything is now much
> simpler.

I like the proposal very much. Can you explain what
reflect.ValueOf(f).Pointer() will return?
Currently it is the address of the code, when f becomes a pointer to a
structure, will it return the value of that pointer?

Some people feed this pointer to runtime.FuncForPC, so it may deserve a remark.

Rémy.

Russ Cox

unread,
Feb 21, 2013, 2:48:15 PM2/21/13
to Rémy Oudompheng, Jan Mercl, golang-dev
On Thu, Feb 21, 2013 at 11:38 AM, Rémy Oudompheng <remyoud...@gmail.com> wrote:
I like the proposal very much. Can you explain what
reflect.ValueOf(f).Pointer() will return?
Currently it is the address of the code, when f becomes a pointer to a
structure, will it return the value of that pointer?

That's what I had intended but...
 
Some people feed this pointer to runtime.FuncForPC, so it may deserve a remark.

In order to support this use case, I've changed it to return the code pointer.
This means that two different funcs might have the same Pointer() result, but so be it.

Russ

Russ Cox

unread,
Feb 22, 2013, 4:49:48 PM2/22/13
to Rémy Oudompheng, Jan Mercl, golang-dev
The new func representation is now in. The Windows port still has some runtime code generation for use as system call callbacks, but that seems to be unavoidable. Except that case, there is no runtime code generation left in the tree.

Receiver-curried methods are not in yet. I will probably put those off for a little while and try to make progress on the linker instead.

If anyone wants to do the experiment, I'd be curious to see benchmark numbers for any system, but particularly ARM, since we never flush the icache anymore.

Russ

Brad Fitzpatrick

unread,
Feb 22, 2013, 5:04:35 PM2/22/13
to Russ Cox, Rémy Oudompheng, Jan Mercl, golang-dev
On linux/amd64 (12-core Intel(R) Xeon(R) CPU W3690 @ 3.47GHz, GOMAXPROCS=1)

From 7dc8e8103052 to tip ba6a123a1ea3

benchmark                 old ns/op    new ns/op    delta
BenchmarkBinaryTree17    5238642582   5137215264   -1.94%
BenchmarkFannkuch11      3523101027   3494765735   -0.80%
BenchmarkGobDecode         11115574     11026784   -0.80%
BenchmarkGobEncode          9345293     12108187  +29.56%
BenchmarkGzip             494476432    491324209   -0.64%
BenchmarkGunzip           119877300    119801987   -0.06%
BenchmarkJSONEncode        42208170     42178626   -0.07%
BenchmarkJSONDecode       103091814    102425879   -0.65%
BenchmarkMandelbrot200      4208892      4198138   -0.26%
BenchmarkParse              6464173      6616809   +2.36%
BenchmarkRevcomp          851281928    850824748   -0.05%
BenchmarkTemplate         113207785    118099431   +4.32%

benchmark                  old MB/s     new MB/s  speedup
BenchmarkGobDecode            69.05        69.61    1.01x
BenchmarkGobEncode            82.13        63.39    0.77x
BenchmarkGzip                 39.24        39.49    1.01x
BenchmarkGunzip              161.87       161.97    1.00x
BenchmarkJSONEncode           45.97        46.01    1.00x
BenchmarkJSONDecode           18.82        18.95    1.01x
BenchmarkParse                 8.96         8.75    0.98x
BenchmarkRevcomp             298.57       298.73    1.00x
BenchmarkTemplate             17.14        16.43    0.96x

Nigel Tao

unread,
Feb 22, 2013, 7:40:58 PM2/22/13
to Brad Fitzpatrick, Russ Cox, Rémy Oudompheng, Jan Mercl, golang-dev
On Sat, Feb 23, 2013 at 9:04 AM, Brad Fitzpatrick <brad...@golang.org> wrote:
> BenchmarkGobEncode 9345293 12108187 +29.56%

Yeah, I see the same (on linux/amd64). Before/after pprof top10s:

Total: 123 samples
23 18.7% 18.7% 23 18.7% runtime.memmove
13 10.6% 29.3% 86 69.9% encoding/gob.(*Encoder).encodeStruct
8 6.5% 35.8% 30 24.4% bytes.(*Buffer).Write
8 6.5% 42.3% 15 12.2% runtime.newstack
7 5.7% 48.0% 13 10.6% bytes.(*Buffer).WriteByte
7 5.7% 53.7% 13 10.6% bytes.(*Buffer).grow
6 4.9% 58.5% 6 4.9% runtime.stackalloc
5 4.1% 62.6% 40 32.5% encoding/gob.encInt64
5 4.1% 66.7% 16 13.0% runtime.lessstack
5 4.1% 70.7% 11 8.9% runtime.oldstack

Total: 156 samples
24 15.4% 15.4% 24 15.4% runtime.memmove
17 10.9% 26.3% 21 13.5% bytes.(*Buffer).WriteByte
15 9.6% 35.9% 107 68.6% encoding/gob.(*Encoder).encodeStruct
12 7.7% 43.6% 58 37.2% encoding/gob.(*encoderState).encodeUint
9 5.8% 49.4% 47 30.1% encoding/gob.encInt64
9 5.8% 55.1% 18 11.5% runtime.oldstack
9 5.8% 60.9% 9 5.8% runtime.stackalloc
9 5.8% 66.7% 9 5.8% runtime.stackfree
8 5.1% 71.8% 14 9.0% bytes.(*Buffer).grow
8 5.1% 76.9% 16 10.3% runtime.newstack

Anthony Starks

unread,
Feb 22, 2013, 10:53:18 PM2/22/13
to golan...@googlegroups.com, Rémy Oudompheng, Jan Mercl
Here's the benchmark for the Raspberry Pi:

pi@raspberrypi ~/go/test/bench $ go version
go version devel +ddb9e6365e57 Wed Feb 20 15:35:27 2013 -0800 linux/arm
pi@raspberrypi ~/go/test/bench $ go test -test.bench='.*' ./...
?       _/home/pi/go/test/bench/garbage [no test files]
testing: warning: no tests to run
PASS
BenchmarkBinaryTree17          1    131488202467 ns/op
BenchmarkFannkuch11        1    61976254131 ns/op
BenchmarkGobDecode         5     424145307 ns/op       1.81 MB/s
BenchmarkGobEncode        20     115032849 ns/op       6.67 MB/s
BenchmarkGzip          1    13868472766 ns/op      1.40 MB/s
BenchmarkGunzip        1    1728657231 ns/op      11.23 MB/s
BenchmarkJSONEncode        1    2374700526 ns/op       0.82 MB/s
BenchmarkJSONDecode        1    5110739064 ns/op       0.38 MB/s
BenchmarkMandelbrot200        20      96515515 ns/op
BenchmarkParse        10     216403777 ns/op       0.27 MB/s
BenchmarkRevcomp          10     192566195 ns/op      13.20 MB/s
BenchmarkTemplate          1    6090635318 ns/op       0.32 MB/s
ok      _/home/pi/go/test/bench/go1 266.752s
?       _/home/pi/go/test/bench/shootout    [no test files]

pi@raspberrypi ~/go/test/bench $ go version
go version devel +e00da5e201e5 Fri Feb 22 17:16:31 2013 -0800 linux/arm
pi@raspberrypi ~/go/test/bench $ go test -test.bench='.*' ./...
?       _/home/pi/go/test/bench/garbage [no test files]
testing: warning: no tests to run
PASS
BenchmarkBinaryTree17          1    112637283111 ns/op
BenchmarkFannkuch11        1    61972329989 ns/op
BenchmarkGobDecode         5     383073401 ns/op       2.00 MB/s
BenchmarkGobEncode        20     120332484 ns/op       6.38 MB/s
BenchmarkGzip          1    13493855517 ns/op      1.44 MB/s
BenchmarkGunzip        1    1680585482 ns/op      11.55 MB/s
BenchmarkJSONEncode        1    2377737536 ns/op       0.82 MB/s
BenchmarkJSONDecode        1    4877799205 ns/op       0.40 MB/s
BenchmarkMandelbrot200        20      96478503 ns/op
BenchmarkParse        10     214091185 ns/op       0.27 MB/s
BenchmarkRevcomp          10     196115543 ns/op      12.96 MB/s
BenchmarkTemplate          1    5572935111 ns/op       0.35 MB/s
ok      _/home/pi/go/test/bench/go1 245.640s
?       _/home/pi/go/test/bench/shootout    [no test files]

Dave Cheney

unread,
Feb 22, 2013, 10:56:50 PM2/22/13
to Anthony Starks, golan...@googlegroups.com, Rémy Oudompheng, Jan Mercl
Can you feed those to $GOROOT/misc/benchcmp to make it easier to disgust please. 
--

Anthony Starks

unread,
Feb 22, 2013, 11:19:32 PM2/22/13
to golan...@googlegroups.com
Thanks to andrey mirtchovski here is the reformatted Raspberry Pi data,

go version devel +ddb9e6365e57 Wed Feb 20 15:35:27 2013 -0800 linux/arm vs.
go version devel +e00da5e201e5 Fri Feb 22 17:16:31 2013 -0800 linux/arm



benchmark old ns/op new ns/op delta
BenchmarkBinaryTree17 131488202467 112637283111 -14.34%
BenchmarkFannkuch11 61976254131 61972329989 -0.01%
BenchmarkGobDecode 424145307 383073401 -9.68%
BenchmarkGobEncode 115032849 120332484 +4.61%
BenchmarkGzip 13868472766 13493855517 -2.70%
BenchmarkGunzip 1728657231 1680585482 -2.78%
BenchmarkJSONEncode 2374700526 2377737536 +0.13%
BenchmarkJSONDecode 5110739064 4877799205 -4.56%
BenchmarkMandelbrot200 96515515 96478503 -0.04%
BenchmarkParse 216403777 214091185 -1.07%
BenchmarkRevcomp 192566195 196115543 +1.84%
BenchmarkTemplate 6090635318 5572935111 -8.50%

benchmark old MB/s new MB/s speedup
BenchmarkGobDecode 1.81 2.00 1.10x
BenchmarkGobEncode 6.67 6.38 0.96x
BenchmarkGzip 1.40 1.44 1.03x
BenchmarkGunzip 11.23 11.55 1.03x
BenchmarkJSONEncode 0.82 0.82 1.00x
BenchmarkJSONDecode 0.38 0.40 1.05x
BenchmarkParse 0.27 0.27 1.00x
BenchmarkRevcomp 13.20 12.96 0.98x
BenchmarkTemplate 0.32 0.35 1.09x

Russ Cox

unread,
Feb 22, 2013, 11:22:31 PM2/22/13
to Anthony Starks, golan...@googlegroups.com
How about 

go test runtime -bench Closure

?

Anthony Starks

unread,
Feb 22, 2013, 11:26:59 PM2/22/13
to Russ Cox, golan...@googlegroups.com
pi@raspberrypi ~ $ go version
go version devel +e00da5e201e5 Fri Feb 22 17:16:31 2013 -0800 linux/arm
pi@raspberrypi ~ $ go test runtime -bench Closure
PASS
BenchmarkCallClosure 50000000 48.7 ns/op
BenchmarkCallClosure1 1000000 1744 ns/op
BenchmarkCallClosure2 1000000 3198 ns/op
BenchmarkCallClosure3 1000000 2952 ns/op
BenchmarkCallClosure4 1000000 2960 ns/op
ok runtime 111.372s

Dave Cheney

unread,
Feb 22, 2013, 11:33:51 PM2/22/13
to Anthony Starks, golan...@googlegroups.com
Thanks Anthony. I'm not surprised there isn't a bigger improvement in
these benchmarks. I am of the opinion that all the low hanging fruit
wrt. closures has been expunged from any code that is part of the go1
benchmarking suite.

What might be more interesting would be to compare the test times for
go/* packages (thinking types and parser), and the runtime of the api
tool.

Jan Mercl

unread,
Feb 23, 2013, 12:41:17 PM2/23/13
to Russ Cox, golan...@googlegroups.com
jnml@fsc-r630:~/tmp$ cat /proc/cpuinfo | head
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 1
microcode : 0x17
cpu MHz : 3200.181
cache size : 1024 KB
physical id : 0
jnml@fsc-r630:~/tmp$ cat old-ver
go version devel +aee6d7fe395a Sat Jan 26 18:16:43 2013 -0800 linux/amd64
jnml@fsc-r630:~/tmp$ go version
go version devel +04d884e0f760 Sat Feb 23 20:24:38 2013 +0800 linux/amd64
jnml@fsc-r630:~/tmp$ benchcmp old.txt new.txt
benchmark old ns/op new ns/op delta
BenchmarkCallClosure 6 6 -0.81%
BenchmarkCallClosure1 6 231 +3342.62%
BenchmarkCallClosure2 144 379 +163.19%
BenchmarkCallClosure3 141 378 +168.09%
BenchmarkCallClosure4 148 379 +156.08%

benchmark old allocs new allocs delta
BenchmarkCallClosure 0 0 n/a%
BenchmarkCallClosure1 0 1 n/a%
BenchmarkCallClosure2 1 2 100.00%
BenchmarkCallClosure3 1 2 100.00%
BenchmarkCallClosure4 1 2 100.00%

benchmark old bytes new bytes delta
BenchmarkCallClosure 0 0 n/a%
BenchmarkCallClosure1 0 17 n/a%
BenchmarkCallClosure2 8 25 212.50%
BenchmarkCallClosure3 8 25 212.50%
BenchmarkCallClosure4 8 25 212.50%
jnml@fsc-r630:~/tmp$

-j

Kyle Lemons

unread,
Feb 23, 2013, 1:36:19 PM2/23/13
to Jan Mercl, Russ Cox, golan...@googlegroups.com
Based on a separate thread discussing the performance of calls through an interface, I was wondering if this proposal can/does improve the situation.  Unfortunately, I can't quite tell by reading the proposal...

When calling a method through an interface, even with the itab caching, we incur a lot of overhead.  It seems like the following:
for { r.Read(b) }

could be "manually hoisted" as follows:
read := r.Read
for { read(b) }

Would doing so improve performance?  Naively, it seems like the compiler could do the work up front to pull out the method from the correct itab and use a func value functionally identical to what would have been stored if the following had been written:
read := r.(*net.TCPConn).Read
for { read(b) }


Jan Mercl

unread,
Feb 23, 2013, 5:25:08 PM2/23/13
to Kyle Lemons, Russ Cox, golan...@googlegroups.com
On Sat, Feb 23, 2013 at 7:36 PM, Kyle Lemons <kev...@google.com> wrote:
> Based on a separate thread discussing the performance of calls through an
> interface, I was wondering if this proposal can/does improve the situation.
> Unfortunately, I can't quite tell by reading the proposal...
>
> When calling a method through an interface, even with the itab caching, we
> incur a lot of overhead. It seems like the following:
> for { r.Read(b) }
>
> could be "manually hoisted" as follows:
> read := r.Read
> for { read(b) }
>
> Would doing so improve performance?

I wanted to know the answer too. AFAICS, the benchmarks in
closure_test.go measure closure _creation performance_. Your idea is
about closure _execution performance_, which I guess dominates the
usage cases.

The same two revisions as before, same machine.

----

jnml@fsc-r630:~/go/src/pkg/runtime$ cat closure2_test.go
// Copyright 2011 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
package runtime_test

import "testing"

func BenchmarkCall2Closure(b *testing.B) {
f := func(ii int) int { return 2 * ii }
for i := 0; i < b.N; i++ {
s += f(i)
}
}

func BenchmarkCall2Closure1(b *testing.B) {
var j int
f := func(ii int) int { return 2*ii + j }
for i := 0; i < b.N; i++ {
j = i
s += f(i)
}
}

func BenchmarkCall2Closure2(b *testing.B) {
var j int
f := func() int {
ss = &j
return 2
}
for i := 0; i < b.N; i++ {
j = i
s += f()
}
}
jnml@fsc-r630:~/go/src/pkg/runtime$ benchcmp ~/tmp/old2.txt ~/tmp/new2.txt
benchmark old ns/op new ns/op delta
BenchmarkCall2Closure 5 5 -7.69%
BenchmarkCall2Closure1 10 5 -44.42%
BenchmarkCall2Closure2 10 5 -47.38%

benchmark old allocs new allocs delta
BenchmarkCall2Closure 0 0 n/a%
BenchmarkCall2Closure1 0 0 n/a%
BenchmarkCall2Closure2 0 0 n/a%

benchmark old bytes new bytes delta
BenchmarkCall2Closure 0 0 n/a%
BenchmarkCall2Closure1 0 0 n/a%
BenchmarkCall2Closure2 0 0 n/a%
jnml@fsc-r630:~/go/src/pkg/runtime$

-j

Kyle Lemons

unread,
Feb 23, 2013, 5:50:14 PM2/23/13
to Jan Mercl, Russ Cox, golan...@googlegroups.com
On Sat, Feb 23, 2013 at 2:25 PM, Jan Mercl <0xj...@gmail.com> wrote:
On Sat, Feb 23, 2013 at 7:36 PM, Kyle Lemons <kev...@google.com> wrote:
> Based on a separate thread discussing the performance of calls through an
> interface, I was wondering if this proposal can/does improve the situation.
> Unfortunately, I can't quite tell by reading the proposal...
>
> When calling a method through an interface, even with the itab caching, we
> incur a lot of overhead.  It seems like the following:
> for { r.Read(b) }
>
> could be "manually hoisted" as follows:
> read := r.Read
> for { read(b) }
>
> Would doing so improve performance?

I wanted to know the answer too. AFAICS, the benchmarks in
closure_test.go measure closure _creation performance_. Your idea is
about closure _execution performance_, which I guess dominates the
usage cases.

I think it's the part that hasn't gone in yet (or I would've tried a benchmark) though I could've missed it being committed.  I think rsc caled it "receiver curried methods".

Jan Mercl

unread,
Feb 23, 2013, 6:10:55 PM2/23/13
to Kyle Lemons, Russ Cox, golan...@googlegroups.com
On Sat, Feb 23, 2013 at 11:50 PM, Kyle Lemons <kev...@google.com> wrote:
> I think it's the part that hasn't gone in yet (or I would've tried a
> benchmark) though I could've missed it being committed. I think rsc caled
> it "receiver curried methods".

You're right, that's step 5 yet to come. I focused on the worsened
(creation) performance of the simpler cases enough to conflate the two
things when the better numbers came out about the (execution)
performance.

-j

Russ Cox

unread,
Feb 25, 2013, 9:34:17 AM2/25/13
to Kyle Lemons, Jan Mercl, golan...@googlegroups.com
On Sat, Feb 23, 2013 at 1:36 PM, Kyle Lemons <kev...@google.com> wrote:
When calling a method through an interface, even with the itab caching, we incur a lot of overhead.

This is news to me. Calling a method through an interface should be no more expensive than calling a virtual method in C++ or Java. It's just a few loads and an indirect call instruction. It's noise compared to what a typical function does.
 
It seems like the following:
for { r.Read(b) }

could be "manually hoisted" as follows:
read := r.Read
for { read(b) }

Would doing so improve performance?

If it did, then as you pointed out, it would be the compiler's job to do that, not the programmer's. But it wouldn't, because the curried-receiver functions are strictly more expensive to call than interface methods.
 
Naively, it seems like the compiler could do the work up front to pull out the method from the correct itab and use a func value functionally identical to what would have been stored if the following had been written:
read := r.(*net.TCPConn).Read
for { read(b) }

This is different from what you wrote above (still slower, but different). In this case the exact function being called is known, so it's an direct call, not an indirect one.

Russ

roger peppe

unread,
Feb 25, 2013, 11:00:11 AM2/25/13
to Russ Cox, Kyle Lemons, Jan Mercl, golan...@googlegroups.com
It's great to see this go in. I was interested to see how the new code performed
against my favorite closure bugbear - sort.Search.

I tried this code: http://play.golang.org/p/pdnEOnV4Pa

These are the results I got:

before the change (revision 15772:6f23480a76e1):
5000000 403 ns/op 64 B/op 1 allocs/op

afterwards (revision 15891:cac5b17d8699)
10000000 173 ns/op 17 B/op 1 allocs/op

That's much better, but I'm surprised to see an allocation still
there. The compiler reports that the function does not escape,
so I'd have thought that it could avoid all allocations.

Is this an optimisation that hasn't been done yet, or an
inevitable cost?

roger peppe

unread,
Feb 25, 2013, 11:04:03 AM2/25/13
to Russ Cox, Kyle Lemons, Jan Mercl, golan...@googlegroups.com
On 25 February 2013 14:34, Russ Cox <r...@golang.org> wrote:
> On Sat, Feb 23, 2013 at 1:36 PM, Kyle Lemons <kev...@google.com> wrote:
>>
>> When calling a method through an interface, even with the itab caching, we
>> incur a lot of overhead.
>
>
> This is news to me. Calling a method through an interface should be no more
> expensive than calling a virtual method in C++ or Java. It's just a few
> loads and an indirect call instruction. It's noise compared to what a
> typical function does.

ISTR doing some measurements that indicated that hoisting out
of the loop the load that fetches the function pointer could be a
significant win
in some situations, but perhaps that's changed.

That could apply both to the new style of function pointers and to
interface calls.

roger peppe

unread,
Feb 25, 2013, 11:07:35 AM2/25/13
to Russ Cox, Kyle Lemons, Jan Mercl, golan...@googlegroups.com
oops ignore me please!
i hadn't seen https://codereview.appspot.com/7397056/

Dave Cheney

unread,
Feb 25, 2013, 6:59:01 PM2/25/13
to roger peppe, Russ Cox, Kyle Lemons, Jan Mercl, golan...@googlegroups.com
CL 7397056 just landed, can you redo your benchmark ?

roger peppe

unread,
Feb 26, 2013, 2:47:22 AM2/26/13
to Dave Cheney, Russ Cox, Kyle Lemons, Jan Mercl, golan...@googlegroups.com
20000000 100 ns/op 0 B/op 0 allocs/op

nice to see. inlining the function call
(http://play.golang.org/p/wX-shZ9ZGt) gives:

50000000 46.9 ns/op 0 B/op 0 allocs/op

2x overhead for creating and using a closure (when the
closure contains only a few instructions) seems very
reasonable to me.

marvellous!

Jan Mercl

unread,
Feb 26, 2013, 3:43:09 AM2/26/13
to roger peppe, Dave Cheney, Russ Cox, Kyle Lemons, golan...@googlegroups.com
On Tue, Feb 26, 2013 at 8:47 AM, roger peppe <rogp...@gmail.com> wrote:
> 20000000 100 ns/op 0 B/op 0 allocs/op
>
> nice to see. inlining the function call
> (http://play.golang.org/p/wX-shZ9ZGt) gives:
>
> 50000000 46.9 ns/op 0 B/op 0 allocs/op
>
> 2x overhead for creating and using a closure (when the
> closure contains only a few instructions) seems very
> reasonable to me.

Nice to see! ;-)

----

Redoing some of the numbers which worried me a bit (cf. Feb 23th):

(09:20) jnml@fsc-r550:~/tmp$ cat old-ver
go version devel +aee6d7fe395a Sat Jan 26 18:16:43 2013 -0800 linux/amd64
(09:25) jnml@fsc-r550:~/tmp$ go version
go version devel +512627a9cdb8 Mon Feb 25 22:06:58 2013 -0800 linux/amd64
(09:25) jnml@fsc-r550:~/tmp$ benchcmp old.txt new.txt
benchmark old ns/op new ns/op delta
BenchmarkCall2Closure 3 3 -3.19%
BenchmarkCall2Closure1 5 4 -13.92%
BenchmarkCall2Closure2 6 3 -44.35%
BenchmarkCallClosure 2 2 -20.07%
BenchmarkCallClosure1 3 5 +57.10%
BenchmarkCallClosure2 54 60 +11.42%
BenchmarkCallClosure3 55 61 +10.95%
BenchmarkCallClosure4 57 62 +7.83%

benchmark old allocs new allocs delta
BenchmarkCall2Closure 0 0 n/a%
BenchmarkCall2Closure1 0 0 n/a%
BenchmarkCall2Closure2 0 0 n/a%
BenchmarkCallClosure 0 0 n/a%
BenchmarkCallClosure1 0 0 n/a%
BenchmarkCallClosure2 1 1 0.00%
BenchmarkCallClosure3 1 1 0.00%
BenchmarkCallClosure4 1 1 0.00%

benchmark old bytes new bytes delta
BenchmarkCall2Closure 0 0 n/a%
BenchmarkCall2Closure1 0 0 n/a%
BenchmarkCall2Closure2 0 0 n/a%
BenchmarkCallClosure 0 0 n/a%
BenchmarkCallClosure1 0 0 n/a%
BenchmarkCallClosure2 8 8 0.00%
BenchmarkCallClosure3 8 8 0.00%
BenchmarkCallClosure4 8 8 0.00%
(09:25) jnml@fsc-r550:~/tmp$

Has anyone an idea why at this machine (details bellow) the
CallClosure benchmarks give opposite signed deltas (except the first
one) wrt https://codereview.appspot.com/7397056/ ? I.e. there's a
speed up and here is a slowdown, though I'm comparing different
revisions than the CL, I believe.

(09:25) jnml@fsc-r550:~/tmp$ cat /proc/cpuinfo | head
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X5450 @ 3.00GHz
stepping : 10
microcode : 0xa07
cpu MHz : 2000.000
cache size : 6144 KB
physical id : 0
(09:31) jnml@fsc-r550:~/tmp$

-j

Alan Donovan

unread,
Mar 3, 2013, 10:34:45 PM3/3/13
to Russ Cox, golang-dev
On 18 February 2013 18:32, Russ Cox <r...@golang.org> wrote:
> http://golang.org/s/go11func
> Please comment here (in this thread).

Nice doc---sorry it took me a while to read it. [And sorry for
initially posting this in the wrong thread.]

I have no quibbles with your proposed course of action, but two minor
comments on the doc:

(1) You might want to mention a sixth kind of call, even if only to
explain it away as sugar around the fifth:

type I interface { f(...) }
f := I.f // "standalone" interface method; equivalent to f :=
func(i I, ...) { return i.f(...) }
var i I
f(i, ...)

(2) The doc makes reference to the classic "two word" formulation of
closures, building the expectation that that's what you're going to
propose--but it isn't. The New Implementation section could
explicitly dismiss it briefly.

Feel free to ignore both of these comments if you don't intend the doc
to stick around after you've finished the work, but I hope that's not
the case, as it's a lucid explanation of and motivation for the new
(and soon to be current) implementation.

cheers
alan
Reply all
Reply to author
Forward
0 new messages