Does Go optimize "switch"?

Stefan Aust

unread,

Apr 15, 2012, 11:09:24 AM4/15/12

to golang-nuts

Hi,

I'm tinkering with Go, trying to create a tiny Python interpreter
(just for fun). It was my understanding that an AST interpreter is
generally slower than a byte code interpreter. However, in Go, the
latter is 25% slower than my naive AST interpreter (and both are much
slower than CPython). Did I do something wrong here? Or is the switch
statement just not as optimized as in C, using a jump table?

This Frame type represents an "activation record" of a Python
function:

type Frame struct {
locals []interface{}
names []string
globals map[string]interface{}
index int
code []byte
consts []interface{}
stack []interface{}
}

My AST interpreter uses structs and methods like this:

type (
Stmt interface {
Eval(f *Frame) interface{}
}
Expr interface {
Exec(f *Frame)
}
While struct {
cond Expr
body Stmt
}
Literal struct {
value interface{}
}
Local struct {
index int
}
GtOp struct {
left, right Expr
}
...
)

func (self *While) Exec(f *Frame) {
for self.cond.Eval(f).(bool) { self.body.Exec(f) }
}
func (self *Literal) Eval(f *Frame) interface{} { return self.value }
func (self *Local) Eval(f *Frame) interface{} { return
f.locals[self.index] }
func (self *GtOp) Eval(f *Frame) interface{} { return
self.left.Eval(f).(int) > self.right.Eval(f).(int) }
...

I implemented enough AST nodes to run a simple while-loop like this:

n = 10000000
while n > 0:
n = n - 1

The AST interpreter needs about 2s on my notebook.

Then I created the following bytecode VM:

func (f *Frame) push(v interface{}) {
f.stack = append(f.stack, v)
}

func (f *Frame) pop() interface{} {
l := len(f.stack) - 1
v := f.stack[l]
f.stack = f.stack[:l]
return v
}

func (f *Frame) execute() {
for {
c := f.code[f.index]
f.index++
switch c {
case 0:
f.push(f.consts[f.code[f.index]])
f.index++
case 1:
f.push(f.locals[f.code[f.index]])
f.index++
case 2:
f.locals[f.code[f.index]] = f.pop()
f.index++
case 3:
r := f.pop()
l := f.pop()
f.push(l.(int) > r.(int))
case 4:
r := f.pop()
l := f.pop()
f.push(l.(int) - r.(int))
case 5:
if !f.pop().(bool) {
f.index = int(f.code[f.index])
} else {
f.index++
}
case 6:
f.index = int(f.code[f.index])
case 7:
return
}
}
}

I assumed a better performance because of better locality and running
everything in just one function. However, that code needs 2,5s to
execute the bytecode for the while loop. I can get slightly better by
inlining push and pop, using a static array instead of slices but I
wonder, whether there is a better implementation strategy.

Stefan

Evan Shaw

unread,

Apr 15, 2012, 6:37:31 PM4/15/12

to Stefan Aust, golang-nuts

The compiler doesn't optimize dense switches into jump tables. I don't
know if there's a particular reason; I think this optimization's just
not implemented (yet?). Currently switches are compiled to binary
searches.

- Evan

Sam Nardoni

unread,

Apr 18, 2012, 8:48:44 AM4/18/12

to golan...@googlegroups.com, Stefan Aust

Could anyone answer whether there is a genuine reason that this isn't implemented? Or is this on the todo list?

Jan Mercl

unread,

Apr 18, 2012, 9:27:13 AM4/18/12

to golan...@googlegroups.com, Stefan Aust

On Wednesday, April 18, 2012 2:48:44 PM UTC+2, Sam Nardoni wrote:

Could anyone answer whether there is a genuine reason that this isn't implemented?

Do you expect every possible optimization and/or feature of the compiler to be implemented automagically just out of thin air by itself? ;-)

⚛

unread,

Apr 18, 2012, 9:29:04 AM4/18/12

to golan...@googlegroups.com, Stefan Aust

On Wednesday, April 18, 2012 2:48:44 PM UTC+2, Sam Nardoni wrote:

Could anyone answer whether there is a genuine reason that this isn't implemented? Or is this on the todo list?

It is possible that having indirect jumps in assembly code would negatively affect register allocation, so compiling switches into jump tables may cause some changes in the register allocation algorithm.

Sam Nardoni

unread,

Apr 18, 2012, 10:08:10 AM4/18/12

to golan...@googlegroups.com, Stefan Aust

I guess I could have worded it better :)

Ian Lance Taylor

unread,

Apr 18, 2012, 10:19:51 AM4/18/12

to Sam Nardoni, golan...@googlegroups.com, Stefan Aust

Sam Nardoni <samna...@gmail.com> writes:

> Could anyone answer whether there is a genuine reason that this isn't
> implemented? Or is this on the todo list?

There is no special reason that this optimization is not implemented.
But as far as I know nobody is actively working on it.

Ian

Ken Thompson

unread,

Apr 18, 2012, 11:03:48 AM4/18/12

to Ian Lance Taylor, Sam Nardoni, golan...@googlegroups.com, Stefan Aust

optimization of switches would be different for
different architectures. every x86, amd, intel,
and then all the non x86s. they all have very
different execution times and cache times
and pipeline times. there is no single answer.
there is usually a huge pipeline cost for
a computed jump.

when you are considering native-client
restrictions, jump tables become impossible.

also note that go switches are different than
c switches. non-constant cases have to be
tested individually no matter what.

having said that, a large (where large depends
on architecture) dense switch would be faster
than what is implemented now. BUT there are
two parameters needed to implement it. how
large and how dense. no matter what is used
for large and dense, there will be a micro-benchmark
on some machine, under some alignment, with
some branch cache that will look wrong.

what is implemented is as follows.
1. in order, all non-constant cases are
compiled and tested as if-elses.
2. groups of larger than 3
constant cases are binary
divided and conquered.
3. 3 or fewer cases are compared
linearly.

i honestly dont think there is a much
better algorithm across all machines.

Ian Lance Taylor

unread,

Apr 18, 2012, 11:42:08 AM4/18/12

to Ken Thompson, Sam Nardoni, golan...@googlegroups.com, Stefan Aust

Ken Thompson <k...@google.com> writes:

> i honestly dont think there is a much
> better algorithm across all machines.

Agreed.

Roger Sayles did an interesting analysis of switch algorithms for C/C++
across architectures at the 2008 GCC Summit. It starts on page 103 of

http://gcc.gnu.org/wiki/HomePage?action=AttachFile&do=view&target=gcc-2008-proceedings.pdf

Of course Go is more complex because of non-constant cases.

Ian

⚛

unread,

Apr 19, 2012, 2:35:29 AM4/19/12

to golan...@googlegroups.com, Ian Lance Taylor, Sam Nardoni, Stefan Aust

On Wednesday, April 18, 2012 5:03:48 PM UTC+2, Ken Thompson wrote:

what is implemented is as follows.
1. in order, all non-constant cases are
compiled and tested as if-elses.
2. groups of larger than 3
constant cases are binary
divided and conquered.
3. 3 or fewer cases are compared
linearly.
i honestly dont think there is a much
better algorithm across all machines.

In my opinion, the 4th option would be to completely eliminate branches by doing compile-time or run-time analysis of code and data. This works across all CPU architectures, even in CPUs without any branch prediction. Unfortunately, this option is very hard to implement in a compiler.

⚛

unread,

Apr 19, 2012, 2:35:47 AM4/19/12

to golan...@googlegroups.com, Ken Thompson, Sam Nardoni, Stefan Aust

On Wednesday, April 18, 2012 5:42:08 PM UTC+2, Ian Lance Taylor wrote:

Ken Thompson <k...@google.com> writes:
> i honestly dont think there is a much
> better algorithm across all machines.
Agreed.
Roger Sayles did an interesting analysis of switch algorithms for C/C++
across architectures at the 2008 GCC Summit. It starts on page 103 of
http://gcc.gnu.org/wiki/HomePage?action=AttachFile&do=view&target=gcc-2008-proceedings.pdf

I do not see any performance measurements in the article.

Reply all

Reply to author

Forward