strings.IndexByte performance compared to for loop

prashant...@gmail.com

unread,

Jan 15, 2016, 3:59:44 PM1/15/16

to golang-nuts

Is it expected that a for loop (using indexes rather than range) would be faster than strings.IndexByte for small strings? I have a simple benchmark here:

https://github.com/prashantv/go-bench/blob/master/hostport_test.go

I'm writing a function that takes a host:port string (where the host does not contain any colons) and returns just the host. I've compared 4 approaches;

* net.SplitHostPort

* strings.IndexByte to find the colon

* For loop using indexes to find the colon

* For loop using range to find the colon

I need this in a performance critical path, so I wanted to find the fastest option. The results of the benchmark:

BenchmarkGetHostIndexByte-8 100000000 25.1 ns/op
BenchmarkGetHostSplitHostPort-8 30000000 58.0 ns/op
BenchmarkGetHostIndexLoop-8 200000000 9.21 ns/op
BenchmarkGetHostRangeLoop-8 30000000 53.1 ns/op

This is running on a x86_64 Darwin kernel.

I'd expect the IndexByte version to be faster -- it seems like repn/scasb should be faster than the compiled loop?

Ian Lance Taylor

unread,

Jan 15, 2016, 4:24:04 PM1/15/16

to prashant...@gmail.com, golang-nuts

Your results seem to show that for your short constant string the
function call overhead dominates the time it takes to look at the
bytes directly. This seems plausible.

Ian

Dave Cheney

unread,

Jan 15, 2016, 4:27:17 PM1/15/16

to golang-nuts

Maybe strings.IndexByte is causing it's arguments to escape to the heap and your benchmarking the gc by accident.

b.ReportAllocs()
-gcflags=-m
-cpuprofile

Will be useful here

prashant...@gmail.com

unread,

Jan 15, 2016, 4:43:43 PM1/15/16

to golang-nuts

I added -benchmem to report allocs, none of them cause any allocs:

BenchmarkGetHostIndexByte-8 50000000 25.4 ns/op 0 B/op 0 allocs/op
BenchmarkGetHostSplitHostPort-8 30000000 58.5 ns/op 0 B/op 0 allocs/op
BenchmarkGetHostIndexLoop-8 200000000 8.59 ns/op 0 B/op 0 allocs/op
BenchmarkGetHostRangeLoop-8 30000000 52.2 ns/op 0 B/op 0 allocs/op

Here's the output from gcflags=-m,
./hostport_test.go:14: leaking param: hostPort
./hostport_test.go:21: leaking param: hostPort
./hostport_test.go:30: leaking param: hostPort to result ~r1 level=0
./hostport_test.go:30: leaking param: hostPort to result ~r1 level=0
./hostport_test.go:41: leaking param: hostPort to result ~r1 level=0
./hostport_test.go:41: leaking param: hostPort to result ~r1 level=0
./hostport_test.go:56: host escapes to heap
./hostport_test.go:50: BenchmarkGetHostIndexByte b does not escape
./hostport_test.go:56: BenchmarkGetHostIndexByte ... argument does not escape
./hostport_test.go:66: host escapes to heap
./hostport_test.go:60: BenchmarkGetHostSplitHostPort b does not escape
./hostport_test.go:66: BenchmarkGetHostSplitHostPort ... argument does not escape
./hostport_test.go:76: host escapes to heap
./hostport_test.go:70: BenchmarkGetHostIndexLoop b does not escape
./hostport_test.go:76: BenchmarkGetHostIndexLoop ... argument does not escape
./hostport_test.go:86: host escapes to heap
./hostport_test.go:80: BenchmarkGetHostRangeLoop b does not escape
./hostport_test.go:86: BenchmarkGetHostRangeLoop ... argument does not escape

Atached a compressed protobuf CPU profile here:

http://www.prashantv.com/files/stringindex-pprof.pb.gz

It's possible it's just function call overhead, although sometimes the difference is more than 2x, which I find weird since I'd expect the overhead to call getHost* should be similar.

Caleb Spare

unread,

Jan 15, 2016, 4:47:57 PM1/15/16

to prashant...@gmail.com, golang-nuts

getHostIndexByte calls strings.IndexByte which isn't inlined because
it's implemented in asm. In fact that asm function start by jumping to
another asm function (indexbytebody) that contains the actual code. In
comparison, your two loop-based functions only have a single function
call on each benchmark loop iteration.

net.SplitHostPort isn't really comparable since it solves a more
general problem. It makes multiple non-inlined function calls to
strings.IndexByte (aliased as net.byteIndex) itself.

The range-based iteration is slower because it calls stringiter2
(again not inlined) on each loop iteration.

Use -gcflags -S to see the asm output.

-Caleb

keith....@gmail.com

unread,

Jan 15, 2016, 6:46:35 PM1/15/16

to golang-nuts, prashant...@gmail.com

I've seen bad startup performance using REP prefixes (~100 cycles) when working on memmove & friends. It might not be worth it for small strings.

keith....@gmail.com

unread,

Jan 15, 2016, 6:53:31 PM1/15/16

to golang-nuts, prashant...@gmail.com, keith....@gmail.com

Yes, this seems to be the problem. IndexByte uses the REP prefix for strings smaller than 16 bytes. Bump your URL up to 16 characters and try again.

14 char url:

enchmarkGetHostIndexByte-8 100000000 23.7 ns/op

BenchmarkGetHostSplitHostPort-8 20000000 99.0 ns/op

BenchmarkGetHostIndexLoop-8 100000000 12.4 ns/op

BenchmarkGetHostRangeLoop-8 30000000 47.0 ns/op

16 char url:

BenchmarkGetHostIndexByte-8 200000000 7.01 ns/op

BenchmarkGetHostSplitHostPort-8 20000000 65.9 ns/op

BenchmarkGetHostIndexLoop-8 100000000 14.0 ns/op

BenchmarkGetHostRangeLoop-8 30000000 43.2 ns/op

prashant...@gmail.com

unread,

Jan 15, 2016, 7:56:08 PM1/15/16

to golang-nuts, prashant...@gmail.com, keith....@gmail.com

That's really interesting. I've pushed a change to my benchmark where the hostPort can be easily modified. I'm seeing even more dramatic differences:

With "127.0.0.1:1234":

BenchmarkGetHostIndexByte-8 50000000 24.9 ns/op
BenchmarkGetHostSplitHostPort-8 100000000 25.1 ns/op
BenchmarkGetHostIndexLoop-8 50000000 25.0 ns/op
BenchmarkGetHostRangeLoop-8 50000000 25.3 ns/op

With "127.127.127.127:1234":

BenchmarkGetHostIndexByte-8 200000000 6.57 ns/op
BenchmarkGetHostSplitHostPort-8 200000000 6.22 ns/op
BenchmarkGetHostIndexLoop-8 300000000 6.59 ns/op
BenchmarkGetHostRangeLoop-8 200000000 6.13 ns/op

It does seem like it's not function overhead but just the implementation of runtime.indexbytebody that's causing this difference.

Perhaps the use of REP should be removed when aligning the pointer to a 16 byte boundary as well as when comparing small values?

Thanks for digging into this Keith

Keith Randall

unread,

Jan 15, 2016, 9:21:50 PM1/15/16

to prashant...@gmail.com, golang-nuts

I have a fix for this. Too late for 1.6, it will go in for 1.7.

prashant...@gmail.com

unread,

Jan 15, 2016, 10:49:58 PM1/15/16

to golang-nuts, prashant...@gmail.com, kei...@alum.mit.edu, keith....@gmail.com

That's great news. Wish we'd discovered this a little earlier in time for 1.6, but glad it's getting fixed.

Can you post the CL, I'd like to patch it in and try it out when it's ready.

Brad Fitzpatrick

unread,

Jan 15, 2016, 11:36:40 PM1/15/16

to kei...@alum.mit.edu, prashant...@gmail.com, golang-nuts

Keith,

Did you file a bug?

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Keith Randall

unread,

Jan 15, 2016, 11:55:03 PM1/15/16

to Brad Fitzpatrick, Prashant Varanasi, golang-nuts

I did not file a bug, but I have a CL pending:
https://go-review.googlesource.com/#/c/18703/

Prashant Varanasi

unread,

Jan 16, 2016, 12:09:36 AM1/16/16

to kei...@alum.mit.edu, Brad Fitzpatrick, golang-nuts

I'm happy to file a bug.

Brad Fitzpatrick

unread,

Jan 16, 2016, 12:32:14 AM1/16/16

to kei...@alum.mit.edu, Prashant Varanasi, golang-nuts

Even better.

Prashant Varanasi

unread,

Jan 16, 2016, 3:14:44 PM1/16/16

to golang-nuts, kei...@alum.mit.edu, prashant...@gmail.com

Filed https://github.com/golang/go/issues/13983

Reply all

Reply to author

Forward