strings.IndexByte performance compared to for loop

442 views
Skip to first unread message

prashant...@gmail.com

unread,
Jan 15, 2016, 3:59:44 PM1/15/16
to golang-nuts
Is it expected that a for loop (using indexes rather than range) would be faster than strings.IndexByte for small strings? I have a simple benchmark here:

I'm writing a function that takes a host:port string (where the host does not contain any colons) and returns just the host. I've compared 4 approaches;
 * net.SplitHostPort
 * strings.IndexByte to find the colon
 * For loop using indexes to find the colon
 * For loop using range to find the colon

I need this in a performance critical path, so I wanted to find the fastest option. The results of the benchmark:
BenchmarkGetHostIndexByte-8     100000000        25.1 ns/op
BenchmarkGetHostSplitHostPort-8 30000000        58.0 ns/op
BenchmarkGetHostIndexLoop-8     200000000         9.21 ns/op
BenchmarkGetHostRangeLoop-8     30000000        53.1 ns/op

This is running on a x86_64 Darwin kernel.

I'd expect the IndexByte version to be faster -- it seems like repn/scasb should be faster than the compiled loop?

Ian Lance Taylor

unread,
Jan 15, 2016, 4:24:04 PM1/15/16
to prashant...@gmail.com, golang-nuts
Your results seem to show that for your short constant string the
function call overhead dominates the time it takes to look at the
bytes directly. This seems plausible.

Ian

Dave Cheney

unread,
Jan 15, 2016, 4:27:17 PM1/15/16
to golang-nuts
Maybe strings.IndexByte is causing it's arguments to escape to the heap and your benchmarking the gc by accident.

b.ReportAllocs()
-gcflags=-m
-cpuprofile

Will be useful here

prashant...@gmail.com

unread,
Jan 15, 2016, 4:43:43 PM1/15/16
to golang-nuts
I added -benchmem to report allocs, none of them cause any allocs:
BenchmarkGetHostIndexByte-8     50000000        25.4 ns/op       0 B/op       0 allocs/op
BenchmarkGetHostSplitHostPort-8 30000000        58.5 ns/op       0 B/op       0 allocs/op
BenchmarkGetHostIndexLoop-8     200000000         8.59 ns/op       0 B/op       0 allocs/op
BenchmarkGetHostRangeLoop-8     30000000        52.2 ns/op       0 B/op       0 allocs/op

Here's the output from gcflags=-m,
./hostport_test.go:14: leaking param: hostPort
./hostport_test.go:21: leaking param: hostPort
./hostport_test.go:30: leaking param: hostPort to result ~r1 level=0
./hostport_test.go:30: leaking param: hostPort to result ~r1 level=0
./hostport_test.go:41: leaking param: hostPort to result ~r1 level=0
./hostport_test.go:41: leaking param: hostPort to result ~r1 level=0
./hostport_test.go:56: host escapes to heap
./hostport_test.go:50: BenchmarkGetHostIndexByte b does not escape
./hostport_test.go:56: BenchmarkGetHostIndexByte ... argument does not escape
./hostport_test.go:66: host escapes to heap
./hostport_test.go:60: BenchmarkGetHostSplitHostPort b does not escape
./hostport_test.go:66: BenchmarkGetHostSplitHostPort ... argument does not escape
./hostport_test.go:76: host escapes to heap
./hostport_test.go:70: BenchmarkGetHostIndexLoop b does not escape
./hostport_test.go:76: BenchmarkGetHostIndexLoop ... argument does not escape
./hostport_test.go:86: host escapes to heap
./hostport_test.go:80: BenchmarkGetHostRangeLoop b does not escape
./hostport_test.go:86: BenchmarkGetHostRangeLoop ... argument does not escape

Atached a compressed protobuf CPU profile here:

It's possible it's just function call overhead, although sometimes the difference is more than 2x, which I find weird since I'd expect the overhead to call getHost* should be similar.

Caleb Spare

unread,
Jan 15, 2016, 4:47:57 PM1/15/16
to prashant...@gmail.com, golang-nuts
getHostIndexByte calls strings.IndexByte which isn't inlined because
it's implemented in asm. In fact that asm function start by jumping to
another asm function (indexbytebody) that contains the actual code. In
comparison, your two loop-based functions only have a single function
call on each benchmark loop iteration.

net.SplitHostPort isn't really comparable since it solves a more
general problem. It makes multiple non-inlined function calls to
strings.IndexByte (aliased as net.byteIndex) itself.

The range-based iteration is slower because it calls stringiter2
(again not inlined) on each loop iteration.

Use -gcflags -S to see the asm output.

-Caleb

keith....@gmail.com

unread,
Jan 15, 2016, 6:46:35 PM1/15/16
to golang-nuts, prashant...@gmail.com
I've seen bad startup performance using REP prefixes (~100 cycles) when working on memmove & friends.  It might not be worth it for small strings.

keith....@gmail.com

unread,
Jan 15, 2016, 6:53:31 PM1/15/16
to golang-nuts, prashant...@gmail.com, keith....@gmail.com
Yes, this seems to be the problem.  IndexByte uses the REP prefix for strings smaller than 16 bytes.  Bump your URL up to 16 characters and try again.

14 char url:

enchmarkGetHostIndexByte-8    100000000         23.7 ns/op

BenchmarkGetHostSplitHostPort-8 20000000         99.0 ns/op

BenchmarkGetHostIndexLoop-8    100000000         12.4 ns/op

BenchmarkGetHostRangeLoop-8    30000000         47.0 ns/op


16 char url:

BenchmarkGetHostIndexByte-8    200000000         7.01 ns/op

BenchmarkGetHostSplitHostPort-8 20000000         65.9 ns/op

BenchmarkGetHostIndexLoop-8    100000000         14.0 ns/op

BenchmarkGetHostRangeLoop-8    30000000         43.2 ns/op

prashant...@gmail.com

unread,
Jan 15, 2016, 7:56:08 PM1/15/16
to golang-nuts, prashant...@gmail.com, keith....@gmail.com
That's really interesting. I've pushed a change to my benchmark where the hostPort can be easily modified. I'm seeing even more dramatic differences:

BenchmarkGetHostIndexByte-8     50000000                24.9 ns/op
BenchmarkGetHostSplitHostPort-8 100000000               25.1 ns/op
BenchmarkGetHostIndexLoop-8     50000000                25.0 ns/op
BenchmarkGetHostRangeLoop-8     50000000                25.3 ns/op
BenchmarkGetHostIndexByte-8     200000000                6.57 ns/op
BenchmarkGetHostSplitHostPort-8 200000000                6.22 ns/op
BenchmarkGetHostIndexLoop-8     300000000                6.59 ns/op
BenchmarkGetHostRangeLoop-8     200000000                6.13 ns/op

It does seem like it's not function overhead but just the implementation of runtime.indexbytebody that's causing this difference.

Perhaps the use of REP should be removed when aligning the pointer to a 16 byte boundary as well as when comparing small values?

Thanks for digging into this Keith

Keith Randall

unread,
Jan 15, 2016, 9:21:50 PM1/15/16
to prashant...@gmail.com, golang-nuts
I have a fix for this. Too late for 1.6, it will go in for 1.7.

prashant...@gmail.com

unread,
Jan 15, 2016, 10:49:58 PM1/15/16
to golang-nuts, prashant...@gmail.com, kei...@alum.mit.edu, keith....@gmail.com
That's great news. Wish we'd discovered this a little earlier in time for 1.6, but glad it's getting fixed.

Can you post the CL, I'd like to patch it in and try it out when it's ready.

Brad Fitzpatrick

unread,
Jan 15, 2016, 11:36:40 PM1/15/16
to kei...@alum.mit.edu, prashant...@gmail.com, golang-nuts
Keith,

Did you file a bug?


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Keith Randall

unread,
Jan 15, 2016, 11:55:03 PM1/15/16
to Brad Fitzpatrick, Prashant Varanasi, golang-nuts
I did not file a bug, but I have a CL pending:
https://go-review.googlesource.com/#/c/18703/

Prashant Varanasi

unread,
Jan 16, 2016, 12:09:36 AM1/16/16
to kei...@alum.mit.edu, Brad Fitzpatrick, golang-nuts

I'm happy to file a bug.

Brad Fitzpatrick

unread,
Jan 16, 2016, 12:32:14 AM1/16/16
to kei...@alum.mit.edu, Prashant Varanasi, golang-nuts
Even better.

Prashant Varanasi

unread,
Jan 16, 2016, 3:14:44 PM1/16/16
to golang-nuts, kei...@alum.mit.edu, prashant...@gmail.com
Reply all
Reply to author
Forward
0 new messages