Weekend Challenge: Fast summing of uint8 slices

572 views
Skip to first unread message

Donovan Hide

unread,
Dec 15, 2012, 12:28:22 PM12/15/12
to golan...@googlegroups.com
Hi,

I'm writing some code for a text search engine which has an implementation of sparsetable and I found a bottleneck when summing large numbers of uint8 slices. I love how easy it is in Go to include assembly and dive into SSE opcodes to quickly see if speed ups are possible. I've posted a test case here:


Would be very grateful of any advice on getting the code even faster!

Cheers,
Donovan.

bryanturley

unread,
Dec 15, 2012, 11:55:17 PM12/15/12
to golan...@googlegroups.com
If you are doing many kilobytes worth of adds and you are writing in assembly you should use some of the sse2 prefetch instructions.
You have to tune them per generation though since the memory fetch latencies change every few years.

Also since most cpus have more than one unit doing the simple sse2 ops you should do a second set of adds in parallel with different regs say X4-6 in your case.
If you do this you have to handle the case of an odd number of passes though.

There are tons of other little tweaks depending on your target cpu as well.




bryanturley

unread,
Dec 15, 2012, 11:56:50 PM12/15/12
to golan...@googlegroups.com


On Saturday, December 15, 2012 10:55:17 PM UTC-6, bryanturley wrote:
If you are doing many kilobytes worth of adds and you are writing in assembly you should use some of the sse2 prefetch instructions.
You have to tune them per generation though since the memory fetch latencies change every few years.

Also since most cpus have more than one unit doing the simple sse2 ops you should do a second set of adds in parallel with different regs say X4-6 in your case.
If you do this you have to handle the case of an odd number of passes though.

 
To be clear in parallel does not mean on a different core, it means interleaved with your current adds on the same core ;)

Kevin Gillette

unread,
Dec 16, 2012, 2:49:24 AM12/16/12
to golan...@googlegroups.com
Keeping in mind that vector ops tend to require 16-byte aligned blocks of length evenly divisible by 16. So if Donovan were given a slice of 135 uint8's, probably the fastest method, rather than try to append zero padding, which can result in expensive (relative to this task) copies and isn't a good API choice anyway, is to round down from the end to the nearest multiple of 16 (in this case, 128), and pass that block to the fast sse2 code, while doing a naive summation of the leftover elements (7 of them in this case).

unread,
Dec 16, 2012, 8:35:51 AM12/16/12
to golan...@googlegroups.com
You can use PADDB in the SSE loop, and sum the 16 bytes after the SSE loop ends.

Donovan Hide

unread,
Dec 16, 2012, 9:16:14 AM12/16/12
to ⚛, golan...@googlegroups.com
Hi again,

thank you Bryan, Kevin and *(!) for the hints and tips. I'm going to have a go at implementing your suggestions. Took some time to improve the test harness to get a better idea of speeds for different inputs:


Here's the current output (hope it looks OK in email form):

go test sum -v
=== RUN TestSumBytes
--- PASS: TestSumBytes (2.50 seconds)
=== RUN TestBenchmark
--- PASS: TestBenchmark (78.76 seconds)
sum_test.go:64: Benchmark Results
Length:        1 Fast:        4.68 ns/op Slow:        5.51 ns/op Improvement  17.88%
Length:        2 Fast:        6.10 ns/op Slow:        6.04 ns/op Improvement  -1.01%
Length:        4 Fast:        7.76 ns/op Slow:        7.94 ns/op Improvement   2.35%
Length:        8 Fast:       10.95 ns/op Slow:       11.20 ns/op Improvement   2.23%
Length:       15 Fast:       17.22 ns/op Slow:       19.08 ns/op Improvement  10.80%
Length:       16 Fast:        7.13 ns/op Slow:       21.26 ns/op Improvement 198.23%
Length:       31 Fast:       20.83 ns/op Slow:       34.59 ns/op Improvement  66.06%
Length:       32 Fast:        8.39 ns/op Slow:       49.60 ns/op Improvement 491.36%
Length:       63 Fast:       25.51 ns/op Slow:       73.13 ns/op Improvement 186.68%
Length:       64 Fast:       12.46 ns/op Slow:       70.75 ns/op Improvement 467.59%
Length:      100 Fast:       17.96 ns/op Slow:      106.39 ns/op Improvement 492.37%
Length:      135 Fast:       22.61 ns/op Slow:      146.08 ns/op Improvement 545.97%
Length:      256 Fast:       32.54 ns/op Slow:      265.46 ns/op Improvement 715.80%
Length:     1024 Fast:      123.45 ns/op Slow:     1006.06 ns/op Improvement 714.98%
Length:    65536 Fast:     7392.98 ns/op Slow:    62733.94 ns/op Improvement 748.56%
PASS
ok   sum 81.472s

I'm not very clued up on choosing amd64 registers and have been using:

go tool 6g -S src/sum/sum_test.go 

to try and understand the calling conventions. If anyone has some spare time and wants to submit an improved sum_amd64.s I'd be very grateful for any help! Obviously it would be great to have some code that is faster for all sizes of slice. I imagine batching up into descending chunks of 128,64,32,16,8 and 1 bytes would yield the maximal SEE register usage. Sounds like a fun challenge if anyone fancies it :-)

Seems like Go is a great way for testing and benchmarking assembly. Reminds me a lot of the BASIC interpreter on RISC OS which had a built in assembler. Starting to get nostalgic now :-)

Cheers,
Donovan.




--
 
 

Donovan Hide

unread,
Dec 16, 2012, 11:46:08 AM12/16/12
to ⚛, golan...@googlegroups.com
Hi again,

If anyone did try the code there was a bug in the return value offset which only materialised on a Sandy Bridge processor. Worked fine on a Core Duo, which was weird. Guess assembly requires a bit more care when it comes to testing!

Updated the assembly to use the pattern from here:


Speed ups for big slices are even more pronounced on Sandy Bridge :-)

go test -v
=== RUN TestSumBytes
--- PASS: TestSumBytes (1.25 seconds)
=== RUN TestBenchmark
--- PASS: TestBenchmark (69.23 seconds)
sum_test.go:64: Benchmark Results
Length:        1 Fast:        3.68 ns/op Slow:        5.50 ns/op Improvement  49.14%
Length:        2 Fast:        4.18 ns/op Slow:        6.72 ns/op Improvement  60.56%
Length:        4 Fast:        5.79 ns/op Slow:       10.07 ns/op Improvement  73.76%
Length:        8 Fast:        9.48 ns/op Slow:       15.02 ns/op Improvement  58.47%
Length:       15 Fast:       15.49 ns/op Slow:       24.41 ns/op Improvement  57.64%
Length:       16 Fast:        4.13 ns/op Slow:       26.83 ns/op Improvement 550.48%
Length:       31 Fast:       15.97 ns/op Slow:       45.91 ns/op Improvement 187.50%
Length:       32 Fast:        4.77 ns/op Slow:       49.03 ns/op Improvement 927.83%
Length:       63 Fast:       17.15 ns/op Slow:       88.13 ns/op Improvement 413.77%
Length:       64 Fast:        6.46 ns/op Slow:       90.19 ns/op Improvement 1296.49%
Length:      100 Fast:       10.48 ns/op Slow:      134.66 ns/op Improvement 1185.05%
Length:      135 Fast:       14.45 ns/op Slow:      181.04 ns/op Improvement 1152.83%
Length:      256 Fast:       16.54 ns/op Slow:      333.77 ns/op Improvement 1918.28%
Length:     1024 Fast:       54.77 ns/op Slow:     1308.52 ns/op Improvement 2289.25%
Length:    65536 Fast:     2693.70 ns/op Slow:    83022.35 ns/op Improvement 2982.10%
PASS
ok   _/usr/local/code/sum 70.491s

Donovan Hide

unread,
Dec 16, 2012, 11:58:01 AM12/16/12
to golan...@googlegroups.com
Actually, the return value bug seems to be a difference between go 1.03 and tip. Note the difference in the offset for the return values:


Looks like a consequence of the move to 64 bit ints:


I guess I'll have to stick with tip!

bryanturley

unread,
Dec 16, 2012, 3:10:45 PM12/16/12
to golan...@googlegroups.com

I was bored so https://github.com/bryanturley/sum_uint8

    Sizes              Sum         FastSum        FastSum2      winner
size:  65536    1009419479,       66430393,       49866691,   FastSum2 wins   20.24x vs Sum    1.33x vs FastSum 
size:  32768     504542316,       35088326,       24717340,   FastSum2 wins   20.41x vs Sum    1.42x vs FastSum 
size:     16        407093,         135181,         409694,    FastSum wins    3.01x vs Sum    3.03x vs FastSum2 
size:     32        657248,         175972,         147953,   FastSum2 wins    4.44x vs Sum    1.19x vs FastSum 
size:     34        711121,         184361,         155237,   FastSum2 wins    4.58x vs Sum    1.19x vs FastSum 
size:     63       1142627,         472246,         440665,   FastSum2 wins    2.59x vs Sum    1.07x vs FastSum 
size:     64       1158126,         209803,         164159,   FastSum2 wins    7.05x vs Sum    1.28x vs FastSum 
size:    128       2142500,         266550,         201678,   FastSum2 wins   10.62x vs Sum    1.32x vs FastSum 
size:    287       4599422,         765123,         614701,   FastSum2 wins    7.48x vs Sum    1.24x vs FastSum 
size:    543       8524335,        1219218,         880528,   FastSum2 wins    9.68x vs Sum    1.38x vs FastSum 
size:    130       2163800,         292849,         228327,   FastSum2 wins    9.48x vs Sum    1.28x vs FastSum 
size:   1024      16004908,        1297445,         969886,   FastSum2 wins   16.50x vs Sum    1.34x vs FastSum 
(from a machine running X11 and a full desktop though so...)

on tip from an amd phenom 2 3.2ghz (sse up to sse3/amds sse4a) though i stayed inside of sse2 I believe, also this version works on unaligned data... I think, should have written a test for that...
I rewrote your test code and didn't use the benchmark framework.
There wasn't enough testing for my tastes.
your original version is untouched i believe...

You are using MOVHLPS on integers. PS == packed singles aka float32.  on some (most? all?) sse units that will force a conversion possibly slowing it down, and take up more of the background registers.
I just woke up so I might be missing something obvious in there...
Also I have no idea what the calling convention's are or what the numbers at the end of this line mean
TEXT ·FastSum2Uint8(SB),7,$0   // 7??? $0 ??? !?!?!?!
I will read up on it more later since I am sure those are important in some way ;)

Nigel Tao

unread,
Dec 16, 2012, 7:56:01 PM12/16/12
to bryanturley, golang-nuts
On Mon, Dec 17, 2012 at 7:10 AM, bryanturley <bryan...@gmail.com> wrote:
> Also I have no idea what the calling convention's are or what the numbers at
> the end of this line mean
> TEXT ·FastSum2Uint8(SB),7,$0 // 7??? $0 ??? !?!?!?!

The caller is responsible for adjusting the stack pointer, and copying
arguments to and return values from the stack. Consider this program:

$ cat main.go
package main

func f(x, y int) int {
return x+y
}

var z int

func main() {
z = f(3, 4)
}



The 6g disassembly (-S) with inlining disabled (-l) is:

$ go tool 6g -S -l -o /dev/null main.go

--- prog list "f" ---
0000 (main.go:3) TEXT f+0(SB),$0-24
0001 (main.go:4) MOVQ x+0(FP),BX
0002 (main.go:4) MOVQ y+8(FP),BP
0003 (main.go:4) ADDQ BP,BX
0004 (main.go:4) MOVQ BX,.noname+16(FP)
0005 (main.go:4) RET ,

--- prog list "main" ---
0006 (main.go:9) TEXT main+0(SB),$24-0
0007 (main.go:10) MOVQ $3,(SP)
0008 (main.go:10) MOVQ $4,8(SP)
0009 (main.go:10) CALL ,f+0(SB)
0010 (main.go:10) MOVQ 16(SP),AX
0011 (main.go:10) MOVQ AX,z+0(SB)
0012 (main.go:11) RET ,


CALL is a pseudo-instruction that the linker expands to also manage
the split stacks (and call runtime.morestack if necessary). The true
disassembly is:

$ go build -gcflags '-l' main.go
$ objdump -d main

0000000000400c00 <main.f>:
400c00: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
400c05: 48 8b 6c 24 10 mov 0x10(%rsp),%rbp
400c0a: 48 01 eb add %rbp,%rbx
400c0d: 48 89 5c 24 18 mov %rbx,0x18(%rsp)
400c12: c3 retq
...

0000000000400c20 <main.main>:
400c20: 64 48 8b 0c 25 f0 ff mov %fs:0xfffffffffffffff0,%rcx
400c27: ff ff
400c29: 48 3b 21 cmp (%rcx),%rsp
400c2c: 77 05 ja 400c33 <main.main+0x13>
400c2e: e8 ed 16 01 00 callq 412320 <runtime.morestack00>
400c33: 48 83 ec 18 sub $0x18,%rsp
400c37: 48 c7 04 24 03 00 00 movq $0x3,(%rsp)
400c3e: 00
400c3f: 48 c7 44 24 08 04 00 movq $0x4,0x8(%rsp)
400c46: 00 00
400c48: e8 b3 ff ff ff callq 400c00 <main.f>
400c4d: 48 8b 44 24 10 mov 0x10(%rsp),%rax
400c52: 48 89 04 25 40 9e 45 mov %rax,0x459e40
400c59: 00
400c5a: 48 83 c4 18 add $0x18,%rsp
400c5e: c3 retq
...



In general...

TEXT f(SB),$8-16 means that on entry to the function the linker should
insert the allocation (splitting if necessary) of an 8-byte stack
frame. That is, the function can safely scribble on bytes 0(SB)
through 7(SB). The -16 means that there are 16 bytes of arguments, at
bytes 0(FP) through 15(FP), and if the stack gets split by the code
the linker inserts, those 16 bytes need to be copied over too.

TEXT f(SB),7,$x means that the linker should never split the stack for
f: instead it should assume that the frame can be grown by x bytes
without a check. (The linker also verifies this claim at link time; x
must be small, and the chain of called functions can't add up to more
than 128 bytes.) Because there's no stack split there is no need for
the -16. The function can still refer to its arguments (they're in the
stack frame of the caller), unconditionally, since there is no chance
of a stack split.

TEXT f(SB),7,$0 means that the function needs no extra space for
itself and does not want the stack split. Like above, because there's
no stack split there's no declaration of how many argument bytes there
are.

The flag bits in the 7 are defined in src/cmd/6l/6.out.h. The important one is
#define NOSPLIT (1<<2)
See also src/pkg/runtime/stack.h.

minux

unread,
Dec 17, 2012, 12:22:34 PM12/17/12
to Nigel Tao, bryanturley, golang-nuts
CALL is not a pseudo-instruction. the linker won't adjust the call instruction itself,
but rather inserts stack split code in front of every function not labeled NOSPLIT. 

In general...

TEXT f(SB),$8-16 means that on entry to the function the linker should
insert the allocation (splitting if necessary) of an 8-byte stack
frame. That is, the function can safely scribble on bytes 0(SB)
I believe you are talking about 0(SP) to 7(SP), just a note however, the convention
for the ARM toolchain is different. You'd better use R13 directly.
through 7(SB).  The -16 means that there are 16 bytes of arguments, at
bytes 0(FP) through 15(FP), and if the stack gets split by the code
the linker inserts, those 16 bytes need to be copied over too.

TEXT f(SB),7,$x means that the linker should never split the stack for
f: instead it should assume that the frame can be grown by x bytes
without a check. (The linker also verifies this claim at link time; x
must be small, and the chain of called functions can't add up to more
than 128 bytes.) Because there's no stack split there is no need for
i believe it's 120 bytes that's checked by the linker. 
the -16. The function can still refer to its arguments (they're in the
stack frame of the caller), unconditionally, since there is no chance
of a stack split.

TEXT f(SB),7,$0 means that the function needs no extra space for
itself and does not want the stack split. Like above, because there's
no stack split there's no declaration of how many argument bytes there
are.

The flag bits in the 7 are defined in src/cmd/6l/6.out.h. The important one is
#define NOSPLIT (1<<2)
See also src/pkg/runtime/stack.h.
I have a question, why we always use 7 for the textflag for non-stack-split
functions?
according to 6.out.h:
1 means NOPROF (why? don't show up in the pprof records?),
2 means DUPOK (why this? the linker will allow multiple definition of this symbol.)
4 means NOSPLIT.

Is this a historic artefact?

Anthony Martin

unread,
Dec 17, 2012, 12:34:26 PM12/17/12
to minux, Nigel Tao, bryanturley, golang-nuts
minux <minu...@gmail.com> once said:
> I have a question, why we always use 7 for the textflag for non-stack-split
> functions?
> according to 6.out.h:
> 1 means NOPROF (why? don't show up in the pprof records?),
> 2 means DUPOK (why this? the linker will allow multiple definition of this
> symbol.)
> 4 means NOSPLIT.
>
> Is this a historic artefact?

Pretty much. DUPOK was used by the Alef compliers
but NOPROF still has a use on Plan 9. It tells the
linker not to emit profile tracing stubs at the
start and end of a function. It's usually controlled
by the "profile" pragma in C code.

Cheers,
Anthony

Donovan Hide

unread,
Dec 18, 2012, 12:18:00 PM12/18/12
to Anthony Martin, minux, Nigel Tao, bryanturley, golang-nuts
Hi again,

thank you to everyone for the information about using the assembler and to Bryan especially for a formidably parallel solution! More as an exercise in learning the ins and outs of the assembler I did a version which always uses PSADBW and uses a lookup table for the masking of the short runs.


I think it's great that you can dive in to ASM when you need to and solve your bottlenecks.

Cheers,
Donovan.




--



bryanturley

unread,
Dec 18, 2012, 12:50:43 PM12/18/12
to golan...@googlegroups.com, Anthony Martin, minux, Nigel Tao, bryanturley


On Tuesday, December 18, 2012 11:18:00 AM UTC-6, Donovan wrote:
Hi again,

thank you to everyone for the information about using the assembler and to Bryan especially for a formidably parallel solution! More as an exercise in learning the ins and outs of the assembler I did a version which always uses PSADBW and uses a lookup table for the masking of the short runs.


I think it's great that you can dive in to ASM when you need to and solve your bottlenecks.

Cheers,
Donovan.


MOVHLPS X0,X1 // move high something packed single (float32)
Your still using MOVHLPS on integers.

I used
  PSRLDQ $8, X1 // doesn't cause register renaming with integers (I assume...)
Packed? Shift Right Logical Double Quadword except it shifts in bytes not bits so $8 bytes


I do have a question for whoever is in control of the assembler.
IF intel's MOVDQA/MOVDQU/etc became MOVOA/MOVOU, why didn't PSRLDQ become PSRLO ?

I prefer MOVOU, took me about 10 minutes to figure out what you guys named that instruction. I finally discovered it by thinking "what would I have named it..." ;)
Greatest name ever DOUBLE QUAD.... intel can't perform basic reduction? At least it wasn't MOVDDDA.

 

bryanturley

unread,
Dec 18, 2012, 12:53:51 PM12/18/12
to golan...@googlegroups.com, Anthony Martin, minux, Nigel Tao, bryanturley


On Tuesday, December 18, 2012 11:50:43 AM UTC-6, bryanturley wrote:


On Tuesday, December 18, 2012 11:18:00 AM UTC-6, Donovan wrote:
Hi again,

thank you to everyone for the information about using the assembler and to Bryan especially for a formidably parallel solution! More as an exercise in learning the ins and outs of the assembler I did a version which always uses PSADBW and uses a lookup table for the masking of the short runs.


I think it's great that you can dive in to ASM when you need to and solve your bottlenecks.

Cheers,
Donovan.


MOVHLPS X0,X1 // move high something packed single (float32)
Your still using MOVHLPS on integers.

I used
  PSRLDQ $8, X1 // doesn't cause register renaming with integers (I assume...)
Packed? Shift Right Logical Double Quadword except it shifts in bytes not bits so $8 bytes

 Should have said EXTRA register renaming.

minux

unread,
Dec 18, 2012, 12:56:53 PM12/18/12
to bryanturley, golan...@googlegroups.com, Anthony Martin, Nigel Tao
On Wed, Dec 19, 2012 at 1:50 AM, bryanturley <bryan...@gmail.com> wrote:
I do have a question for whoever is in control of the assembler.
IF intel's MOVDQA/MOVDQU/etc became MOVOA/MOVOU, why didn't PSRLDQ become PSRLO ?
it might be just a oversight. The translation between Gc toolchain's mnemonic is a bit difficult to
guess (just look through the source code when in doubt, http://tip.golang.org/src/cmd/6a/lex.c#L317).
I'm not sure we can change it though.

I prefer MOVOU, took me about 10 minutes to figure out what you guys named that instruction. I finally discovered it by thinking "what would I have named it..." ;)
Greatest name ever DOUBLE QUAD.... intel can't perform basic reduction? At least it wasn't MOVDDDA.
haha, that's the intel style, a trend starting with "double word" to mean 32-bit.

bryanturley

unread,
Dec 18, 2012, 1:06:35 PM12/18/12
to golan...@googlegroups.com, bryanturley, Anthony Martin, Nigel Tao


On Tuesday, December 18, 2012 11:56:53 AM UTC-6, minux wrote:

On Wed, Dec 19, 2012 at 1:50 AM, bryanturley <bryan...@gmail.com> wrote:
I do have a question for whoever is in control of the assembler.
IF intel's MOVDQA/MOVDQU/etc became MOVOA/MOVOU, why didn't PSRLDQ become PSRLO ?
it might be just a oversight. The translation between Gc toolchain's mnemonic is a bit difficult to
guess (just look through the source code when in doubt, http://tip.golang.org/src/cmd/6a/lex.c#L317).
I'm not sure we can change it though.
 
Well go 1.0 promise was for go not for the mostly undocumented assembler, I say change it (low priority).  ;)

Do the go compilers output straight to binary or do they output to the assembler?

bryanturley

unread,
Dec 18, 2012, 1:20:27 PM12/18/12
to golan...@googlegroups.com, bryanturley, Anthony Martin, Nigel Tao


On Tuesday, December 18, 2012 12:06:35 PM UTC-6, bryanturley wrote:
On Tuesday, December 18, 2012 11:56:53 AM UTC-6, minux wrote:
(just look through the source code when in doubt, http://tip.golang.org/src/cmd/6a/lex.c#L317).
I'm not sure we can change it though.
 

Looks like there is some confusion already (line 889)
  "MOVNTO",	LTYPE3,	AMOVNTO,
"MOVNTDQ", LTYPE3, AMOVNTO, /* syn */
 

minux

unread,
Dec 18, 2012, 1:21:35 PM12/18/12
to bryanturley, golan...@googlegroups.com, Anthony Martin, Nigel Tao
On Wed, Dec 19, 2012 at 2:06 AM, bryanturley <bryan...@gmail.com> wrote:
On Tuesday, December 18, 2012 11:56:53 AM UTC-6, minux wrote:
On Wed, Dec 19, 2012 at 1:50 AM, bryanturley <bryan...@gmail.com> wrote:
I do have a question for whoever is in control of the assembler.
IF intel's MOVDQA/MOVDQU/etc became MOVOA/MOVOU, why didn't PSRLDQ become PSRLO ?
it might be just a oversight. The translation between Gc toolchain's mnemonic is a bit difficult to
guess (just look through the source code when in doubt, http://tip.golang.org/src/cmd/6a/lex.c#L317).
I'm not sure we can change it though. 
Well go 1.0 promise was for go not for the mostly undocumented assembler, I say change it (low priority).  ;)
we can change it, do you wanto to propose a CL?
 
Do the go compilers output straight to binary or do they output to the assembler?
straight to binary (an internal rep. of assembly language), so if we change 6a/8a mnemonic,
.6 and .8 object files won't be affected, you just need to update all *_386.s and *_amd64.s
in the tree to use the new mnemonic.

minux

unread,
Dec 18, 2012, 1:22:51 PM12/18/12
to bryanturley, golan...@googlegroups.com, Anthony Martin, Nigel Tao
oh, it seems the authors don't want to change assembly mnemonics.....

Reply all
Reply to author
Forward
0 new messages