treating []byte as []uint64

811 views
Skip to first unread message

Dan Adkins

unread,
Oct 11, 2011, 3:08:46 PM10/11/11
to golang-nuts
I'd like to do something unsafe but fast. I have the following simple function:

// XorEq computes a ^= b, bitwise.
func XorEq(a, b []byte) {
for i := range a {
a[i] ^= b[i]
}
}

Assumes the inputs have been validated and lengths are equal.

I'd love to be able to speed this up if I know that a & b are
multiples of 8 bytes long by operating on uint64's instead of bytes:

func XorEq64(a, b []uint64) {
for i := range a {
a[i] ^= b[i]
}
}

1) What's the best way to go about this? I don't see an obvious
(unsafe) way to convert between slice types. Pointers, yes. Slices,
no.

2) How would I go about ensuring that my byte slices are 8-byte
aligned? Cache-line aligned?

3) Can we do this optimization 16-bytes at a time with a 128-bit SSE
XOR operation? These instructions often depend on 16-byte alignment,
which is more than the 8-byte alignment that the x86_64
architecture/calling convention guarantees.

(This particular example may be amenable to compiler optimization.
gcc knows how to do this. But we can certainly come up with other
similar examples that a compiler won't optimize for us.)

-Dan

Brad Fitzpatrick

unread,
Oct 11, 2011, 3:29:05 PM10/11/11
to Dan Adkins, golang-nuts

Ian Lance Taylor

unread,
Oct 11, 2011, 4:22:01 PM10/11/11
to Dan Adkins, golang-nuts
Dan Adkins <dad...@gmail.com> writes:

> I'd like to do something unsafe but fast. I have the following simple function:
>
> // XorEq computes a ^= b, bitwise.
> func XorEq(a, b []byte) {
> for i := range a {
> a[i] ^= b[i]
> }
> }
>
> Assumes the inputs have been validated and lengths are equal.
>
> I'd love to be able to speed this up if I know that a & b are
> multiples of 8 bytes long by operating on uint64's instead of bytes:
>
> func XorEq64(a, b []uint64) {
> for i := range a {
> a[i] ^= b[i]
> }
> }
>
> 1) What's the best way to go about this? I don't see an obvious
> (unsafe) way to convert between slice types. Pointers, yes. Slices,
> no.

One approach is something like

*[100000000]uint64)(unsafe.Pointer(&b[0])

and be careful not to read past the end of the original slice.

To get a slice you can use reflect.SliceHeader, set the fields, and
convert a pointer to that to a pointer to a slice. Obviously that is
very unsafe.


> 2) How would I go about ensuring that my byte slices are 8-byte
> aligned? Cache-line aligned?

There is no way to do that at present. You would have to test the
alignment at runtime.


> 3) Can we do this optimization 16-bytes at a time with a 128-bit SSE
> XOR operation? These instructions often depend on 16-byte alignment,
> which is more than the 8-byte alignment that the x86_64
> architecture/calling convention guarantees.

You would have to write assembler code.

Ian

Dan Adkins

unread,
Oct 11, 2011, 5:17:28 PM10/11/11
to golang-nuts
Thank Brad and Ian.

I tried the SliceHeader approach to make a []uint64 out of a []byte.
It was gross but it worked pretty well.

package xor

import (
"reflect"
"unsafe"
)

func xoreq(a, b []byte) {


for i := range a {
a[i] ^= b[i]
}
}

func xoreq64(a, b []uint64) {


for i := range a {
a[i] ^= b[i]
}
}

// touint64 assumes len(x)%8 == 0
func touint64(x []byte) []uint64 {
xx := make([]uint64, 0, 0)
hdrp := (*reflect.SliceHeader)(unsafe.Pointer(&xx))
hdrp.Data = (*reflect.SliceHeader)(unsafe.Pointer(&x)).Data
hdrp.Len = len(x) / 8
hdrp.Cap = len(x) / 8
return xx
}

func XorEq(a, b []byte) {
if len(a) != len(b) {
panic("lengths not equal")
}
if len(a)%8 == 0 {
xoreq64(touint64(a), touint64(b))
} else {
xoreq(a, b)
}
}

A small benchmark shows a huge improvement.

Before:
xor.BenchmarkXor16K 100000 39490 ns/op 414.88 MB/s

After:
xor.BenchmarkXor16K 500000 3631 ns/op 4511.22 MB/s

-Dan

Eleanor McHugh

unread,
Oct 12, 2011, 4:16:35 PM10/12/11
to golang-nuts
On 11 Oct 2011, at 20:08, Dan Adkins wrote:
> I'd like to do something unsafe but fast.

Take a look at raw [0], it has code demonstrating how to resize/retype arbitrary go objects in a sane(-ish) manner.
I originally wrote it to do fast slice type conversions in my VM experiments.

Ellie

[0] http://github.com/feyeleanor/raw

Eleanor McHugh
Games With Brains
http://feyeleanor.tel
----
if _, ok := reality.(Reasonable); !ok { panic(reality) }

unread,
Oct 12, 2011, 4:47:31 PM10/12/11
to golang-nuts
On Oct 11, 9:08 pm, Dan Adkins <dadk...@gmail.com> wrote:
> 2) How would I go about ensuring that my byte slices are 8-byte
> aligned?  Cache-line aligned?

(Maybe I am repeating what Ian already wrote)

You can allocate a larger array (in case of 8-byte alignment: larger
by 7 bytes). Than investigate the expression
"uintptr(unsafe.Pointer(&slice[0]))". If the alignment is invalid,
choose a different (properly aligned) starting element for the slice.

Dan Adkins

unread,
Oct 18, 2011, 8:08:05 PM10/18/11
to golang-nuts
I was playing around with this some more. I went from a version which
hacked a []byte into a []uint64 so we could do 8 bytes at a time, to
an assembly version which does 16 at a time with SSE intstructions:

// func xoreq(a, b []byte)
TEXT ·xoreq(SB),7,$0
MOVQ a+0(FP), SI // a.data
MOVL a+8(FP), CX // a.len
MOVQ b+16(FP), DI // b.data

aligned:
// If there's less than 16 bytes to process, do it byte-by-byte.
CMPL CX, $16
JL cleanup

MOVQ (SI), M1 // a[i]
MOVQ (DI), M0 // b[i]
PXOR M0, M1
MOVQ M1, (SI) // a[i] ^= b[i]
ADDQ $16, SI
ADDQ $16, DI
SUBQ $16, CX
JMP aligned

cleanup:
// We may have some bytes left to process one at a time.
CMPL CX, $0
JE done

MOVB (DI), BX
XORB BX, (SI)
INCQ SI
INCQ DI
DECQ CX
JMP cleanup

done:
RET

No alignment hacks necessary, it just works. And it's three times faster!

Before:
xor.BenchmarkXor 500000 3209 ns/op 5104.83 MB/s

After:
xor.BenchmarkXor 1000000 1037 ns/op 15786.60 MB/s

Note: 6a seems to have non-standard instruction and register names
(where the standard is the Intel manual). I was looking for XMM0, ...
XMM7. It seems they're called M0, ..., M7 instead. Also, the double
quadword move instructions, MOVDQU and MOVDQA, weren't present.
Instead MOVQ does it all... is that right? My tests pass, it must be.
:)

-Dan

Evan Shaw

unread,
Oct 18, 2011, 8:56:10 PM10/18/11
to Dan Adkins, golang-nuts
On Wed, Oct 19, 2011 at 1:08 PM, Dan Adkins <dad...@gmail.com> wrote:
> Note: 6a seems to have non-standard instruction and register names
> (where the standard is the Intel manual).  I was looking for XMM0, ...
> XMM7.  It seems they're called M0, ..., M7 instead.  Also, the double
> quadword move instructions, MOVDQU and MOVDQA, weren't present.
> Instead MOVQ does it all... is that right?  My tests pass, it must be.
> :)

I think you may have inadvertantly written MMX instructions instead of
SSE instructions. I believe registers M0...M7 refer to MMX registers.
The SSE registers are X0...X15. MOVDQA is called MOVO (or MOVOA) and
MOVDQU is called MOVOU. objdump can dump the instructions you've
written in a more familiar dialect, so you can verify that you've
written what you think you've written.

You might find it helpful to reference amd64's bytes.IndexByte
function, which is somewhat similar:
http://code.google.com/p/go/source/browse/src/pkg/bytes/asm_amd64.s

The assembler's based on the Plan 9 assembler, which is somewhat
documented here: http://doc.cat-v.org/plan_9/4th_edition/papers/asm

The best documentation at this point is the assembler and linker code.
Somewhere on this list (or possibly golang-dev) Russ posted about
looking up instructions in the code, but I can't find the post.

- Evan

Dan Adkins

unread,
Oct 19, 2011, 2:02:54 PM10/19/11
to Evan Shaw, golang-nuts
On Tue, Oct 18, 2011 at 5:56 PM, Evan Shaw <eds...@gmail.com> wrote:
> I think you may have inadvertantly written MMX instructions instead of
> SSE instructions. I believe registers M0...M7 refer to MMX registers.
> The SSE registers are X0...X15. MOVDQA is called MOVO (or MOVOA) and
> MOVDQU is called MOVOU. objdump can dump the instructions you've
> written in a more familiar dialect, so you can verify that you've
> written what you think you've written.

I am afraid you are correct. I was only operating on 8 bytes at a time, not 16.

I would like to use some of the eight 128-bit data registers, XMM0,
..., XMM7, and the SSE instruction XORPS to XOR two of these registers
together. How would I do that?

-Dan

Dan Adkins

unread,
Oct 19, 2011, 2:22:02 PM10/19/11
to Evan Shaw, golang-nuts

MOVOU (SI), X0 // a[i]
MOVOU (DI), X1 // b[i]
XORPS X0, X1
MOVOU X1, (SI) // a[i] ^= b[i]

Thanks. The XMM registers are simply X0,...,X7, and the MOVO
instruction was the key to getting things into registers.

-Dan

Bobby Powers

unread,
Oct 19, 2011, 3:11:30 PM10/19/11
to Dan Adkins, Evan Shaw, golang-nuts
On Wed, Oct 19, 2011 at 2:22 PM, Dan Adkins <dad...@gmail.com> wrote:
> Thanks.  The XMM registers are simply X0,...,X7, and the MOVO
> instruction was the key to getting things into registers.

How does SSE compare to MMX in terms of throughput?

Dan Adkins

unread,
Oct 19, 2011, 3:33:47 PM10/19/11
to Bobby Powers, golang-nuts

MMX does 8 bytes at a time, while SSE does 16 at a time.
Unfortunately, my first version used 8-byte MMX registers and
instructions but advanced 16 bytes at a time through the data. :) It
didn't work. But for comparison, here are some correct benchmarks:

MMX:
xor.BenchmarkXor 1000000 2039 ns/op 8031.80 MB/s

SSE:
xor.BenchmarkXor 1000000 1028 ns/op 15935.73 MB/s

Of course, there's really no point to using 64-bit MMX registers or
instructions for XOR. The ordinary XOR instruction handles 64-bit
operands just fine, and at the same speed!

-Dan

Reply all
Reply to author
Forward
0 new messages