// XorEq computes a ^= b, bitwise.
func XorEq(a, b []byte) {
for i := range a {
a[i] ^= b[i]
}
}
Assumes the inputs have been validated and lengths are equal.
I'd love to be able to speed this up if I know that a & b are
multiples of 8 bytes long by operating on uint64's instead of bytes:
func XorEq64(a, b []uint64) {
for i := range a {
a[i] ^= b[i]
}
}
1) What's the best way to go about this? I don't see an obvious
(unsafe) way to convert between slice types. Pointers, yes. Slices,
no.
2) How would I go about ensuring that my byte slices are 8-byte
aligned? Cache-line aligned?
3) Can we do this optimization 16-bytes at a time with a 128-bit SSE
XOR operation? These instructions often depend on 16-byte alignment,
which is more than the 8-byte alignment that the x86_64
architecture/calling convention guarantees.
(This particular example may be amenable to compiler optimization.
gcc knows how to do this. But we can certainly come up with other
similar examples that a compiler won't optimize for us.)
-Dan
> I'd like to do something unsafe but fast. I have the following simple function:
>
> // XorEq computes a ^= b, bitwise.
> func XorEq(a, b []byte) {
> for i := range a {
> a[i] ^= b[i]
> }
> }
>
> Assumes the inputs have been validated and lengths are equal.
>
> I'd love to be able to speed this up if I know that a & b are
> multiples of 8 bytes long by operating on uint64's instead of bytes:
>
> func XorEq64(a, b []uint64) {
> for i := range a {
> a[i] ^= b[i]
> }
> }
>
> 1) What's the best way to go about this? I don't see an obvious
> (unsafe) way to convert between slice types. Pointers, yes. Slices,
> no.
One approach is something like
*[100000000]uint64)(unsafe.Pointer(&b[0])
and be careful not to read past the end of the original slice.
To get a slice you can use reflect.SliceHeader, set the fields, and
convert a pointer to that to a pointer to a slice. Obviously that is
very unsafe.
> 2) How would I go about ensuring that my byte slices are 8-byte
> aligned? Cache-line aligned?
There is no way to do that at present. You would have to test the
alignment at runtime.
> 3) Can we do this optimization 16-bytes at a time with a 128-bit SSE
> XOR operation? These instructions often depend on 16-byte alignment,
> which is more than the 8-byte alignment that the x86_64
> architecture/calling convention guarantees.
You would have to write assembler code.
Ian
I tried the SliceHeader approach to make a []uint64 out of a []byte.
It was gross but it worked pretty well.
package xor
import (
"reflect"
"unsafe"
)
func xoreq(a, b []byte) {
for i := range a {
a[i] ^= b[i]
}
}
func xoreq64(a, b []uint64) {
for i := range a {
a[i] ^= b[i]
}
}
// touint64 assumes len(x)%8 == 0
func touint64(x []byte) []uint64 {
xx := make([]uint64, 0, 0)
hdrp := (*reflect.SliceHeader)(unsafe.Pointer(&xx))
hdrp.Data = (*reflect.SliceHeader)(unsafe.Pointer(&x)).Data
hdrp.Len = len(x) / 8
hdrp.Cap = len(x) / 8
return xx
}
func XorEq(a, b []byte) {
if len(a) != len(b) {
panic("lengths not equal")
}
if len(a)%8 == 0 {
xoreq64(touint64(a), touint64(b))
} else {
xoreq(a, b)
}
}
A small benchmark shows a huge improvement.
Before:
xor.BenchmarkXor16K 100000 39490 ns/op 414.88 MB/s
After:
xor.BenchmarkXor16K 500000 3631 ns/op 4511.22 MB/s
-Dan
Take a look at raw [0], it has code demonstrating how to resize/retype arbitrary go objects in a sane(-ish) manner.
I originally wrote it to do fast slice type conversions in my VM experiments.
Ellie
[0] http://github.com/feyeleanor/raw
Eleanor McHugh
Games With Brains
http://feyeleanor.tel
----
if _, ok := reality.(Reasonable); !ok { panic(reality) }
// func xoreq(a, b []byte)
TEXT ·xoreq(SB),7,$0
MOVQ a+0(FP), SI // a.data
MOVL a+8(FP), CX // a.len
MOVQ b+16(FP), DI // b.data
aligned:
// If there's less than 16 bytes to process, do it byte-by-byte.
CMPL CX, $16
JL cleanup
MOVQ (SI), M1 // a[i]
MOVQ (DI), M0 // b[i]
PXOR M0, M1
MOVQ M1, (SI) // a[i] ^= b[i]
ADDQ $16, SI
ADDQ $16, DI
SUBQ $16, CX
JMP aligned
cleanup:
// We may have some bytes left to process one at a time.
CMPL CX, $0
JE done
MOVB (DI), BX
XORB BX, (SI)
INCQ SI
INCQ DI
DECQ CX
JMP cleanup
done:
RET
No alignment hacks necessary, it just works. And it's three times faster!
Before:
xor.BenchmarkXor 500000 3209 ns/op 5104.83 MB/s
After:
xor.BenchmarkXor 1000000 1037 ns/op 15786.60 MB/s
Note: 6a seems to have non-standard instruction and register names
(where the standard is the Intel manual). I was looking for XMM0, ...
XMM7. It seems they're called M0, ..., M7 instead. Also, the double
quadword move instructions, MOVDQU and MOVDQA, weren't present.
Instead MOVQ does it all... is that right? My tests pass, it must be.
:)
-Dan
I think you may have inadvertantly written MMX instructions instead of
SSE instructions. I believe registers M0...M7 refer to MMX registers.
The SSE registers are X0...X15. MOVDQA is called MOVO (or MOVOA) and
MOVDQU is called MOVOU. objdump can dump the instructions you've
written in a more familiar dialect, so you can verify that you've
written what you think you've written.
You might find it helpful to reference amd64's bytes.IndexByte
function, which is somewhat similar:
http://code.google.com/p/go/source/browse/src/pkg/bytes/asm_amd64.s
The assembler's based on the Plan 9 assembler, which is somewhat
documented here: http://doc.cat-v.org/plan_9/4th_edition/papers/asm
The best documentation at this point is the assembler and linker code.
Somewhere on this list (or possibly golang-dev) Russ posted about
looking up instructions in the code, but I can't find the post.
- Evan
I am afraid you are correct. I was only operating on 8 bytes at a time, not 16.
I would like to use some of the eight 128-bit data registers, XMM0,
..., XMM7, and the SSE instruction XORPS to XOR two of these registers
together. How would I do that?
-Dan
MOVOU (SI), X0 // a[i]
MOVOU (DI), X1 // b[i]
XORPS X0, X1
MOVOU X1, (SI) // a[i] ^= b[i]
Thanks. The XMM registers are simply X0,...,X7, and the MOVO
instruction was the key to getting things into registers.
-Dan
How does SSE compare to MMX in terms of throughput?
MMX does 8 bytes at a time, while SSE does 16 at a time.
Unfortunately, my first version used 8-byte MMX registers and
instructions but advanced 16 bytes at a time through the data. :) It
didn't work. But for comparison, here are some correct benchmarks:
MMX:
xor.BenchmarkXor 1000000 2039 ns/op 8031.80 MB/s
SSE:
xor.BenchmarkXor 1000000 1028 ns/op 15935.73 MB/s
Of course, there's really no point to using 64-bit MMX registers or
instructions for XOR. The ordinary XOR instruction handles 64-bit
operands just fine, and at the same speed!
-Dan