The short answer is that
http://www.felixcloutier.com/x86/PCMPESTRI.html (which copy/pastes the
Intel manuals) says that "The input length register is EAX/RAX (for
xmm1) or EDX/RDX (for xmm2/m128)". You do not set your AX / DX
registers before you issue the PCMPESTRI.
FWIW, I simplified the repro to this:
func init() {
Encode(nil, []byte("aaaabbbbbbbbbaaaabbbb"))
}
To debug this, I built the test binary, as per "go help test":
$ go test -c -o a.out
To find the fully qualified name of the matchLenSSE4 function:
$ objdump -d a.out | grep matchLenSSE4
etc
000000000047ee70 <_/tmp/x.matchLenSSE4>:
etc
I could then run it under gdb:
$ gdb a.out
etc
(gdb) b _/tmp/x.matchLenSSE4
Breakpoint 1 at 0x47ee70: file /tmp/x/asm_amd64.s, line 9.
(gdb) r
etc
(gdb) si
repeat the step-instruction a few times until you get past the
PCMPESTRI instruction
(gdb) si
25 BYTE $0x66; BYTE $0x0f; BYTE $0x3a
(gdb) si
28 JC match_ended
(gdb) p $ecx
$1 = 1
Huh, CX is 1 instead of 8, which suggests that PCMPESTRI is not being
used properly, which led me to the PCMPESTRI docs, which led me to the
short answer above.
To check that, re-run the binary, step-instruction to just before the
PCMPESTRI, and print the registers:
(gdb) si
25 BYTE $0x66; BYTE $0x0f; BYTE $0x3a
(gdb) info registers
rax 0x5 5
rbx 0x4 4
rcx 0x15 21
rdx 0x1 1
etc
Ah, min(AX, DX) is 1, which explains why CX becomes 1 instead of 8,
and why it's not deterministic (e.g. Go 1.4.3 vs Go 1.5.3):
Separately, it's a comment typo, so it doesn't affect the code, but in
this line:
// 0x18 = _SIDD_UBYTE_OPS (0x8) | _SIDD_CMP_EQUAL_EACH (0x8) |
_SIDD_NEGATIVE_POLARITY (0x10)
_SIDD_UBYTE_OPS should be 0x0, not 0x8.
Finally, this isn't related to your immediate question, but if you're
going to propose a pull request for
github.com/golang/snappy to add
things like this, I'd like to see some additional tests at the very
least, and the performance improvements need to be pretty dramatic to
accept the complexity of using assembly.