Inconsistent Behavior around Assembler Function Call (Help needed)

148 views
Skip to first unread message

Klaus Post

unread,
Feb 11, 2016, 11:53:06 AM2/11/16
to golang-nuts
Hi!

I am working on optimizing Snappy, and while I have good success, I have a problem, that *seem* like a linker error. I have struggled with it for two days, and I cannot find any explanation for it. It behaves very irrational, and small changes makes the problem behave different, with effects that seem unrelated to the changes made. There are no goroutines used, so it is not a concurrency issue.

It requires that you have an SSE 4.2 capable machine to reproduce. I have uploaded the reproduction case here:



Here are the interesting calls in 'encode.go:162'. The first is a call to the assembler function in 'asm_amd64.s':

// Extend the match to be as long as possible.
match := matchLenSSE4(src[t:], src[s:], length)

// Move this above the function above to "fix"
match2 := matchLenSSE4Ref(src[t:], src[s:], length)

The second call is a reference code check.

On my Go 1.5.3 - there will be a mismatch, and the program will fail a subsequent check:

> go test -v
=== RUN   TestSmallCopy
--- FAIL: TestSmallCopy (0.00s)
panic: got 1 != 8 expected [recovered]
        panic: got 1 != 8 expected

On Go 1.4.3 it gives a different result (yes - this is a funny problem):

panic: got 5 != 8 expected [recovered]
        panic: got 5 != 8 expected

Running with "-race" doesn't show any races, but sometimes it also yields a different results. If you move the "match2 := ..." above the first line, it "fixes" the problem on Go 1.5.3, but if you add "-race", it fails.

There is *one* write in the assembler function, which I am pretty sure is on the correct stack position (at least according to go vet). I am pretty sure that I don't leak any state from the assembler, and it has NOSPLIT, which should mean that it should be able to derive the correct stack, which isn't changed by the assembler function.

I have tried to disassemble the generated code, but I have been unable to locate the exact cause of the problem. It *seems* like some bad code generation is going on around the function call, but I know the most likely answer  is something I overlooked, probably in my assembler. I have stared at this problem for about a day without getting any progress, so I hope some of you out there can help me.


/Klaus



Nigel Tao

unread,
Feb 11, 2016, 11:38:23 PM2/11/16
to Klaus Post, golang-nuts
The short answer is that
http://www.felixcloutier.com/x86/PCMPESTRI.html (which copy/pastes the
Intel manuals) says that "The input length register is EAX/RAX (for
xmm1) or EDX/RDX (for xmm2/m128)". You do not set your AX / DX
registers before you issue the PCMPESTRI.

FWIW, I simplified the repro to this:

func init() {
Encode(nil, []byte("aaaabbbbbbbbbaaaabbbb"))
}

To debug this, I built the test binary, as per "go help test":

$ go test -c -o a.out

To find the fully qualified name of the matchLenSSE4 function:

$ objdump -d a.out | grep matchLenSSE4
etc
000000000047ee70 <_/tmp/x.matchLenSSE4>:
etc

I could then run it under gdb:

$ gdb a.out
etc
(gdb) b _/tmp/x.matchLenSSE4
Breakpoint 1 at 0x47ee70: file /tmp/x/asm_amd64.s, line 9.
(gdb) r
etc
(gdb) si

repeat the step-instruction a few times until you get past the
PCMPESTRI instruction

(gdb) si
25 BYTE $0x66; BYTE $0x0f; BYTE $0x3a
(gdb) si
28 JC match_ended
(gdb) p $ecx
$1 = 1

Huh, CX is 1 instead of 8, which suggests that PCMPESTRI is not being
used properly, which led me to the PCMPESTRI docs, which led me to the
short answer above.

To check that, re-run the binary, step-instruction to just before the
PCMPESTRI, and print the registers:

(gdb) si
25 BYTE $0x66; BYTE $0x0f; BYTE $0x3a
(gdb) info registers
rax 0x5 5
rbx 0x4 4
rcx 0x15 21
rdx 0x1 1
etc

Ah, min(AX, DX) is 1, which explains why CX becomes 1 instead of 8,
and why it's not deterministic (e.g. Go 1.4.3 vs Go 1.5.3):

Separately, it's a comment typo, so it doesn't affect the code, but in
this line:
// 0x18 = _SIDD_UBYTE_OPS (0x8) | _SIDD_CMP_EQUAL_EACH (0x8) |
_SIDD_NEGATIVE_POLARITY (0x10)
_SIDD_UBYTE_OPS should be 0x0, not 0x8.

Finally, this isn't related to your immediate question, but if you're
going to propose a pull request for github.com/golang/snappy to add
things like this, I'd like to see some additional tests at the very
least, and the performance improvements need to be pretty dramatic to
accept the complexity of using assembly.

Klaus Post

unread,
Feb 12, 2016, 11:45:28 AM2/12/16
to golang-nuts, klau...@gmail.com
On Friday, 12 February 2016 05:38:23 UTC+1, Nigel Tao wrote:
The short answer is that
http://www.felixcloutier.com/x86/PCMPESTRI.html (which copy/pastes the
Intel manuals) says that "The input length register is EAX/RAX (for
xmm1) or EDX/RDX (for xmm2/m128)".

Ah. There it is! Implicit register use is such a pain on x86. Thanks!


Finally, this isn't related to your immediate question, but if you're
going to propose a pull request for github.com/golang/snappy

I am not planning that. I don't see any particular point in producing bit-exact output with the C-implementation, so I am not working toward that.

I'd like to see some additional tests at the very
least, and the performance improvements need to be pretty dramatic to
accept the complexity of using assembly.

 I am seeing about 25% improvement on typical content, of course mostly on AMD64.

The built-in benchmarks are here: https://github.com/klauspost/compress#snappy-package - but I plan a lot more testing/tweaking going forward, and a multi-goroutine Writer as well.


/Klaus

Nigel Tao

unread,
Feb 12, 2016, 10:52:06 PM2/12/16
to Klaus Post, golang-nuts
On Sat, Feb 13, 2016 at 3:45 AM, Klaus Post <klau...@gmail.com> wrote:
> I am seeing about 25% improvement on typical content, of course mostly on
> AMD64.
>
> The built-in benchmarks are here:
> https://github.com/klauspost/compress#snappy-package - but I plan a lot more
> testing/tweaking going forward, and a multi-goroutine Writer as well.

OK, but FWIW, the benchmark numbers in your README.md are listed as
against upstream Snappy as of Feb 9, and there's been a bunch of
changes since then, including a sizeLimit bug fix to Benchmark_ZFlat3,
and I've just checked in a stack size reduction based on your PR #23.

Klaus Post

unread,
Feb 13, 2016, 3:08:45 AM2/13/16
to golang-nuts, klau...@gmail.com
On Saturday, 13 February 2016 04:52:06 UTC+1, Nigel Tao wrote:

OK, but FWIW, the benchmark numbers in your README.md are listed as
against upstream Snappy as of Feb 9, and there's been a bunch of
changes since then, including a sizeLimit bug fix to Benchmark_ZFlat3,
and I've just checked in a stack size reduction based on your PR #23.

The benchmark numbers were from before merging your changes.

/Klaus 
Reply all
Reply to author
Forward
0 new messages