79 views

Skip to first unread message

Sep 1, 2016, 2:05:00 PM9/1/16

to Ian Lance Taylor, golang-co...@googlegroups.com

Ilya Tocar uploaded a change:

https://go-review.googlesource.com/28392

math: fix sqrt regression on AMD64

1.7 introduced a significant regression compared to 1.6:

SqrtIndirect-4 2.32ns ± 0% 7.86ns ± 0% +238.79% (p=0.000 n=20+18)

This is caused by sqrtsd preserving upper part of destination register.

Which introduces dependency on previous value of X0.

In 1.6 benchmark loop didn't use X0 immediately after call:

callq *%rbx

movsd 0x8(%rsp),%xmm2

movsd 0x20(%rsp),%xmm1

addsd %xmm2,%xmm1

mov 0x18(%rsp),%rax

inc %rax

jmp loop

In 1.7 however xmm0 is used just after call:

callq *%rbx

mov 0x10(%rsp),%rcx

lea 0x1(%rcx),%rax

movsd 0x8(%rsp),%xmm0

movsd 0x18(%rsp),%xmm1

I've verified that this is caused by dependency, by inserting

XORPS X0,X0 in the beginning of math.Sqrt, which puts performance back on

1.6 level.

Splitting SQRTSD mem,reg into:

MOVSD mem,reg

SQRTSD reg,reg

Removes dependency, because MOVSD (load version)

doesn't need to preserve upper part of a register.

And reg,reg operation is solved by renamer in CPU.

As a result of this change regression is gone:

SqrtIndirect-4 7.86ns ± 0% 2.33ns ± 0% -70.36% (p=0.000 n=18+17)

Change-Id: Ic7eebe8866445adff5bc38192fa8d64c9a6b8872

---

M src/math/sqrt_amd64.s

1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/math/sqrt_amd64.s b/src/math/sqrt_amd64.s

index f8d825d..d72000f 100644

--- a/src/math/sqrt_amd64.s

+++ b/src/math/sqrt_amd64.s

@@ -5,7 +5,8 @@

#include "textflag.h"

// func Sqrt(x float64) float64

-TEXT ·Sqrt(SB),NOSPLIT,$0

- SQRTSD x+0(FP), X0

- MOVSD X0, ret+8(FP)

+TEXT ·Sqrt(SB), NOSPLIT, $0

+ MOVSD x+0(FP), X0

+ SQRTSD X0, X1

+ MOVSD X1, ret+8(FP)

RET

--

https://go-review.googlesource.com/28392

https://go-review.googlesource.com/28392

math: fix sqrt regression on AMD64

1.7 introduced a significant regression compared to 1.6:

SqrtIndirect-4 2.32ns ± 0% 7.86ns ± 0% +238.79% (p=0.000 n=20+18)

This is caused by sqrtsd preserving upper part of destination register.

Which introduces dependency on previous value of X0.

In 1.6 benchmark loop didn't use X0 immediately after call:

callq *%rbx

movsd 0x8(%rsp),%xmm2

movsd 0x20(%rsp),%xmm1

addsd %xmm2,%xmm1

mov 0x18(%rsp),%rax

inc %rax

jmp loop

In 1.7 however xmm0 is used just after call:

callq *%rbx

mov 0x10(%rsp),%rcx

lea 0x1(%rcx),%rax

movsd 0x8(%rsp),%xmm0

movsd 0x18(%rsp),%xmm1

I've verified that this is caused by dependency, by inserting

XORPS X0,X0 in the beginning of math.Sqrt, which puts performance back on

1.6 level.

Splitting SQRTSD mem,reg into:

MOVSD mem,reg

SQRTSD reg,reg

Removes dependency, because MOVSD (load version)

doesn't need to preserve upper part of a register.

And reg,reg operation is solved by renamer in CPU.

As a result of this change regression is gone:

SqrtIndirect-4 7.86ns ± 0% 2.33ns ± 0% -70.36% (p=0.000 n=18+17)

Change-Id: Ic7eebe8866445adff5bc38192fa8d64c9a6b8872

---

M src/math/sqrt_amd64.s

1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/math/sqrt_amd64.s b/src/math/sqrt_amd64.s

index f8d825d..d72000f 100644

--- a/src/math/sqrt_amd64.s

+++ b/src/math/sqrt_amd64.s

@@ -5,7 +5,8 @@

#include "textflag.h"

// func Sqrt(x float64) float64

-TEXT ·Sqrt(SB),NOSPLIT,$0

- SQRTSD x+0(FP), X0

- MOVSD X0, ret+8(FP)

+TEXT ·Sqrt(SB), NOSPLIT, $0

+ MOVSD x+0(FP), X0

+ SQRTSD X0, X1

+ MOVSD X1, ret+8(FP)

RET

--

https://go-review.googlesource.com/28392

Sep 1, 2016, 2:05:16 PM9/1/16

to golang-co...@googlegroups.com

Ilya Tocar has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 1: Run-TryBot+1

--

https://go-review.googlesource.com/28392

Gerrit-Reviewer: Ilya Tocar <ilya....@intel.com>

Gerrit-HasComments: No

math: fix sqrt regression on AMD64

--

https://go-review.googlesource.com/28392

Gerrit-Reviewer: Ilya Tocar <ilya....@intel.com>

Gerrit-HasComments: No

Sep 1, 2016, 2:05:22 PM9/1/16

to Ilya Tocar, golang-co...@googlegroups.com

Gobot Gobot has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 1:

TryBots beginning. Status page: http://farmer.golang.org/try?commit=24bbc144

math: fix sqrt regression on AMD64

TryBots beginning. Status page: http://farmer.golang.org/try?commit=24bbc144

Sep 1, 2016, 2:20:57 PM9/1/16

to Ilya Tocar, golang-co...@googlegroups.com

Gobot Gobot has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 1: TryBot-Result+1
TryBots are happy.

--

https://go-review.googlesource.com/28392

Gerrit-Reviewer: Gobot Gobot <go...@golang.org>

Sep 2, 2016, 11:13:45 AM9/2/16

to Ilya Tocar, Keith Randall, Gobot Gobot, golang-co...@googlegroups.com

Keith Randall has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 1:

> TryBots are happy.

I'm not convinced this is necessary.

movsd also only writes the low 64 bits of the register. So that's not the

fix that is happening here. It just happens that you're using X1 as the

output of the sqrt instruction instead of X0. If you redo your patch to do:

MOVSD x+0(FP), X0

SQRTSD X0, X0

MOVSD X0, ret+8(FP)

I think you'll get the same bad behavior. Using XORPS on the destination

register first might be a more robust fix to this.

That said, I think the real problem is the benchmark, not the math.Sqrt

code. The benchmark is testing the throughput of the sqrt instruction, and

the "bug" in 1.7 is that because of partial register nonsense we end up

measuring the latency, not throughput. I think the benchmark should be

changed to measure latency all the time. So something like:

x := 1.0

f := math.Sqrt

for i := 0; i < b.N; i++ {

x = f(x)

}

Global = x

Should do that.

Maybe do x = f(x)+1 to make sure we aren't testing some sqrt(1) special

case.

--

https://go-review.googlesource.com/28392

Gerrit-Reviewer: Gobot Gobot <go...@golang.org>

Gerrit-Reviewer: Ilya Tocar <ilya....@intel.com>

Gerrit-Reviewer: Keith Randall <k...@golang.org>

Gerrit-HasComments: No

math: fix sqrt regression on AMD64

> TryBots are happy.

I'm not convinced this is necessary.

movsd also only writes the low 64 bits of the register. So that's not the

fix that is happening here. It just happens that you're using X1 as the

output of the sqrt instruction instead of X0. If you redo your patch to do:

MOVSD x+0(FP), X0

SQRTSD X0, X0

MOVSD X0, ret+8(FP)

I think you'll get the same bad behavior. Using XORPS on the destination

register first might be a more robust fix to this.

That said, I think the real problem is the benchmark, not the math.Sqrt

code. The benchmark is testing the throughput of the sqrt instruction, and

the "bug" in 1.7 is that because of partial register nonsense we end up

measuring the latency, not throughput. I think the benchmark should be

changed to measure latency all the time. So something like:

x := 1.0

f := math.Sqrt

for i := 0; i < b.N; i++ {

x = f(x)

}

Global = x

Should do that.

Maybe do x = f(x)+1 to make sure we aren't testing some sqrt(1) special

case.

--

https://go-review.googlesource.com/28392

Gerrit-Reviewer: Gobot Gobot <go...@golang.org>

Gerrit-Reviewer: Ilya Tocar <ilya....@intel.com>

Gerrit-HasComments: No

Sep 2, 2016, 11:25:41 AM9/2/16

to Keith Randall, Gobot Gobot, golang-co...@googlegroups.com

Ilya Tocar has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 1:

> I'm not convinced this is necessary.

>

> movsd also only writes the low 64 bits of the register.

Nope. Load form of movsd zeros out upper 64 bits.

From Intel® 64 and IA-32 Architectures Software Developer’s Manual:

MOVSD (128-bit Legacy SSE version: MOVSD XMM1, m64)

DEST[63:0] SRC[63:0]

DEST[127:64] 0 // Upper part is not preserved

DEST[VLMAX-1:128] (Unmodified)

> If you redo your patch to do:

> MOVSD x+0(FP), X0

> SQRTSD X0, X0

> MOVSD X0, ret+8(FP)

> I think you'll get the same bad behavior.

I've checked and both variants have the same performance

> That said, I think the real problem is the benchmark, not the

> math.Sqrt code. The benchmark is testing the throughput of the

> sqrt instruction, and the "bug" in 1.7 is that because of partial

> register nonsense we end up measuring the latency, not throughput.

> I think the benchmark should be changed to measure latency all the

> time.

What about other sqrt benches (sqrt,sqrtgo)?

math: fix sqrt regression on AMD64

> I'm not convinced this is necessary.

>

> movsd also only writes the low 64 bits of the register.

From Intel® 64 and IA-32 Architectures Software Developer’s Manual:

MOVSD (128-bit Legacy SSE version: MOVSD XMM1, m64)

DEST[63:0] SRC[63:0]

DEST[127:64] 0 // Upper part is not preserved

DEST[VLMAX-1:128] (Unmodified)

> If you redo your patch to do:

> MOVSD x+0(FP), X0

> SQRTSD X0, X0

> MOVSD X0, ret+8(FP)

> I think you'll get the same bad behavior.

> That said, I think the real problem is the benchmark, not the

> math.Sqrt code. The benchmark is testing the throughput of the

> sqrt instruction, and the "bug" in 1.7 is that because of partial

> register nonsense we end up measuring the latency, not throughput.

> I think the benchmark should be changed to measure latency all the

> time.

Sep 2, 2016, 11:36:41 AM9/2/16

to Ilya Tocar, Keith Randall, Gobot Gobot, golang-co...@googlegroups.com

Keith Randall has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 1: Code-Review+2

> > I'm not convinced this is necessary.

> >

> > movsd also only writes the low 64 bits of the register.

> Nope. Load form of movsd zeros out upper 64 bits.

> From Intel® 64 and IA-32 Architectures Software Developer’s Manual:

>

> MOVSD (128-bit Legacy SSE version: MOVSD XMM1, m64)

> DEST[63:0] SRC[63:0]

> DEST[127:64] 0 // Upper part is not preserved

> DEST[VLMAX-1:128] (Unmodified)

>

Ah, sorry, I misread the manual. The reg->reg MOVSD doesn't zero the high

bits, but the mem->reg one does.

So this is a good fix then, sorry. Always good to avoid partial register

stalls.

>

> > If you redo your patch to do:

> > MOVSD x+0(FP), X0

> > SQRTSD X0, X0

> > MOVSD X0, ret+8(FP)

> > I think you'll get the same bad behavior.

> I've checked and both variants have the same performance

>

>

> > That said, I think the real problem is the benchmark, not the

> > math.Sqrt code. The benchmark is testing the throughput of the

> > sqrt instruction, and the "bug" in 1.7 is that because of partial

> > register nonsense we end up measuring the latency, not

> throughput.

> > I think the benchmark should be changed to measure latency all

> the

> > time.

> What about other sqrt benches (sqrt,sqrtgo)?

Yes, we should probably fix all of these benchmarks. For instance,

BenchmarkSqrt doesn't actually benchmark square roots - they all get

constant-folded away. It's benchmarking add.

math: fix sqrt regression on AMD64

> > I'm not convinced this is necessary.

> >

> > movsd also only writes the low 64 bits of the register.

> Nope. Load form of movsd zeros out upper 64 bits.

> From Intel® 64 and IA-32 Architectures Software Developer’s Manual:

>

> MOVSD (128-bit Legacy SSE version: MOVSD XMM1, m64)

> DEST[63:0] SRC[63:0]

> DEST[127:64] 0 // Upper part is not preserved

> DEST[VLMAX-1:128] (Unmodified)

>

bits, but the mem->reg one does.

So this is a good fix then, sorry. Always good to avoid partial register

stalls.

>

> > If you redo your patch to do:

> > MOVSD x+0(FP), X0

> > SQRTSD X0, X0

> > MOVSD X0, ret+8(FP)

> > I think you'll get the same bad behavior.

> I've checked and both variants have the same performance

>

>

> > That said, I think the real problem is the benchmark, not the

> > math.Sqrt code. The benchmark is testing the throughput of the

> > sqrt instruction, and the "bug" in 1.7 is that because of partial

> > register nonsense we end up measuring the latency, not

> throughput.

> > I think the benchmark should be changed to measure latency all

> the

> > time.

> What about other sqrt benches (sqrt,sqrtgo)?

BenchmarkSqrt doesn't actually benchmark square roots - they all get

constant-folded away. It's benchmarking add.

Sep 5, 2016, 10:11:36 AM9/5/16

to Keith Randall, Gobot Gobot, golang-co...@googlegroups.com

Reviewers: Keith Randall, Gobot Gobot

Ilya Tocar uploaded a new patch set:

This also removes old Sqrt benchmarks, in favor of benchmarks measuring

latency.

Only SqrtIndirect is kept, to show impact of this patch.

Change-Id: Ic7eebe8866445adff5bc38192fa8d64c9a6b8872

---

M src/math/all_test.go

M src/math/sqrt_amd64.s

2 files changed, 24 insertions(+), 14 deletions(-)

Ilya Tocar uploaded a new patch set:

latency.

Only SqrtIndirect is kept, to show impact of this patch.

Change-Id: Ic7eebe8866445adff5bc38192fa8d64c9a6b8872

---

M src/math/all_test.go

M src/math/sqrt_amd64.s

2 files changed, 24 insertions(+), 14 deletions(-)

Sep 5, 2016, 10:12:02 AM9/5/16

to Keith Randall, Gobot Gobot, golang-co...@googlegroups.com

Ilya Tocar has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 2: Run-TryBot+1

--

https://go-review.googlesource.com/28392

Gerrit-Reviewer: Gobot Gobot <go...@golang.org>

Gerrit-Reviewer: Ilya Tocar <ilya....@intel.com>

Gerrit-Reviewer: Keith Randall <k...@golang.org>

Gerrit-HasComments: No

math: fix sqrt regression on AMD64

--

https://go-review.googlesource.com/28392

Gerrit-Reviewer: Gobot Gobot <go...@golang.org>

Gerrit-Reviewer: Ilya Tocar <ilya....@intel.com>

Gerrit-Reviewer: Keith Randall <k...@golang.org>

Sep 5, 2016, 10:12:22 AM9/5/16

to Ilya Tocar, Keith Randall, golang-co...@googlegroups.com

Gobot Gobot has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 2:

TryBots beginning. Status page: http://farmer.golang.org/try?commit=5b30fac5

math: fix sqrt regression on AMD64

TryBots beginning. Status page: http://farmer.golang.org/try?commit=5b30fac5

Sep 5, 2016, 10:18:23 AM9/5/16

to Ilya Tocar, Keith Randall, golang-co...@googlegroups.com

Gobot Gobot has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 2:
Build is still in progress...

This change failed on darwin-amd64-10_10:

See

https://storage.googleapis.com/go-build-log/5b30fac5/darwin-amd64-10_10_1186cd15.log

Consult https://build.golang.org/ to see whether it's a new failure. Other

builds still in progress; subsequent failure notices suppressed until final

report.

Sep 5, 2016, 10:23:32 AM9/5/16

to Ilya Tocar, Keith Randall, golang-co...@googlegroups.com

Gobot Gobot has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 2: TryBot-Result-1
1 of 18 TryBots failed:

Failed on darwin-amd64-10_10:

https://storage.googleapis.com/go-build-log/5b30fac5/darwin-amd64-10_10_1186cd15.log

Consult https://build.golang.org/ to see whether they are new failures.

Sep 5, 2016, 10:48:25 AM9/5/16

to Ilya Tocar, Keith Randall, Gobot Gobot, golang-co...@googlegroups.com

Keith Randall has posted comments on this change.

math: fix sqrt regression on AMD64

Patch Set 2:

LGTM.

The trybot failure looks unrelated.

math: fix sqrt regression on AMD64

LGTM.

The trybot failure looks unrelated.

Sep 6, 2016, 11:45:09 AM9/6/16

to golang-...@googlegroups.com, Keith Randall, Gobot Gobot, golang-co...@googlegroups.com

Ilya Tocar has submitted this change and it was merged.

Run-TryBot: Ilya Tocar <ilya....@intel.com>

Reviewed-by: Keith Randall <k...@golang.org>

Keith Randall: Looks good to me, approved

Ilya Tocar: Run TryBots

Objections:

Gobot Gobot: TryBots failed

This also removes old Sqrt benchmarks, in favor of benchmarks measuring

latency.

Only SqrtIndirect is kept, to show impact of this patch.

Change-Id: Ic7eebe8866445adff5bc38192fa8d64c9a6b8872

Reviewed-on: https://go-review.googlesource.com/28392
latency.

Only SqrtIndirect is kept, to show impact of this patch.

Change-Id: Ic7eebe8866445adff5bc38192fa8d64c9a6b8872

Run-TryBot: Ilya Tocar <ilya....@intel.com>

Reviewed-by: Keith Randall <k...@golang.org>

---

M src/math/all_test.go

M src/math/sqrt_amd64.s

2 files changed, 24 insertions(+), 14 deletions(-)

Approvals:
M src/math/all_test.go

M src/math/sqrt_amd64.s

2 files changed, 24 insertions(+), 14 deletions(-)

Keith Randall: Looks good to me, approved

Ilya Tocar: Run TryBots

Objections:

Gobot Gobot: TryBots failed

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu