Wei Xiao has uploaded this change for review.
build: enable frame-pointer for arm64
Having frame-pointer on by default makes Linux's perf and other profilers much
more useful, because it lets them gather a stack trace efficiently on profiling
events. Major changes include:
1. save FP&LR at prologue and restore them at epilogue via STP/LDP instructions
2. change frame size alignment to 16 bytes (original: 8 mod 16)
3. adjust offset and frame size in *.s, analysis and wrapper code
4. adjust DWARF information to support FP
5. add support to FP in goroutine scheduler
6. adjust SSA arm backend to support FP
7. adjust signal handler according to frame layout changes
8. adjust cgo trampoline functions according to frame layout changes
9. improve vet and rules for arm64 (a bug fixing and reduced whitelist)
10. change Darwin and Android runtime for enabling frame-pointer (not test)
Performance impacts on go1 benchmarks:
benchmark old ns/op new ns/op delta
BenchmarkBinaryTree17-8 6803983403 6810476278 +0.10%
BenchmarkFannkuch11-8 6490332236 6883854940 +6.06%
BenchmarkFmtFprintfEmpty-8 102 96.8 -5.10%
BenchmarkFmtFprintfString-8 170 183 +7.65%
BenchmarkFmtFprintfInt-8 206 213 +3.40%
BenchmarkFmtFprintfIntInt-8 287 291 +1.39%
BenchmarkFmtFprintfPrefixedInt-8 372 376 +1.08%
BenchmarkFmtFprintfFloat-8 498 504 +1.20%
BenchmarkFmtManyArgs-8 1355 1360 +0.37%
BenchmarkGobDecode-8 17278849 18028547 +4.34%
BenchmarkGobEncode-8 13585105 15896441 +17.01%
BenchmarkGzip-8 847514555 842141859 -0.63%
BenchmarkGunzip-8 84164307 85339225 +1.40%
BenchmarkHTTPClientServer-8 121795 122040 +0.20%
BenchmarkJSONEncode-8 33164339 34835671 +5.04%
BenchmarkJSONDecode-8 150069088 151143217 +0.72%
BenchmarkMandelbrot200-8 9968340 9958440 -0.10%
BenchmarkGoParse-8 8198923 8204454 +0.07%
BenchmarkRegexpMatchEasy0_32-8 261 235 -9.96%
BenchmarkRegexpMatchEasy0_1K-8 1875 1860 -0.80%
BenchmarkRegexpMatchEasy1_32-8 274 258 -5.84%
BenchmarkRegexpMatchEasy1_1K-8 2186 2170 -0.73%
BenchmarkRegexpMatchMedium_32-8 328 322 -1.83%
BenchmarkRegexpMatchMedium_1K-8 85482 85828 +0.40%
BenchmarkRegexpMatchHard_32-8 4654 4756 +2.19%
BenchmarkRegexpMatchHard_1K-8 136968 139327 +1.72%
BenchmarkRevcomp-8 1345153499 1366431108 +1.58%
BenchmarkTemplate-8 153990828 168649910 +9.52%
BenchmarkTimeParse-8 822 817 -0.61%
BenchmarkTimeFormat-8 800 801 +0.12%
benchmark old MB/s new MB/s speedup
BenchmarkGobDecode-8 44.42 42.57 0.96x
BenchmarkGobEncode-8 56.50 48.28 0.85x
BenchmarkGzip-8 22.90 23.04 1.01x
BenchmarkGunzip-8 230.56 227.38 0.99x
BenchmarkJSONEncode-8 58.51 55.70 0.95x
BenchmarkJSONDecode-8 12.93 12.84 0.99x
BenchmarkGoParse-8 7.06 7.06 1.00x
BenchmarkRegexpMatchEasy0_32-8 122.23 135.87 1.11x
BenchmarkRegexpMatchEasy0_1K-8 545.91 550.48 1.01x
BenchmarkRegexpMatchEasy1_32-8 116.50 123.94 1.06x
BenchmarkRegexpMatchEasy1_1K-8 468.24 471.86 1.01x
BenchmarkRegexpMatchMedium_32-8 3.04 3.10 1.02x
BenchmarkRegexpMatchMedium_1K-8 11.98 11.93 1.00x
BenchmarkRegexpMatchHard_32-8 6.88 6.73 0.98x
BenchmarkRegexpMatchHard_1K-8 7.48 7.35 0.98x
BenchmarkRevcomp-8 188.95 186.01 0.98x
BenchmarkTemplate-8 12.60 11.51 0.91x
Updates #15840
Change-Id: I1bfaca6dba29a63009d7c6ab04ed7a1413d9479e
---
M misc/cgo/test/issue9400/asm_arm64.s
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/gc/asm_test.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/obj/link.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/dwarf.go
M src/cmd/vet/all/whitelist/arm64.txt
M src/cmd/vet/all/whitelist/darwin_arm64.txt
M src/cmd/vet/asmdecl.go
M src/reflect/asm_arm64.s
M src/runtime/asm_arm64.s
M src/runtime/atomic_arm64.s
M src/runtime/cgo/asm_arm64.s
M src/runtime/cgo/signal_darwin_arm64.s
M src/runtime/cgocall.go
M src/runtime/duff_arm64.s
M src/runtime/internal/atomic/asm_arm64.s
M src/runtime/internal/atomic/atomic_arm64.s
M src/runtime/internal/sys/arch_arm64.go
M src/runtime/memmove_arm64.s
M src/runtime/mgcmark.go
M src/runtime/mkduff.go
M src/runtime/rt0_android_arm64.s
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/signal_arm64.go
M src/runtime/stack.go
M src/runtime/sys_darwin_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
33 files changed, 468 insertions(+), 428 deletions(-)
diff --git a/misc/cgo/test/issue9400/asm_arm64.s b/misc/cgo/test/issue9400/asm_arm64.s
index 9bb5081..838212f 100644
--- a/misc/cgo/test/issue9400/asm_arm64.s
+++ b/misc/cgo/test/issue9400/asm_arm64.s
@@ -6,7 +6,7 @@
#include "textflag.h"
-TEXT ·RewindAndSetgid(SB),NOSPLIT,$-8-0
+TEXT ·RewindAndSetgid(SB),NOSPLIT,$-16-0
// Save link register
MOVD R30, R9
diff --git a/src/cmd/compile/internal/arm64/ggen.go b/src/cmd/compile/internal/arm64/ggen.go
index f7b3851..b6c4dde 100644
--- a/src/cmd/compile/internal/arm64/ggen.go
+++ b/src/cmd/compile/internal/arm64/ggen.go
@@ -14,10 +14,10 @@
var darwin = objabi.GOOS == "darwin"
func padframe(frame int64) int64 {
- // arm64 requires that the frame size (not counting saved LR)
- // be empty or be 8 mod 16. If not, pad it.
- if frame != 0 && frame%16 != 8 {
- frame += 8
+ // arm64 requires that the frame size (not counting saved FP&LR)
+ // be empty or be 16 bytes aligned. If not, pad it.
+ if frame%16 != 0 {
+ frame += (16-(frame%16))
}
return frame
}
@@ -28,23 +28,23 @@
}
if cnt < int64(4*gc.Widthptr) {
for i := int64(0); i < cnt; i += int64(gc.Widthptr) {
- p = pp.Appendpp(p, arm64.AMOVD, obj.TYPE_REG, arm64.REGZERO, 0, obj.TYPE_MEM, arm64.REGSP, 8+off+i)
+ p = pp.Appendpp(p, arm64.AMOVD, obj.TYPE_REG, arm64.REGZERO, 0, obj.TYPE_MEM, arm64.REGSP, 16+off+i)
}
} else if cnt <= int64(128*gc.Widthptr) && !darwin { // darwin ld64 cannot handle BR26 reloc with non-zero addend
if cnt%(2*int64(gc.Widthptr)) != 0 {
- p = pp.Appendpp(p, arm64.AMOVD, obj.TYPE_REG, arm64.REGZERO, 0, obj.TYPE_MEM, arm64.REGSP, 8+off)
+ p = pp.Appendpp(p, arm64.AMOVD, obj.TYPE_REG, arm64.REGZERO, 0, obj.TYPE_MEM, arm64.REGSP, 16+off)
off += int64(gc.Widthptr)
cnt -= int64(gc.Widthptr)
}
p = pp.Appendpp(p, arm64.AMOVD, obj.TYPE_REG, arm64.REGSP, 0, obj.TYPE_REG, arm64.REGRT1, 0)
- p = pp.Appendpp(p, arm64.AADD, obj.TYPE_CONST, 0, 8+off, obj.TYPE_REG, arm64.REGRT1, 0)
+ p = pp.Appendpp(p, arm64.AADD, obj.TYPE_CONST, 0, 16+off, obj.TYPE_REG, arm64.REGRT1, 0)
p.Reg = arm64.REGRT1
p = pp.Appendpp(p, obj.ADUFFZERO, obj.TYPE_NONE, 0, 0, obj.TYPE_MEM, 0, 0)
p.To.Name = obj.NAME_EXTERN
p.To.Sym = gc.Duffzero
p.To.Offset = 4 * (64 - cnt/(2*int64(gc.Widthptr)))
} else {
- p = pp.Appendpp(p, arm64.AMOVD, obj.TYPE_CONST, 0, 8+off-8, obj.TYPE_REG, arm64.REGTMP, 0)
+ p = pp.Appendpp(p, arm64.AMOVD, obj.TYPE_CONST, 0, 16+off-8, obj.TYPE_REG, arm64.REGTMP, 0)
p = pp.Appendpp(p, arm64.AMOVD, obj.TYPE_REG, arm64.REGSP, 0, obj.TYPE_REG, arm64.REGRT1, 0)
p = pp.Appendpp(p, arm64.AADD, obj.TYPE_REG, arm64.REGTMP, 0, obj.TYPE_REG, arm64.REGRT1, 0)
p.Reg = arm64.REGRT1
diff --git a/src/cmd/compile/internal/gc/asm_test.go b/src/cmd/compile/internal/gc/asm_test.go
index ff04817..f2a6aff 100644
--- a/src/cmd/compile/internal/gc/asm_test.go
+++ b/src/cmd/compile/internal/gc/asm_test.go
@@ -1775,7 +1775,7 @@
return *(&x)
}
`,
- pos: []string{"TEXT\t.*, [$]-8-8"},
+ pos: []string{"TEXT\t.*, [$]-16-8"},
},
{
// check that we don't emit comparisons for constant shift
diff --git a/src/cmd/compile/internal/gc/pgen.go b/src/cmd/compile/internal/gc/pgen.go
index 25bb0ae..b8d0a0d 100644
--- a/src/cmd/compile/internal/gc/pgen.go
+++ b/src/cmd/compile/internal/gc/pgen.go
@@ -378,9 +378,9 @@
abbrev = dwarf.DW_ABRV_AUTO
if Ctxt.FixedFrameSize() == 0 {
offs -= int64(Widthptr)
- }
- if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
- offs -= int64(Widthptr)
+ if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
+ offs -= int64(Widthptr)
+ }
}
case PPARAM, PPARAMOUT:
diff --git a/src/cmd/internal/obj/arm64/asm7.go b/src/cmd/internal/obj/arm64/asm7.go
index ddd4348..cba4b69 100644
--- a/src/cmd/internal/obj/arm64/asm7.go
+++ b/src/cmd/internal/obj/arm64/asm7.go
@@ -592,7 +592,7 @@
ctxt.Diag("arm64 ops not initialized, call arm64.buildop first")
}
- c := ctxt7{ctxt: ctxt, newprog: newprog, cursym: cursym, autosize: int32(p.To.Offset&0xffffffff) + 8}
+ c := ctxt7{ctxt: ctxt, newprog: newprog, cursym: cursym, autosize: int32(p.To.Offset&0xffffffff) + 16}
bflag := 1
pc := int64(0)
@@ -1218,7 +1218,7 @@
// a.Offset is still relative to pseudo-FP.
a.Reg = obj.REG_NONE
}
- c.instoffset = int64(c.autosize) + a.Offset + 8
+ c.instoffset = int64(c.autosize) + a.Offset + 16
return autoclass(c.instoffset)
case obj.NAME_NONE:
@@ -1307,7 +1307,7 @@
// a.Offset is still relative to pseudo-FP.
a.Reg = obj.REG_NONE
}
- c.instoffset = int64(c.autosize) + a.Offset + 8
+ c.instoffset = int64(c.autosize) + a.Offset + 16
goto aconsize
}
return C_GOK
diff --git a/src/cmd/internal/obj/arm64/obj7.go b/src/cmd/internal/obj/arm64/obj7.go
index c435a5a..ba8bef0 100644
--- a/src/cmd/internal/obj/arm64/obj7.go
+++ b/src/cmd/internal/obj/arm64/obj7.go
@@ -524,27 +524,16 @@
if textstksiz < 0 {
c.autosize = 0
} else {
- c.autosize = int32(textstksiz + 8)
+ c.autosize = int32(textstksiz + 16)
}
- if (c.cursym.Func.Text.Mark&LEAF != 0) && c.autosize <= 8 {
+
+ if (c.cursym.Func.Text.Mark&LEAF != 0) && c.autosize <= 16 {
c.autosize = 0
} else if c.autosize&(16-1) != 0 {
- // The frame includes an LR.
- // If the frame size is 8, it's only an LR,
- // so there's no potential for breaking references to
- // local variables by growing the frame size,
- // because there are no local variables.
- // But otherwise, if there is a non-empty locals section,
- // the author of the code is responsible for making sure
- // that the frame size is 8 mod 16.
- if c.autosize == 8 {
- c.autosize += 8
- c.cursym.Func.Locals += 8
- } else {
- c.ctxt.Diag("%v: unaligned frame size %d - must be 8 mod 16 (or 0)", p, c.autosize-8)
- }
+ c.ctxt.Diag("%v: unaligned frame size %d - must 16 bytes aligned", p, c.autosize)
}
- p.To.Offset = int64(c.autosize) - 8
+
+ p.To.Offset = int64(c.autosize - 16)
if c.autosize == 0 && !(c.cursym.Func.Text.Mark&LEAF != 0) {
if c.ctxt.Debugvlog {
c.ctxt.Logf("save suppressed in: %s\n", c.cursym.Func.Text.From.Sym.Name)
@@ -557,8 +546,9 @@
}
aoffset = c.autosize
- if aoffset > 0xF0 {
- aoffset = 0xF0
+ // 504 is the maximum immediate offset for STP/LDP
+ if aoffset > 504 {
+ aoffset = 504
}
if c.cursym.Func.Text.Mark&LEAF != 0 {
c.cursym.Set(obj.AttrLeaf, true)
@@ -571,8 +561,8 @@
// it is a leaf function, so that traceback works.
q = p
if c.autosize > aoffset {
- // Frame size is too large for a MOVD.W instruction.
- // Store link register before decrementing SP, so if a signal comes
+ // Frame size is too large for a STP.W instruction.
+ // Store FP&LR before decrementing SP, so if a signal comes
// during the execution of the function prologue, the traceback
// code will not see a half-updated stack frame.
q = obj.Appendp(q, c.newprog)
@@ -586,9 +576,10 @@
q = obj.Appendp(q, c.newprog)
q.Pos = p.Pos
- q.As = AMOVD
- q.From.Type = obj.TYPE_REG
- q.From.Reg = REGLINK
+ q.As = ASTP
+ q.From.Type = obj.TYPE_REGREG
+ q.From.Reg = REGFP
+ q.From.Offset = REGLINK
q.To.Type = obj.TYPE_MEM
q.To.Reg = REGTMP
@@ -600,34 +591,55 @@
q1.To.Type = obj.TYPE_REG
q1.To.Reg = REGSP
q1.Spadj = c.autosize
+
+ if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
+ q1 = obj.Appendp(q1, c.newprog)
+ q1.Pos = p.Pos
+ q1.As = AMOVD
+ q1.From.Type = obj.TYPE_REG
+ q1.From.Reg = REGSP
+ q1.To.Type = obj.TYPE_REG
+ q1.To.Reg = REGFP
+ }
} else {
- // small frame, update SP and save LR in a single MOVD.W instruction
+ // small frame, update SP and save FP&LR in a single STP.W instruction
q1 = obj.Appendp(q, c.newprog)
- q1.As = AMOVD
+ q1.As = ASTP
q1.Pos = p.Pos
- q1.From.Type = obj.TYPE_REG
- q1.From.Reg = REGLINK
+ q1.From.Type = obj.TYPE_REGREG
+ q1.From.Reg = REGFP
+ q1.From.Offset = REGLINK
q1.To.Type = obj.TYPE_MEM
q1.Scond = C_XPRE
q1.To.Offset = int64(-aoffset)
q1.To.Reg = REGSP
q1.Spadj = aoffset
+
+ if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
+ q1 = obj.Appendp(q1, c.newprog)
+ q1.Pos = p.Pos
+ q1.As = AMOVD
+ q1.From.Type = obj.TYPE_REG
+ q1.From.Reg = REGSP
+ q1.To.Type = obj.TYPE_REG
+ q1.To.Reg = REGFP
+ }
}
if c.cursym.Func.Text.From.Sym.Wrapper() {
// if(g->panic != nil && g->panic->argp == FP) g->panic->argp = bottom-of-frame
//
// MOV g_panic(g), R1
- // CBNZ checkargp
+ // CBNZ R1, checkargp
// end:
// NOP
// ... function body ...
// checkargp:
// MOV panic_argp(R1), R2
- // ADD $(autosize+8), RSP, R3
+ // ADD $(autosize+16), RSP, R3
// CMP R2, R3
// BNE end
- // ADD $8, RSP, R4
+ // ADD $16, RSP, R4
// MOVD R4, panic_argp(R1)
// B end
//
@@ -672,11 +684,11 @@
// CBNZ branches to the MOV above
cbnz.Pcond = mov
- // ADD $(autosize+8), SP, R3
+ // ADD $(autosize+16), SP, R3
q = obj.Appendp(mov, c.newprog)
q.As = AADD
q.From.Type = obj.TYPE_CONST
- q.From.Offset = int64(c.autosize) + 8
+ q.From.Offset = int64(c.autosize) + 16
q.Reg = REGSP
q.To.Type = obj.TYPE_REG
q.To.Reg = REG_R3
@@ -694,11 +706,11 @@
q.To.Type = obj.TYPE_BRANCH
q.Pcond = end
- // ADD $8, SP, R4
+ // ADD $16, SP, R4
q = obj.Appendp(q, c.newprog)
q.As = AADD
q.From.Type = obj.TYPE_CONST
- q.From.Offset = 8
+ q.From.Offset = 16
q.Reg = REGSP
q.To.Type = obj.TYPE_REG
q.To.Reg = REG_R4
@@ -736,22 +748,35 @@
p.To.Type = obj.TYPE_REG
p.To.Reg = REGSP
p.Spadj = -c.autosize
+
+ if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
+ p = obj.Appendp(p, c.newprog)
+ p.As = AMOVD
+ p.From.Type = obj.TYPE_REG
+ p.From.Reg = REGSP
+ p.To.Type = obj.TYPE_REG
+ p.To.Reg = REGFP
+ }
+
}
} else {
- /* want write-back pre-indexed SP+autosize -> SP, loading REGLINK*/
+ /* want write-back pre-indexed SP+autosize -> SP, loading FP&LR*/
aoffset = c.autosize
- if aoffset > 0xF0 {
- aoffset = 0xF0
+ // 504 is the maximum immediate offset for STP/LDP
+ if aoffset > 504 {
+ aoffset = 504
}
- p.As = AMOVD
+ p.As = ALDP
p.From.Type = obj.TYPE_MEM
p.Scond = C_XPOST
p.From.Offset = int64(aoffset)
p.From.Reg = REGSP
- p.To.Type = obj.TYPE_REG
- p.To.Reg = REGLINK
+ p.To.Type = obj.TYPE_REGREG
+ p.To.Reg = REGFP
+ p.To.Offset = REGLINK
p.Spadj = -aoffset
+
if c.autosize > aoffset {
q = newprog()
q.As = AADD
diff --git a/src/cmd/internal/obj/link.go b/src/cmd/internal/obj/link.go
index 68e1b70..1aec391 100644
--- a/src/cmd/internal/obj/link.go
+++ b/src/cmd/internal/obj/link.go
@@ -527,6 +527,9 @@
// PIC code on ppc64le requires 32 bytes of stack, and it's easier to
// just use that much stack always on ppc64x.
return int64(4 * ctxt.Arch.PtrSize)
+ case sys.ARM64:
+ // FP+LR
+ return int64(2 * ctxt.Arch.PtrSize)
default:
return int64(ctxt.Arch.PtrSize)
}
diff --git a/src/cmd/internal/objabi/util.go b/src/cmd/internal/objabi/util.go
index 1da0502..66a60c4 100644
--- a/src/cmd/internal/objabi/util.go
+++ b/src/cmd/internal/objabi/util.go
@@ -54,7 +54,7 @@
}
func Framepointer_enabled(goos, goarch string) bool {
- return framepointer_enabled != 0 && goarch == "amd64" && goos != "nacl"
+ return framepointer_enabled != 0 && (goarch == "amd64" && goos != "nacl" || goarch == "arm64")
}
func addexp(s string) {
diff --git a/src/cmd/link/internal/ld/dwarf.go b/src/cmd/link/internal/ld/dwarf.go
index 6359877..46b7d3d 100644
--- a/src/cmd/link/internal/ld/dwarf.go
+++ b/src/cmd/link/internal/ld/dwarf.go
@@ -1232,7 +1232,12 @@
// after a stack frame has been allocated.
deltaBuf = append(deltaBuf, dwarf.DW_CFA_offset_extended_sf)
deltaBuf = dwarf.AppendUleb128(deltaBuf, uint64(Thearch.Dwarfreglr))
- deltaBuf = dwarf.AppendSleb128(deltaBuf, -int64(pcsp.value)/dataAlignmentFactor)
+ offset := pcsp.value
+ if objabi.GOARCH == "arm64" {
+ // LR is stored above FP
+ offset -= int32(SysArch.PtrSize)
+ }
+ deltaBuf = dwarf.AppendSleb128(deltaBuf, -int64(offset)/dataAlignmentFactor)
} else {
// The return address is restored into the link register
// when a stack frame has been de-allocated.
diff --git a/src/cmd/vet/all/whitelist/arm64.txt b/src/cmd/vet/all/whitelist/arm64.txt
index 8a3c891..24fc6f4 100644
--- a/src/cmd/vet/all/whitelist/arm64.txt
+++ b/src/cmd/vet/all/whitelist/arm64.txt
@@ -2,12 +2,6 @@
runtime/asm_ARCHSUFF.s: [GOARCH] cannot check cross-package assembly function: Compare is in package bytes
-// False positives.
-
-// reflect trampolines intentionally omit arg size. Same for morestack.
-reflect/asm_arm64.s: [arm64] makeFuncStub: use of 16(RSP) points beyond argument frame
-reflect/asm_arm64.s: [arm64] methodValueCall: use of 16(RSP) points beyond argument frame
-
// Intentionally missing declarations.
runtime/asm_arm64.s: [arm64] abort: function abort missing Go declaration
runtime/asm_arm64.s: [arm64] addmoduledata: function addmoduledata missing Go declaration
diff --git a/src/cmd/vet/all/whitelist/darwin_arm64.txt b/src/cmd/vet/all/whitelist/darwin_arm64.txt
index 080a4ca..4a1cd43 100644
--- a/src/cmd/vet/all/whitelist/darwin_arm64.txt
+++ b/src/cmd/vet/all/whitelist/darwin_arm64.txt
@@ -1,14 +1,6 @@
// darwin/arm64-specific vet whitelist. See readme.txt for details.
-runtime/sys_darwin_arm64.s: [arm64] sigtramp: 24(RSP) should be infostyle+8(FP)
-runtime/sys_darwin_arm64.s: [arm64] sigtramp: 24(RSP) should be infostyle+8(FP)
runtime/sys_darwin_arm64.s: [arm64] bsdthread_create: RET without writing to 4-byte ret+24(FP)
runtime/sys_darwin_arm64.s: [arm64] bsdthread_start: function bsdthread_start missing Go declaration
runtime/sys_darwin_arm64.s: [arm64] bsdthread_register: RET without writing to 4-byte ret+0(FP)
-runtime/cgo/signal_darwin_arm64.s: [arm64] panicmem: use of 8(RSP) points beyond argument frame
-runtime/cgo/signal_darwin_arm64.s: [arm64] panicmem: use of 8(RSP) points beyond argument frame
-runtime/cgo/signal_darwin_arm64.s: [arm64] panicmem: use of 16(RSP) points beyond argument frame
-runtime/cgo/signal_darwin_arm64.s: [arm64] panicmem: use of 8(RSP) points beyond argument frame
-runtime/cgo/signal_darwin_arm64.s: [arm64] panicmem: use of 16(RSP) points beyond argument frame
-runtime/cgo/signal_darwin_arm64.s: [arm64] panicmem: use of 16(RSP) points beyond argument frame
runtime/asm_arm64.s: [arm64] sigreturn: function sigreturn missing Go declaration
diff --git a/src/cmd/vet/asmdecl.go b/src/cmd/vet/asmdecl.go
index 7882112..1483f54 100644
--- a/src/cmd/vet/asmdecl.go
+++ b/src/cmd/vet/asmdecl.go
@@ -109,7 +109,7 @@
var (
re = regexp.MustCompile
asmPlusBuild = re(`//\s+\+build\s+([^\n]+)`)
- asmTEXT = re(`\bTEXT\b(.*)·([^\(]+)\(SB\)(?:\s*,\s*([0-9A-Z|+]+))?(?:\s*,\s*\$(-?[0-9]+)(?:-([0-9]+))?)?`)
+ asmTEXT = re(`\bTEXT\b(.*)·([^\(]+)\(SB\)(?:\s*,\s*([0-9A-Z|+\(\)]+))?(?:\s*,\s*\$(-?[0-9]+)(?:-([0-9]+))?)?`)
asmDATA = re(`\b(DATA|GLOBL)\b`)
asmNamedFP = re(`([a-zA-Z0-9_\xFF-\x{10FFFF}]+)(?:\+([0-9]+))\(FP\)`)
asmUnnamedFP = re(`[^+\-0-9](([0-9]+)\(FP\))`)
@@ -253,6 +253,10 @@
if archDef.lr {
// Account for caller's saved LR
localSize += archDef.intSize
+ if arch == "arm64" {
+ // Account for 2 saved FP of caller and callee
+ localSize += 2*archDef.ptrSize
+ }
}
argSize, _ = strconv.Atoi(m[5])
if fn == nil && !strings.Contains(fnName, "<>") {
diff --git a/src/reflect/asm_arm64.s b/src/reflect/asm_arm64.s
index d156370..f180467 100644
--- a/src/reflect/asm_arm64.s
+++ b/src/reflect/asm_arm64.s
@@ -9,11 +9,11 @@
// See the comment on the declaration of makeFuncStub in makefunc.go
// for more details.
// No arg size here, runtime pulls arg map out of the func value.
-TEXT ·makeFuncStub(SB),(NOSPLIT|WRAPPER),$24
+TEXT ·makeFuncStub(SB),(NOSPLIT|WRAPPER),$16
NO_LOCAL_POINTERS
- MOVD R26, 8(RSP)
+ MOVD R26, 16(RSP)
MOVD $argframe+0(FP), R3
- MOVD R3, 16(RSP)
+ MOVD R3, 24(RSP)
BL ·callReflect(SB)
RET
@@ -21,10 +21,10 @@
// See the comment on the declaration of methodValueCall in makefunc.go
// for more details.
// No arg size here; runtime pulls arg map out of the func value.
-TEXT ·methodValueCall(SB),(NOSPLIT|WRAPPER),$24
+TEXT ·methodValueCall(SB),(NOSPLIT|WRAPPER),$16
NO_LOCAL_POINTERS
- MOVD R26, 8(RSP)
+ MOVD R26, 16(RSP)
MOVD $argframe+0(FP), R3
- MOVD R3, 16(RSP)
+ MOVD R3, 24(RSP)
BL ·callMethod(SB)
RET
diff --git a/src/runtime/asm_arm64.s b/src/runtime/asm_arm64.s
index 4a68b4a..d56a1d8 100644
--- a/src/runtime/asm_arm64.s
+++ b/src/runtime/asm_arm64.s
@@ -1,4 +1,4 @@
-// Copyright 2015 The Go Authors. All rights reserved.
+// Copyright 2017 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
@@ -8,17 +8,16 @@
#include "funcdata.h"
#include "textflag.h"
-TEXT runtime·rt0_go(SB),NOSPLIT,$0
+TEXT runtime·rt0_go(SB),NOSPLIT,$32
// SP = stack; R0 = argc; R1 = argv
- SUB $32, RSP
- MOVW R0, 8(RSP) // argc
- MOVD R1, 16(RSP) // argv
+ MOVW R0, 16(RSP) // argc
+ MOVD R1, 24(RSP) // argv
// create istack out of the given (operating system) stack.
// _cgo_init may update stackguard.
MOVD $runtime·g0(SB), g
- MOVD RSP, R7
+ MOVD RSP, R7
MOVD $(-64*1024)(R7), R0
MOVD R0, g_stackguard0(g)
MOVD R0, g_stackguard1(g)
@@ -27,8 +26,7 @@
// if there is a _cgo_init, call it using the gcc ABI.
MOVD _cgo_init(SB), R12
- CMP $0, R12
- BEQ nocgo
+ CBZ R12, nocgo
MRS_TPIDR_R0 // load TLS base pointer
MOVD R0, R3 // arg 3: TLS base pointer
@@ -40,9 +38,6 @@
MOVD $setg_gcc<>(SB), R1 // arg 1: setg
MOVD g, R0 // arg 0: G
BL (R12)
- MOVD _cgo_init(SB), R12
- CMP $0, R12
- BEQ nocgo
nocgo:
// update stackguard after _cgo_init
@@ -60,25 +55,15 @@
MOVD R0, g_m(g)
BL runtime·check(SB)
-
- MOVW 8(RSP), R0 // copy argc
- MOVW R0, -8(RSP)
- MOVD 16(RSP), R0 // copy argv
- MOVD R0, 0(RSP)
BL runtime·args(SB)
BL runtime·osinit(SB)
BL runtime·schedinit(SB)
// create a new goroutine to start program
- MOVD $runtime·mainPC(SB), R0 // entry
- MOVD RSP, R7
- MOVD.W $0, -8(R7)
- MOVD.W R0, -8(R7)
- MOVD.W $0, -8(R7)
- MOVD.W $0, -8(R7)
- MOVD R7, RSP
+ MOVD $runtime·mainPC(SB), R0 // entry
+ MOVD R0, 24(RSP) // fn
+ MOVD $0, 16(RSP) // siz
BL runtime·newproc(SB)
- ADD $32, RSP
// start this M
BL runtime·mstart(SB)
@@ -90,11 +75,11 @@
DATA runtime·mainPC+0(SB)/8,$runtime·main(SB)
GLOBL runtime·mainPC(SB),RODATA,$8
-TEXT runtime·breakpoint(SB),NOSPLIT,$-8-0
+TEXT runtime·breakpoint(SB),NOSPLIT,$-16-0
BRK
RET
-TEXT runtime·asminit(SB),NOSPLIT,$-8-0
+TEXT runtime·asminit(SB),NOSPLIT,$-16-0
RET
/*
@@ -103,10 +88,11 @@
// void gosave(Gobuf*)
// save state in Gobuf; setjmp
-TEXT runtime·gosave(SB), NOSPLIT, $-8-8
+TEXT runtime·gosave(SB), NOSPLIT, $-16-8
MOVD buf+0(FP), R3
MOVD RSP, R0
MOVD R0, gobuf_sp(R3)
+ MOVD R29, gobuf_bp(R3)
MOVD LR, gobuf_pc(R3)
MOVD g, gobuf_g(R3)
MOVD ZR, gobuf_lr(R3)
@@ -120,7 +106,7 @@
// void gogo(Gobuf*)
// restore state from Gobuf; longjmp
-TEXT runtime·gogo(SB), NOSPLIT, $24-8
+TEXT runtime·gogo(SB), NOSPLIT, $16-8
MOVD buf+0(FP), R5
// If ctxt is not nil, invoke deletion barrier before overwriting.
@@ -128,8 +114,8 @@
CMP $0, R0
BEQ nilctxt
MOVD $gobuf_ctxt(R5), R0
- MOVD R0, 8(RSP)
- MOVD ZR, 16(RSP)
+ MOVD R0, 16(RSP)
+ MOVD ZR, 24(RSP)
BL runtime·writebarrierptr_prewrite(SB)
MOVD buf+0(FP), R5
@@ -140,10 +126,12 @@
MOVD 0(g), R4 // make sure g is not nil
MOVD gobuf_sp(R5), R0
MOVD R0, RSP
+ MOVD gobuf_bp(R5), R29
MOVD gobuf_lr(R5), LR
MOVD gobuf_ret(R5), R0
MOVD gobuf_ctxt(R5), R26
MOVD $0, gobuf_sp(R5)
+ MOVD $0, gobuf_bp(R5)
MOVD $0, gobuf_ret(R5)
MOVD $0, gobuf_lr(R5)
MOVD $0, gobuf_ctxt(R5)
@@ -155,10 +143,11 @@
// Switch to m->g0's stack, call fn(g).
// Fn must never return. It should gogo(&g->sched)
// to keep running g.
-TEXT runtime·mcall(SB), NOSPLIT, $-8-8
+TEXT runtime·mcall(SB), NOSPLIT, $-16-8
// Save caller state in g->sched
MOVD RSP, R0
MOVD R0, (g_sched+gobuf_sp)(g)
+ MOVD R29, (g_sched+gobuf_bp)(g)
MOVD LR, (g_sched+gobuf_pc)(g)
MOVD $0, (g_sched+gobuf_lr)(g)
MOVD g, (g_sched+gobuf_g)(g)
@@ -175,9 +164,9 @@
MOVD 0(R26), R4 // code pointer
MOVD (g_sched+gobuf_sp)(g), R0
MOVD R0, RSP // sp = m->g0->sched.sp
- MOVD R3, -8(RSP)
- MOVD $0, -16(RSP)
- SUB $16, RSP
+ MOVD (g_sched+gobuf_bp)(g), R29
+ MOVD R3, -16(RSP)
+ STP.W (ZR,ZR), -32(RSP)
BL (R4)
B runtime·badmcall2(SB)
@@ -222,6 +211,7 @@
MOVD R6, (g_sched+gobuf_pc)(g)
MOVD RSP, R0
MOVD R0, (g_sched+gobuf_sp)(g)
+ MOVD R29, (g_sched+gobuf_bp)(g)
MOVD $0, (g_sched+gobuf_lr)(g)
MOVD g, (g_sched+gobuf_g)(g)
@@ -233,8 +223,9 @@
SUB $16, R3
AND $~15, R3
MOVD $runtime·mstart(SB), R4
- MOVD R4, 0(R3)
+ STP (ZR, R4), (R3)
MOVD R3, RSP
+ MOVD (g_sched+gobuf_bp)(g), R29
// call target function
MOVD 0(R26), R3 // code pointer
@@ -246,7 +237,9 @@
BL runtime·save_g(SB)
MOVD (g_sched+gobuf_sp)(g), R0
MOVD R0, RSP
+ MOVD (g_sched+gobuf_bp)(g), R29
MOVD $0, (g_sched+gobuf_sp)(g)
+ MOVD $0, (g_sched+gobuf_bp)(g)
RET
noswitch:
@@ -267,7 +260,7 @@
// the top of a stack (for example, morestack calling newstack
// calling the scheduler calling newm calling gc), so we must
// record an argument size. For that purpose, it has no arguments.
-TEXT runtime·morestack(SB),NOSPLIT,$-8-0
+TEXT runtime·morestack(SB),NOSPLIT,$-16-0
// Cannot grow scheduler stack (m->g0).
MOVD g_m(g), R8
MOVD m_g0(R8), R4
@@ -287,6 +280,7 @@
// Set g->sched to context in f
MOVD RSP, R0
MOVD R0, (g_sched+gobuf_sp)(g)
+ MOVD R29, (g_sched+gobuf_bp)(g)
MOVD LR, (g_sched+gobuf_pc)(g)
MOVD R3, (g_sched+gobuf_lr)(g)
// newstack will fill gobuf.ctxt.
@@ -303,15 +297,16 @@
BL runtime·save_g(SB)
MOVD (g_sched+gobuf_sp)(g), R0
MOVD R0, RSP
- MOVD.W $0, -16(RSP) // create a call frame on g0
- MOVD R26, 8(RSP) // ctxt argument
+ MOVD (g_sched+gobuf_bp)(g), R29
+ MOVD.W $0, -32(RSP) // create a call frame on g0
+ MOVD R26, 16(RSP) // ctxt argument
BL runtime·newstack(SB)
// Not reached, but make sure the return PC from the call to newstack
// is still in this function, and not the beginning of the next.
UNDEF
-TEXT runtime·morestack_noctxt(SB),NOSPLIT,$-4-0
+TEXT runtime·morestack_noctxt(SB),NOSPLIT,$-16-0
MOVW $0, R26
B runtime·morestack(SB)
@@ -332,7 +327,7 @@
TEXT reflect·call(SB), NOSPLIT, $0-0
B ·reflectcall(SB)
-TEXT ·reflectcall(SB), NOSPLIT, $-8-32
+TEXT ·reflectcall(SB), NOSPLIT, $-16-32
MOVWU argsize+24(FP), R16
DISPATCH(runtime·call32, 32)
DISPATCH(runtime·call64, 64)
@@ -368,9 +363,9 @@
NO_LOCAL_POINTERS; \
/* copy arguments to stack */ \
MOVD arg+16(FP), R3; \
- MOVWU argsize+24(FP), R4; \
- MOVD RSP, R5; \
- ADD $(8-1), R5; \
+ MOVWU argsize+24(FP), R4; \
+ MOVD RSP, R5; \
+ ADD $(16-1), R5; \
SUB $1, R3; \
ADD R5, R4; \
CMP R5, R4; \
@@ -388,7 +383,7 @@
MOVD arg+16(FP), R3; \
MOVWU n+24(FP), R4; \
MOVWU retoffset+28(FP), R6; \
- ADD $8, RSP, R5; \
+ ADD $16, RSP, R5; \
ADD R6, R5; \
ADD R6, R3; \
SUB R6, R4; \
@@ -399,57 +394,57 @@
// separate function so it can allocate stack space for the arguments
// to reflectcallmove. It does not follow the Go ABI; it expects its
// arguments in registers.
-TEXT callRet<>(SB), NOSPLIT, $40-0
- MOVD R7, 8(RSP)
- MOVD R3, 16(RSP)
- MOVD R5, 24(RSP)
- MOVD R4, 32(RSP)
+TEXT callRet<>(SB), NOSPLIT, $32-0
+ MOVD R7, 16(RSP)
+ MOVD R3, 24(RSP)
+ MOVD R5, 32(RSP)
+ MOVD R4, 40(RSP)
BL runtime·reflectcallmove(SB)
RET
-// These have 8 added to make the overall frame size a multiple of 16,
-// as required by the ABI. (There is another +8 for the saved LR.)
-CALLFN(·call32, 40 )
-CALLFN(·call64, 72 )
-CALLFN(·call128, 136 )
-CALLFN(·call256, 264 )
-CALLFN(·call512, 520 )
-CALLFN(·call1024, 1032 )
-CALLFN(·call2048, 2056 )
-CALLFN(·call4096, 4104 )
-CALLFN(·call8192, 8200 )
-CALLFN(·call16384, 16392 )
-CALLFN(·call32768, 32776 )
-CALLFN(·call65536, 65544 )
-CALLFN(·call131072, 131080 )
-CALLFN(·call262144, 262152 )
-CALLFN(·call524288, 524296 )
-CALLFN(·call1048576, 1048584 )
-CALLFN(·call2097152, 2097160 )
-CALLFN(·call4194304, 4194312 )
-CALLFN(·call8388608, 8388616 )
-CALLFN(·call16777216, 16777224 )
-CALLFN(·call33554432, 33554440 )
-CALLFN(·call67108864, 67108872 )
-CALLFN(·call134217728, 134217736 )
-CALLFN(·call268435456, 268435464 )
-CALLFN(·call536870912, 536870920 )
-CALLFN(·call1073741824, 1073741832 )
+// frame size is a multiple of 16,
+// as required by the ABI. (16 bytes aligned.)
+CALLFN(·call32, 32 )
+CALLFN(·call64, 64 )
+CALLFN(·call128, 128 )
+CALLFN(·call256, 256 )
+CALLFN(·call512, 512 )
+CALLFN(·call1024, 1024 )
+CALLFN(·call2048, 2048 )
+CALLFN(·call4096, 4096 )
+CALLFN(·call8192, 8192 )
+CALLFN(·call16384, 16384 )
+CALLFN(·call32768, 32768 )
+CALLFN(·call65536, 65536 )
+CALLFN(·call131072, 131072 )
+CALLFN(·call262144, 262144 )
+CALLFN(·call524288, 524288 )
+CALLFN(·call1048576, 1048576 )
+CALLFN(·call2097152, 2097152 )
+CALLFN(·call4194304, 4194304 )
+CALLFN(·call8388608, 8388608 )
+CALLFN(·call16777216, 16777216 )
+CALLFN(·call33554432, 33554432 )
+CALLFN(·call67108864, 67108864 )
+CALLFN(·call134217728, 134217728 )
+CALLFN(·call268435456, 268435456 )
+CALLFN(·call536870912, 536870912 )
+CALLFN(·call1073741824, 1073741824 )
// AES hashing not implemented for ARM64, issue #10109.
-TEXT runtime·aeshash(SB),NOSPLIT,$-8-0
+TEXT runtime·aeshash(SB),NOSPLIT,$-16-0
MOVW $0, R0
MOVW (R0), R1
-TEXT runtime·aeshash32(SB),NOSPLIT,$-8-0
+TEXT runtime·aeshash32(SB),NOSPLIT,$-16-0
MOVW $0, R0
MOVW (R0), R1
-TEXT runtime·aeshash64(SB),NOSPLIT,$-8-0
+TEXT runtime·aeshash64(SB),NOSPLIT,$-16-0
MOVW $0, R0
MOVW (R0), R1
-TEXT runtime·aeshashstr(SB),NOSPLIT,$-8-0
+TEXT runtime·aeshashstr(SB),NOSPLIT,$-16-0
MOVW $0, R0
MOVW (R0), R1
-
+
TEXT runtime·procyield(SB),NOSPLIT,$0-0
MOVWU cycles+0(FP), R0
again:
@@ -463,23 +458,24 @@
// 1. grab stored LR for caller
// 2. sub 4 bytes to get back to BL deferreturn
// 3. BR to fn
-TEXT runtime·jmpdefer(SB), NOSPLIT, $-8-16
- MOVD 0(RSP), R0
+TEXT runtime·jmpdefer(SB), NOSPLIT, $-16-16
+ MOVD 8(RSP), R0
SUB $4, R0
MOVD R0, LR
MOVD fv+0(FP), R26
MOVD argp+8(FP), R0
MOVD R0, RSP
- SUB $8, RSP
+ SUB $16, RSP // reserve space for FP&LR
MOVD 0(R26), R3
B (R3)
// Save state of caller into g->sched. Smashes R0.
-TEXT gosave<>(SB),NOSPLIT,$-8
+TEXT gosave<>(SB),NOSPLIT,$-16
MOVD LR, (g_sched+gobuf_pc)(g)
- MOVD RSP, R0
+ MOVD RSP, R0
MOVD R0, (g_sched+gobuf_sp)(g)
+ MOVD R29, (g_sched+gobuf_bp)(g)
MOVD $0, (g_sched+gobuf_lr)(g)
MOVD $0, (g_sched+gobuf_ret)(g)
// Assert ctxt is zero. See func save.
@@ -513,6 +509,7 @@
BL runtime·save_g(SB)
MOVD (g_sched+gobuf_sp)(g), R0
MOVD R0, RSP
+ MOVD (g_sched+gobuf_bp)(g), R29
MOVD R9, R0
// Now on a scheduling stack (a pthread-created stack).
@@ -544,28 +541,27 @@
// cgocallback(void (*fn)(void*), void *frame, uintptr framesize, uintptr ctxt)
// Turn the fn into a Go func (by taking its address) and call
// cgocallback_gofunc.
-TEXT runtime·cgocallback(SB),NOSPLIT,$40-32
+TEXT runtime·cgocallback(SB),NOSPLIT,$32-32
MOVD $fn+0(FP), R0
- MOVD R0, 8(RSP)
- MOVD frame+8(FP), R0
MOVD R0, 16(RSP)
- MOVD framesize+16(FP), R0
+ MOVD frame+8(FP), R0
MOVD R0, 24(RSP)
- MOVD ctxt+24(FP), R0
+ MOVD framesize+16(FP), R0
MOVD R0, 32(RSP)
+ MOVD ctxt+24(FP), R0
+ MOVD R0, 40(RSP)
MOVD $runtime·cgocallback_gofunc(SB), R0
BL (R0)
RET
// cgocallback_gofunc(FuncVal*, void *frame, uintptr framesize, uintptr ctxt)
// See cgocall.go for more details.
-TEXT ·cgocallback_gofunc(SB),NOSPLIT,$24-32
+TEXT ·cgocallback_gofunc(SB),NOSPLIT,$16-32
NO_LOCAL_POINTERS
// Load g from thread-local storage.
MOVB runtime·iscgo(SB), R3
- CMP $0, R3
- BEQ nocgo
+ CBZ R3, nocgo
BL runtime·load_g(SB)
nocgo:
@@ -574,8 +570,7 @@
// In this case, we're running on the thread stack, so there's
// lots of space, but the linker doesn't know. Hide the call from
// the linker analysis by using an indirect call.
- CMP $0, g
- BEQ needm
+ CBZ g, needm
MOVD g_m(g), R8
MOVD R8, savedm-8(SP)
@@ -635,18 +630,22 @@
BL runtime·save_g(SB)
MOVD (g_sched+gobuf_sp)(g), R4 // prepare stack as R4
MOVD (g_sched+gobuf_pc)(g), R5
- MOVD R5, -(24+8)(R4)
+ MOVD R5, -(16+8)(R4) // LR
+ MOVD (g_sched+gobuf_bp)(g), R5
+ MOVD R5, -(16+16)(R4) // FP
MOVD ctxt+24(FP), R0
- MOVD R0, -(16+8)(R4)
- MOVD $-(24+8)(R4), R0 // maintain 16-byte SP alignment
+ MOVD R0, -(16)(R4)
+ MOVD $-(16+16)(R4), R0 // point to the bottom of cur frame
MOVD R0, RSP
BL runtime·cgocallbackg(SB)
// Restore g->sched (== m->curg->sched) from saved values.
- MOVD 0(RSP), R5
+ MOVD 8(RSP), R5
MOVD R5, (g_sched+gobuf_pc)(g)
+ MOVD (RSP), R5
+ MOVD R5, (g_sched+gobuf_bp)(g)
MOVD RSP, R4
- ADD $(24+8), R4, R4
+ ADD $(16+16), R4, R4
MOVD R4, (g_sched+gobuf_sp)(g)
// Switch back to m->g0's stack and restore m->g0->sched.sp.
@@ -674,7 +673,7 @@
// Called from cgo wrappers, this function returns g->m->curg.stack.hi.
// Must obey the gcc calling convention.
-TEXT _cgo_topofstack(SB),NOSPLIT,$24
+TEXT _cgo_topofstack(SB),NOSPLIT,$16
// g (R28) and REGTMP (R27) might be clobbered by load_g. They
// are callee-save in the gcc calling convention, so save them.
MOVD R27, savedR27-8(SP)
@@ -697,24 +696,24 @@
RET
// void setg_gcc(G*); set g called from gcc
-TEXT setg_gcc<>(SB),NOSPLIT,$8
+TEXT setg_gcc<>(SB),NOSPLIT,$16
MOVD R0, g
MOVD R27, savedR27-8(SP)
BL runtime·save_g(SB)
MOVD savedR27-8(SP), R27
RET
-TEXT runtime·getcallerpc(SB),NOSPLIT,$8-16
- MOVD 16(RSP), R0 // LR saved by caller
+TEXT runtime·getcallerpc(SB),NOSPLIT,$-16-16
+ MOVD 8(RSP), R0 // LR saved by caller
MOVD R0, ret+8(FP)
RET
-TEXT runtime·abort(SB),NOSPLIT,$-8-0
+TEXT runtime·abort(SB),NOSPLIT,$-16-0
B (ZR)
UNDEF
// memequal(p, q unsafe.Pointer, size uintptr) bool
-TEXT runtime·memequal(SB),NOSPLIT,$-8-25
+TEXT runtime·memequal(SB),NOSPLIT,$-16-25
MOVD a+0(FP), R1
MOVD b+8(FP), R2
MOVD size+16(FP), R3
@@ -736,17 +735,17 @@
RET
// memequal_varlen(a, b unsafe.Pointer) bool
-TEXT runtime·memequal_varlen(SB),NOSPLIT,$40-17
+TEXT runtime·memequal_varlen(SB),NOSPLIT,$32-17
MOVD a+0(FP), R3
MOVD b+8(FP), R4
CMP R3, R4
BEQ eq
MOVD 8(R26), R5 // compiler stores size at offset 8 in the closure
- MOVD R3, 8(RSP)
- MOVD R4, 16(RSP)
- MOVD R5, 24(RSP)
+ MOVD R3, 16(RSP)
+ MOVD R4, 24(RSP)
+ MOVD R5, 32(RSP)
BL runtime·memequal(SB)
- MOVBU 32(RSP), R3
+ MOVBU 40(RSP), R3
MOVB R3, ret+16(FP)
RET
eq:
@@ -754,20 +753,20 @@
MOVB R3, ret+16(FP)
RET
-TEXT runtime·cmpstring(SB),NOSPLIT,$-4-40
+TEXT runtime·cmpstring(SB),NOSPLIT,$-16-40
MOVD s1_base+0(FP), R2
MOVD s1_len+8(FP), R0
MOVD s2_base+16(FP), R3
MOVD s2_len+24(FP), R1
- ADD $40, RSP, R7
+ MOVD $ret+32(FP), R7
B runtime·cmpbody<>(SB)
-TEXT bytes·Compare(SB),NOSPLIT,$-4-56
+TEXT bytes·Compare(SB),NOSPLIT,$-16-56
MOVD s1+0(FP), R2
MOVD s1+8(FP), R0
MOVD s2+24(FP), R3
MOVD s2+32(FP), R1
- ADD $56, RSP, R7
+ MOVD $ret+48(FP), R7
B runtime·cmpbody<>(SB)
// On entry:
@@ -779,7 +778,7 @@
//
// On exit:
// R4, R5, and R6 are clobbered
-TEXT runtime·cmpbody<>(SB),NOSPLIT,$-4-0
+TEXT runtime·cmpbody<>(SB),NOSPLIT,$-16-0
CMP R2, R3
BEQ samebytes // same starting pointers; compare lengths
CMP R0, R1
@@ -885,7 +884,7 @@
// The top-most function running on a goroutine
// returns to goexit+PCQuantum.
-TEXT runtime·goexit(SB),NOSPLIT,$-8-0
+TEXT runtime·goexit(SB),NOSPLIT,$-16-0
MOVD R0, R0 // NOP
BL runtime·goexit1(SB) // does not return
diff --git a/src/runtime/atomic_arm64.s b/src/runtime/atomic_arm64.s
index 4704aa6..e5e02a5 100644
--- a/src/runtime/atomic_arm64.s
+++ b/src/runtime/atomic_arm64.s
@@ -4,6 +4,6 @@
#include "textflag.h"
-TEXT ·publicationBarrier(SB),NOSPLIT,$-8-0
+TEXT ·publicationBarrier(SB),NOSPLIT,$-16-0
DMB $0xe // DMB ST
RET
diff --git a/src/runtime/cgo/asm_arm64.s b/src/runtime/cgo/asm_arm64.s
index 925e930..2ea6a01 100644
--- a/src/runtime/cgo/asm_arm64.s
+++ b/src/runtime/cgo/asm_arm64.s
@@ -7,38 +7,49 @@
// Called by C code generated by cmd/cgo.
// func crosscall2(fn func(a unsafe.Pointer, n int32, ctxt uintptr), a unsafe.Pointer, n int32, ctxt uintptr)
// Saves C callee-saved registers and calls fn with three arguments.
-TEXT crosscall2(SB),NOSPLIT,$-8
+TEXT crosscall2(SB),NOSPLIT,$-16
/*
* We still need to save all callee save register as before, and then
* push 3 args for fn (R1, R2, R3).
- * Also note that at procedure entry in gc world, 8(RSP) will be the
+ * Also note that at procedure entry in gc world, 16(RSP) will be the
* first arg.
* TODO(minux): use LDP/STP here if it matters.
*/
- SUB $(8*24), RSP
- MOVD R1, (8*1)(RSP)
- MOVD R2, (8*2)(RSP)
- MOVD R3, (8*3)(RSP)
- MOVD R19, (8*4)(RSP)
- MOVD R20, (8*5)(RSP)
- MOVD R21, (8*6)(RSP)
- MOVD R22, (8*7)(RSP)
- MOVD R23, (8*8)(RSP)
- MOVD R24, (8*9)(RSP)
- MOVD R25, (8*10)(RSP)
- MOVD R26, (8*11)(RSP)
- MOVD R27, (8*12)(RSP)
- MOVD g, (8*13)(RSP)
- MOVD R29, (8*14)(RSP)
- MOVD R30, (8*15)(RSP)
- FMOVD F8, (8*16)(RSP)
- FMOVD F9, (8*17)(RSP)
- FMOVD F10, (8*18)(RSP)
- FMOVD F11, (8*19)(RSP)
- FMOVD F12, (8*20)(RSP)
- FMOVD F13, (8*21)(RSP)
- FMOVD F14, (8*22)(RSP)
- FMOVD F15, (8*23)(RSP)
+ SUB $(8*26), RSP
+ // +------------+
+ // | R19 - R28 |
+ // +------------+
+ // | R29, R30 |
+ // +------------+
+ // | F8 - F15 |
+ // +------------+
+ // | R1, R2, R3 |
+ // +------------+
+ // | FP, LR |
+ // +------------+
+ MOVD R1, (8*2)(RSP)
+ MOVD R2, (8*3)(RSP)
+ MOVD R3, (8*4)(RSP)
+ MOVD R19, (8*5)(RSP)
+ MOVD R20, (8*6)(RSP)
+ MOVD R21, (8*7)(RSP)
+ MOVD R22, (8*8)(RSP)
+ MOVD R23, (8*9)(RSP)
+ MOVD R24, (8*10)(RSP)
+ MOVD R25, (8*11)(RSP)
+ MOVD R26, (8*12)(RSP)
+ MOVD R27, (8*13)(RSP)
+ MOVD g, (8*14)(RSP)
+ MOVD R29, (8*15)(RSP)
+ MOVD R30, (8*16)(RSP)
+ FMOVD F8, (8*17)(RSP)
+ FMOVD F9, (8*18)(RSP)
+ FMOVD F10, (8*19)(RSP)
+ FMOVD F11, (8*20)(RSP)
+ FMOVD F12, (8*21)(RSP)
+ FMOVD F13, (8*22)(RSP)
+ FMOVD F14, (8*23)(RSP)
+ FMOVD F15, (8*24)(RSP)
MOVD R0, R19
@@ -46,28 +57,25 @@
BL runtime·load_g(SB)
BL (R19)
- MOVD (8*1)(RSP), R1
- MOVD (8*2)(RSP), R2
- MOVD (8*3)(RSP), R3
- MOVD (8*4)(RSP), R19
- MOVD (8*5)(RSP), R20
- MOVD (8*6)(RSP), R21
- MOVD (8*7)(RSP), R22
- MOVD (8*8)(RSP), R23
- MOVD (8*9)(RSP), R24
- MOVD (8*10)(RSP), R25
- MOVD (8*11)(RSP), R26
- MOVD (8*12)(RSP), R27
- MOVD (8*13)(RSP), g
- MOVD (8*14)(RSP), R29
- MOVD (8*15)(RSP), R30
- FMOVD (8*16)(RSP), F8
- FMOVD (8*17)(RSP), F9
- FMOVD (8*18)(RSP), F10
- FMOVD (8*19)(RSP), F11
- FMOVD (8*20)(RSP), F12
- FMOVD (8*21)(RSP), F13
- FMOVD (8*22)(RSP), F14
- FMOVD (8*23)(RSP), F15
- ADD $(8*24), RSP
+ MOVD (8*5)(RSP), R19
+ MOVD (8*6)(RSP), R20
+ MOVD (8*7)(RSP), R21
+ MOVD (8*8)(RSP), R22
+ MOVD (8*9)(RSP), R23
+ MOVD (8*10)(RSP), R24
+ MOVD (8*11)(RSP), R25
+ MOVD (8*12)(RSP), R26
+ MOVD (8*13)(RSP), R27
+ MOVD (8*14)(RSP), g
+ MOVD (8*15)(RSP), R29
+ MOVD (8*16)(RSP), R30
+ FMOVD (8*17)(RSP), F8
+ FMOVD (8*18)(RSP), F9
+ FMOVD (8*19)(RSP), F10
+ FMOVD (8*20)(RSP), F11
+ FMOVD (8*21)(RSP), F12
+ FMOVD (8*22)(RSP), F13
+ FMOVD (8*23)(RSP), F14
+ FMOVD (8*24)(RSP), F15
+ ADD $(8*26), RSP
RET
diff --git a/src/runtime/cgo/signal_darwin_arm64.s b/src/runtime/cgo/signal_darwin_arm64.s
index 75aefd4..8afe837 100644
--- a/src/runtime/cgo/signal_darwin_arm64.s
+++ b/src/runtime/cgo/signal_darwin_arm64.s
@@ -10,7 +10,7 @@
//
// R1 - LR at moment of fault
// R2 - PC at moment of fault
-TEXT ·panicmem(SB),NOSPLIT,$-8
+TEXT ·panicmem(SB),NOSPLIT,$16
// If in external C code, we need to load the g register.
BL runtime·load_g(SB)
CMP $0, g
@@ -18,9 +18,8 @@
// On a foreign thread.
// TODO(crawshaw): call badsignal
- MOVD.W $0, -16(RSP)
MOVW $139, R1
- MOVW R1, 8(RSP)
+ MOVW R1, 16(RSP)
B runtime·exit(SB)
ongothread:
@@ -37,13 +36,12 @@
// Build a 32-byte stack frame for us for this call.
// Saved LR (none available) is at the bottom,
- // then the PC argument for setsigsegv,
+ // then the PC argument for setsigsegv,
// then a copy of the LR for us to restore.
- MOVD.W $0, -32(RSP)
- MOVD R1, 8(RSP)
+ MOVD R1, 24(RSP)
MOVD R2, 16(RSP)
BL runtime·setsigsegv(SB)
- MOVD 8(RSP), R1
+ MOVD 24(RSP), R1
MOVD 16(RSP), R2
// Build a 16-byte stack frame for the simulated
@@ -51,6 +49,6 @@
// 32-byte stack frame above.
// The saved LR in this frame is the LR at time of fault,
// and the LR on entry to sigpanic is the PC at time of fault.
- MOVD.W R1, 16(RSP)
+ STP.W (ZR, R1), 16(RSP)
MOVD R2, R30
B runtime·sigpanic(SB)
diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707..f11b6b8 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -255,9 +255,9 @@
// SP and the stack frame and between the stack frame and the arguments.
cb = (*args)(unsafe.Pointer(sp + 4*sys.PtrSize))
case "arm64":
- // On arm64, stack frame is four words and there's a saved LR between
+ // On arm64, stack frame is four words and there's saved FP&LR between
// SP and the stack frame and between the stack frame and the arguments.
- cb = (*args)(unsafe.Pointer(sp + 5*sys.PtrSize))
+ cb = (*args)(unsafe.Pointer(sp + 6*sys.PtrSize))
case "amd64":
// On amd64, stack frame is two words, plus caller PC.
if framepointer_enabled {
diff --git a/src/runtime/duff_arm64.s b/src/runtime/duff_arm64.s
index 21619ff..2f1515b 100644
--- a/src/runtime/duff_arm64.s
+++ b/src/runtime/duff_arm64.s
@@ -4,7 +4,7 @@
#include "textflag.h"
-TEXT runtime·duffzero(SB), NOSPLIT, $-8-0
+TEXT runtime·duffzero(SB), NOSPLIT, $-16-0
STP.P (ZR, ZR), 16(R16)
STP.P (ZR, ZR), 16(R16)
STP.P (ZR, ZR), 16(R16)
diff --git a/src/runtime/internal/atomic/asm_arm64.s b/src/runtime/internal/atomic/asm_arm64.s
index b6af632..dad2df0 100644
--- a/src/runtime/internal/atomic/asm_arm64.s
+++ b/src/runtime/internal/atomic/asm_arm64.s
@@ -29,10 +29,10 @@
TEXT runtime∕internal∕atomic·Casuintptr(SB), NOSPLIT, $0-25
B runtime∕internal∕atomic·Cas64(SB)
-TEXT runtime∕internal∕atomic·Loaduintptr(SB), NOSPLIT, $-8-16
+TEXT runtime∕internal∕atomic·Loaduintptr(SB), NOSPLIT, $-16-16
B runtime∕internal∕atomic·Load64(SB)
-TEXT runtime∕internal∕atomic·Loaduint(SB), NOSPLIT, $-8-16
+TEXT runtime∕internal∕atomic·Loaduint(SB), NOSPLIT, $-16-16
B runtime∕internal∕atomic·Load64(SB)
TEXT runtime∕internal∕atomic·Storeuintptr(SB), NOSPLIT, $0-16
diff --git a/src/runtime/internal/atomic/atomic_arm64.s b/src/runtime/internal/atomic/atomic_arm64.s
index 6c2031c..a188993 100644
--- a/src/runtime/internal/atomic/atomic_arm64.s
+++ b/src/runtime/internal/atomic/atomic_arm64.s
@@ -5,21 +5,21 @@
#include "textflag.h"
// uint32 runtime∕internal∕atomic·Load(uint32 volatile* addr)
-TEXT ·Load(SB),NOSPLIT,$-8-12
+TEXT ·Load(SB),NOSPLIT,$-16-12
MOVD ptr+0(FP), R0
LDARW (R0), R0
MOVW R0, ret+8(FP)
RET
// uint64 runtime∕internal∕atomic·Load64(uint64 volatile* addr)
-TEXT ·Load64(SB),NOSPLIT,$-8-16
+TEXT ·Load64(SB),NOSPLIT,$-16-16
MOVD ptr+0(FP), R0
LDAR (R0), R0
MOVD R0, ret+8(FP)
RET
// void *runtime∕internal∕atomic·Loadp(void *volatile *addr)
-TEXT ·Loadp(SB),NOSPLIT,$-8-16
+TEXT ·Loadp(SB),NOSPLIT,$-16-16
MOVD ptr+0(FP), R0
LDAR (R0), R0
MOVD R0, ret+8(FP)
diff --git a/src/runtime/internal/sys/arch_arm64.go b/src/runtime/internal/sys/arch_arm64.go
index df8fb1e..10bb9ed 100644
--- a/src/runtime/internal/sys/arch_arm64.go
+++ b/src/runtime/internal/sys/arch_arm64.go
@@ -12,7 +12,7 @@
PCQuantum = 4
Int64Align = 8
HugePageSize = 0
- MinFrameSize = 8
+ MinFrameSize = 16
)
type Uintreg uint64
diff --git a/src/runtime/memmove_arm64.s b/src/runtime/memmove_arm64.s
index 00813d4..ae149e5 100644
--- a/src/runtime/memmove_arm64.s
+++ b/src/runtime/memmove_arm64.s
@@ -5,7 +5,7 @@
#include "textflag.h"
// void runtime·memmove(void*, void*, uintptr)
-TEXT runtime·memmove(SB), NOSPLIT, $-8-24
+TEXT runtime·memmove(SB), NOSPLIT, $-16-24
MOVD to+0(FP), R3
MOVD from+8(FP), R4
MOVD n+16(FP), R5
diff --git a/src/runtime/mgcmark.go b/src/runtime/mgcmark.go
index 9029d19..ce6b2a1 100644
--- a/src/runtime/mgcmark.go
+++ b/src/runtime/mgcmark.go
@@ -807,13 +807,7 @@
// Scan local variables if stack frame has been allocated.
size := frame.varp - frame.sp
- var minsize uintptr
- switch sys.ArchFamily {
- case sys.ARM64:
- minsize = sys.SpAlign
- default:
- minsize = sys.MinFrameSize
- }
+ minsize := uintptr(sys.MinFrameSize)
if size > minsize {
stkmap := (*stackmap)(funcdata(f, _FUNCDATA_LocalsPointerMaps))
if stkmap == nil || stkmap.n <= 0 {
diff --git a/src/runtime/mkduff.go b/src/runtime/mkduff.go
index fb7cbc2..187db81 100644
--- a/src/runtime/mkduff.go
+++ b/src/runtime/mkduff.go
@@ -153,7 +153,7 @@
// ZR: always zero
// R16 (aka REGRT1): ptr to memory to be zeroed
// On return, R16 points to the last zeroed dword.
- fmt.Fprintln(w, "TEXT runtime·duffzero(SB), NOSPLIT, $-8-0")
+ fmt.Fprintln(w, "TEXT runtime·duffzero(SB), NOSPLIT, $-16-0")
for i := 0; i < 63; i++ {
fmt.Fprintln(w, "\tSTP.P\t(ZR, ZR), 16(R16)")
}
diff --git a/src/runtime/rt0_android_arm64.s b/src/runtime/rt0_android_arm64.s
index 9378213..c03932e 100644
--- a/src/runtime/rt0_android_arm64.s
+++ b/src/runtime/rt0_android_arm64.s
@@ -4,13 +4,13 @@
#include "textflag.h"
-TEXT _rt0_arm64_android(SB),NOSPLIT,$-8
+TEXT _rt0_arm64_android(SB),NOSPLIT,$-16
MOVD $_rt0_arm64_linux(SB), R4
B (R4)
// When building with -buildmode=c-shared, this symbol is called when the shared
// library is loaded.
-TEXT _rt0_arm64_android_lib(SB),NOSPLIT,$-8
+TEXT _rt0_arm64_android_lib(SB),NOSPLIT,$-16
MOVW $1, R0 // argc
MOVD $_rt0_arm64_android_argv(SB), R1 // **argv
MOVD $_rt0_arm64_linux_lib(SB), R4
diff --git a/src/runtime/rt0_darwin_arm64.s b/src/runtime/rt0_darwin_arm64.s
index f607683..023b49e 100644
--- a/src/runtime/rt0_darwin_arm64.s
+++ b/src/runtime/rt0_darwin_arm64.s
@@ -6,7 +6,7 @@
// No need for _rt0_arm64_darwin as darwin/arm64 only
// supports external linking.
-TEXT _rt0_arm64_darwin(SB),NOSPLIT,$-8
+TEXT _rt0_arm64_darwin(SB),NOSPLIT,$-16
MOVD $42, R0
MOVD $1, R16 // SYS_exit
SVC $0x80
@@ -16,25 +16,25 @@
//
// Note that all currently shipping darwin/arm64 platforms require
// cgo and do not support c-shared.
-TEXT _rt0_arm64_darwin_lib(SB),NOSPLIT,$168
+TEXT _rt0_arm64_darwin_lib(SB),NOSPLIT,$176
// Preserve callee-save registers.
- MOVD R19, 24(RSP)
- MOVD R20, 32(RSP)
- MOVD R21, 40(RSP)
- MOVD R22, 48(RSP)
- MOVD R23, 56(RSP)
- MOVD R24, 64(RSP)
- MOVD R25, 72(RSP)
- MOVD R26, 80(RSP)
- MOVD R27, 88(RSP)
- FMOVD F8, 96(RSP)
- FMOVD F9, 104(RSP)
- FMOVD F10, 112(RSP)
- FMOVD F11, 120(RSP)
- FMOVD F12, 128(RSP)
- FMOVD F13, 136(RSP)
- FMOVD F14, 144(RSP)
- FMOVD F15, 152(RSP)
+ MOVD R19, 32(RSP)
+ MOVD R20, 40(RSP)
+ MOVD R21, 48(RSP)
+ MOVD R22, 56(RSP)
+ MOVD R23, 64(RSP)
+ MOVD R24, 72(RSP)
+ MOVD R25, 80(RSP)
+ MOVD R26, 88(RSP)
+ MOVD R27, 96(RSP)
+ FMOVD F8, 104(RSP)
+ FMOVD F9, 112(RSP)
+ FMOVD F10, 120(RSP)
+ FMOVD F11, 128(RSP)
+ FMOVD F12, 136(RSP)
+ FMOVD F13, 144(RSP)
+ FMOVD F14, 152(RSP)
+ FMOVD F15, 160(RSP)
MOVD R0, _rt0_arm64_darwin_lib_argc<>(SB)
MOVD R1, _rt0_arm64_darwin_lib_argv<>(SB)
@@ -50,23 +50,23 @@
BL (R4)
// Restore callee-save registers.
- MOVD 24(RSP), R19
- MOVD 32(RSP), R20
- MOVD 40(RSP), R21
- MOVD 48(RSP), R22
- MOVD 56(RSP), R23
- MOVD 64(RSP), R24
- MOVD 72(RSP), R25
- MOVD 80(RSP), R26
- MOVD 88(RSP), R27
- FMOVD 96(RSP), F8
- FMOVD 104(RSP), F9
- FMOVD 112(RSP), F10
- FMOVD 120(RSP), F11
- FMOVD 128(RSP), F12
- FMOVD 136(RSP), F13
- FMOVD 144(RSP), F14
- FMOVD 152(RSP), F15
+ MOVD 32(RSP), R19
+ MOVD 40(RSP), R20
+ MOVD 48(RSP), R21
+ MOVD 56(RSP), R22
+ MOVD 64(RSP), R23
+ MOVD 72(RSP), R24
+ MOVD 80(RSP), R25
+ MOVD 88(RSP), R26
+ MOVD 96(RSP), R27
+ FMOVD 104(RSP), F8
+ FMOVD 112(RSP), F9
+ FMOVD 120(RSP), F10
+ FMOVD 128(RSP), F11
+ FMOVD 136(RSP), F12
+ FMOVD 144(RSP), F13
+ FMOVD 152(RSP), F14
+ FMOVD 160(RSP), F15
RET
TEXT _rt0_arm64_darwin_lib_go(SB),NOSPLIT,$0
@@ -80,7 +80,7 @@
DATA _rt0_arm64_darwin_lib_argv<>(SB)/8, $0
GLOBL _rt0_arm64_darwin_lib_argv<>(SB),NOPTR, $8
-TEXT main(SB),NOSPLIT,$-8
+TEXT main(SB),NOSPLIT,$-16
MOVD $runtime·rt0_go(SB), R2
BL (R2)
exit:
diff --git a/src/runtime/rt0_linux_arm64.s b/src/runtime/rt0_linux_arm64.s
index d01d415..a4ea3bb 100644
--- a/src/runtime/rt0_linux_arm64.s
+++ b/src/runtime/rt0_linux_arm64.s
@@ -4,32 +4,32 @@
#include "textflag.h"
-TEXT _rt0_arm64_linux(SB),NOSPLIT,$-8
+TEXT _rt0_arm64_linux(SB),NOSPLIT,$-16
MOVD 0(RSP), R0 // argc
ADD $8, RSP, R1 // argv
BL main(SB)
// When building with -buildmode=c-shared, this symbol is called when the shared
// library is loaded.
-TEXT _rt0_arm64_linux_lib(SB),NOSPLIT,$168
+TEXT _rt0_arm64_linux_lib(SB),NOSPLIT,$176
// Preserve callee-save registers.
- MOVD R19, 24(RSP)
- MOVD R20, 32(RSP)
- MOVD R21, 40(RSP)
- MOVD R22, 48(RSP)
- MOVD R23, 56(RSP)
- MOVD R24, 64(RSP)
- MOVD R25, 72(RSP)
- MOVD R26, 80(RSP)
- MOVD R27, 88(RSP)
- FMOVD F8, 96(RSP)
- FMOVD F9, 104(RSP)
- FMOVD F10, 112(RSP)
- FMOVD F11, 120(RSP)
- FMOVD F12, 128(RSP)
- FMOVD F13, 136(RSP)
- FMOVD F14, 144(RSP)
- FMOVD F15, 152(RSP)
+ MOVD R19, 32(RSP)
+ MOVD R20, 40(RSP)
+ MOVD R21, 48(RSP)
+ MOVD R22, 56(RSP)
+ MOVD R23, 64(RSP)
+ MOVD R24, 72(RSP)
+ MOVD R25, 80(RSP)
+ MOVD R26, 88(RSP)
+ MOVD R27, 96(RSP)
+ FMOVD F8, 104(RSP)
+ FMOVD F9, 112(RSP)
+ FMOVD F10, 120(RSP)
+ FMOVD F11, 128(RSP)
+ FMOVD F12, 136(RSP)
+ FMOVD F13, 144(RSP)
+ FMOVD F14, 152(RSP)
+ FMOVD F15, 160(RSP)
MOVD R0, _rt0_arm64_linux_lib_argc<>(SB)
MOVD R1, _rt0_arm64_linux_lib_argv<>(SB)
@@ -50,30 +50,30 @@
nocgo:
MOVD $0x800000, R0 // stacksize = 8192KB
MOVD $_rt0_arm64_linux_lib_go(SB), R1
- MOVD R0, 8(RSP)
- MOVD R1, 16(RSP)
+ MOVD R0, 16(RSP)
+ MOVD R1, 24(RSP)
MOVD $runtime·newosproc0(SB),R4
BL (R4)
restore:
// Restore callee-save registers.
- MOVD 24(RSP), R19
- MOVD 32(RSP), R20
- MOVD 40(RSP), R21
- MOVD 48(RSP), R22
- MOVD 56(RSP), R23
- MOVD 64(RSP), R24
- MOVD 72(RSP), R25
- MOVD 80(RSP), R26
- MOVD 88(RSP), R27
- FMOVD 96(RSP), F8
- FMOVD 104(RSP), F9
- FMOVD 112(RSP), F10
- FMOVD 120(RSP), F11
- FMOVD 128(RSP), F12
- FMOVD 136(RSP), F13
- FMOVD 144(RSP), F14
- FMOVD 152(RSP), F15
+ MOVD 32(RSP), R19
+ MOVD 40(RSP), R20
+ MOVD 48(RSP), R21
+ MOVD 56(RSP), R22
+ MOVD 64(RSP), R23
+ MOVD 72(RSP), R24
+ MOVD 80(RSP), R25
+ MOVD 88(RSP), R26
+ MOVD 96(RSP), R27
+ FMOVD 104(RSP), F8
+ FMOVD 112(RSP), F9
+ FMOVD 120(RSP), F10
+ FMOVD 128(RSP), F11
+ FMOVD 136(RSP), F12
+ FMOVD 144(RSP), F13
+ FMOVD 152(RSP), F14
+ FMOVD 160(RSP), F15
RET
TEXT _rt0_arm64_linux_lib_go(SB),NOSPLIT,$0
@@ -88,7 +88,7 @@
GLOBL _rt0_arm64_linux_lib_argv<>(SB),NOPTR, $8
-TEXT main(SB),NOSPLIT,$-8
+TEXT main(SB),NOSPLIT,$-16
MOVD $runtime·rt0_go(SB), R2
BL (R2)
exit:
diff --git a/src/runtime/signal_arm64.go b/src/runtime/signal_arm64.go
index 1db0525..254ae9e 100644
--- a/src/runtime/signal_arm64.go
+++ b/src/runtime/signal_arm64.go
@@ -65,6 +65,10 @@
// anyway.
sp := c.sp() - sys.SpAlign // needs only sizeof uint64, but must align the stack
c.set_sp(sp)
+ if sys.GOARCH == "arm64" {
+ *(*uint64)(unsafe.Pointer(uintptr(sp))) = c.r29()
+ sp += sys.RegSize
+ }
*(*uint64)(unsafe.Pointer(uintptr(sp))) = c.lr()
pc := gp.sigpc
diff --git a/src/runtime/stack.go b/src/runtime/stack.go
index 4e60e80..89022b2 100644
--- a/src/runtime/stack.go
+++ b/src/runtime/stack.go
@@ -512,6 +512,21 @@
// +------------------+
// | return address |
// +------------------+ <- frame->sp
+//
+// (arm64)
+// +------------------+
+// | args from caller |
+// +------------------+ <- frame->argp
+// | caller's retaddr |
+// +------------------+
+// | caller's FP (*) | (*) if framepointer_enabled
+// +------------------+ <- frame->varp
+// | locals |
+// +------------------+
+// | args to callee |
+// +------------------+
+// | return address |
+// +------------------+ <- frame->sp
type adjustinfo struct {
old stack
@@ -635,13 +650,7 @@
// Adjust local variables if stack frame has been allocated.
size := frame.varp - frame.sp
- var minsize uintptr
- switch sys.ArchFamily {
- case sys.ARM64:
- minsize = sys.SpAlign
- default:
- minsize = sys.MinFrameSize
- }
+ minsize := uintptr(sys.MinFrameSize)
if size > minsize {
var bv bitvector
stackmap := (*stackmap)(funcdata(f, _FUNCDATA_LocalsPointerMaps))
diff --git a/src/runtime/sys_darwin_arm64.s b/src/runtime/sys_darwin_arm64.s
index 34fb1f3..e705a65 100644
--- a/src/runtime/sys_darwin_arm64.s
+++ b/src/runtime/sys_darwin_arm64.s
@@ -79,7 +79,7 @@
MOVW R0, ret+24(FP)
RET
-TEXT runtime·exit(SB),NOSPLIT,$-8
+TEXT runtime·exit(SB),NOSPLIT,$-16
MOVW code+0(FP), R0
MOVW $SYS_exit, R16
SVC $0x80
@@ -150,8 +150,8 @@
SVC $0x80
RET
-TEXT runtime·walltime(SB),NOSPLIT,$40-12
- MOVD RSP, R0 // timeval
+TEXT runtime·walltime(SB),NOSPLIT,$32-12
+ ADD $16, RSP, R0 // timeval
MOVD R0, R9 // this is how dyld calls gettimeofday
MOVW $0, R1 // zone
MOVD $0, R2 // see issue 16570
@@ -159,8 +159,8 @@
SVC $0x80 // Note: x0 is tv_sec, w1 is tv_usec
CMP $0, R0
BNE inreg
- MOVD 0(RSP), R0
- MOVW 8(RSP), R1
+ MOVD 16(RSP), R0
+ MOVW 24(RSP), R1
inreg:
MOVD R0, sec+0(FP)
MOVW $1000, R3
@@ -168,8 +168,8 @@
MOVW R1, nsec+8(FP)
RET
-TEXT runtime·nanotime(SB),NOSPLIT,$40
- MOVD RSP, R0 // timeval
+TEXT runtime·nanotime(SB),NOSPLIT,$32
+ ADD $16, RSP, R0 // timeval
MOVD R0, R9 // this is how dyld calls gettimeofday
MOVW $0, R1 // zone
MOVD $0, R2 // see issue 16570
@@ -177,8 +177,8 @@
SVC $0x80 // Note: x0 is tv_sec, w1 is tv_usec
CMP $0, R0
BNE inreg
- MOVD 0(RSP), R0
- MOVW 8(RSP), R1
+ MOVD 16(RSP), R0
+ MOVW 24(RSP), R1
inreg:
MOVW $1000000000, R3
MUL R3, R0
@@ -222,14 +222,14 @@
// badsignal will expect R2 at 8(RSP), so we also
// push R1 onto stack. turns out we do need R1
// to do sigreturn.
- MOVD.W R1, -16(RSP)
- MOVD R2, 8(RSP)
+ MOVD.W R1, -32(RSP)
+ MOVD R2, 16(RSP)
MOVD R4, 24(RSP) // save ucontext, badsignal might clobber R4
MOVD $runtime·badsignal(SB), R26
BL (R26)
MOVD 0(RSP), R1 // saved infostype
MOVD 24(RSP), R0 // the ucontext
- ADD $(16+16), RSP
+ ADD $(16+32), RSP
B ret
cont:
@@ -244,14 +244,14 @@
SUB $64, R6
// copy arguments for call to sighandler
- MOVD R2, 8(R6) // signal num
- MOVD R3, 16(R6) // signal info
- MOVD R4, 24(R6) // context
- MOVD g, 32(R6) // old_g
+ MOVD R2, 16(R6) // signal num
+ MOVD R3, 24(R6) // signal info
+ MOVD R4, 32(R6) // context
+ MOVD g, 40(R6) // old_g
// Backup ucontext and infostyle
- MOVD R4, 40(R6)
- MOVD R1, 48(R6)
+ MOVD R4, 48(R6)
+ MOVD R1, 56(R6)
// switch stack and g
MOVD R6, RSP // sigtramp is not re-entrant, so no need to back up RSP.
@@ -260,8 +260,8 @@
BL (R0)
// call sigreturn
- MOVD 40(RSP), R0 // saved ucontext
- MOVD 48(RSP), R1 // saved infostyle
+ MOVD 48(RSP), R0 // saved ucontext
+ MOVD 56(RSP), R1 // saved infostyle
ret:
MOVW $SYS_sigreturn, R16 // sigreturn(ucontext, infostyle)
SVC $0x80
@@ -289,7 +289,7 @@
BL notok<>(SB)
RET
-TEXT runtime·usleep(SB),NOSPLIT,$24
+TEXT runtime·usleep(SB),NOSPLIT,$16
MOVW usec+0(FP), R0
MOVW R0, R1
MOVW $1000000, R2
diff --git a/src/runtime/sys_linux_arm64.s b/src/runtime/sys_linux_arm64.s
index e921f99..3f91cf4 100644
--- a/src/runtime/sys_linux_arm64.s
+++ b/src/runtime/sys_linux_arm64.s
@@ -48,19 +48,19 @@
#define SYS_connect 203
#define SYS_brk 214
-TEXT runtime·exit(SB),NOSPLIT,$-8-4
+TEXT runtime·exit(SB),NOSPLIT,$-16-4
MOVW code+0(FP), R0
MOVD $SYS_exit_group, R8
SVC
RET
-TEXT runtime·exit1(SB),NOSPLIT,$-8-4
+TEXT runtime·exit1(SB),NOSPLIT,$-16-4
MOVW code+0(FP), R0
MOVD $SYS_exit, R8
SVC
RET
-TEXT runtime·open(SB),NOSPLIT,$-8-20
+TEXT runtime·open(SB),NOSPLIT,$-16-20
MOVD $AT_FDCWD, R0
MOVD name+0(FP), R1
MOVW mode+8(FP), R2
@@ -74,7 +74,7 @@
MOVW R0, ret+16(FP)
RET
-TEXT runtime·closefd(SB),NOSPLIT,$-8-12
+TEXT runtime·closefd(SB),NOSPLIT,$-16-12
MOVW fd+0(FP), R0
MOVD $SYS_close, R8
SVC
@@ -85,7 +85,7 @@
MOVW R0, ret+8(FP)
RET
-TEXT runtime·write(SB),NOSPLIT,$-8-28
+TEXT runtime·write(SB),NOSPLIT,$-16-28
MOVD fd+0(FP), R0
MOVD p+8(FP), R1
MOVW n+16(FP), R2
@@ -98,7 +98,7 @@
MOVW R0, ret+24(FP)
RET
-TEXT runtime·read(SB),NOSPLIT,$-8-28
+TEXT runtime·read(SB),NOSPLIT,$-16-28
MOVW fd+0(FP), R0
MOVD p+8(FP), R1
MOVW n+16(FP), R2
@@ -111,7 +111,7 @@
MOVW R0, ret+24(FP)
RET
-TEXT runtime·getrlimit(SB),NOSPLIT,$-8-20
+TEXT runtime·getrlimit(SB),NOSPLIT,$-16-20
MOVW kind+0(FP), R0
MOVD limit+8(FP), R1
MOVD $SYS_getrlimit, R8
@@ -119,24 +119,24 @@
MOVW R0, ret+16(FP)
RET
-TEXT runtime·usleep(SB),NOSPLIT,$24-4
+TEXT runtime·usleep(SB),NOSPLIT,$16-4
MOVWU usec+0(FP), R3
MOVD R3, R5
MOVW $1000000, R4
UDIV R4, R3
- MOVD R3, 8(RSP)
+ MOVD R3, 16(RSP)
MUL R3, R4
SUB R4, R5
MOVW $1000, R4
MUL R4, R5
- MOVD R5, 16(RSP)
+ MOVD R5, 24(RSP)
// pselect6(0, 0, 0, 0, &ts, 0)
MOVD $0, R0
MOVD R0, R1
MOVD R0, R2
MOVD R0, R3
- ADD $8, RSP, R4
+ ADD $16, RSP, R4
MOVD R0, R5
MOVD $SYS_pselect6, R8
SVC
@@ -148,7 +148,7 @@
MOVW R0, ret+0(FP)
RET
-TEXT runtime·raise(SB),NOSPLIT,$-8
+TEXT runtime·raise(SB),NOSPLIT,$-16
MOVD $SYS_gettid, R8
SVC
MOVW R0, R0 // arg 1 tid
@@ -157,7 +157,7 @@
SVC
RET
-TEXT runtime·raiseproc(SB),NOSPLIT,$-8
+TEXT runtime·raiseproc(SB),NOSPLIT,$-16
MOVD $SYS_getpid, R8
SVC
MOVW R0, R0 // arg 1 pid
@@ -166,7 +166,7 @@
SVC
RET
-TEXT runtime·setitimer(SB),NOSPLIT,$-8-24
+TEXT runtime·setitimer(SB),NOSPLIT,$-16-24
MOVW mode+0(FP), R0
MOVD new+8(FP), R1
MOVD old+16(FP), R2
@@ -174,7 +174,7 @@
SVC
RET
-TEXT runtime·mincore(SB),NOSPLIT,$-8-28
+TEXT runtime·mincore(SB),NOSPLIT,$-16-28
MOVD addr+0(FP), R0
MOVD n+8(FP), R1
MOVD dst+16(FP), R2
@@ -184,24 +184,24 @@
RET
// func walltime() (sec int64, nsec int32)
-TEXT runtime·walltime(SB),NOSPLIT,$24-12
+TEXT runtime·walltime(SB),NOSPLIT,$16-12
MOVW $0, R0 // CLOCK_REALTIME
- MOVD RSP, R1
+ ADD $16, RSP, R1
MOVD $SYS_clock_gettime, R8
SVC
- MOVD 0(RSP), R3 // sec
- MOVD 8(RSP), R5 // nsec
+ MOVD 16(RSP), R3 // sec
+ MOVD 24(RSP), R5 // nsec
MOVD R3, sec+0(FP)
MOVW R5, nsec+8(FP)
RET
-TEXT runtime·nanotime(SB),NOSPLIT,$24-8
+TEXT runtime·nanotime(SB),NOSPLIT,$16-8
MOVW $1, R0 // CLOCK_MONOTONIC
- MOVD RSP, R1
+ ADD $16, RSP, R1
MOVD $SYS_clock_gettime, R8
SVC
- MOVD 0(RSP), R3 // sec
- MOVD 8(RSP), R5 // nsec
+ MOVD 16(RSP), R3 // sec
+ MOVD 24(RSP), R5 // nsec
// sec is in R3, nsec in R5
// return nsec in R3
MOVD $1000000000, R4
@@ -210,7 +210,7 @@
MOVD R3, ret+0(FP)
RET
-TEXT runtime·rtsigprocmask(SB),NOSPLIT,$-8-28
+TEXT runtime·rtsigprocmask(SB),NOSPLIT,$-16-28
MOVW how+0(FP), R0
MOVD new+8(FP), R1
MOVD old+16(FP), R2
@@ -224,7 +224,7 @@
done:
RET
-TEXT runtime·rt_sigaction(SB),NOSPLIT,$-8-36
+TEXT runtime·rt_sigaction(SB),NOSPLIT,$-16-36
MOVD sig+0(FP), R0
MOVD new+8(FP), R1
MOVD old+16(FP), R2
@@ -242,18 +242,18 @@
BL (R11)
RET
-TEXT runtime·sigtramp(SB),NOSPLIT,$24
+TEXT runtime·sigtramp(SB),NOSPLIT,$32
// this might be called in external code context,
// where g is not set.
// first save R0, because runtime·load_g will clobber it
- MOVW R0, 8(RSP)
+ MOVW R0, 16(RSP)
MOVBU runtime·iscgo(SB), R0
CMP $0, R0
BEQ 2(PC)
BL runtime·load_g(SB)
- MOVD R1, 16(RSP)
- MOVD R2, 24(RSP)
+ MOVD R1, 24(RSP)
+ MOVD R2, 32(RSP)
MOVD $runtime·sigtrampgo(SB), R0
BL (R0)
RET
@@ -262,7 +262,7 @@
MOVD $runtime·sigtramp(SB), R3
B (R3)
-TEXT runtime·mmap(SB),NOSPLIT,$-8
+TEXT runtime·mmap(SB),NOSPLIT,$-16
MOVD addr+0(FP), R0
MOVD n+8(FP), R1
MOVW prot+16(FP), R2
@@ -278,7 +278,7 @@
MOVD R0, ret+32(FP)
RET
-TEXT runtime·munmap(SB),NOSPLIT,$-8
+TEXT runtime·munmap(SB),NOSPLIT,$-16
MOVD addr+0(FP), R0
MOVD n+8(FP), R1
MOVD $SYS_munmap, R8
@@ -289,7 +289,7 @@
cool:
RET
-TEXT runtime·madvise(SB),NOSPLIT,$-8
+TEXT runtime·madvise(SB),NOSPLIT,$-16
MOVD addr+0(FP), R0
MOVD n+8(FP), R1
MOVW flags+16(FP), R2
@@ -300,7 +300,7 @@
// int64 futex(int32 *uaddr, int32 op, int32 val,
// struct timespec *timeout, int32 *uaddr2, int32 val2);
-TEXT runtime·futex(SB),NOSPLIT,$-8
+TEXT runtime·futex(SB),NOSPLIT,$-16
MOVD addr+0(FP), R0
MOVW op+8(FP), R1
MOVW val+12(FP), R2
@@ -313,7 +313,7 @@
RET
// int64 clone(int32 flags, void *stk, M *mp, G *gp, void (*fn)(void));
-TEXT runtime·clone(SB),NOSPLIT,$-8
+TEXT runtime·clone(SB),NOSPLIT,$-16
MOVW flags+0(FP), R0
MOVD stk+8(FP), R1
@@ -381,7 +381,7 @@
SVC
B again // keep exiting
-TEXT runtime·sigaltstack(SB),NOSPLIT,$-8
+TEXT runtime·sigaltstack(SB),NOSPLIT,$-16
MOVD new+0(FP), R0
MOVD old+8(FP), R1
MOVD $SYS_sigaltstack, R8
@@ -393,12 +393,12 @@
ok:
RET
-TEXT runtime·osyield(SB),NOSPLIT,$-8
+TEXT runtime·osyield(SB),NOSPLIT,$-16
MOVD $SYS_sched_yield, R8
SVC
RET
-TEXT runtime·sched_getaffinity(SB),NOSPLIT,$-8
+TEXT runtime·sched_getaffinity(SB),NOSPLIT,$-16
MOVD pid+0(FP), R0
MOVD len+8(FP), R1
MOVD buf+16(FP), R2
@@ -408,7 +408,7 @@
RET
// int32 runtime·epollcreate(int32 size);
-TEXT runtime·epollcreate(SB),NOSPLIT,$-8
+TEXT runtime·epollcreate(SB),NOSPLIT,$-16
MOVW $0, R0
MOVD $SYS_epoll_create1, R8
SVC
@@ -416,7 +416,7 @@
RET
// int32 runtime·epollcreate1(int32 flags);
-TEXT runtime·epollcreate1(SB),NOSPLIT,$-8
+TEXT runtime·epollcreate1(SB),NOSPLIT,$-16
MOVW flags+0(FP), R0
MOVD $SYS_epoll_create1, R8
SVC
@@ -424,7 +424,7 @@
RET
// func epollctl(epfd, op, fd int32, ev *epollEvent) int
-TEXT runtime·epollctl(SB),NOSPLIT,$-8
+TEXT runtime·epollctl(SB),NOSPLIT,$-16
MOVW epfd+0(FP), R0
MOVW op+4(FP), R1
MOVW fd+8(FP), R2
@@ -435,7 +435,7 @@
RET
// int32 runtime·epollwait(int32 epfd, EpollEvent *ev, int32 nev, int32 timeout);
-TEXT runtime·epollwait(SB),NOSPLIT,$-8
+TEXT runtime·epollwait(SB),NOSPLIT,$-16
MOVW epfd+0(FP), R0
MOVD ev+8(FP), R1
MOVW nev+16(FP), R2
@@ -447,7 +447,7 @@
RET
// void runtime·closeonexec(int32 fd);
-TEXT runtime·closeonexec(SB),NOSPLIT,$-8
+TEXT runtime·closeonexec(SB),NOSPLIT,$-16
MOVW fd+0(FP), R0 // fd
MOVD $2, R1 // F_SETFD
MOVD $1, R2 // FD_CLOEXEC
diff --git a/src/runtime/traceback.go b/src/runtime/traceback.go
index c74d438..b3f1c9f 100644
--- a/src/runtime/traceback.go
+++ b/src/runtime/traceback.go
@@ -266,6 +266,10 @@
if usesLR {
if n == 0 && frame.sp < frame.fp || frame.lr == 0 {
lrPtr = frame.sp
+ if GOARCH == "arm64" {
+ // LR is stored above FP
+ lrPtr += sys.RegSize
+ }
frame.lr = *(*uintptr)(unsafe.Pointer(lrPtr))
}
} else {
@@ -472,12 +476,13 @@
// On link register architectures, sighandler saves the LR on stack
// before faking a call to sigpanic.
if usesLR && waspanic {
- x := *(*uintptr)(unsafe.Pointer(frame.sp))
- frame.sp += sys.MinFrameSize
+ lrPtr := frame.sp
if GOARCH == "arm64" {
- // arm64 needs 16-byte aligned SP, always
- frame.sp += sys.PtrSize
+ // LR is stored above FP
+ lrPtr += sys.RegSize
}
+ x := *(*uintptr)(unsafe.Pointer(lrPtr))
+ frame.sp += sys.MinFrameSize
f = findfunc(frame.pc)
frame.fn = f
if !f.valid() {
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
mind taking a look?
Nice work! I ran iostest.sh and androidtest.sh and both darwin/arm64 and android/arm64 pass with this CL applied.
If you like, add
Fixes #10110
Fixes #12100
to the CL description.
Looks ok in general. Have not carefully read the assembly. A few questions:
Patch set 1:Run-TryBot +1
9 comments:
arm64
Patch Set #1, Line 20: a bug fixing and reduced whitelist
This should be a separate CL, if possible.
Patch Set #1, Line 25: benchmark old ns/op new ns/op delta
Could you use benchstat (https://godoc.org/golang.org/x/perf/cmd/benchstat) instead of benchcmp?
Patch Set #1, Line 76: Updates #15840
I think this issue is specific to AMD64. You can create another issue for ARM64.
File src/cmd/compile/internal/arm64/ggen.go:
Patch Set #1, Line 20: (16-(frame%16))
No need for the outer parentheses.
File src/cmd/internal/obj/arm64/obj7.go:
Patch Set #1, Line 753: p = obj.Appendp(p, c.newprog)
I don't understand this instruction. Why let FP = SP here?
Probably we should always do the LDP path as long as frame size is not 0. I think we saved LR and FP whenever frame size is not 0, even for leaf function.
Patch Set #1, Line 1: // Copyright 2017 The Go Authors. All rights reserved.
Undo this. We don't update copyright years. (also in other files)
File src/runtime/cgo/asm_arm64.s:
Patch Set #1, Line 28: // | FP, LR |
FP and LR are the same as R29, R30. Why do it twice?
Patch Set #1, Line 529: // +------------------+ <- frame->sp
Also an FP here, right?
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
>
> Fixes #10110
Does this CL make gdb any better?
Build is still in progress...
This change failed on misc-vet-vetall:
See https://storage.googleapis.com/go-build-log/ce225725/misc-vet-vetall_e66a3fe4.log
Consult https://build.golang.org/ to see whether it's a new failure. Other builds still in progress; subsequent failure notices suppressed until final report.
1 of 21 TryBots failed:
Failed on misc-vet-vetall: https://storage.googleapis.com/go-build-log/ce225725/misc-vet-vetall_e66a3fe4.log
Consult https://build.golang.org/ to see whether they are new failures.
Patch set 1:TryBot-Result -1
1 comment:
Patch Set #1, Line 20: a bug fixing and reduced whitelist
This should be a separate CL, if possible.
Done
I have created: CL 62850, Could you please review it firstly?
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Wei Xiao uploaded patch set #2 to this change.
build: support frame-pointer for arm64
Supporting frame-pointer makes Linux's perf and other profilers much more useful
because it lets them gather a stack trace efficiently on profiling events. Major
changes include:
1. save FP&LR (LR above FP) at prologue and restore them at epilogue via STP/LDP instructions
2. change frame size alignment to 16 bytes (original: 8 mod 16)
3. adjust offset and frame size in *.s, analysis and wrapper code
4. adjust DWARF information to support FP
5. add support to FP in goroutine scheduler
6. adjust SSA arm64 backend to support FP
7. adjust signal handler according to frame layout changes
8. adjust cgo trampoline functions according to frame layout changes
9. change Darwin and Android runtime for enabling frame-pointer
Performance impacts on go1 benchmarks:
Enable frame-pointer
name old time/op new time/op delta
BinaryTree17-32 6.52s ± 0% 6.82s ± 0% ~ (p=1.000 n=1+1)
Fannkuch11-32 6.75s ± 0% 6.87s ± 0% ~ (p=1.000 n=1+1)
FmtFprintfEmpty-32 95.7ns ± 0% 95.5ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfString-32 166ns ± 0% 182ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfInt-32 199ns ± 0% 206ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfIntInt-32 275ns ± 0% 278ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfPrefixedInt-32 349ns ± 0% 366ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfFloat-32 474ns ± 0% 481ns ± 0% ~ (p=1.000 n=1+1)
FmtManyArgs-32 1.32µs ± 0% 1.34µs ± 0% ~ (p=1.000 n=1+1)
GobDecode-32 16.4ms ± 0% 18.0ms ± 0% ~ (p=1.000 n=1+1)
GobEncode-32 12.5ms ± 0% 14.9ms ± 0% ~ (p=1.000 n=1+1)
Gzip-32 829ms ± 0% 793ms ± 0% ~ (p=1.000 n=1+1)
Gunzip-32 81.3ms ± 0% 83.6ms ± 0% ~ (p=1.000 n=1+1)
HTTPClientServer-32 126µs ± 0% 128µs ± 0% ~ (p=1.000 n=1+1)
JSONEncode-32 32.2ms ± 0% 33.9ms ± 0% ~ (p=1.000 n=1+1)
JSONDecode-32 149ms ± 0% 141ms ± 0% ~ (p=1.000 n=1+1)
Mandelbrot200-32 9.48ms ± 0% 9.48ms ± 0% ~ (p=1.000 n=1+1)
GoParse-32 8.14ms ± 0% 8.15ms ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy0_32-32 220ns ± 0% 221ns ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy0_1K-32 1.76µs ± 0% 1.77µs ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy1_32-32 236ns ± 0% 232ns ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy1_1K-32 2.09µs ± 0% 2.05µs ± 0% ~ (p=1.000 n=1+1)
RegexpMatchMedium_32-32 304ns ± 0% 280ns ± 0% ~ (p=1.000 n=1+1)
RegexpMatchMedium_1K-32 85.4µs ± 0% 75.7µs ± 0% ~ (p=1.000 n=1+1)
RegexpMatchHard_32-32 4.13µs ± 0% 4.46µs ± 0% ~ (p=1.000 n=1+1)
RegexpMatchHard_1K-32 122µs ± 0% 122µs ± 0% ~ (p=1.000 n=1+1)
Revcomp-32 2.35s ± 0% 2.42s ± 0% ~ (p=1.000 n=1+1)
Template-32 146ms ± 0% 170ms ± 0% ~ (p=1.000 n=1+1)
TimeParse-32 752ns ± 0% 775ns ± 0% ~ (p=1.000 n=1+1)
TimeFormat-32 770ns ± 0% 803ns ± 0% ~ (p=1.000 n=1+1)
name old speed new speed delta
GobDecode-32 46.9MB/s ± 0% 42.5MB/s ± 0% ~ (p=1.000 n=1+1)
GobEncode-32 61.6MB/s ± 0% 51.6MB/s ± 0% ~ (p=1.000 n=1+1)
Gzip-32 23.4MB/s ± 0% 24.5MB/s ± 0% ~ (p=1.000 n=1+1)
Gunzip-32 239MB/s ± 0% 232MB/s ± 0% ~ (p=1.000 n=1+1)
JSONEncode-32 60.2MB/s ± 0% 57.3MB/s ± 0% ~ (p=1.000 n=1+1)
JSONDecode-32 13.0MB/s ± 0% 13.8MB/s ± 0% ~ (p=1.000 n=1+1)
GoParse-32 7.11MB/s ± 0% 7.11MB/s ± 0% ~ (all equal)
RegexpMatchEasy0_32-32 145MB/s ± 0% 144MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy0_1K-32 582MB/s ± 0% 579MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy1_32-32 135MB/s ± 0% 137MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy1_1K-32 489MB/s ± 0% 499MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchMedium_32-32 3.28MB/s ± 0% 3.57MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchMedium_1K-32 12.0MB/s ± 0% 13.5MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchHard_32-32 7.74MB/s ± 0% 7.18MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchHard_1K-32 8.39MB/s ± 0% 8.41MB/s ± 0% ~ (p=1.000 n=1+1)
Revcomp-32 108MB/s ± 0% 105MB/s ± 0% ~ (p=1.000 n=1+1)
Template-32 13.3MB/s ± 0% 11.4MB/s ± 0% ~ (p=1.000 n=1+1)
Disable frame-pointer (GOEXPERIMENT=noframepointer)
name old time/op new time/op delta
BinaryTree17-32 6.52s ± 0% 6.68s ± 0% ~ (p=1.000 n=1+1)
Fannkuch11-32 6.75s ± 0% 6.77s ± 0% ~ (p=1.000 n=1+1)
FmtFprintfEmpty-32 95.7ns ± 0% 93.7ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfString-32 166ns ± 0% 169ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfInt-32 199ns ± 0% 201ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfIntInt-32 275ns ± 0% 272ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfPrefixedInt-32 349ns ± 0% 352ns ± 0% ~ (p=1.000 n=1+1)
FmtFprintfFloat-32 474ns ± 0% 483ns ± 0% ~ (p=1.000 n=1+1)
FmtManyArgs-32 1.32µs ± 0% 1.31µs ± 0% ~ (p=1.000 n=1+1)
GobDecode-32 16.4ms ± 0% 17.5ms ± 0% ~ (p=1.000 n=1+1)
GobEncode-32 12.5ms ± 0% 15.5ms ± 0% ~ (p=1.000 n=1+1)
Gzip-32 829ms ± 0% 827ms ± 0% ~ (p=1.000 n=1+1)
Gunzip-32 81.3ms ± 0% 81.6ms ± 0% ~ (p=1.000 n=1+1)
HTTPClientServer-32 126µs ± 0% 125µs ± 0% ~ (p=1.000 n=1+1)
JSONEncode-32 32.2ms ± 0% 33.9ms ± 0% ~ (p=1.000 n=1+1)
JSONDecode-32 149ms ± 0% 152ms ± 0% ~ (p=1.000 n=1+1)
Mandelbrot200-32 9.48ms ± 0% 9.48ms ± 0% ~ (p=1.000 n=1+1)
GoParse-32 8.14ms ± 0% 7.75ms ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy0_32-32 220ns ± 0% 222ns ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy0_1K-32 1.76µs ± 0% 1.78µs ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy1_32-32 236ns ± 0% 233ns ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy1_1K-32 2.09µs ± 0% 2.05µs ± 0% ~ (p=1.000 n=1+1)
RegexpMatchMedium_32-32 304ns ± 0% 275ns ± 0% ~ (p=1.000 n=1+1)
RegexpMatchMedium_1K-32 85.4µs ± 0% 76.3µs ± 0% ~ (p=1.000 n=1+1)
RegexpMatchHard_32-32 4.13µs ± 0% 4.12µs ± 0% ~ (p=1.000 n=1+1)
RegexpMatchHard_1K-32 122µs ± 0% 121µs ± 0% ~ (p=1.000 n=1+1)
Revcomp-32 2.35s ± 0% 2.45s ± 0% ~ (p=1.000 n=1+1)
Template-32 146ms ± 0% 156ms ± 0% ~ (p=1.000 n=1+1)
TimeParse-32 752ns ± 0% 761ns ± 0% ~ (p=1.000 n=1+1)
TimeFormat-32 770ns ± 0% 776ns ± 0% ~ (p=1.000 n=1+1)
name old speed new speed delta
GobDecode-32 46.9MB/s ± 0% 44.0MB/s ± 0% ~ (p=1.000 n=1+1)
GobEncode-32 61.6MB/s ± 0% 49.7MB/s ± 0% ~ (p=1.000 n=1+1)
Gzip-32 23.4MB/s ± 0% 23.5MB/s ± 0% ~ (p=1.000 n=1+1)
Gunzip-32 239MB/s ± 0% 238MB/s ± 0% ~ (p=1.000 n=1+1)
JSONEncode-32 60.2MB/s ± 0% 57.2MB/s ± 0% ~ (p=1.000 n=1+1)
JSONDecode-32 13.0MB/s ± 0% 12.8MB/s ± 0% ~ (p=1.000 n=1+1)
GoParse-32 7.11MB/s ± 0% 7.47MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy0_32-32 145MB/s ± 0% 144MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy0_1K-32 582MB/s ± 0% 575MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy1_32-32 135MB/s ± 0% 137MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchEasy1_1K-32 489MB/s ± 0% 500MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchMedium_32-32 3.28MB/s ± 0% 3.63MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchMedium_1K-32 12.0MB/s ± 0% 13.4MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchHard_32-32 7.74MB/s ± 0% 7.78MB/s ± 0% ~ (p=1.000 n=1+1)
RegexpMatchHard_1K-32 8.39MB/s ± 0% 8.47MB/s ± 0% ~ (p=1.000 n=1+1)
Revcomp-32 108MB/s ± 0% 104MB/s ± 0% ~ (p=1.000 n=1+1)
Template-32 13.3MB/s ± 0% 12.4MB/s ± 0% ~ (p=1.000 n=1+1)
Frame-pointer is enabled by default but can be disabled by setting: GOEXPERIMENT=noframepointer.
Fixes #10110
Fixes #12100
Change-Id: I1bfaca6dba29a63009d7c6ab04ed7a1413d9479e
---
M misc/cgo/test/issue9400/asm_arm64.s
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/arm64/ssa.go
M src/cmd/compile/internal/gc/asm_test.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/obj/link.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/dwarf.go
M src/cmd/vet/all/whitelist/darwin_arm64.txt
M src/cmd/vet/asmdecl.go
M src/reflect/asm_arm64.s
M src/runtime/asm_arm64.s
M src/runtime/atomic_arm64.s
M src/runtime/cgo/asm_arm64.s
M src/runtime/cgo/signal_darwin_arm64.s
M src/runtime/cgocall.go
M src/runtime/duff_arm64.s
M src/runtime/internal/atomic/asm_arm64.s
M src/runtime/internal/atomic/atomic_arm64.s
M src/runtime/internal/sys/arch_arm64.go
M src/runtime/memmove_arm64.s
M src/runtime/mgcmark.go
M src/runtime/mkduff.go
M src/runtime/rt0_android_arm64.s
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/signal_arm64.go
M src/runtime/stack.go
M src/runtime/sys_darwin_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
33 files changed, 571 insertions(+), 426 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Patch Set 1: Run-TryBot+1
(9 comments)
Looks ok in general. Have not carefully read the assembly. A few questions:
- What frame layout do tools like Linux perf expect? FP above LR or LR above FP? Good to mention it in a comment and the CL description.
LR is above FP according to "Procedure Call Standard use by the Application Binary Interface (ABI) for the ARM 64-bit architecture"
- FP slot is reserved even when frame pointer is not enabled. Is it possible not to reserve the slot unconditionally? I guess it's probably very hard, because a lot of code would have to conditional on it. But at least we want to understand it better.
Yes, it's very hard since FP&LR are stored at the bottom of frame and there are a lot of hard-coded offsets relative to RSP in run-time assembly. If we don't reserve the slot then all these offsets need to be calculated at run-time which incurs non-trivial overhead.
- Have you tested it when frame pointer is disabled?
Yes, both of them are tested. Actually, X86 will fail with disabled frame pointer since it doesn't reserve the slot.
- On x86 backend, FP is maintained for Duff's device calls (cmd/internal/obj/x86/asm6.go, line ~3827). Should we do the same?
Yes, I have added code to maintain FP for Duff's device calls.
9 comments:
arm64
Done
Patch Set #1, Line 20: for enabling frame-pointer
Done […]
Done
Could you use benchstat (https://godoc.org/golang. […]
Done
I think this issue is specific to AMD64. You can create another issue for ARM64.
Done
File src/cmd/compile/internal/arm64/ggen.go:
Patch Set #1, Line 20: 16 - (frame % 1
No need for the outer parentheses.
Patch Set #1, Line 753: p = obj.Appendp(p, c.newprog)
I don't understand this instruction. Why let FP = SP here? […]
Maintaining FP for accurate profiling. Otherwise, FP won't be correct after returning to caller.
Patch Set #1, Line 1: // Copyright 2015 The Go Authors. All rights reserved.
Undo this. We don't update copyright years. […]
Done
File src/runtime/cgo/asm_arm64.s:
Patch Set #1, Line 28: // | FP, LR |
FP and LR are the same as R29, R30. […]
FP and LR are just place holders so that callee can get correct parameters.
Patch Set #1, Line 529: // +------------------+
Also an FP here, right?
Done
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Patch Set 1:
Nice work! I ran iostest.sh and androidtest.sh and both darwin/arm64 and android/arm64 pass with this CL applied.
If you like, add
Fixes #10110
Fixes #12100to the CL description.
Done.
Wei Xiao uploaded patch set #3 to this change.
build: support frame-pointer for arm64
Supporting frame-pointer makes Linux's perf and other profilers much more useful
because it lets them gather a stack trace efficiently on profiling events. Major
changes include:
1. save FP&LR (LR above FP) at prologue and restore them at epilogue via STP/LDP instructions
2. change frame size alignment to 16 bytes (original: 8 mod 16)
3. adjust offset and frame size in *.s, analysis and wrapper code
4. adjust DWARF information to support FP
5. add support to FP in goroutine scheduler
6. adjust SSA arm64 backend to support FP
7. adjust signal handler according to frame layout changes
8. adjust cgo trampoline functions according to frame layout changes
9. change Darwin and Android runtime for enabling frame-pointer
Performance impacts on go1 benchmarks:
Enable frame-pointer (by default)
name old time/op new time/op delta
BinaryTree17-8 6.53s ± 1% 6.69s ± 2% +2.40% (p=0.008 n=5+5)
Fannkuch11-8 6.34s ± 0% 6.59s ± 0% +3.96% (p=0.008 n=5+5)
FmtFprintfEmpty-8 101ns ± 1% 102ns ± 0% +1.07% (p=0.048 n=5+5)
FmtFprintfString-8 182ns ± 2% 188ns ± 2% +3.18% (p=0.016 n=5+5)
FmtFprintfInt-8 206ns ± 0% 214ns ± 0% +4.08% (p=0.016 n=4+5)
FmtFprintfIntInt-8 273ns ± 0% 289ns ± 0% +5.86% (p=0.016 n=4+5)
FmtFprintfPrefixedInt-8 374ns ± 1% 375ns ± 1% ~ (p=0.524 n=5+5)
FmtFprintfFloat-8 499ns ± 1% 499ns ± 1% ~ (p=0.611 n=5+5)
FmtManyArgs-8 1.37µs ± 2% 1.33µs ± 1% -2.44% (p=0.024 n=5+5)
GobDecode-8 16.6ms ± 1% 18.0ms ± 2% +8.44% (p=0.008 n=5+5)
GobEncode-8 13.6ms ± 1% 15.9ms ± 1% +16.47% (p=0.008 n=5+5)
Gzip-8 844ms ± 0% 793ms ± 0% -6.07% (p=0.008 n=5+5)
Gunzip-8 84.0ms ± 0% 84.7ms ± 0% +0.79% (p=0.008 n=5+5)
HTTPClientServer-8 121µs ± 1% 121µs ± 0% ~ (p=0.690 n=5+5)
JSONEncode-8 33.4ms ± 1% 35.6ms ± 0% +6.61% (p=0.008 n=5+5)
JSONDecode-8 151ms ± 1% 156ms ± 1% +3.84% (p=0.008 n=5+5)
Mandelbrot200-8 10.0ms ± 0% 10.0ms ± 0% +0.08% (p=0.008 n=5+5)
GoParse-8 7.92ms ± 1% 8.00ms ± 1% +1.04% (p=0.008 n=5+5)
RegexpMatchEasy0_32-8 233ns ± 1% 236ns ± 2% ~ (p=0.762 n=5+5)
RegexpMatchEasy0_1K-8 1.86µs ± 0% 1.85µs ± 0% ~ (p=0.317 n=5+5)
RegexpMatchEasy1_32-8 250ns ± 1% 253ns ± 4% ~ (p=0.651 n=5+5)
RegexpMatchEasy1_1K-8 2.13µs ± 0% 2.15µs ± 0% +0.95% (p=0.008 n=5+5)
RegexpMatchMedium_32-8 290ns ± 1% 291ns ± 0% ~ (p=0.381 n=5+5)
RegexpMatchMedium_1K-8 79.0µs ± 0% 79.3µs ± 0% +0.36% (p=0.008 n=5+5)
RegexpMatchHard_32-8 4.38µs ± 2% 4.39µs ± 0% ~ (p=0.690 n=5+5)
RegexpMatchHard_1K-8 127µs ± 0% 130µs ± 0% +2.52% (p=0.016 n=5+4)
Revcomp-8 1.35s ± 1% 1.37s ± 0% +1.19% (p=0.032 n=5+5)
Template-8 160ms ± 1% 170ms ± 3% +6.27% (p=0.008 n=5+5)
TimeParse-8 779ns ± 0% 798ns ± 0% +2.46% (p=0.008 n=5+5)
TimeFormat-8 782ns ± 0% 806ns ± 1% +3.07% (p=0.008 n=5+5)
name old speed new speed delta
GobDecode-8 46.2MB/s ± 1% 42.6MB/s ± 2% -7.76% (p=0.008 n=5+5)
GobEncode-8 56.4MB/s ± 1% 48.4MB/s ± 1% -14.13% (p=0.008 n=5+5)
Gzip-8 23.0MB/s ± 0% 24.5MB/s ± 0% +6.44% (p=0.008 n=5+5)
Gunzip-8 231MB/s ± 0% 229MB/s ± 0% -0.79% (p=0.008 n=5+5)
JSONEncode-8 58.1MB/s ± 1% 54.5MB/s ± 0% -6.20% (p=0.008 n=5+5)
JSONDecode-8 12.9MB/s ± 1% 12.4MB/s ± 1% -3.71% (p=0.008 n=5+5)
GoParse-8 7.31MB/s ± 1% 7.24MB/s ± 0% -1.07% (p=0.008 n=5+5)
RegexpMatchEasy0_32-8 137MB/s ± 1% 136MB/s ± 2% ~ (p=0.841 n=5+5)
RegexpMatchEasy0_1K-8 551MB/s ± 0% 552MB/s ± 0% ~ (p=0.310 n=5+5)
RegexpMatchEasy1_32-8 128MB/s ± 1% 126MB/s ± 4% ~ (p=0.690 n=5+5)
RegexpMatchEasy1_1K-8 480MB/s ± 0% 475MB/s ± 0% -0.94% (p=0.008 n=5+5)
RegexpMatchMedium_32-8 3.44MB/s ± 1% 3.43MB/s ± 0% ~ (p=0.206 n=5+5)
RegexpMatchMedium_1K-8 13.0MB/s ± 0% 12.9MB/s ± 0% ~ (p=0.079 n=4+5)
RegexpMatchHard_32-8 7.30MB/s ± 2% 7.29MB/s ± 0% ~ (p=0.651 n=5+5)
RegexpMatchHard_1K-8 8.05MB/s ± 0% 7.85MB/s ± 0% -2.48% (p=0.016 n=5+4)
Revcomp-8 188MB/s ± 1% 186MB/s ± 0% -1.18% (p=0.032 n=5+5)
Template-8 12.2MB/s ± 1% 11.4MB/s ± 3% -5.89% (p=0.008 n=5+5)
Disable frame-pointer (GOEXPERIMENT=noframepointer)
name old time/op new time/op delta
BinaryTree17-8 6.53s ± 1% 6.60s ± 2% ~ (p=0.095 n=5+5)
Fannkuch11-8 6.34s ± 0% 6.36s ± 1% ~ (p=0.690 n=5+5)
FmtFprintfEmpty-8 101ns ± 1% 109ns ± 0% +8.01% (p=0.016 n=5+4)
FmtFprintfString-8 182ns ± 2% 186ns ± 1% +2.31% (p=0.016 n=5+5)
FmtFprintfInt-8 206ns ± 0% 204ns ± 1% -0.87% (p=0.000 n=4+5)
FmtFprintfIntInt-8 273ns ± 0% 272ns ± 0% -0.37% (p=0.016 n=4+5)
FmtFprintfPrefixedInt-8 374ns ± 1% 372ns ± 1% ~ (p=0.389 n=5+5)
FmtFprintfFloat-8 499ns ± 1% 503ns ± 0% ~ (p=0.095 n=5+5)
FmtManyArgs-8 1.37µs ± 2% 1.31µs ± 2% -3.85% (p=0.008 n=5+5)
GobDecode-8 16.6ms ± 1% 18.0ms ± 1% +8.30% (p=0.008 n=5+5)
GobEncode-8 13.6ms ± 1% 15.3ms ± 1% +12.10% (p=0.008 n=5+5)
Gzip-8 844ms ± 0% 842ms ± 0% -0.22% (p=0.016 n=5+5)
Gunzip-8 84.0ms ± 0% 83.7ms ± 0% -0.35% (p=0.032 n=5+5)
HTTPClientServer-8 121µs ± 1% 121µs ± 1% ~ (p=0.421 n=5+5)
JSONEncode-8 33.4ms ± 1% 34.2ms ± 0% +2.58% (p=0.008 n=5+5)
JSONDecode-8 151ms ± 1% 152ms ± 1% +0.83% (p=0.016 n=5+5)
Mandelbrot200-8 10.0ms ± 0% 10.0ms ± 0% +0.07% (p=0.008 n=5+5)
GoParse-8 7.92ms ± 1% 8.21ms ± 0% +3.62% (p=0.008 n=5+5)
RegexpMatchEasy0_32-8 233ns ± 1% 228ns ± 0% -2.31% (p=0.008 n=5+5)
RegexpMatchEasy0_1K-8 1.86µs ± 0% 1.86µs ± 0% ~ (p=0.730 n=5+5)
RegexpMatchEasy1_32-8 250ns ± 1% 246ns ± 0% -1.52% (p=0.008 n=5+5)
RegexpMatchEasy1_1K-8 2.13µs ± 0% 2.13µs ± 0% ~ (p=0.683 n=5+5)
RegexpMatchMedium_32-8 290ns ± 1% 289ns ± 0% -0.62% (p=0.032 n=5+5)
RegexpMatchMedium_1K-8 79.0µs ± 0% 79.9µs ± 0% +1.20% (p=0.008 n=5+5)
RegexpMatchHard_32-8 4.38µs ± 2% 4.30µs ± 0% ~ (p=0.127 n=5+5)
RegexpMatchHard_1K-8 127µs ± 0% 127µs ± 0% ~ (p=1.000 n=5+5)
Revcomp-8 1.35s ± 1% 1.35s ± 1% ~ (p=0.690 n=5+5)
Template-8 160ms ± 1% 165ms ± 0% +3.38% (p=0.008 n=5+5)
TimeParse-8 779ns ± 0% 787ns ± 1% +0.95% (p=0.024 n=5+5)
TimeFormat-8 782ns ± 0% 804ns ± 1% +2.89% (p=0.008 n=5+5)
name old speed new speed delta
GobDecode-8 46.2MB/s ± 1% 42.6MB/s ± 1% -7.67% (p=0.008 n=5+5)
GobEncode-8 56.4MB/s ± 1% 50.3MB/s ± 1% -10.79% (p=0.008 n=5+5)
Gzip-8 23.0MB/s ± 0% 23.0MB/s ± 0% +0.21% (p=0.024 n=5+5)
Gunzip-8 231MB/s ± 0% 232MB/s ± 0% +0.35% (p=0.032 n=5+5)
JSONEncode-8 58.1MB/s ± 1% 56.7MB/s ± 0% -2.52% (p=0.008 n=5+5)
JSONDecode-8 12.9MB/s ± 1% 12.8MB/s ± 1% -0.82% (p=0.024 n=5+5)
GoParse-8 7.31MB/s ± 1% 7.06MB/s ± 0% -3.53% (p=0.008 n=5+5)
RegexpMatchEasy0_32-8 137MB/s ± 1% 140MB/s ± 0% +2.46% (p=0.008 n=5+5)
RegexpMatchEasy0_1K-8 551MB/s ± 0% 551MB/s ± 0% ~ (p=0.690 n=5+5)
RegexpMatchEasy1_32-8 128MB/s ± 1% 130MB/s ± 0% +1.62% (p=0.008 n=5+5)
RegexpMatchEasy1_1K-8 480MB/s ± 0% 480MB/s ± 0% ~ (p=0.460 n=5+5)
RegexpMatchMedium_32-8 3.44MB/s ± 1% 3.46MB/s ± 0% +0.64% (p=0.016 n=5+4)
RegexpMatchMedium_1K-8 13.0MB/s ± 0% 12.8MB/s ± 0% -1.16% (p=0.029 n=4+4)
RegexpMatchHard_32-8 7.30MB/s ± 2% 7.44MB/s ± 0% ~ (p=0.238 n=5+5)
RegexpMatchHard_1K-8 8.05MB/s ± 0% 8.05MB/s ± 0% ~ (p=0.913 n=5+5)
Revcomp-8 188MB/s ± 1% 188MB/s ± 1% ~ (p=0.571 n=5+5)
Template-8 12.2MB/s ± 1% 11.8MB/s ± 0% -3.27% (p=0.008 n=5+5)
Frame-pointer is enabled by default but can be disabled by setting: GOEXPERIMENT=noframepointer.
Fixes #10110
Fixes #12100
Change-Id: I1bfaca6dba29a63009d7c6ab04ed7a1413d9479e
---
M misc/cgo/test/issue9400/asm_arm64.s
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/arm64/ssa.go
M src/cmd/compile/internal/gc/asm_test.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/obj/link.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/dwarf.go
M src/cmd/vet/all/whitelist/darwin_arm64.txt
M src/cmd/vet/asmdecl.go
M src/reflect/asm_arm64.s
M src/runtime/asm_arm64.s
M src/runtime/atomic_arm64.s
M src/runtime/cgo/asm_arm64.s
M src/runtime/cgo/signal_darwin_arm64.s
M src/runtime/cgocall.go
M src/runtime/duff_arm64.s
M src/runtime/internal/atomic/asm_arm64.s
M src/runtime/internal/atomic/atomic_arm64.s
M src/runtime/internal/sys/arch_arm64.go
M src/runtime/memmove_arm64.s
M src/runtime/mgcmark.go
M src/runtime/mkduff.go
M src/runtime/rt0_android_arm64.s
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/signal_arm64.go
M src/runtime/stack.go
M src/runtime/sys_darwin_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
33 files changed, 569 insertions(+), 424 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Any more comments?
2 comments:
Patch Set #1, Line 753: p = obj.Appendp(p, c.newprog)
Maintaining FP for accurate profiling. Otherwise, FP won't be correct after returning to caller.
Done
File src/runtime/cgo/asm_arm64.s:
Patch Set #1, Line 28: // | FP, LR |
FP and LR are just place holders so that callee can get correct parameters.
Done
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
7 comments:
File src/cmd/compile/internal/arm64/ggen.go:
Could we change 16 to gc.Ctxt.FixedFrameSize()? Also below.
File src/cmd/compile/internal/arm64/ssa.go:
Patch Set #3, Line 579: if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
For AMD64 we do this in the assembler backend (https://go.googlesource.com/go/+/510327012bd42aca3deac989e2e109dc71bb4605/src/cmd/internal/obj/x86/asm6.go#4029). Can we also do it there?
File src/cmd/internal/obj/arm64/obj7.go:
Patch Set #1, Line 753: p = obj.Appendp(p, c.newprog)
Done
I still don't understand this. FP should be restored from the saved value on stack, right?
File src/cmd/internal/obj/arm64/obj7.go:
Patch Set #3, Line 595: if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
This should be before the update of SP?
Patch Set #3, Line 618: if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
So is this. This should probably be ADD $framesize, SP, FP ?
Patch Set #3, Line 258: ptrSize
intSize and ptrSize are equal, but other lines use intSize, so let's do the same here.
File src/runtime/cgo/asm_arm64.s:
Patch Set #1, Line 28: // | FP, LR |
Done
I mean, why not save FP and LR here, and remove R29 and R30 above?
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Wei Xiao uploaded patch set #4 to this change.
build: support frame-pointer for arm64
Supporting frame-pointer makes Linux's perf and other profilers much more useful
because it lets them gather a stack trace efficiently on profiling events. Major
changes include:
1. save FP&LR (LR above FP) at prologue and restore them at epilogue via STP/LDP instructions
2. change frame size alignment to 16 bytes (original: 8 mod 16)
3. adjust offset and frame size in *.s, analysis and wrapper code
4. adjust DWARF information to support FP
5. add support to FP in goroutine scheduler
6. adjust SSA arm64 backend to support FP
7. adjust signal handler according to frame layout changes
8. adjust cgo trampoline functions according to frame layout changes
9. change Darwin and Android runtime for enabling frame-pointer
Performance impacts on go1 benchmarks:
Jira: ENTLLT-523
---
M misc/cgo/test/issue9400/asm_arm64.s
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/gc/asm_test.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/obj/link.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/dwarf.go
M src/cmd/vet/all/whitelist/darwin_arm64.txt
M src/cmd/vet/asmdecl.go
M src/reflect/asm_arm64.s
M src/runtime/asm_arm64.s
M src/runtime/atomic_arm64.s
M src/runtime/cgo/asm_arm64.s
M src/runtime/cgo/signal_darwin_arm64.s
M src/runtime/cgocall.go
M src/runtime/duff_arm64.s
M src/runtime/internal/atomic/asm_arm64.s
M src/runtime/internal/atomic/atomic_arm64.s
M src/runtime/internal/sys/arch_arm64.go
M src/runtime/memmove_arm64.s
M src/runtime/mgcmark.go
M src/runtime/mkduff.go
M src/runtime/rt0_android_arm64.s
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/signal_arm64.go
M src/runtime/stack.go
M src/runtime/sys_darwin_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/nosplit.go
33 files changed, 576 insertions(+), 413 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
7 comments:
Could we change 16 to gc.Ctxt.FixedFrameSize()? Also below.
Done
File src/cmd/compile/internal/arm64/ssa.go:
Patch Set #3, Line 579: p := s.Prog(obj.ADUFFZERO)
For AMD64 we do this in the assembler backend (https://go.googlesource. […]
Done
File src/cmd/internal/obj/arm64/obj7.go:
Patch Set #1, Line 753: p = obj.Appendp(p, c.newprog)
I still don't understand this. […]
No. For RET of leaf function with autosize > 0, just need to "ADD autosize, RSP, RSP" and then let: FP = RSP (no need to restore FP & LR from stack (with LDP) since LR is unchanged). Although FP & LR are saved on stack at the prologue, it's just for traceback and stack unwinding purpose.
File src/cmd/internal/obj/arm64/obj7.go:
Patch Set #3, Line 595: if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
This should be before the update of SP?
FP always points to the start of current frame (i.e updated RSP)
Patch Set #3, Line 618: if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
So is this. […]
"ADD $framesize, SP, FP" will make FP point to previous frame.
Patch Set #3, Line 258: f.intSi
intSize and ptrSize are equal, but other lines use intSize, so let's do the same here.
Patch Set #1, Line 28: // | FP, LR |
I mean, why not save FP and LR here, and remove R29 and R30 above?
We can't save FP and LR here since the 2 slots will later be used by callees (line 57, 58) to save/restore its FP&LR. Lines: 20-24 are for C ABI and lines: 26-28 are for Go ABI. We can't mix them up.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
2 comments:
Patch Set #3, Line 595: if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
FP always points to the start of current frame (i. […]
This is very confusing. As far as I understand how frame pointer works in general, it should point to the top of the current frame, or the bottom of the caller's frame (whereas the stack pointer points to the bottom of the current frame). This is also how Go AMD64 toolchain uses frame pointer.
Does your change work in completely different way? I think we need more explanation.
File src/runtime/cgo/asm_arm64.s:
Patch Set #1, Line 28: // | FP, LR |
used by callees (line 57, 58)
Why? The Callee should save its FP and LR in its own frame, and it should never change anything in this frame (except the return value slots, but those functions return no value).
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
2 comments:
Patch Set #3, Line 595: if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
This is very confusing. […]
The implementation is according to Procedure Call Standard for the ARM 64-bit Architecture (http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf) 5.2.3 The Frame Pointer: The frame record for the innermost frame (belonging
to the most recent routine invocation) shall be pointed to by the Frame Pointer register (FP).
Also, our FP implementation is the same as that of GCC.
File src/runtime/cgo/asm_arm64.s:
Patch Set #1, Line 28: // | FP, LR |
> used by callees (line 57, 58) […]
I think the callee is not a Go function. Instead, its callee is a CGO callback (i.e C function). Since current function won't save and restore FP&LR due to its negative frame size ($-16) we need to save them as part of C API context.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
R=go1.11
Wei Xiao uploaded patch set #5 to this change.
build: support frame-pointer for arm64
Supporting frame-pointer makes Linux's perf and other profilers much more useful
because it lets them gather a stack trace efficiently on profiling events. Major
changes include:
1. save FP&LR (LR above FP) at prologue and restore them at epilogue via STP/LDP instructions
2. change frame size alignment to 16 bytes (original: 8 mod 16)
3. adjust offset and frame size in *.s, analysis and wrapper code
4. adjust DWARF information to support FP
5. add support to FP in goroutine scheduler
6. adjust SSA arm64 backend to support FP
7. adjust signal handler according to frame layout changes
8. adjust cgo trampoline functions according to frame layout changes
9. change Darwin and Android runtime for enabling frame-pointer
Performance impacts on go1 benchmarks:
---
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/gc/asm_test.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/obj/link.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/dwarf.go
M src/cmd/vet/all/whitelist/darwin_arm64.txt
M src/cmd/vet/asmdecl.go
M src/internal/bytealg/compare_arm64.s
M src/internal/bytealg/equal_arm64.s
M src/reflect/asm_arm64.s
M src/runtime/asm_arm64.s
M src/runtime/cgo/asm_arm64.s
M src/runtime/cgo/signal_darwin_arm64.s
M src/runtime/cgocall.go
M src/runtime/internal/sys/arch_arm64.go
M src/runtime/mgcmark.go
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/signal_arm64.go
M src/runtime/stack.go
M src/runtime/sys_darwin_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/nosplit.go
M test/retjmp.dir/a.s
28 files changed, 570 insertions(+), 405 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
I would be appreciated if you can proceed reviewing this CL.
Maintaining the CL is really costly since your guys are changing runtime constantly.
Patch Set 5:
I would be appreciated if you can proceed reviewing this CL.
Maintaining the CL is really costly since your guys are changing runtime constantly.
Sorry, there are some things that I still don't quite understand. If the FP and the SP are almost always equal, what is the point to use two registers? Wouldn't it also work that we simply save SP at function entry? Does any tool actually look at the FP register, or it only looks at what is saved on stack?
Also, I tried compiling C program using Clang for darwin/arm64. Apparently it doesn't use the frame pointer this way. There, R29 is pointing to the top of the frame, instead of the bottom. And this is also what is saved on the callee's stack slot.
Patch Set 5:
Patch Set 5:
I would be appreciated if you can proceed reviewing this CL.
Maintaining the CL is really costly since your guys are changing runtime constantly.Sorry, there are some things that I still don't quite understand. If the FP and the SP are almost always equal, what is the point to use two registers? Wouldn't it also work that we simply save SP at function entry? Does any tool actually look at the FP register, or it only looks at what is saved on stack?
Also, I tried compiling C program using Clang for darwin/arm64. Apparently it doesn't use the frame pointer this way. There, R29 is pointing to the top of the frame, instead of the bottom. And this is also what is saved on the callee's stack slot.
AFAIK, perf and gdb can look at the FP register if DWARF info is stripped. But these tools don't look at the SP register to do efficient unwinding since the standard (Procedure Call Standard for the ARM 64-bit Architecture) doesn't say SP also points to the head of frame record chain (I admit that SP is almost equal to FP for Go but It may not be true for other languages). This is why we need to set the dedicated register FP and save an additional pointer (start address of previous frame) in the frame. Enabling FP has at least 2 purposes: 1. make profiling more efficient (just accessing the frame record chain without consulting DWARF which is saved on disk) 2. make debugger backtrace still work even when DWARF is stripped (Released product may strip DWARF).
I don't have darwin/arm64 environment. But for GCC on linux/arm64, R29 points at the start of current frame. One simple example is as below:
$ cat hello.c
int foo1(int a, int b);
int foo2(int a, int b)
{
return foo1(b, a);
}
$ gcc -fno-omit-frame-pointer -S hello.c
$ cat hello.s
.arch armv8-a
.file "hello.c"
.text
.align 2
.global foo2
.type foo2, %function
foo2:
stp x29, x30, [sp, -32]!
add x29, sp, 0
str w0, [x29, 28]
str w1, [x29, 24]
ldr w1, [x29, 28]
ldr w0, [x29, 24]
bl foo1
ldp x29, x30, [sp], 32
ret
.size foo2, .-foo2
.ident "GCC: (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
.section .note.GNU-stack,"",@progbits
I think the implementation of both Clang for darwin/arm64 and GCC for linux/arm64 are right since it's unimportant whether FP points to bottom or top of the frame. The only important thing for unwinding based on FP is: "The frame record for the innermost frame (belonging to the most recent routine invocation) shall be pointed to by the Frame Pointer register (FP). The lowest addressed double-word shall point to the previous frame record and the highest addressed double-word shall contain the value passed in LR on entry to the current function." (Procedure Call Standard for the ARM 64-bit Architecture 5.2.3).
> > AFAIK, perf and gdb can look at the FP register if DWARF info is stripped. But these tools don't look at the SP register to do efficient unwinding since the standard (Procedure Call Standard for the ARM 64-bit Architecture) doesn't say SP also points to the head of frame record chain (I admit that SP is almost equal to FP for Go but It may not be true for other languages). This is why we need to set the dedicated register FP and save an additional pointer (start address of previous frame) in the frame. Enabling FP has at least 2 purposes: 1. make profiling more efficient (just accessing the frame record chain without consulting DWARF which is saved on disk) 2. make debugger backtrace still work even when DWARF is stripped (Released product may strip DWARF).
> >
I think the implementation of both Clang for darwin/arm64 and GCC for linux/arm64 are right since it's unimportant whether FP points to bottom or top of the frame. The only important thing for unwinding based on FP is: "The frame record for the innermost frame (belonging to the most recent routine invocation) shall be pointed to by the Frame Pointer register (FP). The lowest addressed double-word shall point to the previous frame record and the highest addressed double-word shall contain the value passed in LR on entry to the current function." (Procedure Call Standard for the ARM 64-bit Architecture 5.2.3).
Yes, I also see that GCC on Linux does want you described. And I agree that both are correct. But I think we would rather not have two ARM64 Go ABIs. Does one of them work on both platforms, which I mean, the tools on both platforms understand? Per https://groups.google.com/d/topic/golang-dev/v1TxCJNemPY/discussion, it seems the use of frame pointer in CL is not understood by darwin tools. What about the other way?
1. make profiling more efficient (just accessing the frame record chain without consulting DWARF which is saved on disk)
I guess you meant profiling with Linux perf or something other than pprof? I believe pprof looks at Go's specific symbol table and PC-SP table. (Although there is probably a proposal to use frame pointers.)
2. make debugger backtrace still work even when DWARF is stripped (Released product may strip DWARF).
For stripped binary many things in the debugger don't work. I'm not so worried on this.
Wei Xiao uploaded patch set #6 to this change.
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/obj/link.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/dwarf.go
M src/cmd/vet/all/whitelist/darwin_arm64.txt
M src/cmd/vet/asmdecl.go
M src/internal/bytealg/equal_arm64.s
M src/reflect/asm_arm64.s
M src/runtime/asm_arm64.s
M src/runtime/cgo/asm_arm64.s
M src/runtime/cgo/signal_darwin_arm64.s
M src/runtime/cgocall.go
M src/runtime/internal/sys/arch_arm64.go
M src/runtime/mgcmark.go
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/signal_arm64.go
M src/runtime/stack.go
M src/runtime/sys_darwin_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/codegen/stack.go
M test/nosplit.go
M test/retjmp.dir/a.s
27 files changed, 563 insertions(+), 398 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Wei Xiao uploaded patch set #7 to this change.
27 files changed, 563 insertions(+), 396 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Patch set 7:Run-TryBot +1
TryBots beginning. Status page: https://farmer.golang.org/try?commit=83261479
TryBots are happy.
Patch set 7:TryBot-Result +1
Hi Cherry,
For some reason I'm unable to access darwin/arm64 platform so far and It's hard for me to identify the root-cause of the wrong stack trace (https://groups.google.com/d/topic/golang-dev/v1TxCJNemPY/discussion). But the wrong stack trace is more likely related to DWARF of arm64 according to the latest update (https://github.com/golang/go/issues/22716). Besides, in my opinion, most parts of stack trace unwinding shared the same code path between linux and darwin form arm64 and we don't have the wrong stack trace issue in linux/arm64 so the issue in darwin is likely unrelated to FP (frame-pointer).
I'm also experimenting on putting the FP at the top of frame just as you suggested but find that there are some disadvantages of putting FP at the top of frame rather than at the bottom of frame.
1. it requires 1 more instruction and will impact both binary size and performance:
FP at the bottom:
1. STP.w (FP, LR), -OFFSET(SP)
2. MOVD SP, FP
FP at the top:
1. STP.W (FP, LR), -16(SP)
2. MOVD SP, FP
3. SUB $(OFFSET-16), SP
2. We need to complicate trace-back logicals (espacially func: gentraceback). Currently, trace-back divides architure into 2 catogries via var: usesLR. If we put FP at the top like 386/amd64 architecture then arm64 is going to be treated as "!usesLR" (because sys.MinFrameSize=0). But arm64 is essentially architecture with LR register and LR value may not be pushed into stack (e.g leaf function) for optimization. So we need to complicate trace-back logicals which is designed to handle 386/amd64 which always push return address automatically into stack and the changes may incurr some tricky bugs.
I think it's doable for arm64 to put FP at the top of frame just like amd64 has done. But the changes may be big and touch many basic features designed for architecture with LR register. The benefit should be larger on architecture with LR to put FP at the bottom of frame.
Patch Set 7:
Hi Cherry,
For some reason I'm unable to access darwin/arm64 platform so far and It's hard for me to identify the root-cause of the wrong stack trace (https://groups.google.com/d/topic/golang-dev/v1TxCJNemPY/discussion). But the wrong stack trace is more likely related to DWARF of arm64 according to the latest update (https://github.com/golang/go/issues/22716). Besides, in my opinion, most parts of stack trace unwinding shared the same code path between linux and darwin form arm64 and we don't have the wrong stack trace issue in linux/arm64 so the issue in darwin is likely unrelated to FP (frame-pointer).
I'm also experimenting on putting the FP at the top of frame just as you suggested but find that there are some disadvantages of putting FP at the top of frame rather than at the bottom of frame.
1. it requires 1 more instruction and will impact both binary size and performance:
FP at the bottom:
1. STP.w (FP, LR), -OFFSET(SP)
2. MOVD SP, FP
FP at the top:
1. STP.W (FP, LR), -16(SP)
2. MOVD SP, FP
3. SUB $(OFFSET-16), SP2. We need to complicate trace-back logicals (espacially func: gentraceback). Currently, trace-back divides architure into 2 catogries via var: usesLR. If we put FP at the top like 386/amd64 architecture then arm64 is going to be treated as "!usesLR" (because sys.MinFrameSize=0). But arm64 is essentially architecture with LR register and LR value may not be pushed into stack (e.g leaf function) for optimization. So we need to complicate trace-back logicals which is designed to handle 386/amd64 which always push return address automatically into stack and the changes may incurr some tricky bugs.
I think it's doable for arm64 to put FP at the top of frame just like amd64 has done. But the changes may be big and touch many basic features designed for architecture with LR register. The benefit should be larger on architecture with LR to put FP at the bottom of frame.
Thank you for the experiments! I agree that doing what Clang does, i.e. putting the FP at the top of the frame, doesn't play well with the rest of Go runtime, and make things complicated. This is probably not what we should do.
What about letting FP *pointing* to the top of the current frame, but still storing it at the bottom? Specifically,
args
caller's LR
caller's FP
------------------- <-- (FP) points here
locals
...
args for callee
saved LR
saved FP
------------------- <-- (SP) points here
If I understand correctly this is also ABI conforming. This is also consistent with how I (maybe others) understand how frame pointer works.
This can be done with two instructions (for small frames):
STP.W (FP, LR), -framesize(SP)
ADD $framesize, SP, FP
I think this also plays well with Go's traceback, except the need of telling it LR is saved at 8(SP) instead of 0(SP), which you already did.
Does this mechanism work with Darwin or Linux? In particular, for the case in Issue #22716, I guess the "looping" stack traces are resulted from that the FP is pointing to the current frame, so it thinks the function is called by itself. If FP points to the caller frame, I guess it may help.
Patch Set 7:
I just experimented your suggestion of letting FP point to the top of the current frame and find that it causes perf (Linux profiling tool) fail to record the second-to-last frame. I have created a simple test which has a calling chain: main->foo1->foo2 and there are some hotspot loops inside foo2. I use following command to record call-graph based on FP:
$ perf record --call-graph fp ./test_perf
Then display the call-graph with following command:
$ perf report --call-graph --stdio
If FP points to the bottom of the current frame, we get following result:
99.46% 99.46% test_perf test_perf [.] main.foo2
|
---runtime.main
main.main
main.foo1
main.foo2
If FP points to the top of the current frame, we get following result:
99.60% 99.60% test_perf test_perf [.] main.foo2
|
---runtime.goexit
main.main
main.foo2
Obviously, foo1 is lost if FP points to the top of the current frame.
When profiling foo2, perf try to guess its caller address via "*(FP+8)" and its caller frame via "*(FP)". If FP points to the top of foo2 frame then FP also points to the bottom of foo1 frame and perf will unwind to main frame via "*(FP+8)" and "*(FP)", which skips the real caller foo1.
Go's traceback doesn't depend on FP and I'm enabling FP mainly for the better support to external tool such as Linux perf on arm64 and FP has been enabled for amd64 about two years ago. For server side program perform analysis, Linux perf and its flame graph are really helpful and many internet companies mentioned these tools in last week Gopher china conference (I also met your teammate Marcel there).
If you think letting FP point to the top of the current frame can help debug Issue #22716 I can submit my local changes. But FP should point to the bottom of the current frame indeed.
Wei Xiao uploaded patch set #8 to this change.
M src/runtime/cgo/gcc_signal_darwin_armx.c
M src/runtime/cgo/signal_darwin_arm64.s
M src/runtime/cgocall.go
M src/runtime/internal/sys/arch_arm64.go
M src/runtime/mgcmark.go
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/signal_arm64.go
M src/runtime/stack.go
M src/runtime/sys_darwin_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/codegen/stack.go
M test/nosplit.go
M test/retjmp.dir/a.s
28 files changed, 571 insertions(+), 397 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Patch set 8:Run-TryBot +1Code-Review +1
TryBots beginning. Status page: https://farmer.golang.org/try?commit=ba72508b
TryBots are happy.
Patch set 8:TryBot-Result +1
Any more comments? We hope that frame-pointer can be enabled for arm64 in 1.11.
It is not surprising that a frame is missing due to FP set to caller's frame. I was hoping perf can look at SP if SP is not equal to FP, but it seems not the case. So I guess we have to go with this weird ABI then. Kind of unfortunate.
Seems this doesn't work on darwin/arm64. So probably we only enable frame pointer on linux/arm64? If we add support to other OSes, like *bsd/arm64 or windows/arm64, I don't know what FP usage those platforms expect, and this may or may not work there. My concern is that I think we really don't want to have multiple frame pointer ABI on ARM64.
How much does it affect user assembly code? I'm hoping we don't break users too much. The FP register is already supposed to be reserved, so this is fine. But the frame sizes and offsets need to change. Anything else? Can we provide an automated tool (go fix) that fixes the majority of user assembly code?
3 comments:
Patch Set #9, Line 22: Performance impacts on go1 benchmarks:
Can you also show the geomean (benchstat -geomean)?
Patch Set #9, Line 133: Fixes #12100
Apparently it doesn't help darwin/arm64. Remove this.
Patch Set #7, Line 5: TEXT ·f(SB), 4, $0-0
It was intentional for this function to have non-zero frame size. Change to 16 if you need alignment.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Patch Set 9:
Seems this doesn't work on darwin/arm64. So probably we only enable frame pointer on linux/arm64? If we add support to other OSes, like *bsd/arm64 or windows/arm64, I don't know what FP usage those platforms expect, and this may or may not work there. My concern is that I think we really don't want to have multiple frame pointer ABI on ARM64.
With CL 108295 enabling DWARF on darwin/arm64, are there still issues on darwin/arm64 with this CL? Judging by https://github.com/golang/go/issues/22716, Xcode uses DWARF information and not the FP to construct backtraces. Is there another test that relies on maintaining FP that this CL fixes?
Another idea (from discussion with Austin): Does it work if we put the saved FP below where SP is pointing to? I.e.
locals
...
args for callee
saved LR <-- (SP) points here
saved FP <-- (FP) points here
This way, if Linux perf only looks at FP, not SP, it still sees the chain FPs with all frames. At the same time, 0(SP) is still the saved LR, and all the offsets of locals are unchanged. So we can minimize the affect on assembly code. We may be also able to let the assembler transparently round up the frame size to a multiple of 16, so the frame size in the assembly code doesn't need to change, either. When we grow the stack, we need to reserved the extra word below SP, but the assembler and the runtime should be able to handle it.
Does perf work with this layout?
Patch Set 9:
Patch Set 9:
Seems this doesn't work on darwin/arm64. So probably we only enable frame pointer on linux/arm64? If we add support to other OSes, like *bsd/arm64 or windows/arm64, I don't know what FP usage those platforms expect, and this may or may not work there. My concern is that I think we really don't want to have multiple frame pointer ABI on ARM64.
With CL 108295 enabling DWARF on darwin/arm64, are there still issues on darwin/arm64 with this CL? Judging by https://github.com/golang/go/issues/22716, Xcode uses DWARF information and not the FP to construct backtraces. Is there another test that relies on maintaining FP that this CL fixes?
I'm not sure whether this CL will cause problem on darwin/arm64. But apparently it doesn't help either.
Wei Xiao uploaded patch set #10 to this change.
build: support frame-pointer for arm64
Supporting frame-pointer makes Linux's perf and other profilers much more useful
because it lets them gather a stack trace efficiently on profiling events. Major
changes include:
1. save FP&LR (LR above FP) at prologue and restore them at epilogue via STP/LDP instructions
2. change frame size alignment to 16 bytes (original: 8 mod 16)
3. adjust offset and frame size in *.s, analysis and wrapper code
4. adjust DWARF information to support FP
5. add support to FP in goroutine scheduler
6. adjust SSA arm64 backend to support FP
7. adjust signal handler according to frame layout changes
8. adjust cgo trampoline functions according to frame layout changes
9. change Darwin and Android runtime for enabling frame-pointer
Performance impacts on go1 benchmarks:
Enable frame-pointer (by default)
name old time/op new time/op delta
BinaryTree17-8 6.25s ± 1% 6.45s ± 1% +3.09% (p=0.008 n=5+5)
Fannkuch11-8 6.03s ± 0% 5.74s ± 0% -4.78% (p=0.016 n=5+4)
FmtFprintfEmpty-8 91.6ns ± 0% 94.7ns ± 1% +3.30% (p=0.008 n=5+5)
FmtFprintfString-8 175ns ± 2% 177ns ± 2% ~ (p=0.206 n=5+5)
FmtFprintfInt-8 204ns ± 1% 201ns ± 1% ~ (p=0.063 n=5+5)
FmtFprintfIntInt-8 285ns ± 2% 289ns ± 2% ~ (p=0.063 n=5+5)
FmtFprintfPrefixedInt-8 428ns ± 1% 424ns ± 1% ~ (p=0.143 n=5+5)
FmtFprintfFloat-8 479ns ± 2% 496ns ± 3% +3.68% (p=0.016 n=5+5)
FmtManyArgs-8 1.39µs ± 0% 1.35µs ± 1% -2.51% (p=0.008 n=5+5)
GobDecode-8 15.3ms ± 0% 16.7ms ± 1% +9.17% (p=0.008 n=5+5)
GobEncode-8 12.5ms ± 1% 14.0ms ± 2% +12.14% (p=0.008 n=5+5)
Gzip-8 621ms ± 0% 619ms ± 0% -0.43% (p=0.008 n=5+5)
Gunzip-8 73.0ms ± 0% 73.7ms ± 0% +0.86% (p=0.008 n=5+5)
HTTPClientServer-8 116µs ± 1% 115µs ± 1% -1.27% (p=0.032 n=5+5)
JSONEncode-8 31.0ms ± 0% 31.1ms ± 0% ~ (p=0.421 n=5+5)
JSONDecode-8 140ms ± 1% 144ms ± 1% +2.86% (p=0.008 n=5+5)
Mandelbrot200-8 9.67ms ± 0% 9.67ms ± 0% ~ (p=0.841 n=5+5)
GoParse-8 7.58ms ± 0% 7.66ms ± 1% +1.07% (p=0.008 n=5+5)
RegexpMatchEasy0_32-8 180ns ± 8% 195ns ± 6% ~ (p=0.183 n=5+5)
RegexpMatchEasy0_1K-8 673ns ± 1% 677ns ± 2% ~ (p=0.738 n=5+5)
RegexpMatchEasy1_32-8 166ns ±12% 196ns ±11% +18.24% (p=0.016 n=5+5)
RegexpMatchEasy1_1K-8 1.02µs ± 4% 1.06µs ± 2% ~ (p=0.056 n=5+5)
RegexpMatchMedium_32-8 285ns ± 3% 322ns ± 5% +12.82% (p=0.008 n=5+5)
RegexpMatchMedium_1K-8 99.1µs ± 7% 90.6µs ± 6% -8.56% (p=0.032 n=5+5)
RegexpMatchHard_32-8 4.02µs ± 1% 4.45µs ± 7% +10.63% (p=0.008 n=5+5)
RegexpMatchHard_1K-8 121µs ± 6% 122µs ± 0% ~ (p=0.151 n=5+5)
Revcomp-8 1.07s ± 0% 1.07s ± 1% ~ (p=0.222 n=5+5)
Template-8 160ms ± 0% 165ms ± 1% +3.10% (p=0.008 n=5+5)
TimeParse-8 766ns ± 0% 749ns ± 1% -2.31% (p=0.016 n=4+5)
TimeFormat-8 745ns ± 1% 738ns ± 0% -0.86% (p=0.016 n=5+5)
[Geo mean] 116µs 119µs +2.40%
name old speed new speed delta
GobDecode-8 50.0MB/s ± 0% 45.9MB/s ± 1% -8.38% (p=0.008 n=5+5)
GobEncode-8 61.3MB/s ± 1% 54.7MB/s ± 2% -10.83% (p=0.008 n=5+5)
Gzip-8 31.2MB/s ± 0% 31.4MB/s ± 0% +0.44% (p=0.008 n=5+5)
Gunzip-8 266MB/s ± 0% 263MB/s ± 0% -0.85% (p=0.008 n=5+5)
JSONEncode-8 62.6MB/s ± 0% 62.4MB/s ± 0% ~ (p=0.381 n=5+5)
JSONDecode-8 13.9MB/s ± 1% 13.5MB/s ± 1% -2.78% (p=0.008 n=5+5)
GoParse-8 7.64MB/s ± 0% 7.56MB/s ± 1% -1.05% (p=0.008 n=5+5)
RegexpMatchEasy0_32-8 178MB/s ± 8% 164MB/s ± 6% ~ (p=0.222 n=5+5)
RegexpMatchEasy0_1K-8 1.52GB/s ± 1% 1.51GB/s ± 2% ~ (p=0.841 n=5+5)
RegexpMatchEasy1_32-8 193MB/s ±11% 164MB/s ±12% -15.44% (p=0.016 n=5+5)
RegexpMatchEasy1_1K-8 1.01GB/s ± 4% 0.96GB/s ± 2% ~ (p=0.056 n=5+5)
RegexpMatchMedium_32-8 3.50MB/s ± 3% 3.11MB/s ± 5% -11.21% (p=0.008 n=5+5)
RegexpMatchMedium_1K-8 10.4MB/s ± 7% 11.3MB/s ± 6% +9.12% (p=0.032 n=5+5)
RegexpMatchHard_32-8 7.96MB/s ± 1% 7.21MB/s ± 7% -9.33% (p=0.008 n=5+5)
RegexpMatchHard_1K-8 8.49MB/s ± 6% 8.37MB/s ± 0% ~ (p=0.127 n=5+5)
Revcomp-8 238MB/s ± 0% 237MB/s ± 1% ~ (p=0.222 n=5+5)
Template-8 12.1MB/s ± 0% 11.7MB/s ± 1% -2.98% (p=0.008 n=5+5)
[Geo mean] 49.4MB/s 47.4MB/s -4.18%
Disable frame-pointer (GOEXPERIMENT=noframepointer)
name old time/op new time/op delta
BinaryTree17-8 6.25s ± 1% 6.31s ± 2% ~ (p=0.222 n=5+5)
Fannkuch11-8 6.03s ± 0% 6.04s ± 0% +0.20% (p=0.008 n=5+5)
FmtFprintfEmpty-8 91.6ns ± 0% 99.0ns ± 2% +8.01% (p=0.008 n=5+5)
FmtFprintfString-8 175ns ± 2% 176ns ± 0% ~ (p=0.238 n=5+4)
FmtFprintfInt-8 204ns ± 1% 202ns ± 0% ~ (p=0.206 n=5+4)
FmtFprintfIntInt-8 285ns ± 2% 285ns ± 1% ~ (p=0.302 n=5+5)
FmtFprintfPrefixedInt-8 428ns ± 1% 425ns ± 1% ~ (p=0.325 n=5+5)
FmtFprintfFloat-8 479ns ± 2% 476ns ± 0% ~ (p=0.603 n=5+5)
FmtManyArgs-8 1.39µs ± 0% 1.34µs ± 2% -3.39% (p=0.008 n=5+5)
GobDecode-8 15.3ms ± 0% 16.6ms ± 1% +8.50% (p=0.008 n=5+5)
GobEncode-8 12.5ms ± 1% 13.1ms ± 1% +4.50% (p=0.008 n=5+5)
Gzip-8 621ms ± 0% 619ms ± 0% -0.40% (p=0.008 n=5+5)
Gunzip-8 73.0ms ± 0% 73.6ms ± 0% +0.73% (p=0.008 n=5+5)
HTTPClientServer-8 116µs ± 1% 116µs ± 1% ~ (p=0.548 n=5+5)
JSONEncode-8 31.0ms ± 0% 30.7ms ± 0% -1.17% (p=0.008 n=5+5)
JSONDecode-8 140ms ± 1% 145ms ± 1% +3.44% (p=0.008 n=5+5)
Mandelbrot200-8 9.67ms ± 0% 9.66ms ± 0% -0.04% (p=0.032 n=5+5)
GoParse-8 7.58ms ± 0% 7.70ms ± 0% +1.56% (p=0.008 n=5+5)
RegexpMatchEasy0_32-8 180ns ± 8% 175ns ± 4% ~ (p=0.937 n=5+5)
RegexpMatchEasy0_1K-8 673ns ± 1% 659ns ± 2% ~ (p=0.087 n=5+5)
RegexpMatchEasy1_32-8 166ns ±12% 182ns ±17% ~ (p=0.341 n=5+5)
RegexpMatchEasy1_1K-8 1.02µs ± 4% 1.06µs ± 3% ~ (p=0.151 n=5+5)
RegexpMatchMedium_32-8 285ns ± 3% 285ns ± 6% ~ (p=0.611 n=5+5)
RegexpMatchMedium_1K-8 99.1µs ± 7% 75.4µs ± 0% -23.90% (p=0.016 n=5+4)
RegexpMatchHard_32-8 4.02µs ± 1% 4.00µs ± 0% ~ (p=0.103 n=5+5)
RegexpMatchHard_1K-8 121µs ± 6% 123µs ± 0% ~ (p=0.190 n=5+4)
Revcomp-8 1.07s ± 0% 1.08s ± 1% +1.44% (p=0.008 n=5+5)
Template-8 160ms ± 0% 164ms ± 1% +2.16% (p=0.008 n=5+5)
TimeParse-8 766ns ± 0% 766ns ± 1% ~ (p=1.000 n=4+5)
TimeFormat-8 745ns ± 1% 756ns ± 1% +1.56% (p=0.008 n=5+5)
[Geo mean] 116µs 116µs +0.24%
name old speed new speed delta
GobDecode-8 50.0MB/s ± 0% 46.1MB/s ± 1% -7.84% (p=0.008 n=5+5)
GobEncode-8 61.3MB/s ± 1% 58.7MB/s ± 1% -4.31% (p=0.008 n=5+5)
Gzip-8 31.2MB/s ± 0% 31.4MB/s ± 0% +0.41% (p=0.008 n=5+5)
Gunzip-8 266MB/s ± 0% 264MB/s ± 0% -0.73% (p=0.008 n=5+5)
JSONEncode-8 62.6MB/s ± 0% 63.3MB/s ± 0% +1.19% (p=0.008 n=5+5)
JSONDecode-8 13.9MB/s ± 1% 13.4MB/s ± 1% -3.33% (p=0.008 n=5+5)
GoParse-8 7.64MB/s ± 0% 7.52MB/s ± 0% -1.54% (p=0.008 n=5+5)
RegexpMatchEasy0_32-8 178MB/s ± 8% 183MB/s ± 4% ~ (p=1.000 n=5+5)
RegexpMatchEasy0_1K-8 1.52GB/s ± 1% 1.55GB/s ± 2% ~ (p=0.095 n=5+5)
RegexpMatchEasy1_32-8 193MB/s ±11% 179MB/s ±16% ~ (p=0.310 n=5+5)
RegexpMatchEasy1_1K-8 1.01GB/s ± 4% 0.97GB/s ± 3% ~ (p=0.151 n=5+5)
RegexpMatchMedium_32-8 3.50MB/s ± 3% 3.51MB/s ± 6% ~ (p=0.611 n=5+5)
RegexpMatchMedium_1K-8 10.4MB/s ± 7% 13.6MB/s ± 0% +30.98% (p=0.016 n=5+4)
RegexpMatchHard_32-8 7.96MB/s ± 1% 7.99MB/s ± 0% ~ (p=0.071 n=5+5)
RegexpMatchHard_1K-8 8.49MB/s ± 6% 8.35MB/s ± 0% ~ (p=0.143 n=5+4)
Revcomp-8 238MB/s ± 0% 235MB/s ± 1% -1.42% (p=0.008 n=5+5)
Template-8 12.1MB/s ± 0% 11.8MB/s ± 1% -2.12% (p=0.008 n=5+5)
[Geo mean] 49.4MB/s 49.4MB/s -0.08%
Frame-pointer is enabled by default but can be disabled by setting: GOEXPERIMENT=noframepointer.
Fixes #10110
Change-Id: I1bfaca6dba29a63009d7c6ab04ed7a1413d9479e
---
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/obj/link.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/dwarf.go
M src/cmd/vet/all/whitelist/darwin_arm64.txt
M src/cmd/vet/asmdecl.go
M src/internal/bytealg/equal_arm64.s
M src/reflect/asm_arm64.s
M src/runtime/asm_arm64.s
M src/runtime/cgo/asm_arm64.s
M src/runtime/cgo/gcc_signal_darwin_armx.c
M src/runtime/cgo/signal_darwin_arm64.s
M src/runtime/cgocall.go
M src/runtime/internal/sys/arch_arm64.go
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/signal_arm64.go
M src/runtime/stack.go
M src/runtime/sys_darwin_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/codegen/stack.go
M test/nosplit.go
M test/retjmp.dir/a.s
27 files changed, 602 insertions(+), 397 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
For PS10
1. It's ok to only enable FP on linux/arm64 for me but most changes are OS-independent (espacially asm_arm64.s). I don't know how to enable the changes only for linux while disable the changes for other OS.
2. For any OS running on arm64, it should follow "Procedure Call Standard for the ARM 64-bit Architecture" in which saved FP is below saved LR. I can't think out a scenaro in which current FP usage will cause problem. Our changes don't create new ABI on arm64 and just put the missing FP below current LR. There is only one FP ABI on arm64.
3. I have developed a feature in assembler to automatically fix frame sizes and offsets (relative to RSP) inside user old assembly code in PS10. So enabling FP won't break user old assembly code in most cases (I admit that there are still tricky corner cases needing manual fix).
3 comments:
Patch Set #9, Line 22: Performance impacts on go1 benchmarks:
Can you also show the geomean (benchstat -geomean)?
Done
Apparently it doesn't help darwin/arm64. Remove this.
Done
Patch Set #7, Line 5: TEXT ·f(SB), 4, $16-0
It was intentional for this function to have non-zero frame size. […]
Done
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Wei Xiao uploaded patch set #11 to this change.
build: support frame-pointer for arm64
Supporting frame-pointer makes Linux's perf and other profilers much more useful
because it lets them gather a stack trace efficiently on profiling events. Major
changes include:
1. save FP below where RSP is pointing to (proposed by Cherry and Austin)
2. adjust some specific offsets in runtime assembly and wrapper code
3. add support to FP in goroutine scheduler
4. adjust DWARF information to support FP
5. adjust tests to avoid stackoverflow error
Performance impacts on go1 benchmarks:
Enable frame-pointer (by default)
name old time/op new time/op delta
BinaryTree17-46 5.94s ± 0% 6.00s ± 0% +1.03% (p=0.029 n=4+4)
Fannkuch11-46 2.84s ± 1% 2.77s ± 0% -2.58% (p=0.008 n=5+5)
FmtFprintfEmpty-46 55.0ns ± 1% 58.9ns ± 1% +7.06% (p=0.008 n=5+5)
FmtFprintfString-46 102ns ± 0% 105ns ± 0% +2.94% (p=0.008 n=5+5)
FmtFprintfInt-46 118ns ± 0% 117ns ± 1% -1.19% (p=0.000 n=4+5)
FmtFprintfIntInt-46 181ns ± 0% 182ns ± 1% ~ (p=0.444 n=5+5)
FmtFprintfPrefixedInt-46 215ns ± 1% 214ns ± 0% ~ (p=0.254 n=5+4)
FmtFprintfFloat-46 292ns ± 0% 296ns ± 0% +1.46% (p=0.029 n=4+4)
FmtManyArgs-46 720ns ± 0% 732ns ± 0% +1.72% (p=0.008 n=5+5)
GobDecode-46 9.82ms ± 1% 10.03ms ± 2% +2.10% (p=0.008 n=5+5)
GobEncode-46 8.14ms ± 0% 8.72ms ± 1% +7.14% (p=0.008 n=5+5)
Gzip-46 420ms ± 0% 424ms ± 0% +0.92% (p=0.008 n=5+5)
Gunzip-46 48.2ms ± 0% 48.4ms ± 0% +0.41% (p=0.008 n=5+5)
HTTPClientServer-46 201µs ± 4% 201µs ± 0% ~ (p=0.730 n=5+4)
JSONEncode-46 17.1ms ± 0% 17.7ms ± 1% +3.80% (p=0.008 n=5+5)
JSONDecode-46 88.0ms ± 0% 90.1ms ± 0% +2.42% (p=0.008 n=5+5)
Mandelbrot200-46 5.06ms ± 0% 5.07ms ± 0% ~ (p=0.310 n=5+5)
GoParse-46 5.04ms ± 0% 5.12ms ± 0% +1.53% (p=0.008 n=5+5)
RegexpMatchEasy0_32-46 117ns ± 0% 117ns ± 0% ~ (all equal)
RegexpMatchEasy0_1K-46 332ns ± 0% 329ns ± 0% -0.78% (p=0.008 n=5+5)
RegexpMatchEasy1_32-46 104ns ± 0% 113ns ± 0% +8.65% (p=0.029 n=4+4)
RegexpMatchEasy1_1K-46 563ns ± 0% 569ns ± 0% +1.10% (p=0.008 n=5+5)
RegexpMatchMedium_32-46 167ns ± 2% 177ns ± 1% +5.74% (p=0.008 n=5+5)
RegexpMatchMedium_1K-46 49.5µs ± 0% 53.4µs ± 0% +7.81% (p=0.008 n=5+5)
RegexpMatchHard_32-46 2.56µs ± 1% 2.72µs ± 0% +6.01% (p=0.008 n=5+5)
RegexpMatchHard_1K-46 77.0µs ± 0% 81.8µs ± 0% +6.24% (p=0.016 n=5+4)
Revcomp-46 631ms ± 1% 627ms ± 1% ~ (p=0.095 n=5+5)
Template-46 81.8ms ± 0% 86.3ms ± 0% +5.55% (p=0.008 n=5+5)
TimeParse-46 423ns ± 0% 432ns ± 0% +2.32% (p=0.008 n=5+5)
TimeFormat-46 478ns ± 2% 497ns ± 1% +3.89% (p=0.008 n=5+5)
[Geo mean] 71.6µs 73.3µs +2.45%
name old speed new speed delta
GobDecode-46 78.1MB/s ± 1% 76.6MB/s ± 2% -2.04% (p=0.008 n=5+5)
GobEncode-46 94.3MB/s ± 0% 88.0MB/s ± 1% -6.67% (p=0.008 n=5+5)
Gzip-46 46.2MB/s ± 0% 45.8MB/s ± 0% -0.91% (p=0.008 n=5+5)
Gunzip-46 403MB/s ± 0% 401MB/s ± 0% -0.41% (p=0.008 n=5+5)
JSONEncode-46 114MB/s ± 0% 109MB/s ± 1% -3.66% (p=0.008 n=5+5)
JSONDecode-46 22.0MB/s ± 0% 21.5MB/s ± 0% -2.35% (p=0.008 n=5+5)
GoParse-46 11.5MB/s ± 0% 11.3MB/s ± 0% -1.51% (p=0.008 n=5+5)
RegexpMatchEasy0_32-46 272MB/s ± 0% 272MB/s ± 1% ~ (p=0.190 n=4+5)
RegexpMatchEasy0_1K-46 3.08GB/s ± 0% 3.11GB/s ± 0% +0.77% (p=0.008 n=5+5)
RegexpMatchEasy1_32-46 306MB/s ± 0% 283MB/s ± 0% -7.63% (p=0.029 n=4+4)
RegexpMatchEasy1_1K-46 1.82GB/s ± 0% 1.80GB/s ± 0% -1.07% (p=0.008 n=5+5)
RegexpMatchMedium_32-46 5.99MB/s ± 0% 5.64MB/s ± 1% -5.77% (p=0.016 n=4+5)
RegexpMatchMedium_1K-46 20.7MB/s ± 0% 19.2MB/s ± 0% -7.25% (p=0.008 n=5+5)
RegexpMatchHard_32-46 12.5MB/s ± 1% 11.8MB/s ± 0% -5.66% (p=0.008 n=5+5)
RegexpMatchHard_1K-46 13.3MB/s ± 0% 12.5MB/s ± 1% -6.01% (p=0.008 n=5+5)
Revcomp-46 402MB/s ± 1% 405MB/s ± 1% ~ (p=0.095 n=5+5)
Template-46 23.7MB/s ± 0% 22.5MB/s ± 0% -5.25% (p=0.008 n=5+5)
[Geo mean] 82.2MB/s 79.6MB/s -3.26%
Disable frame-pointer (GOEXPERIMENT=noframepointer)
name old time/op new time/op delta
BinaryTree17-46 5.94s ± 0% 5.96s ± 0% +0.39% (p=0.029 n=4+4)
Fannkuch11-46 2.84s ± 1% 2.79s ± 1% -1.68% (p=0.008 n=5+5)
FmtFprintfEmpty-46 55.0ns ± 1% 55.2ns ± 3% ~ (p=0.794 n=5+5)
FmtFprintfString-46 102ns ± 0% 103ns ± 0% +0.98% (p=0.016 n=5+4)
FmtFprintfInt-46 118ns ± 0% 115ns ± 0% -2.54% (p=0.029 n=4+4)
FmtFprintfIntInt-46 181ns ± 0% 179ns ± 0% -1.10% (p=0.000 n=5+4)
FmtFprintfPrefixedInt-46 215ns ± 1% 213ns ± 0% ~ (p=0.143 n=5+4)
FmtFprintfFloat-46 292ns ± 0% 300ns ± 0% +2.83% (p=0.029 n=4+4)
FmtManyArgs-46 720ns ± 0% 739ns ± 0% +2.64% (p=0.008 n=5+5)
GobDecode-46 9.82ms ± 1% 9.78ms ± 1% ~ (p=0.151 n=5+5)
GobEncode-46 8.14ms ± 0% 8.12ms ± 1% ~ (p=0.690 n=5+5)
Gzip-46 420ms ± 0% 420ms ± 0% ~ (p=0.548 n=5+5)
Gunzip-46 48.2ms ± 0% 48.0ms ± 0% -0.33% (p=0.032 n=5+5)
HTTPClientServer-46 201µs ± 4% 199µs ± 3% ~ (p=0.548 n=5+5)
JSONEncode-46 17.1ms ± 0% 17.2ms ± 0% ~ (p=0.056 n=5+5)
JSONDecode-46 88.0ms ± 0% 88.6ms ± 0% +0.64% (p=0.008 n=5+5)
Mandelbrot200-46 5.06ms ± 0% 5.07ms ± 0% ~ (p=0.548 n=5+5)
GoParse-46 5.04ms ± 0% 5.07ms ± 0% +0.65% (p=0.008 n=5+5)
RegexpMatchEasy0_32-46 117ns ± 0% 112ns ± 4% -4.27% (p=0.016 n=4+5)
RegexpMatchEasy0_1K-46 332ns ± 0% 330ns ± 1% ~ (p=0.095 n=5+5)
RegexpMatchEasy1_32-46 104ns ± 0% 110ns ± 1% +5.29% (p=0.029 n=4+4)
RegexpMatchEasy1_1K-46 563ns ± 0% 567ns ± 2% ~ (p=0.151 n=5+5)
RegexpMatchMedium_32-46 167ns ± 2% 166ns ± 0% ~ (p=0.333 n=5+4)
RegexpMatchMedium_1K-46 49.5µs ± 0% 49.6µs ± 0% ~ (p=0.841 n=5+5)
RegexpMatchHard_32-46 2.56µs ± 1% 2.49µs ± 0% -2.81% (p=0.008 n=5+5)
RegexpMatchHard_1K-46 77.0µs ± 0% 75.8µs ± 0% -1.55% (p=0.008 n=5+5)
Revcomp-46 631ms ± 1% 628ms ± 0% ~ (p=0.095 n=5+5)
Template-46 81.8ms ± 0% 84.3ms ± 1% +3.05% (p=0.008 n=5+5)
TimeParse-46 423ns ± 0% 425ns ± 0% +0.52% (p=0.008 n=5+5)
TimeFormat-46 478ns ± 2% 478ns ± 1% ~ (p=1.000 n=5+5)
[Geo mean] 71.6µs 71.6µs -0.01%
name old speed new speed delta
GobDecode-46 78.1MB/s ± 1% 78.5MB/s ± 1% ~ (p=0.151 n=5+5)
GobEncode-46 94.3MB/s ± 0% 94.5MB/s ± 1% ~ (p=0.690 n=5+5)
Gzip-46 46.2MB/s ± 0% 46.2MB/s ± 0% ~ (p=0.571 n=5+5)
Gunzip-46 403MB/s ± 0% 404MB/s ± 0% +0.33% (p=0.032 n=5+5)
JSONEncode-46 114MB/s ± 0% 113MB/s ± 0% ~ (p=0.056 n=5+5)
JSONDecode-46 22.0MB/s ± 0% 21.9MB/s ± 0% -0.64% (p=0.008 n=5+5)
GoParse-46 11.5MB/s ± 0% 11.4MB/s ± 0% -0.64% (p=0.008 n=5+5)
RegexpMatchEasy0_32-46 272MB/s ± 0% 285MB/s ± 4% +4.74% (p=0.016 n=4+5)
RegexpMatchEasy0_1K-46 3.08GB/s ± 0% 3.10GB/s ± 1% ~ (p=0.151 n=5+5)
RegexpMatchEasy1_32-46 306MB/s ± 0% 290MB/s ± 1% -5.21% (p=0.029 n=4+4)
RegexpMatchEasy1_1K-46 1.82GB/s ± 0% 1.81GB/s ± 2% ~ (p=0.151 n=5+5)
RegexpMatchMedium_32-46 5.99MB/s ± 0% 6.02MB/s ± 1% ~ (p=0.063 n=4+5)
RegexpMatchMedium_1K-46 20.7MB/s ± 0% 20.7MB/s ± 0% ~ (p=0.659 n=5+5)
RegexpMatchHard_32-46 12.5MB/s ± 1% 12.8MB/s ± 0% +2.88% (p=0.008 n=5+5)
RegexpMatchHard_1K-46 13.3MB/s ± 0% 13.5MB/s ± 0% +1.58% (p=0.008 n=5+5)
Revcomp-46 402MB/s ± 1% 405MB/s ± 0% ~ (p=0.095 n=5+5)
Template-46 23.7MB/s ± 0% 23.0MB/s ± 1% -2.95% (p=0.008 n=5+5)
[Geo mean] 82.2MB/s 82.3MB/s +0.04%
Frame-pointer is enabled by default but can be disabled by setting: GOEXPERIMENT=noframepointer.
Fixes #10110
Change-Id: I1bfaca6dba29a63009d7c6ab04ed7a1413d9479e
---
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/objabi/util.go
M src/runtime/asm_arm64.s
M src/runtime/cgocall.go
M src/runtime/runtime-gdb_test.go
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/nosplit.go
10 files changed, 210 insertions(+), 22 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Patch Set 9:
Another idea (from discussion with Austin): Does it work if we put the saved FP below where SP is pointing to? I.e.
locals
...
args for callee
saved LR <-- (SP) points here
saved FP <-- (FP) points hereThis way, if Linux perf only looks at FP, not SP, it still sees the chain FPs with all frames. At the same time, 0(SP) is still the saved LR, and all the offsets of locals are unchanged. So we can minimize the affect on assembly code. We may be also able to let the assembler transparently round up the frame size to a multiple of 16, so the frame size in the assembly code doesn't need to change, either. When we grow the stack, we need to reserved the extra word below SP, but the assembler and the runtime should be able to handle it.
Does perf work with this layout?
@Cherry and @Austin
Thanks for the great idea!
I have implemented it in PS11 and perf works with this layout.
Changes to assembly code and runtime logicals are significantly reduced in this solution.
But one sideeffect is that stackover is more prone to occur especially when inline optimisation is turn off and current stack reserved area blow RSP (880 bytes) is insufficient for some case. I have to make changes to some tests in PS11 to avoid stackover error reported by link stack check.
In a summary:
In PS10, stack frame layout is changed (saved FP take the place of saved LR and the latter is saved above the former)
Pro:
1. it makes runtime more clean and removes existing special handling for arm64 (e.g https://go.googlesource.com/go/+/4fe688c6a49e59e852f0bfebbb4cf71366987ce7/src/runtime/stack.go#1166).
2. Stack size won't need to be 8 mod 16 which is really weired since all other 64bit architectures require 16 bytes alignment.
Con:
1. it makes runtime more complex especially trackback in which it has 3 categories of architecture now: no-LR, LR disabling FP and LR enabling FP.
2. it may break user old assembly although the care is rare with the latest automated fixing feature.
In PS11, the bottom of stack frame layout is unchanged (but it steals 16 bytes from the top of next stack frame)
Pro:
1. it minimizes the affect on assembly code
2. it minimizes the changes to the whole toolchains
Con:
1. it consumes more stack space (16 bytes per frame) and is more prone to trigger stack overflow
2. it tends to result in tricky bugs if users don't realize that there are extra 16 bytes for saving FP at the top of each frame.
Both solutions are ok to me as long as FP is enabled.
Any comments?
Thanks! This is great. PS 11 is much better.
Con:
1. it consumes more stack space (16 bytes per frame) and is more prone to trigger stack overflow
It only needs 8 bytes for the frame pointer. Reserving 16 bytes is for alignment, right? And one word is always wasted? I think we can do better here. We can change the compiler to round the frame size from 8 mod 16 to multiple of 16. For user-written assembly function, we can simply round it up if it is 8 mod 16, but also start to accept multiple-of-16 frame size.
2. it tends to result in tricky bugs if users don't realize that there are extra 16 bytes for saving FP at the top of each frame.
I think we can teach the linker stack overflow check that it needs an extra word below SP on ARM64.
2 comments:
File src/runtime/traceback.go:
Patch Set #11, Line 271: sys.GoarchArm64 == 1
GOARCH == "arm64"
Patch Set #11, Line 125: main 104 nosplit; REJECT ppc64 ppc64le amd64
I don't understand this change. How is this different from line 121? Also, this line is intended to test frame size of 120. If this should now be rejected on ARM64 (due to frame pointer), add arm64 to the REJECT list, instead of change the the frame size.
Also below.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Wei Xiao uploaded patch set #12 to this change.
build: support frame-pointer for arm64
Supporting frame-pointer makes Linux's perf and other profilers much more useful
because it lets them gather a stack trace efficiently on profiling events. Major
changes include:
1. save FP on the word below where RSP is pointing to (proposed by Cherry and Austin)
2. adjust some specific offsets in runtime assembly and wrapper code
3. add support to FP in goroutine scheduler
4. adjust DWARF information to support FP
5. adjust link stack overflow check to take the extra word into account
6. adjust nosplit test cases to adapt to the new stack limit (i.e 880-8 bytes)
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/lib.go
M src/runtime/asm_arm64.s
M src/runtime/cgocall.go
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/nosplit.go
11 files changed, 239 insertions(+), 45 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Patch Set 11:
(2 comments)
Thanks! This is great. PS 11 is much better.
Con:
1. it consumes more stack space (16 bytes per frame) and is more prone to trigger stack overflowIt only needs 8 bytes for the frame pointer. Reserving 16 bytes is for alignment, right? And one word is always wasted? I think we can do better here. We can change the compiler to round the frame size from 8 mod 16 to multiple of 16. For user-written assembly function, we can simply round it up if it is 8 mod 16, but also start to accept multiple-of-16 frame size.
Good idea! I have implemented it and the interesting result is that we don't need to disable inlining opt to avoid stack overflow for test cases (I mean the ones inside runtime/runtime-gdb_test.go) since it accepts multiple-of-16 frame size. The result also indicates that there is little increase in stack space consumption and for most functions (generated by compiler) FP is saved on the word which is originally wasted to round the frame size for 8 mod 16.
2. it tends to result in tricky bugs if users don't realize that there are extra 16 bytes for saving FP at the top of each frame.I think we can teach the linker stack overflow check that it needs an extra word below SP on ARM64.
Yes, I have changed linker stack overflow check to take the extra word into account.
2 comments:
Patch Set #11, Line 271: sys.GoarchArm64 == 1
GOARCH == "arm64"
I write it this way on purpose because "sys.GoarchArm64" is a const and I think "sys.GoarchArm64 == 1" will be evaluated to true at compilation time for arm64 port, which results in better code generation.
Patch Set #11, Line 125: main 120 nosplit; REJECT ppc64 ppc64le amd64 arm64
I don't understand this change. […]
You're right and changing test case is not good.
They should now be rejected on ARM64 due to frame pointer.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
1 comment:
Patch Set #11, Line 271: sys.GoarchArm64 == 1
I write it this way on purpose because "sys.GoarchArm64" is a const and I think "sys. […]
Since runtime.GOARCH is a constant, I believe the GOARCH == "arm64" check also reduces to true at compile time.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Wei Xiao uploaded patch set #13 to this change.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
1 comment:
File src/runtime/traceback.go:
Patch Set #11, Line 271: GOARCH == "arm64" &&
Since runtime. […]
Yes, You and Cherry are right and I confused runtime.GOARCH with objabi.GOARCH.
I have changed it to GOARCH for consistency with others (although my method has the same effect)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Hi reviewers,
Is there any other comments?
I wish the CL can be merged by the end of this month because I will switch to other project next month. This CL may be my last contribution to Golang and I will try my best to address your comments.
Patch Set 13:
Hi reviewers,
Is there any other comments?
I wish the CL can be merged by the end of this month because I will switch to other project next month. This CL may be my last contribution to Golang and I will try my best to address your comments.
Thank you very much for your contributions to Go!
I have been reading it in detail in the past week. Mostly good. Still not finish reading obj7.go
I think this is probably a little too late for Go 1.11, given the complexity and the broad influence of this CL. Probably good to check it in at the beginning of Go 1.12 cycle.
9 comments:
File src/cmd/compile/internal/arm64/ggen.go:
Patch Set #13, Line 18: be empty or
0 is also 16 bytes aligned. No need the special case now.
File src/cmd/compile/internal/gc/pgen.go:
if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
offs -= int64(Widthptr)
}
}
if objabi.GOARCH == "arm64" {
// frame top 8 bytes are for FP
I'm not sure I understand this part. If FP is enabled and GOARCH is ARM64, it seems the result is the same before and after. Do you mean to subtract Widthptr regardless of whether FP is enabled on ARM64? Why?
File src/cmd/internal/obj/arm64/asm7.go:
8 or16 now
Same here.
File src/cmd/internal/objabi/util.go:
Patch Set #13, Line 79: return framepointer_enabled != 0 && (goarch == "amd64" && goos != "nacl" || goarch == "arm64")
Do we want to enable frame pointer on darwin? Per previous discussion, it doesn't seem to help. I'm not sure whether it causes problem. Elias?
Do we need to increase the frame size here? It doesn't seem to store anything there. Also, it's hard to understand what these numbers are for.
Patch Set #13, Line 269: // On arm64, stack frame is four words and there's a saved LR between
Update the comment. (or not, if we decided not to do the change in asm_arm64.s)
File src/runtime/traceback.go:
Patch Set #13, Line 271: if GOARCH == "arm64" && frame.sp != frame.fp {
This can be just combined with the AMD64 case below.
if s := string(version); goarch == "amd64" && strings.Contains(s, "X:") && !strings.Contains(s, "framepointer") {
// Skip this test if framepointer is NOT enabled on AMD64
return
}
I assume this also applies to ARM64 now?
We probably could check both frame pointer on and off, with something like changing goarch from amd64 to amd64nofp if frame pointer is off, then we can code both cases into the tests. Not necessarily in this CL.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Patch set 13:Run-TryBot +1
TryBots beginning. Status page: https://farmer.golang.org/try?commit=250aa77f
TryBots are happy.
Patch set 13:TryBot-Result +1
1 comment:
Patch Set #13, Line 79: return framepointer_enabled != 0 && (goarch == "amd64" && goos != "nacl" || goarch == "arm64")
Do we want to enable frame pointer on darwin? Per previous discussion, it doesn't seem to help. […]
Cherry-picking this CL on top of tip (CL 120495) GOARCH=arm64 iostest.bash runs until the testcarchive test, which hangs. I've confirmed the testcarchive test works on darwin/arm with this CL as well as on darwin/arm64 if I don't apply this CL.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Wei Xiao uploaded patch set #14 to this change.
11 files changed, 237 insertions(+), 46 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Thanks for the comments!
9 comments:
0 is also 16 bytes aligned. No need the special case now.
if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
offs -= int64(Widthptr)
}
}
if objabi.GOARCH == "arm64" {
// frame top 8 bytes are for FP
I'm not sure I understand this part. […]
Yes, for implementation simplicity, the stack layout is the same for both disabling and enabling FP. The main challenge is that there are some hard-code offsets inside run-time.
File src/cmd/internal/obj/arm64/asm7.go:
8 or16 now
Done
Same here.
Done
File src/cmd/internal/objabi/util.go:
Patch Set #13, Line 79: return framepointer_enabled != 0 && (goarch == "amd64" && goos != "nacl" || goarch == "arm64" && goos == "linux")
Cherry-picking this CL on top of tip (CL 120495) GOARCH=arm64 iostest. […]
I don't have darwin/arm64 environment but your result surprises me since this CL is almost system-independent. Could you please help collect the call stack dumps triggered by the timeout of running the test case?
You can also comment line 1049 and 1050 of file src/runtime/asm_arm64.s to see if there is any changes to the result.
Do we need to increase the frame size here? It doesn't seem to store anything there. […]
It's needed to align with the stack layout (i.e frame size is indeed 32+16) assumed by assembler. Otherwise there will be problems when doing traceback.
Patch Set #13, Line 269: // On arm64, stack frame is four words and there's a saved LR between
Update the comment. (or not, if we decided not to do the change in asm_arm64. […]
Done
File src/runtime/traceback.go:
This can be just combined with the AMD64 case below.
Done
if s := string(version); goarch == "amd64" && strings.Contains(s, "X:") && !strings.Contains(s, "framepointer") {
// Skip this test if framepointer is NOT enabled on AMD64
return
}
I assume this also applies to ARM64 now? […]
Yes, it's better to do it in another CL.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Patch set 14:Run-TryBot +1
TryBots beginning. Status page: https://farmer.golang.org/try?commit=a1c058eb
TryBots are happy.
Patch set 14:TryBot-Result +1
IO
Patch set 14:-Run-TryBot
1 comment:
Patch Set #13, Line 79: return framepointer_enabled != 0 && (goarch == "amd64" && goos != "nacl" || goarch == "arm64" && goos == "linux")
I don't have darwin/arm64 environment but your result surprises me since this CL is almost system-in […]
With PS 14, the std tests crash with errors that look like memory corruption. I've pasted a few such traces here: https://pastebin.com/2gm8wCMA.
Commenting line 1049 and 1050 in asm_arm64.s doesn't seem to make a difference (the tests still crash).
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Wei Xiao uploaded patch set #15 to this change.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
@Cherry, I'm considering an improvement to only apply current changes to arm64/linux and keep other ports the same as before with some conditional code. Do you think it's ok to enable FP only for linux? My last work day for arm is next Friday and I still have enough time to complete the improvement and address your comments.
1 comment:
Patch Set #13, Line 79: return framepointer_enabled != 0 && (goarch == "amd64" && goos != "nacl" || goarch == "arm64" && goos == "linux")
With PS 14, the std tests crash with errors that look like memory corruption. […]
Elias, could you please try PS15? Any feedback is appreciated!
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
1 comment:
Patch Set #13, Line 79: return framepointer_enabled != 0 && (goarch == "amd64" && goos != "nacl" || goarch == "arm64" && goos == "linux")
Elias, could you please try PS15? Any feedback is appreciated!
I ran the first few std tests and the testcarchive test and they all passed:
$ GOARCH=arm64 ./iostest.bash
Building Go cmd/dist using /Users/elias/go1.10.1.
Building Go toolchain1 using /Users/elias/go1.10.1.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for host, darwin/amd64.
Building packages and commands for target, darwin/arm64.
---
Installed Go for darwin/arm64 in /Users/elias/go-tip
Installed commands in /Users/elias/go-tip/bin
##### Testing packages.
ok archive/tar 43.616s
ok archive/zip 34.258s
ok bufio 18.745s
ok bytes 55.588s
ok compress/bzip2 49.726s
ok compress/flate 46.021s
^C
$ GOARCH=arm64 go tool dist test testcarchive
##### ../misc/cgo/testcarchive
PASS
ALL TESTS PASSED (some were excluded)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
The CL passes all tests for both linux and darwin.
Any other comments?
R=cherryz
Cherry, could you review this? It'd be nice to get this in as it's been open for some time and would be useful for users & parity with amd64.
Patch Set 15:
R=cherryz
Cherry, could you review this? It'd be nice to get this in as it's been open for some time and would be useful for users & parity with amd64.
Yeah, I have been reviewing this. There has been substantial rewrite of this CL after the freeze. I feel this is a bit too invasive for Go 1.11. We can finish review this but check it in when Go 1.12 is open. What do you think?
Patch set 15:Run-TryBot +1
11 comments:
Patch Set #15, Line 15: 4. adjust DWARF information to support FP
I don't see any code doing this now. Did I miss anything? Is it still necessary with FP at -8(SP)?
Patch Set #15, Line 131: Frame-pointer is enabled by default
on Linux
File src/cmd/compile/internal/gc/pgen.go:
if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
offs -= int64(Widthptr)
}
}
if objabi.GOARCH == "arm64" {
// frame top 8 bytes are for FP
Yes, for implementation simplicity, the stack layout is the same for both disabling and enabling FP. […]
Ok. Probably rearrange it as
if Ctxt.FixedFrameSize() == 0 {
offs -= int64(Widthptr)
}
if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) || objabi.GOARCH == "arm64" {
offs -= int64(Widthptr)
}Also comment that we leave the space for FP on ARM64, even if the frame pointer is disabled.
Patch Set #13, Line 640: if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) {
Same here.
File src/cmd/internal/obj/arm64/obj7.go:
else {
// FP offsets need an updated p.To.Offset.
p.To.Offset = int64(c.autosize) - 8
}
Now this line runs only when c.autosize == 0? I don't understand how this would work...
Also, the adjustment of c.cursym.Func.Locals is gone. Is it on purpose?
Patch Set #15, Line 638: int64(-8)
Conversion is not needed, also below.
Patch Set #15, Line 868: break
break is not needed. Go doesn't do implicit fall through. Also below.
Patch Set #15, Line 873: -32(SP)
-24(SP) ?
q4.Pos = p.Pos
q4.As = obj.ADUFFCOPY
q4.To = p.To
Probably *q4 = *p, also below.
It's needed to align with the stack layout (i.e frame size is indeed 32+16) assumed by assembler. […]
If we change the declared frame size from 24 to 16, the assembler will make the actual frame size 32, and it should just work without adding 16 here?
if s := string(version); goarch == "amd64" && strings.Contains(s, "X:") && !strings.Contains(s, "framepointer") {
// Skip this test if framepointer is NOT enabled on AMD64
return
}
Yes, it's better to do it in another CL.
Does the test pass if the frame pointer is disabled on ARM64? If not, we would need to disable the test when frame pointer is off, in this CL.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
TryBots beginning. Status page: https://farmer.golang.org/try?commit=28664700
TryBots are happy.
Patch set 15:TryBot-Result +1
Wei Xiao uploaded patch set #16 to this change.
build: support frame-pointer for arm64
Supporting frame-pointer makes Linux's perf and other profilers much more useful
because it lets them gather a stack trace efficiently on profiling events. Major
changes include:
1. save FP on the word below where RSP is pointing to (proposed by Cherry and Austin)
2. adjust some specific offsets in runtime assembly and wrapper code
3. add support to FP in goroutine scheduler
4. adjust link stack overflow check to take the extra word into account
5. adjust nosplit test cases to enable frame sizes which are 16 bytes aligned
Frame-pointer is enabled on Linux by default but can be disabled by setting: GOEXPERIMENT=noframepointer.
Fixes #10110
Change-Id: I1bfaca6dba29a63009d7c6ab04ed7a1413d9479e
---
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/lib.go
M src/runtime/asm_arm64.s
M src/runtime/cgocall.go
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/nosplit.go
11 files changed, 229 insertions(+), 46 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
It's ok to check it in when Go 1.12 is open. But please note that I won't be able to help resolve new conflicts and failures after this week.
Patch set 16:Run-TryBot +1
11 comments:
Patch Set #15, Line 15: 4. adjust link stack overflow check to take the extra word into account
I don't see any code doing this now. […]
You're right, we don't need adjustment in link for DWARF anymore.
on Linux
Done
File src/cmd/compile/internal/gc/pgen.go:
}
if objabi.Framepointer_enabled(objabi.GOOS, objabi.GOARCH) || objabi.GOARCH == "arm64" {
// There is a word space for FP on ARM64 even if the frame pointer is disabled
offs -= int64(Widthptr)
}
Ok. Probably rearrange it as […]
Done
Patch Set #13, Line 640: base -= int64(Widthptr)
Same here.
Done
File src/cmd/internal/obj/arm64/obj7.go:
c.ctxt.Diag("%v: unaligned frame size %d - must be 16 aligned", p, c.autosize-8)
}
} else {
Now this line runs only when c.autosize == 0? I don't understand how this would work... […]
1. The value of "p.To.Offset" is used to restore "c.autosize" in function: span7. Since we need to consider more cases (i.e "8 mod 16" and "16 bytes aligned") than before, we add restoring logical in span7 now. We don't need to prepare "exact" value of "p.To.Offset" here and only run the line for NOFRAME which we do no special handling in span7.
2. I elided the adjustment of c.cursym.Func.Locals on purpose because I thought "Locals" are just for debugging and printing purpose. But it's better to update its value as old implementation has done and I have added the adjustment back.
Patch Set #15, Line 638: j.TYPE_ME
Conversion is not needed, also below.
Done
break is not needed. Go doesn't do implicit fall through. Also below.
Done
Patch Set #15, Line 873: -24(SP)
-24(SP) ?
Done
q4.Pos = p.Pos
q4.As = obj.ADUFFCOPY
q4.To = p.To
Probably *q4 = *p, also below.
"*q4 = *p" will overwrite q4.Link and result in dead loop
If we change the declared frame size from 24 to 16, the assembler will make the actual frame size 32 […]
we can't declare the frame size as 16 because this function does need 3 frame words (savedm, savedsp and one parameter (i.e ctxt) for calling func: cgocallbackg)
if s := string(version); goarch == "amd64" && strings.Contains(s, "X:") && !strings.Contains(s, "framepointer") {
// Skip this test if framepointer is NOT enabled on AMD64
return
}
Does the test pass if the frame pointer is disabled on ARM64? If not, we would need to disable the t […]
Yes, tests pass if the frame pointer is disabled on ARM64 since we have the same stack layout for both enabled and disabled frame pointer.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
TryBots beginning. Status page: https://farmer.golang.org/try?commit=15ae33ed
TryBots are happy.
Patch set 16:TryBot-Result +1
Any other comments? I only have 2 days to help address them.
last ping ......
Patch Set 16: Run-TryBot+1
(11 comments)
It's ok to check it in when Go 1.12 is open. But please note that I won't be able to help resolve new conflicts and failures after this week.
Thanks for your contribution!
Could someone on your team take over? I can probably take care of the merge conflicts, if any.
Patch set 16:Run-TryBot +1Code-Review +1
4 comments:
c.ctxt.Diag("%v: unaligned frame size %d - must be 16 aligned", p, c.autosize-8)
}
} else {
1. The value of "p.To.Offset" is used to restore "c.autosize" in function: span7. […]
Is there a reason for this to be done in two places? Isn't it that c.autosize gets added twice? Could we do it just here? Here we also handle two alignment cases. And we can set c.extrasize here if necessary.
q4.Pos = p.Pos
q4.As = obj.ADUFFCOPY
q4.To = p.To
"*q4 = *p" will overwrite q4. […]
Ok.
This comment seems no longer right? Update the comment. Is there anything we need to do in unwindm?
we can't declare the frame size as 16 because this function does need 3 frame words (savedm, savedsp […]
Ok, thanks. I don't think expressions like 24+8+8+16 is much more readable. Maybe just write a single number, like -56(R4).
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Wei Xiao uploaded patch set #17 to this change.
build: support frame-pointer for arm64
Supporting frame-pointer makes Linux's perf and other profilers much more useful
because it lets them gather a stack trace efficiently on profiling events. Major
changes include:
1. save FP on the word below where RSP is pointing to (proposed by Cherry and Austin)
2. adjust some specific offsets in runtime assembly and wrapper code
3. add support to FP in goroutine scheduler
4. adjust link stack overflow check to take the extra word into account
M src/cmd/internal/obj/util.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/lib.go
M src/runtime/asm_arm64.s
M src/runtime/cgocall.go
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/codegen/stack.go
M test/nosplit.go
13 files changed, 232 insertions(+), 56 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Yes, Zheng will take over the CL.
Patch set 17:Run-TryBot +1
3 comments:
c.autosize += extrasize
c.cursym.Func.Locals += extrasize
Is there a reason for this to be done in two places? Isn't it that c. […]
Done
This comment seems no longer right? Update the comment. […]
It's still right.
The advantage of current solution is its good compatibility.
Ok, thanks. I don't think expressions like 24+8+8+16 is much more readable. […]
Done
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
TryBots beginning. Status page: https://farmer.golang.org/try?commit=7fb3b2f1
TryBots are happy.
Patch set 17:TryBot-Result +1
Patch Set 17: Run-TryBot+1
(3 comments)
Yes, Zheng will take over the CL.
Thanks! This is great.
Patch set 17:Code-Review +1
2 comments:
File src/cmd/internal/obj/arm64/asm7.go:
Patch Set #17, Line 777: c := ctxt7{ctxt: ctxt, newprog: newprog, cursym: cursym, autosize: int32(p.To.Offset & 0xffffffff), extrasize: int32(p.To.Offset >> 32)}
At this point you probably can set p.To.Offset back, removing the high 32 bits. Then maybe you don't need to change util.go?
Patch Set #17, Line 1048: -(48)
Remove the parentheses. Also below.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Zheng Xu uploaded patch set #18 to the change originally created by Wei Xiao.
Jira: ENTLLT-523
---
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/lib.go
M src/runtime/asm_arm64.s
M src/runtime/cgocall.go
M src/runtime/msan_arm64.s
M src/runtime/rt0_android_arm64.s
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/codegen/stack.go
M test/nosplit.go
16 files changed, 253 insertions(+), 59 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Zheng Xu uploaded patch set #19 to the change originally created by Wei Xiao.
---
M src/cmd/compile/internal/arm64/ggen.go
M src/cmd/compile/internal/gc/pgen.go
M src/cmd/internal/obj/arm64/asm7.go
M src/cmd/internal/obj/arm64/obj7.go
M src/cmd/internal/objabi/util.go
M src/cmd/link/internal/ld/lib.go
M src/runtime/asm_arm64.s
M src/runtime/cgocall.go
M src/runtime/msan_arm64.s
M src/runtime/rt0_android_arm64.s
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/codegen/stack.go
M test/nosplit.go
16 files changed, 253 insertions(+), 59 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Sorry for replying late. There is a trivial issue with current frame design. We will need special handling when calling to native function directly without going through cgo. I worked around this by reserving additional 16 bytes.
It won't cause problems if the developers won't call native function directly with Go assembly code. But I am not sure if this is allowed or not.
2 comments:
Patch Set #17, Line 777: c := ctxt7{ctxt: ctxt, newprog: newprog, cursym: cursym, autosize: int32(p.To.Offset & 0xffffffff), extrasize: int32(p.To.Offset >> 32)}
At this point you probably can set p.To.Offset back, removing the high 32 bits. […]
Done
Remove the parentheses. Also below.
Done
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Sounds like this is going into Go 1.12 now, not Go 1.11.
Patch Set 19:
(2 comments)
Sorry for replying late. There is a trivial issue with current frame design. We will need special handling when calling to native function directly without going through cgo. I worked around this by reserving additional 16 bytes.
Seems ok.
It won't cause problems if the developers won't call native function directly with Go assembly code. But I am not sure if this is allowed or not.
Calling external function directly from user assembly code is not supported. We don't need to worry about it. It might be possible to carefully code it to get it work for certain special cases, but we never guaranteed, or even tried, to make it work. Note that in the few special cases in the runtime, the external function are all called on special stacks (system stack or signal stack), which is already not the user code supposed to do.
Patch set 19:Run-TryBot +1
1 comment:
File src/runtime/msan_arm64.s:
Patch Set #19, Line 62: ADD $16, RSP
Not needed. The next instruction writes to RSP anyway.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
TryBots beginning. Status page: https://farmer.golang.org/try?commit=a8a14fcf
TryBots are happy.
Patch set 19:TryBot-Result +1
1 comment:
Patch Set #19, Line 62: ADD $16, RSP
Not needed. The next instruction writes to RSP anyway.
Do we need to consider the PMU samples in __msan_xxx() functions? If we tend to ignore this case, we do not need to reserve these bytes to avoid corrupt the previous FP stored on (RSP-8) since the previous FP won't be restored as you mentioned.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
5 comments:
byte
two (for consistency)
File src/runtime/msan_arm64.s:
Patch Set #19, Line 62: ADD $16, RSP
Do we need to consider the PMU samples in __msan_xxx() functions? If we tend to ignore this case, we […]
What does PMU stand for?
This is probably ok. We are on the g0 stack here, and the saved FP is probably not very interesting.
File src/runtime/rt0_android_arm64.s:
Patch Set #19, Line 7: TEXT _rt0_arm64_android(SB),NOSPLIT|NOFRAME,$0
Do the _rt0 stuff matter? These are the program start routines, called by the system. There is probably no saved FP here.
Patch Set #19, Line 10: BL (R4)
Here we are calling a Go assembly function, not external function. Why do we need this? (also below and other _rt0 functions)
Also, I think this probably doesn't work, since _rt0_arm64_linux assumes 0(RSP) is argc. If we adjust RSP, it will go wrong.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Zheng Xu uploaded patch set #20 to the change originally created by Wei Xiao.
M src/runtime/rt0_darwin_arm64.s
M src/runtime/rt0_linux_arm64.s
M src/runtime/sys_linux_arm64.s
M src/runtime/traceback.go
M test/codegen/stack.go
M test/nosplit.go
14 files changed, 243 insertions(+), 57 deletions(-)
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
6 comments:
byte
Done
two (for consistency)
Done
File src/runtime/msan_arm64.s:
What does PMU stand for? […]
PMU is the hardware Performance Measurement Unit. It will increase counter when certain hardware events happen. It will trigger exception when the counter reach the threshold.
When the exception happens, we will record the call stack so that people can analysis the software behavior. We are expecting to use FP to get the call stack which cause lower overhead than using DRAWF information. LR=*(FP+8); FP'=*FP; LR'=*(FP'+8); FP''=*FP'; ...
We cannot get the correct call stack if saved FP is corrupted. But I guess it should be fine. The performance data won't be very interesting when running with memory sanitizer.
File src/runtime/rt0_android_arm64.s:
Patch Set #19, Line 7: TEXT _rt0_arm64_android(SB),NOSPLIT|NOFRAME,$0
Do the _rt0 stuff matter? These are the program start routines, called by the system. […]
Done
Here we are calling a Go assembly function, not external function. […]
Done
Patch Set #19, Line 20: DATA _rt0_arm64_android_argv+0x08(SB)/8,$0 // end argv
Also there is no need to move SP here. It is called from native and there is no Go frame created.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
Alright, let's get it in.
Patch set 20:Run-TryBot +1Code-Review +2
1 comment:
PMU is the hardware Performance Measurement Unit. […]
Thanks for the explanation.
We are on the g0 stack here, but we didn't change the FP register, which still points to the FP slot on the old g stack. The C function will save this FP on its frame. If I'm right, if we get a signal that unwinds the stack here, it will find the saved FP from the C function, then continue unwinding to the g stack where msancall is called, instead of the g0 stack. I think this is fine, and in some sense correct because it is the g stack calling the msan function, not the frame above us on the g0 stack. I may be wrong.
We didn't save anything at -8(SP) on the g0 stack, so nothing to preserve, I think.
To view, visit change 61511. To unsubscribe, or for help writing mail filters, visit settings.
TryBots beginning. Status page: https://farmer.golang.org/try?commit=adcfa7bc
TryBots are happy.
Patch set 20:TryBot-Result +1