gc: optimize JMP to RET instructions

229 views
Skip to first unread message

Arseny Samoylov

unread,
Aug 13, 2024, 1:10:58 PM8/13/24
to golang-nuts
Hello community, recently I found that gc generates a lot of JMP to RET instructions and there is no optimization for that. Consider this example:

```

// asm_arm64.s

#include "textflag.h"

 

TEXT ·jmp_to_ret(SB), NOSPLIT, $0-0

    JMP ret

ret:

    RET

```

This compiles to :

```

TEXT main.jmp_to_ret.abi0(SB) asm_arm64.s

  asm_arm64.s:4         0x77530                 14000001                JMP 1(PC)

  asm_arm64.s:6         0x77534                 d65f03c0                RET

```


Obviously, it can be optimized just to RET instruction.

So I made a patch that replaces JMP to RET with RET instruction (on Prog representation):

```

diff --git a/src/cmd/internal/obj/pass.go b/src/cmd/internal/obj/pass.go
index 066b779539..87f1121641 100644
--- a/src/cmd/internal/obj/pass.go
+++ b/src/cmd/internal/obj/pass.go
@@ -174,8 +174,16 @@ func linkpatch(ctxt *Link, sym *LSym, newprog ProgAlloc) {
                        continue
                }
                p.To.SetTarget(brloop(p.To.Target()))
-               if p.To.Target() != nil && p.To.Type == TYPE_BRANCH {
-                       p.To.Offset = p.To.Target().Pc
+               if p.To.Target() != nil {
+                       if p.As == AJMP && p.To.Target().As == ARET {
+                               p.As = ARET
+                               p.To = p.To.Target().To
+                               continue
+                       }
+
+                       if p.To.Type == TYPE_BRANCH {
+                               p.To.Offset = p.To.Target().Pc
+                       }
                }
        }
 }

```

You can find this patch on my GH.


I encountered few problems:

* Increase in code size - because RET instruction can translate in multiple instructions (ldp, add, and ret - on arm64 for example):

.text section of simple go program that calls function from above increases in 0x3D0 bytes; go binary itself increases in 0x2570 (almost 10KB) in .text section size 

(this is for arm64 binaries)

* Optimization on Prog representation is too late, and example above translates to:

```

TEXT main.jmp_to_ret.abi0(SB) asm_arm64.s

  asm_arm64.s:4         0x77900                 d65f03c0                RET

  asm_arm64.s:6         0x77904                 d65f03c0                RET

```

(no dead code elimination was done =( )


So I am looking for some ideas. Maybe this optimization should be done on SSA form and needs some heuristics (to avoid increase in code size).

And also I would like to have suggestion where to benchmark my optimization. Bent benchmark is tooooo long =(.


Ps: example of JMP to RET from runtime:

```

TEXT runtime.strequal(SB) a/go/src/runtime/alg.go

  alg.go:378            0x12eac                 14000004                JMP 4(PC) // JMP to RET in Prog

  alg.go:378            0x12eb0                 f9400000                MOVD (R0), R0

  alg.go:378            0x12eb4                 f9400021                MOVD (R1), R1

  alg.go:378            0x12eb8                 97fffc72                CALL runtime.memequal(SB)

  alg.go:378            0x12ebc                 a97ffbfd                LDP -8(RSP), (R29, R30)

  alg.go:378            0x12ec0                 9100c3ff                ADD $48, RSP, RSP

  alg.go:378            0x12ec4                 d65f03c0                RET

...

```

Keith Randall

unread,
Aug 13, 2024, 8:59:55 PM8/13/24
to golang-nuts
We generally don't do optimizations like that directly on assembly. In fact, we used to do some like that but they have been removed.
We want the generated machine code to faithfully mirror the assembly input. People writing assembly have all kind of reasons for laying out instructions in particular ways (better for various caches, etc) that we don't want to disrupt.

If the Go compiler is generating such a pattern, we can optimize that. There's some discussion here https://github.com/golang/go/issues/24936 but nothing substantive came of it. It would need benchmarks demonstrating it is worth it, and concerns about debuggability (can you set a breakpoint on each return in the source?) also matter.

> Ps: example of JMP to RET from runtime:

That is a JMP to the LDP instruction, not directly to the RET.

Arseny Samoylov

unread,
Aug 14, 2024, 12:31:55 PM8/14/24
to golang-nuts
Thank you for your answer!

> We generally don't do optimizations like that directly on assembly.
I definitely agree. But this is also a pattern for generated code.

> and concerns about debuggability (can you set a breakpoint on each return in the source?) also matter
This is an interesting problem that I haven't thought about, thank you!

> That is a JMP to the LDP instruction, not directly to the RET.
Yes, but on Prog representation it is. I mentioned it when pointed out problem with increasing code size (RET translates to multiple instructions).

>  There's some discussion here https://github.com/golang/go/issues/24936
I am grateful for the link to the discussion. In this discussion, you mentioned yours abandoned CL that actually does the contrary of my optimization =).

>  It would need benchmarks demonstrating it is worth it
Can you please provide some suggestions for benchmarks? I tried bent, but I would like to test on some other benchmarks. 

Thank you in advance!

robert engels

unread,
Aug 14, 2024, 12:40:22 PM8/14/24
to Arseny Samoylov, golang-nuts
Won’t the speculative/parallel execution by most processors make the JMP essentially a no-op?

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/00b5127d-0027-4db0-93db-11f7fe21fb4an%40googlegroups.com.

Arseny Samoylov

unread,
Aug 14, 2024, 12:46:26 PM8/14/24
to golang-nuts
> Won’t the speculative/parallel execution by most processors make the JMP essentially a no-op?
I guess you are right, but this is true when JMP destination already in instruction buffer. I guess most of these cases are when JMP leads to RET inside on function, so indeed this optimization will have almost zero effect. But if RET instruction appears to be far enough, I guess this optimization can be meaningful.

robert engels

unread,
Aug 14, 2024, 1:21:01 PM8/14/24
to Arseny Samoylov, golang-nuts
My understanding is that optimizations like this are almost never worth it on modern processors - the increased code size works against the modern branch predictor and speculative executions - vs the single shared piece of code - there is less possibilities and thus instructions to preload.

Arseny Samoylov

unread,
Aug 15, 2024, 4:00:59 AM8/15/24
to golang-nuts
I guess you are right.
Thank you very much for the discussion!
Reply all
Reply to author
Forward
0 new messages