0000000100086b40 <_main.t1>:
100086b40: fe 0f 1e f8 str x30, [sp, #-32]!
100086b44: fd 83 1f f8 stur x29, [sp, #-8]
100086b48: fd 23 00 d1 sub x29, sp, #8
100086b4c: ff ff 04 a9 stp xzr, xzr, [sp, #72]
100086b50: ff ff 00 a9 stp xzr, xzr, [sp, #8]
100086b54: 00 00 80 d2 mov x0, #0
100086b58: 09 00 00 14 b 0x100086b7c <_main.t1+0x3c>
100086b5c: e1 a3 00 91 add x1, sp, #40
100086b60: 22 78 60 f8 ldr x2, [x1, x0, lsl #3]
100086b64: e3 e3 00 91 add x3, sp, #56
100086b68: 64 78 60 f8 ldr x4, [x3, x0, lsl #3]
100086b6c: 42 00 04 8b add x2, x2, x4
100086b70: e4 23 00 91 add x4, sp, #8
100086b74: 82 78 20 f8 str x2, [x4, x0, lsl #3]
100086b78: 00 04 00 91 add x0, x0, #1
100086b7c: 1f 08 00 f1 cmp x0, #2
100086b80: eb fe ff 54
b.lt 0x100086b5c <_main.t1+0x1c>
100086b84: e0 07 40 f9 ldr x0, [sp, #8]
100086b88: e1 0b 40 f9 ldr x1, [sp, #16]
100086b8c: e0 27 00 f9 str x0, [sp, #72]
100086b90: e1 2b 00 f9 str x1, [sp, #80]
100086b94: ff 83 00 91 add sp, sp, #32
100086b98: fd 23 00 d1 sub x29, sp, #8
100086b9c: c0 03 5f d6 ret
1. the register value of x1,x3 can be compute out side of the loop.
2. store the result directly to [sp, #72], and [sp, #80], the temporary data in [sp, #8] and [sp, #16] not necessary.
i want to know why the compiler couldn't do the optimize, or which command cat go more optimized code.