void test(double* dst, const double* src) {
auto src_val = _mm256_loadu_pd(src);
_mm256_storeu_pd(dst, src_val);
}
Which is compiled to
ldp q0, q1, [x1]
sub sp, sp, #32
stp q0, q1, [sp]
stp q0, q1, [x0]
add sp, sp, 32
ret
Some more complicated functions produce further extraneous operations; e.g.
void test(double* dst, double* a) {
__m256d b = _mm256_loadu_pd(a);
__m256d c = _mm256_add_pd(b, b);
_mm256_storeu_pd(dst, c);
}
Produces
0000000000000000 <test(double*, double*)>:
0: ad400021 ldp q1, q0, [x1]
4: d10143ff sub sp, sp, #0x50
8: 9100ffe2 add x2, sp, #0x3f
c: 927be842 and x2, x2, #0xffffffffffffffe0
10: 4e61d421 fadd v1.2d, v1.2d, v1.2d
14: 4e60d400 fadd v0.2d, v0.2d, v0.2d
18: ad000041 stp q1, q0, [x2]
1c: ad400440 ldp q0, q1, [x2]
20: ad000400 stp q0, q1, [x0]
24: 910143ff add sp, sp, #0x50
28: d65f03c0 ret
Which has an extra load in there as well. This second example also produces a poor result on x86_64 with no AVX (gcc-11, -O2):
0000000000000000 <_Z4testPdS_>:
0: f3 0f 1e fa endbr64
4: 55 push %rbp
5: 48 89 e5 mov %rsp,%rbp
8: 48 83 e4 e0 and $0xffffffffffffffe0,%rsp
c: 48 83 ec 60 sub $0x60,%rsp
10: 66 0f 10 46 10 movupd 0x10(%rsi),%xmm0
15: 66 0f 10 0e movupd (%rsi),%xmm1
19: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
20: 00 00
22: 48 89 44 24 58 mov %rax,0x58(%rsp)
27: 31 c0 xor %eax,%eax
29: 66 0f 58 c9 addpd %xmm1,%xmm1
2d: 66 0f 58 c0 addpd %xmm0,%xmm0
31: 0f 11 0f movups %xmm1,(%rdi)
34: 0f 11 47 10 movups %xmm0,0x10(%rdi)
38: 48 8b 44 24 58 mov 0x58(%rsp),%rax
3d: 64 48 2b 04 25 28 00 sub %fs:0x28,%rax
44: 00 00
46: 75 02 jne 4a <_Z4testPdS_+0x4a>
48: c9 leave
49: c3 ret
4a: e8 00 00 00 00 call 4f <_Z4testPdS_+0x4f>
More recent versions of gcc and clang don't have this problem, but that's not really an option in our build environment.
Currently, simde_mm256_storeu_pd is run using simde_memcpy.
A Solution
I can fix this by explicitly making simde_mm256_storeu_pd do two 128 bit stores instead, for example, like the following:
simde_mm256_storeu_pd (simde_float64 mem_addr[4], simde__m256d a) {
#if defined(SIMDE_X86_AVX_NATIVE)
_mm256_storeu_pd(mem_addr, a);
#elif SIMDE_NATURAL_VECTOR_SIZE_LE(128)
simde__m256d_private a_ = simde__m256d_to_private(a);
for (size_t i = 0 ; i < (sizeof(a_.m128d) / sizeof(a_.m128d[0])) ; i++) {
simde_mm_storeu_pd(mem_addr + 2*i, a_.m128d[i]);
}
#else
simde_memcpy(mem_addr, &a, sizeof(a));
#endif
}
With this change (and a similar change to loadu) I get good assembly output.
But is this the best approach? It would affect other architectures with 128 bit or less simd width.
As an alternative, I could make this conditional on SIMDE_ARM_NEON_A64V8_NATIVE
Which would be better?