256 bit avx loads and stores on neon -- correct approach to patching?

12 views

Skip to first unread message

Alex Khripin

unread,

Jul 17, 2024, 1:30:05 PM7/17/24

to SIMD Everywhere

Hello,

I have a problem, and am trying to decide on the style for the best solution.

The Problem

We noticed some unnecessary stores when using simde to run some avx code on an arm64 platform (arm8.2-a gcc 10 and gcc 11)

The simplest case is just a no-op function:

void test(double* dst, const double* src) {
    auto src_val = _mm256_loadu_pd(src);
    _mm256_storeu_pd(dst,  src_val);
}

Which is compiled to
        ldp     q0, q1, [x1]
        sub     sp, sp, #32
        stp     q0, q1, [sp]
        stp     q0, q1, [x0]
        add     sp, sp, 32
        ret

Some more complicated functions produce further extraneous operations; e.g.
void test(double* dst, double* a) {
    __m256d b = _mm256_loadu_pd(a);
    __m256d c = _mm256_add_pd(b, b);
    _mm256_storeu_pd(dst, c);
}

Produces
0000000000000000 <test(double*, double*)>:
   0:   ad400021        ldp     q1, q0, [x1]
   4:   d10143ff        sub     sp, sp, #0x50
   8:   9100ffe2        add     x2, sp, #0x3f
   c:   927be842        and     x2, x2, #0xffffffffffffffe0
  10:   4e61d421        fadd    v1.2d, v1.2d, v1.2d
  14:   4e60d400        fadd    v0.2d, v0.2d, v0.2d
  18:   ad000041        stp     q1, q0, [x2]
  1c:   ad400440        ldp     q0, q1, [x2]
  20:   ad000400        stp     q0, q1, [x0]
  24:   910143ff        add     sp, sp, #0x50
  28:   d65f03c0        ret

Which has an extra load in there as well. This second example also produces a poor result on x86_64 with no AVX (gcc-11, -O2):
0000000000000000 <_Z4testPdS_>:
   0:   f3 0f 1e fa             endbr64 
   4:   55                      push   %rbp
   5:   48 89 e5                mov    %rsp,%rbp
   8:   48 83 e4 e0             and    $0xffffffffffffffe0,%rsp
   c:   48 83 ec 60             sub    $0x60,%rsp
  10:   66 0f 10 46 10          movupd 0x10(%rsi),%xmm0
  15:   66 0f 10 0e             movupd (%rsi),%xmm1
  19:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
  20:   00 00 
  22:   48 89 44 24 58          mov    %rax,0x58(%rsp)
  27:   31 c0                   xor    %eax,%eax
  29:   66 0f 58 c9             addpd  %xmm1,%xmm1
  2d:   66 0f 58 c0             addpd  %xmm0,%xmm0
  31:   0f 11 0f                movups %xmm1,(%rdi)
  34:   0f 11 47 10             movups %xmm0,0x10(%rdi)
  38:   48 8b 44 24 58          mov    0x58(%rsp),%rax
  3d:   64 48 2b 04 25 28 00    sub    %fs:0x28,%rax
  44:   00 00 
  46:   75 02                   jne    4a <_Z4testPdS_+0x4a>
  48:   c9                      leave  
  49:   c3                      ret    
  4a:   e8 00 00 00 00          call   4f <_Z4testPdS_+0x4f>

More recent versions of gcc and clang don't have this problem, but that's not really an option in our build environment.

Currently, simde_mm256_storeu_pd is run using simde_memcpy.

A Solution

I can fix this by explicitly making simde_mm256_storeu_pd do two 128 bit stores instead, for example, like the following:
simde_mm256_storeu_pd (simde_float64 mem_addr[4], simde__m256d a) {
  #if defined(SIMDE_X86_AVX_NATIVE)
    _mm256_storeu_pd(mem_addr, a);
  #elif SIMDE_NATURAL_VECTOR_SIZE_LE(128)
    simde__m256d_private a_ = simde__m256d_to_private(a);
    for (size_t i = 0 ; i < (sizeof(a_.m128d) / sizeof(a_.m128d[0])) ; i++) {
      simde_mm_storeu_pd(mem_addr + 2*i, a_.m128d[i]);
    }
  #else
    simde_memcpy(mem_addr, &a, sizeof(a));
  #endif
}

With this change (and a similar change to loadu) I get good assembly output.

But is this the best approach? It would affect other architectures with 128 bit or less simd width.

As an alternative, I could make this conditional on SIMDE_ARM_NEON_A64V8_NATIVE

Which would be better?

Reply all

Reply to author

Forward

0 new messages