Question about auto-vectorization behavior for OpenMPI OP components across architectures

Marco Vogel

unread,

Jan 31, 2025, 4:06:48 AMJan 31

to de...@lists.open-mpi.org

Hello,

I implemented a new OP component for OpenMPI targeting the RISC-V vector
extension, following existing implementations for x86 (AVX) and ARM
(NEON). During testing, I aimed to reproduce results from a paper
discussing the AVX512 OP component, which stated that OpenMPI’s default
compiler did not generate auto-vectorized code
(https://icl.utk.edu/files/publications/2020/icl-utk-1416-2020.pdf
Chapter 5 Experimental evaluation). However, on my Zen4 machine, I
observed no performance difference between the AVX OP component and the
base implementation (with --mca op ^avx) when running `MPI_Reduce_local`
on a 1MB array.
To investigate, I rebuilt OpenMPI with CFLAGS='-O3 -fno-tree-vectorize',
which then confirmed the paper’s findings. This behavior is consistent
across x86 (AVX), ARM (NEON) and RISC-V (RVV). My question: Did I
overlook something in my testing or setup? Why wouldn’t the compiler in
the paper auto-vectorize the base operations when mine allegedly does
unless explicitly disabled?

Thank you!

Marco

Gilles Gouaillardet

unread,

Jan 31, 2025, 5:48:11 AMJan 31

to de...@lists.open-mpi.org

Marco,

The compiler may vectorize if generating code optimised for a given platform.
A distro provided Open MPI is likely to be optimised only for "common" architectures (e.g. no AVX512 on x86 - SSE only? - and no SVE on aarch64)

Cheers,

Gilles

To unsubscribe from this group and stop receiving emails from it, send an email to devel+un...@lists.open-mpi.org.

Marco Vogel

unread,

Jan 31, 2025, 7:08:35 AMJan 31

to de...@lists.open-mpi.org

Gilles,

Thank you for your response. I understand that distro-provided OpenMPI binaries are typically built for broad compatibility, often targeting only baseline instruction sets.

For x86, this makes sense—if OpenMPI is compiled with a target instruction set like `x86-64-v2` (no AVX), the `configure.m4` script for the AVX component first attempts to compile AVX code directly. If that fails, it retries with the necessary vectorization flags (e.g., `-mavx512f`, etc.). If successful, these flags are applied, ensuring that vectorized functions are included in the AVX component. At runtime, OpenMPI detects CPU capabilities (via CPUID) and uses the AVX functions when available, even if vectorization wasn’t explicitly enabled by the package maintainers - assuming I correctly understood the compilation process of the OP components.

What I find unclear is why the AArch64 component follows a different approach. During configuration, it only checks whether the compiler can compile NEON or SVE without additional flags. If not, the corresponding intrinsic functions are omitted entirely. This means that if the distro compilation settings don’t allow NEON or SVE, OpenMPI won’t include the optimized functions, and processors with these vector units won’t benefit. Conversely, if NEON or SVE is allowed, the base OPs will likely be auto-vectorized, reducing the performance gap between the base and intrinsic implementations.

Is there a specific reason for this difference in handling SIMD support between x86 and AArch64 in OpenMPI or am I wrong about the configuration process?

Cheers,

Marco

Gilles Gouaillardet

unread,

Jan 31, 2025, 7:17:31 AMJan 31

to de...@lists.open-mpi.org

Marco,

these are some fair points, and I guess George (who initially authored this module iirc) will soon shed some light

Cheers,

Gilles

George Bosilca

unread,

Feb 22, 2025, 4:22:41 PMFeb 22

to de...@lists.open-mpi.org

Sorry for the late answer. Most of the things above are correct, when building for a specific architecture the compiler does wonders, give or take a few years. But as Gilles pointed out we are seeking the best performance across different family of processors, so we helped the compiler a little.

If I understand correctly it works on X86 but somehow we screwed up the ARM part by not checking different sets of flags. One thing to notice is that the paper mentioned here was from 2020 (which means the experiments were certainly done in the 2019 timeframe), when few ARM processors were available and the distros were distributing binaries compiled with a more optimal set of flags. That has certainly changed which means the configure.m4 for the sve op needs a well-deserved updated to mimic the x86 and provide a base version, a neon version and then maybe a few versions of SVE (depending on the vector length).

Any contribution would be more than welcome, if you provide a patch I will certainly be happy to review it.

Best,

George.

Marco Vogel

unread,

Feb 24, 2025, 11:11:06 AMFeb 24

to de...@lists.open-mpi.org

Hi George,

Thank you for your response and clarification.

I am working on integrating the same flag-checking mechanism used in the AVX component into the AArch64 component. However, I have encountered an issue.
On x86, the GCC compiler provides dedicated command-line switches for SIMD instruction sets, such as -mavx (GCC x86 Options). These options are independent of the -march configuration within the CFLAGS variable, allowing the AVX component to append -mavx without modifying -march.
In contrast, for AArch64, there does not appear to be an equivalent standalone switch for enabling SVE (GCC AArch64 Options). Instead, SVE is enabled by appending +sve directly to the -march parameter, unless it is already implicitly included (e.g., with armv9-a or later).
To address this, I attempted to modify the -march parameter within CFLAGS as follows:

AS_IF([echo "$CFLAGS" | grep -qv -- '\+sve'],

[modified_cflags="`echo $CFLAGS | sed 's/$-march=[^ ]*$/\1+sve/'`"])

While this is not an optimal solution, I wanted to explore how far this approach would take me. For testing, I appended +sve to the end of the -march=armxxx string and verified whether the modified flag combination enabled SVE code compilation. The configuration process completed successfully, but an issue arose during the compilation of OpenMPI.
In Makefile.am, I integrated the new CFLAGS value in the same manner as the AVX component:

liblocal_ops_sve_la_CFLAGS =
      @MCA_BUILD_OP_SVE_FLAGS@

However, this only adds the contents of @MCA_BUILD_OP_SVE_FLAGS@ (which includes +sve) to the existing CFLAGS variable instead of replacing it. As a result, the final CFLAGS contains two -march options. Since liblocal_ops_sve_la_CFLAGS is prepended to the original CFLAGS, the compiler recognizes only the second, unmodified -march value, effectively ignoring the appended +sve flag. My modifications to the build systems are available in my branch (https://github.com/vogma/ompi/tree/sve_op_build_update).

Given that this experimental workaround is not functioning as intended, and since GCC does not provide dedicated command-line options like -msve or -msve2, it seems unlikely that the AVX-based approach can be replicated for AArch64. I am open to testing any ideas one might have and will gladly submit a pull request to get any working changes reviewed.

Marco

Reply all

Reply to author

Forward