The initial discussion can be found
here. Repost here to make it a formal discussion.
Background
Compilers warn for such use scenarios, but generate code in an altered ABI.
This will lead unexpected run time failures when linking across AVX2 and AVX512 targets. And linker cannot detect the risk in advance, which makes user in a high risk when they use 512-bit vector types on non 512-bit targets.
The problem has been existing for many years, and not limited to 512-bit vector. But it's getting serious for 512-bit vector ABI in the future AVX10 targets
[spec, technical paper]. Because the AVX10-256 is a general setting for binaries that can run on both AVX10-256 and AVX10-512. It would be common that binaries compiled with AVX10-256 link with native built binaries on AVX10-512 targets in the future.
To avoid the potential undetectable linking catastrophes, we should improve the ABI by unifying it on both AVX10-256 and AVX10-512 targets. Here are proposals to solve it.
Proposals
Proposal 1: Promote attribute from AVX10-256 to AVX10-512 for any function which has 512-bit or above vectors in passing/returning arguments.
Problem: Binary cannot run on AVX10-256 only target.
Reason:
When user tries to pass/return 512-bit vector, they should be aware of it will become target dependent. User should be taught not to use it on 256-bit targets and there will be unexpected things happening if they insist.
Actually, ICC and MSVC already have chosen to promote for the argument:
https://godbolt.org/z/vcrf9qW5z I think if compiler have to choose the misbehavior between fail in result and crash due to illegal instruction, the latter is definitely better than the former.
In this way, we can also declare x86-64-v5 is inherit from x86-64-v4 and has the interaction with previous versions.
Proposal 2: Abort compilation when user tries to pass/return 512-bit vectors.
Reason: This turns pential run time crash into compile time error.
Proposal 3: Change the ABI of 512-bit vector and always be passed/returned from memory.
Reason: We expect AVX10-256 is a universal configuration and in most scenarios, 512-bit vector won't bring performance improvements. So we can sacrifice a little 512-bit performance to achieve the interaction between AVX10-256 and AVX10-512. In this way, there won't have any runtime issue in the future either.
Summary
My preference is proposal 1 is better than proposal 2 and proposal 3 is the lest choice becaue 512-bit ABI on 512-bit targets is widely used everywhere.