Hi all.
We want to make a patch to always emit unaligned vector move instructions on AVX machine with option control. We do this for the following reason:
raise an exception. If the same pointer is passed to the memop instruction, it will work. Thus, the behavior of misalignment depends upon what optimization levels and passes are applied, and small source changes could cause
issues to appear and disappear. It's better for the user to consistently use unaligned load/store to improve the debug experience;
Roman Lebedev is worried that this patch will hide UB. In our opinions, UB doesn't have to mean raise an exception. The example code(https://godbolt.org/z/43bYPraoa) does have UB behavior but it is still valid (and reasonable) to interpret that UB as `go slower',
instead of `raise exception'. Besides, as default we still emit aligned instructions as before, but we provide an option for users with this need.
We have two patches discussing this issue, one of which has been abandoned:
https://reviews.llvm.org/D88396 (abandoned)
https://reviews.llvm.org/D99565 (in review)
Thanks.
Chen Liu.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
+1 to what James said. My reaction to the original proposal is a strong -1, and James did a good job of explaining why.
Philip
Hi, James Y Knight.
I'm not sure if you misunderstood this patch. This patch won’t change any alignment information in IR and MI, which means ‘load…align 32’ will always keep the alignment information but select ‘vmovups’ instead of ‘vmovaps’ during ISEL. It can be simply considered that the only thing this patch does is to replace the aligned-move mnemonic with the unaligned-move mnemonic (in fact, we shouldn’t call it replace but emit unaligned). I think there is no impact on optimization or code layout.
After discussion, we think this option more like changing the behavior when process with unaligned memory: raising exception or accepting performance degradation. Maybe the option is more like “no-exception-on-unalginedmem”. We do have some users want this feature. They can accept “run slow” but do not want exception.
Thanks.
Chen Liu.
Hi, James Y Knight.
I'm not sure if you misunderstood this patch. This patch won’t change any alignment information in IR and MI, which means ‘load…align 32’ will always keep the alignment information but select ‘vmovups’ instead of ‘vmovaps’ during ISEL. It can be simply considered that the only thing this patch does is to replace the aligned-move mnemonic with the unaligned-move mnemonic (in fact, we shouldn’t call it replace but emit unaligned). I think there is no impact on optimization or code layout.
Yes, replacing aligned move instruction with unaligned move instruction doesn’t solve all the issue that happens in optimization pipeline, but it doesn’t make things worse. One advantage for unaligned move is that it makes the behavior the same no matter the mov instruction is folded or not. Do you think it is worth to support this feature if compiler can help users avoid changing their complex legacy code?
Thanks
Yuanke
IMO, no. We should encourage sanitizers instead.
From experience, any code base where porting trips across this
probably also has a bunch of other undefined behavior which is
causing less obvious miscompiles, and also need found and fixed.
That's why we have sanitizers.
Philip
I’ve debated whether to chime in, and decided it can’t hurt.
Sony had to do a similar downstream patch for PS4. Our use-case is pretty constrained, though. There’s only one toolchain, there’s only one target chip, so we don’t have any portability considerations to think about. What we do have are games shipping on DVD that can’t be re-released and can’t even necessarily be patched, and a strict backward compatibility requirement. So, if there’s a game out there that didn’t happen to follow all the alignment requirements, and it worked on console version 1.00, it still has to be working on version 100.00. (FTR, we’re currently on about 8.00.)
I don’t think we ever seriously considered upstreaming our patch. The circumstances where it’s really necessary do exist, but are pretty limited.
I don’t think arguments of the form “it’s okay because X Y and Z” are going to be persuasive. “We have this situation in the following circumstances” might help people understand.
--paulr
From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of
Luo, Yuanke via llvm-dev
Sent: Thursday, April 15, 2021 10:07 AM
To: James Y Knight <jykn...@google.com>; Liu, Chen3 <chen...@intel.com>
Cc: llvm...@lists.llvm.org; Maslov, Sergey V <sergey....@intel.com>
Subject: Re: [llvm-dev] [RFC] [X86] Emit unaligned vector moves on avx machine with option control.
Yes, replacing aligned move instruction with unaligned move instruction doesn’t solve all the issue that happens in optimization pipeline, but it doesn’t make things worse. One advantage for unaligned move is that it makes the behavior the same no matter the mov instruction is folded or not. Do you think it is worth to support this feature if compiler can help users avoid changing their complex legacy code?
Thanks
Yuanke
_______________________________________________
We conditioned it on the PS4 target; no option. So, PS4 consistently uses unaligned instructions for (temporal) vector load/store.
We get occasional downstream test failures because of this, which we deal with.
--paulr
What I suspect you actually want here is an option to tell Clang not to infer load/store alignments based on object types or alignment attributes -- instead treating everything as being potentially aligned to 1 unless the allocation is seen (e.g. global/local variables). Clang would still need to use the usual alignment computation for variable definitions and structure layout, but not memory operations. If clang emits "load ... align 1" instructions in LLVM IR, the right thing would then happen in the X86 backend automatically.
| This sounds like the -fmax-type-align flag:
Well, no, at least not for the PS4 case. In our case, the type had an alignment attribute but the caller didn’t make sure the allocated memory was aligned properly. The -fmax-type-align flag explicitly doesn’t do anything in that case, if I’m reading it correctly. (Yes, it’s a bug. Yes, sanitizers or other testing could have found it. No, there is no opportunity to do any of the things that would have fixed it correctly.)
Really what we did was effectively this: Pretend X86 doesn’t have a VMOVAPS opcode. That’s all. Nothing about memory/operand alignment attributes was modified, IR is unchanged. Pretend that one machine opcode is missing. Can’t possibly affect anything about IR optimizations, *maybe* something post-ISel would be different but even that is hard to imagine. (As best I can remember, the only test updates we had to make were to change things like “vmovaps” to “vmov{{u|a}}ps” and done.) It’s like we did s/movaps/movups/g on the assembly output.
I still can’t say I think it should be appropriate to do upstream—no real info yet on Intel’s problem case--but I hope this explains why the bigger hammer (i.e., get Clang involved) doesn’t seem necessary or appropriate.
--paulr
From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of
Reid Kleckner via llvm-dev
Sent: Thursday, April 15, 2021 12:59 PM
To: James Y Knight <jykn...@google.com>
Cc: llvm...@lists.llvm.org; Liu, Chen3 <chen...@intel.com>; Luo, Yuanke <yuank...@intel.com>; Maslov, Sergey V <sergey....@intel.com>
Subject: Re: [llvm-dev] [RFC] [X86] Emit unaligned vector moves on avx machine with option control.
On Wed, Apr 14, 2021 at 11:58 AM James Y Knight via llvm-dev <llvm...@lists.llvm.org> wrote:
_______________________________________________
Reid, I’m not clear why anyone would want to “power down” the alignment-aware optimizations? How does that benefit anyone? For example…
Let’s postulate a target that has only non-trapping load/store instructions; maybe they go faster on aligned addresses, but don’t trap on unaligned addresses. It has been a few decades but I think VAX worked this way.
Would you insist we should power-down the alignment-aware optimizations for this target? Just because the hardware couldn’t require aligned data? I hope not.
The conclusion must be, then, that there is no relationship between the existence of trapping/non-trapping instruction behavior for a given target, and how the frontend and middle-end should behave.
Therefore, we can’t insist on the front-end slapping “align 1” on everything just because the target doesn’t trap a non-aligned load.
Therefore, the choice of trapping/non-trapping instruction behavior in the X86 target specifically, has no necessary relationship to how alignment is thought of in the front-end/middle-end.
HTH,
--paulr
Before we completely float away from adding this to LLVM, consider that MSVC behavior is already like ICC: https://godbolt.org/z/o4eaqGv9v
And GCC folks are saying they could add an option for compatibility.
The Intel’s cases where this is needed is exactly for robustness of interoperability with the already released software.
From: Reid Kleckner <r...@google.com>
Sent: Thursday, April 15, 2021 1:09 PM
To: Robinson, Paul <paul.r...@sony.com>
Cc: James Y Knight <jykn...@google.com>; Liu, Chen3 <chen...@intel.com>; Luo, Yuanke <yuank...@intel.com>; Maslov, Sergey V <sergey....@intel.com>; llvm-dev <llvm...@lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] [X86] Emit unaligned vector moves on avx machine with option control.
Reid, I’m not clear why anyone would want to “power down” the alignment-aware optimizations? How does that benefit anyone? For example…
Let’s postulate a target that has only non-trapping load/store instructions; maybe they go faster on aligned addresses, but don’t trap on unaligned addresses. It has been a few decades but I think VAX worked this way.
Would you insist we should power-down the alignment-aware optimizations for this target? Just because the hardware couldn’t require aligned data? I hope not.
The conclusion must be, then, that there is no relationship between the existence of trapping/non-trapping instruction behavior for a given target, and how the frontend and middle-end should behave.
Therefore, we can’t insist on the front-end slapping “align 1” on everything just because the target doesn’t trap a non-aligned load.
Therefore, the choice of trapping/non-trapping instruction behavior in the X86 target specifically, has no necessary relationship to how alignment is thought of in the front-end/middle-end.
I think we can consider movaps is a limitation for legacy microarchitectures which gets better performance for aligned memory load/store. It does to be feature for new microarchitectures, i.e. movups is faster to execute on new microarchitectures when aligned.
If users depend on exceptions on alignment tricks, they should explicitly use proposed option like “-exception-on-unalginedmem”, which is not only keep to use movaps but also block existing memory folding. Does it make more sense?
Thanks
Pengfei
From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of
James Y Knight via llvm-dev
Sent: Saturday, April 17, 2021 5:35 AM
To: Robinson, Paul <paul.r...@sony.com>
> I still don't know exactly what Intel is facing, but at Sony we
> have games already shipped that CANNOT BE FIXED because they are
> embedded in DVD. It is literally physically impossible to fix the
> buggy software, and we have a moral contract with users that their
> games will continue to run on all future releases of the console.
Are they JIT'ed? If not, i'm not really sure how this change
to the X86 backend would retroactively "fix" already-compiled code.
> I understand your goal is to find and fix bugs in software that is
> still under development and CAN be fixed. I fully endorse that
> goal. However, that is not the situation that Sony has, and likely
> not what Intel has. Your proposal will NOT solve our problem.
>
> HTH,
> --paulr
Roman
Out of curiosity, why do you solve this in the backend rather than
patching your frontend/headers to not specify alignment? To my
knowledge (which isn't great in this area), alignment of pointer
arguments isn't part of the ABI. Wouldn't tweaking the headers and
simply recompiling your system libraries get you the same effect?
p.s. The more you explain about your use case, the less motivating I
find it for upstream. This sounds like a weird situation you've created
for yourselves and should bear the cost of maintaining the mitigation
for. Just as other downstream distributions do for other issues. Just
my 2 cents.
I understand your goal is to find and fix bugs in software that is
still under development and CAN be fixed. I fully endorse that
goal. However, that is not the situation that Sony has, and likely
not what Intel has. Your proposal will NOT solve our problem.
We might still not be fully understanding one another, because this:
so that you can compile code with under-aligned objects, and have it work as the author expected it to
sounds like you’re expecting us to recompile the client code that creates the under-aligned objects. That is literally not possible. If you do understand that part, great, it’s just not obvious to me from how you’re phrasing things.
I (still) don’t know what Intel is facing. For Sony’s problem, we would be much more likely to try to do something specific to the APIs that are being abused, rather than something draconian like eliminating alignment requirements for everyone. But of course we have a solution that works for us, so there’s that much more inertia to overcome.
--paulr
From: James Y Knight <jykn...@google.com>
Sent: Monday, April 19, 2021 2:30 PM
To: Robinson, Paul <paul.r...@sony.com>
I collected the feedback/requirement from Intel customer as below.
Our software runs in an embedded environment and is processing buffers which are unaligned. Sometimes this misalignment is simply because the buffer allocation is beyond the immediate control of our software but it can also be because we are processing blocks of data which are not multiples of the vector size (e.g., 6, 12 or 24). We can’t just fix our buffers to make them aligned. Our code is complicated and we support multiple instruction sets operating using the same algorithms by using templated code. For example:
template<typename DVEC_TYPE>
void doSomething(DVEC_TYPE* data)
{
// Trivial example – reality would be something much more substantial, possibly with loops or other function calls.
*data += 1.0f;
}
Note that we use dvec to help us abstract the ISA, but other similar header-only vector overloading libraries also exist.
We would then instantiate our function above multiple times for each ISA or data type we care about:
template void doSomething<float>(float* data); // Scalar type useful for debugging algorithm and doing basic testing
template void doSomething<F32vec8>(F32vec8* data); // Different AVX widths
template void doSomething<F32vec16>(F32vec16* data);
template void doSomething<I32vec16>(I32vec16* data); // Different element type
The functions are sufficiently large that we don’t want to have to write a different version for each ISA. We know that the incoming data may be mis-aligned and that accessing it directly is UB, so we could modify our code to explicitly handle misalignment. Something like:
template<typename DVEC_TYPE>
void doSomething(DVEC_TYPE* data)
{
DVEC_TYPE t;
loadu(t, data);
t += 1.0f;
storeu(data, t);
}
The code has become more verbose, less readable (maintainable, debuggable, etc), and it no longer works with plain scalar types which don’t have loadu/storeu defined unless we start defining overloaded helper functions. Also, if `data’ pointed at an array, we’d have to throw some pointer arithmetic into the mix, rather than just using plain `data[IDX]’ syntax. We can certainly write code which could cope with the misalignment explicitly but it just ends up becoming messy. Or, we could leverage the hardware to manage this misalignment for us letting the compiler emit the movups instruction, instead of movaps.
Until now we have only been using the Intel Compiler, so we have written our code to use ICC’s unaligned operations and hardware support to make our code cleaner. We are looking at porting our code to LLVM, but LLVM is making this harder than it needs to be.
Thanks
Yuanke
_______________________________________________
How about:
https://godbolt.org/z/vsj9raaqM
> Or, we could leverage the hardware to manage this misalignment for us letting the compiler emit the movups instruction, instead of movaps.
I guess people are intentionally ignoring all mentions that the code
will *still* be miscompiled in other ways.
That's sad.
> Until now we have only been using the Intel Compiler, so we have written our code to use ICC’s unaligned operations and hardware support to make our code cleaner. We are looking at porting our code to LLVM, but LLVM is making this harder than it needs to be.
>
>
>
> Thanks
>
> Yuanke
Roman