We are students from Indian Institute of Technology(IIT), Hyderabad, we would like to propose the addition of the following pragmas in LLVM that aide in (or possibly increase the scope of) vectorization in LLVM (in comparison with other compilers).
ivdep
Nontemporal
[no]vecremainder
[no]mask_readwrite
[un]aligned
Could you please check the following Google document for the semantic description of these pragmas:
https://docs.google.com/document/d/1YjGnyzWFKJvqbpCsZicCUczzU8HlLHkmG9MssUw-R1A/edit?usp=sharing
It would be great if you could please review the above document and suggest us on how to proceed further (either about the semantics, or, about the code sections in LLVM).
Thank you
Yashas, Happy, Sai Praharsh, and Bhavya
B.Tech 3rd year, IITH.
Hi,
First, as a high-level note, you posted a link to a Google doc, and at the end of the Google doc, you have a list of questions that you'd like answered. In the future, please put the questions directly in the email. For one thing, more people will read your email than will open your Google doc. Second, having the questions in the email should allow a better threading structure to the replies.
Ivdep: Is clang loop vectorize(assume_safety) equivalent to ivdep? To what extent do the semantics of ivdep need to be modified for Clang to create an equally “useful pragma”? To what extent would it be helpful to have this pragma in Clang?
Nontemporal:What kind of analysis can we do in LLVM to find where to use nontemporal accesses? Any help would be greatly appreciated.
vecremainder/novecremainder: Should the pragma simply call the vectorizer to attempt to vectorize the remainder loop, or should the vectorizer use a different method?
mask_readwrite/nomask_readwrite: Is it a good idea to implement a pragma that will generate mask intrinsics in the IR? What other architectures (except x86) has support for masked read/writes?
Reference:https://llvm.org/devmtg/2015-04/slides/MaskedIntrinsics.pdf
LLVM has mask intrinsics for targets with AVX, AVX2, AVX-512.
From Slides: ”Most of the targets do not support masked instructions, optimization of instructions with masks is problematic, avoid introducing new masked instructions into LLVM IR”
aligned/unaligned: Is it worthwhile to have LLVM specific pragma rather depending on OpenMP?
-Hal
We are students from Indian Institute of Technology(IIT), Hyderabad, we would like to propose the addition of the following pragmas in LLVM that aide in (or possibly increase the scope of) vectorization in LLVM (in comparison with other compilers).
ivdep
Nontemporal
[no]vecremainder
[no]mask_readwrite
[un]aligned
Could you please check the following Google document for the semantic description of these pragmas:
https://docs.google.com/document/d/1YjGnyzWFKJvqbpCsZicCUczzU8HlLHkmG9MssUw-R1A/edit?usp=sharing
Hi,
First, as a high-level note, you posted a link to a Google doc, and at the end of the Google doc, you have a list of questions that you'd like answered. In the future, please put the questions directly in the email. For one thing, more people will read your email than will open your Google doc. Second, having the questions in the email should allow a better threading structure to the replies.
Ivdep: Is clang loop vectorize(assume_safety) equivalent to ivdep? To what extent do the semantics of ivdep need to be modified for Clang to create an equally “useful pragma”? To what extent would it be helpful to have this pragma in Clang?
There is a fundamental problem with the way that ivdep is defined by Intel's current documentation, at least for C/C++. As you note in your Google doc, it essentially says that the optimizer may ignore loop-carried dependencies except for those dependencies
it can definitely prove are present. These are not semantics that any other compiler can actually replicate, and is not equivalent to "vectorize(assume_safety)" (which asserts that no loop-carried dependencies are present). The good news is that, in conversations
I've had with Intel, an openness to making these semantics more concrete has been expressed. I think it would be very useful to have ivdep in Clang, but only after we nail down the semantics with Intel is some useful way.
Nontemporal:What kind of analysis can we do in LLVM to find where to use nontemporal accesses? Any help would be greatly appreciated.
If you're asking about the pragma, then what analysis is necessary? In general, you're looking for accesses that won't benefit from caching (e.g., streaming data which is not accessed again).
vecremainder/novecremainder: Should the pragma simply call the vectorizer to attempt to vectorize the remainder loop, or should the vectorizer use a different method?
Something like that. There were patches posted at some point to enable tail-loop vectorization. At this point, I imagine that you'd construct a VPlan with the vectorized tail.
mask_readwrite/nomask_readwrite: Is it a good idea to implement a pragma that will generate mask intrinsics in the IR? What other architectures (except x86) has support for masked read/writes?
ARM SVE might also fall into this category.
Reference:https://llvm.org/devmtg/2015-04/slides/MaskedIntrinsics.pdf
LLVM has mask intrinsics for targets with AVX, AVX2, AVX-512.
From Slides: ”Most of the targets do not support masked instructions, optimization of instructions with masks is problematic, avoid introducing new masked instructions into LLVM IR”
aligned/unaligned: Is it worthwhile to have LLVM specific pragma rather depending on OpenMP?
My opinion is that, so long as we have our own vectorization pragma, it should be as fully-featured as people request it to be.
-Hal
-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
On 8/8/19 2:03 PM, Hal Finkel wrote:
Hi,
First, as a high-level note, you posted a link to a Google doc, and at the end of the Google doc, you have a list of questions that you'd like answered. In the future, please put the questions directly in the email. For one thing, more people will read your email than will open your Google doc. Second, having the questions in the email should allow a better threading structure to the replies.
Ivdep: Is clang loop vectorize(assume_safety) equivalent to ivdep? To what extent do the semantics of ivdep need to be modified for Clang to create an equally “useful pragma”? To what extent would it be helpful to have this pragma in Clang?
There is a fundamental problem with the way that ivdep is defined by Intel's current documentation, at least for C/C++. As you note in your Google doc, it essentially says that the optimizer may ignore loop-carried dependencies except for those dependencies it can definitely prove are present. These are not semantics that any other compiler can actually replicate, and is not equivalent to "vectorize(assume_safety)" (which asserts that no loop-carried dependencies are present). The good news is that, in conversations I've had with Intel, an openness to making these semantics more concrete has been expressed. I think it would be very useful to have ivdep in Clang, but only after we nail down the semantics with Intel is some useful way.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwIGaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=O_4M49EtSpZ_-BQYeigzGv0P4__noMcSu2RYEjS1vKs&m=ttZjwoTRuUQgVSd_8PZOPypfqqn-GiNqAl9WLpPxiAk&s=L-X4vbafbWIKsdnIqTTXsiRM2ku9-D5cLKCXc18dtUo&e=
Thanks, Scott.
Regarding this:
> It doesn't remove all dependencies, just dependencies that inhibit vectorization.
This matches what Cray's manual says, but I'm also not sure how to interpret this statement. Does that means that the dependencies ignored are dependent on the selected target? I'm a bit worried that the dependencies interesting for vectorization might change over time or depend on the hardware being targeted.
Can you please take a look at the way that Intel's Fortran manual defines ivdep (https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-ivdep) and say whether those semantics would also make sense for Cray's implementation?
I believe our consensus view is that the semantics of these kinds of pragmas should be specified such that we could create a sanitizer which checks their dynamic semantic correctness independent of what the optimizer is actually capable of exploiting.
-Hal
> We'll put together a list of what we do with IVDEP and see if they are all covered under that wording.
Thanks, that will be helpful.
-Hal
> 2 Nontemporal
Is this a hint or a command? If it's a command then this would
implicitly specify the data is aligned on some targets (e.g. Intel X86).
I'm not sure we want to make that implicit assumption as it is very easy
for the programmer to get this wrong.
-David
> There is a fundamental problem with the way that ivdep is defined by Intel's current documentation, at least for C/C++. As you note in your Google doc, it essentially says that the optimizer may ignore loop-carried dependencies except for those dependencies it can definitely prove are present. These are not semantics that any other compiler can actually replicate, and is not equivalent to "vectorize(assume_safety)" (which asserts that no loop-carried dependencies are present). The good news is that, in conversations I've had with Intel, an openness to making these semantics more concrete has been expressed. I think it would be very useful to have ivdep in Clang, but only after we nail down the semantics with Intel is some useful way.Agreed. I don't see a lot of value in having the compiler override a pragma that is supposed to override the compiler :) Cray's IVDEP really means what the documentation says: Ignore Vector DEPendencies. It doesn't remove all dependencies, just dependencies that inhibit vectorization. It also does not force vectorization. If it's not possible or not profitable to vectorize, then it won't vectorize.
I will add that ivdep is well used by Cray and its users, so I'd like to see it well defined in Clang/llvm.
HAPPY Mahto via llvm-dev <llvm...@lists.llvm.org> writes:
> 2 Nontemporal
Is this a hint or a command? If it's a command then this would
implicitly specify the data is aligned on some targets (e.g. Intel X86).
I'm not sure we want to make that implicit assumption as it is very easy
for the programmer to get this wrong.
vecremainder/novecremainder: Should the pragma simply call the vectorizer to attempt to vectorize the remainder loop, or should the vectorizer use a different method?
>
> Something like that. There were patches posted at some point to enable tail-loop vectorization. At this point, I imagine that you'd construct a VPlan with the vectorized tail.
Yep, committed in https://reviews.llvm.org/rL366989 and https://reviews.llvm.org/D65197.
The pragma name is different, but I think it tries to achieve the same thing.
If I understand Intel's documentation correctly, these are different things:
vectorize.predicate.enable: Do not create an epilogue loop (use masked
instructions in the main loop instead)
vecremainder: If there is an epilogue loop, vectorize it as well
(which will require masked instructions in the epilogue, but not in
the main loop)
Michael
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
Am Do., 15. Aug. 2019 um 15:06 Uhr schrieb Terry Greyzck via llvm-dev
<llvm...@lists.llvm.org>:
> * Primarily ivdep allows ambiguous dependencies to be ignored, examples:
> * p[i] = q[j]
> * a[ix[i]] = b[iy[i]]
> * a[ix[i]] += 1.0
"ambiguous dependencies" is very vague. Does it mean the compiler has
to do some analysis to detect non-ambiguous dependencies?
When using "llvm.mem.parallel_loop_access", this would mean the
front-end would have to detect which accesses are non-ambiguous and
not annotate them. However, the annotation is for single accesses, not
dependencies. Both "p[i]" and "q[j]" look non-ambiguous individually,
but the vectorizer would have to add a runtime-check and loop
versioning to ensure that these are not aliasing.
> * ivdep still requires automatic detection of reductions, including
> multiple homogeneous reductions on a single variable, examples:
> * x = x + a[i]
> * x = x + a[i]; if ( c[i] > 0.0 ) { x = x + b[i] }
We could leave away the "llvm.mem.parallel_loop_access" for the
LoadInst and StoreInst of the reduction variable, assuming detected
reductions are limited over scalar variables. However, mem2reg/sroa
would remove those memory accesses anyway, including their annotation,
requiring the LoopVectorizer to detect that the resulting PHINode is a
reduction. Mem2reg/sroa/LICM would also do so with non-reductions, and
array elements that are promoted to registers during the execution of
the loop, such that the loop would not be vectorizable.
Michael
This is what makes implementing ivep with Cray's semantics difficult.
To be compatible, we'd need to replicate Cray's cycle breaking.
Missing a detected reduction means ignoring its dependency cycle and
therefore a miscompilation where Cray's vectorizer might have produced
correct code (and the other way around). Unpredictably miscompiling
programs is probably not what users would expect.
> One thing to remember is that is perfectly valid for the "ivdep" loop
> nest to still be rejected as a vector candidate for any reason, so
> support for an "ivdep" pragma could be implemented in stages if desired.
The vectorizer rejecting any "ivdep" loop that has unbroken dependency
cycles makes the annotation useless. We'd need to have a description
of dependencies that any Cray compiler (including past and future
versions) will ignore (instead of breaking by e.g. reduction
detection) with ivdep such that Clang never miscompiles a loop that a
Cray compiler compiles correctly.