[llvm-dev] [LLVM] (RFC) Addition/Support of new Vectorization Pragmas in LLVM

HAPPY Mahto via llvm-dev

unread,

Aug 8, 2019, 12:56:23 PM8/8/19

to llvm...@lists.llvm.org, BHAVYA BAGLA, MAMIDALA SAI PRAHARSH, HAPPY KUMAR, YASHAS ANDALURI

Hello all,

We are students from Indian Institute of Technology(IIT), Hyderabad, we would like to propose the addition of the following pragmas in LLVM that aide in (or possibly increase the scope of) vectorization in LLVM (in comparison with other compilers).

ivdep
Nontemporal
[no]vecremainder
[no]mask_readwrite
[un]aligned

Could you please check the following Google document for the semantic description of these pragmas:

https://docs.google.com/document/d/1YjGnyzWFKJvqbpCsZicCUczzU8HlLHkmG9MssUw-R1A/edit?usp=sharing

It would be great if you could please review the above document and suggest us on how to proceed further (either about the semantics, or, about the code sections in LLVM).

Thank you

Yashas, Happy, Sai Praharsh, and Bhavya

B.Tech 3rd year, IITH.

Finkel, Hal J. via llvm-dev

unread,

Aug 8, 2019, 3:03:31 PM8/8/19

to HAPPY Mahto, llvm...@lists.llvm.org, MAMIDALA SAI PRAHARSH, BHAVYA BAGLA, YASHAS ANDALURI

Hi,

First, as a high-level note, you posted a link to a Google doc, and at the end of the Google doc, you have a list of questions that you'd like answered. In the future, please put the questions directly in the email. For one thing, more people will read your email than will open your Google doc. Second, having the questions in the email should allow a better threading structure to the replies.

Ivdep: Is clang loop vectorize(assume_safety) equivalent to ivdep? To what extent do the semantics of ivdep need to be modified for Clang to create an equally “useful pragma”? To what extent would it be helpful to have this pragma in Clang?
Nontemporal:What kind of analysis can we do in LLVM to find where to use nontemporal accesses? Any help would be greatly appreciated.
vecremainder/novecremainder: Should the pragma simply call the vectorizer to attempt to vectorize the remainder loop, or should the vectorizer use a different method?
mask_readwrite/nomask_readwrite: Is it a good idea to implement a pragma that will generate mask intrinsics in the IR? What other architectures (except x86) has support for masked read/writes?

Reference:https://llvm.org/devmtg/2015-04/slides/MaskedIntrinsics.pdf

LLVM has mask intrinsics for targets with AVX, AVX2, AVX-512.

From Slides: ”Most of the targets do not support masked instructions, optimization of instructions with masks is problematic, avoid introducing new masked instructions into LLVM IR”

aligned/unaligned: Is it worthwhile to have LLVM specific pragma rather depending on OpenMP?

-Hal

Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

From: llvm-dev <llvm-dev...@lists.llvm.org> on behalf of HAPPY Mahto via llvm-dev <llvm...@lists.llvm.org>
Sent: Thursday, August 8, 2019 11:55 AM
To: llvm...@lists.llvm.org <llvm...@lists.llvm.org>
Cc: BHAVYA BAGLA <cs17bte...@iith.ac.in>; MAMIDALA SAI PRAHARSH <es17bte...@iith.ac.in>; HAPPY KUMAR <cs17bte...@iith.ac.in>; YASHAS ANDALURI <es17bte...@iith.ac.in>
Subject: [llvm-dev] [LLVM] (RFC) Addition/Support of new Vectorization Pragmas in LLVM

Hello all,

We are students from Indian Institute of Technology(IIT), Hyderabad, we would like to propose the addition of the following pragmas in LLVM that aide in (or possibly increase the scope of) vectorization in LLVM (in comparison with other compilers).

ivdep
Nontemporal
[no]vecremainder
[no]mask_readwrite
[un]aligned

Could you please check the following Google document for the semantic description of these pragmas:

https://docs.google.com/document/d/1YjGnyzWFKJvqbpCsZicCUczzU8HlLHkmG9MssUw-R1A/edit?usp=sharing

Vectorization Pragmas LLVM:RFC: V2

docs.google.com

Vectorization Pragmas in LLVM: An RFC Yashas Andaluri, Happy Mahto, M Sai Praharsh, Bhavya Bagla IIT Hyderabad Aug 8th, 2019 [Thanks to feedback from Venugopal Raghavan, Shivarama Rao (AMD) and Michael Kruse & Hal Finkel (ANL).] Vectorization Pragmas ivdep vector(nontemporal) vector([no]vecrema...

Finkel, Hal J. via llvm-dev

unread,

Aug 8, 2019, 7:52:28 PM8/8/19

to HAPPY Mahto, llvm...@lists.llvm.org, MAMIDALA SAI PRAHARSH, BHAVYA BAGLA, YASHAS ANDALURI

On 8/8/19 2:03 PM, Hal Finkel wrote:

Hi,

First, as a high-level note, you posted a link to a Google doc, and at the end of the Google doc, you have a list of questions that you'd like answered. In the future, please put the questions directly in the email. For one thing, more people will read your email than will open your Google doc. Second, having the questions in the email should allow a better threading structure to the replies.

Ivdep: Is clang loop vectorize(assume_safety) equivalent to ivdep? To what extent do the semantics of ivdep need to be modified for Clang to create an equally “useful pragma”? To what extent would it be helpful to have this pragma in Clang?

There is a fundamental problem with the way that ivdep is defined by Intel's current documentation, at least for C/C++. As you note in your Google doc, it essentially says that the optimizer may ignore loop-carried dependencies except for those dependencies it can definitely prove are present. These are not semantics that any other compiler can actually replicate, and is not equivalent to "vectorize(assume_safety)" (which asserts that no loop-carried dependencies are present). The good news is that, in conversations I've had with Intel, an openness to making these semantics more concrete has been expressed. I think it would be very useful to have ivdep in Clang, but only after we nail down the semantics with Intel is some useful way.

Nontemporal:What kind of analysis can we do in LLVM to find where to use nontemporal accesses? Any help would be greatly appreciated.

If you're asking about the pragma, then what analysis is necessary? In general, you're looking for accesses that won't benefit from caching (e.g., streaming data which is not accessed again).

vecremainder/novecremainder: Should the pragma simply call the vectorizer to attempt to vectorize the remainder loop, or should the vectorizer use a different method?

Something like that. There were patches posted at some point to enable tail-loop vectorization. At this point, I imagine that you'd construct a VPlan with the vectorized tail.

mask_readwrite/nomask_readwrite: Is it a good idea to implement a pragma that will generate mask intrinsics in the IR? What other architectures (except x86) has support for masked read/writes?

ARM SVE might also fall into this category.

Reference:https://llvm.org/devmtg/2015-04/slides/MaskedIntrinsics.pdf

LLVM has mask intrinsics for targets with AVX, AVX2, AVX-512.

From Slides: ”Most of the targets do not support masked instructions, optimization of instructions with masks is problematic, avoid introducing new masked instructions into LLVM IR”

aligned/unaligned: Is it worthwhile to have LLVM specific pragma rather depending on OpenMP?

My opinion is that, so long as we have our own vectorization pragma, it should be as fully-featured as people request it to be.

-Hal

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Cameron McInally via llvm-dev

unread,

Aug 8, 2019, 8:51:01 PM8/8/19

to Finkel, Hal J., llvm...@lists.llvm.org, MAMIDALA SAI PRAHARSH, YASHAS ANDALURI, HAPPY Mahto, BHAVYA BAGLA

On Thu, Aug 8, 2019 at 7:52 PM Finkel, Hal J. via llvm-dev <llvm...@lists.llvm.org> wrote:

On 8/8/19 2:03 PM, Hal Finkel wrote:

Hi,

First, as a high-level note, you posted a link to a Google doc, and at the end of the Google doc, you have a list of questions that you'd like answered. In the future, please put the questions directly in the email. For one thing, more people will read your email than will open your Google doc. Second, having the questions in the email should allow a better threading structure to the replies.

Ivdep: Is clang loop vectorize(assume_safety) equivalent to ivdep? To what extent do the semantics of ivdep need to be modified for Clang to create an equally “useful pragma”? To what extent would it be helpful to have this pragma in Clang?

There is a fundamental problem with the way that ivdep is defined by Intel's current documentation, at least for C/C++. As you note in your Google doc, it essentially says that the optimizer may ignore loop-carried dependencies except for those dependencies it can definitely prove are present. These are not semantics that any other compiler can actually replicate, and is not equivalent to "vectorize(assume_safety)" (which asserts that no loop-carried dependencies are present). The good news is that, in conversations I've had with Intel, an openness to making these semantics more concrete has been expressed. I think it would be very useful to have ivdep in Clang, but only after we nail down the semantics with Intel is some useful way.

To be fair, IVDEP most likely originated at Cray. [Or maybe Control Data. The history is fuzzy that far back. I do know it predates ANSI C.]

There's a publicly available copy of the Cray C/C++ manual here:

https://pubs.cray.com/content/S-2179/9.0/cray-classic-c-and-c++-reference-manual/vectorization-directives

Scott Manley from Cray would be good resource to tap for clarification on the semantics.

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwIGaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=O_4M49EtSpZ_-BQYeigzGv0P4__noMcSu2RYEjS1vKs&m=ttZjwoTRuUQgVSd_8PZOPypfqqn-GiNqAl9WLpPxiAk&s=L-X4vbafbWIKsdnIqTTXsiRM2ku9-D5cLKCXc18dtUo&e=

Scott Manley via llvm-dev

unread,

Aug 9, 2019, 11:57:39 AM8/9/19

to cameron....@nyu.edu, llvm...@lists.llvm.org, BHAVYA BAGLA, HAPPY Mahto, YASHAS ANDALURI, MAMIDALA SAI PRAHARSH

> There is a fundamental problem with the way that ivdep is defined by Intel's current documentation, at least for C/C++. As you note in your Google doc, it essentially says that the optimizer may ignore loop-carried dependencies except for those dependencies it can definitely prove are present. These are not semantics that any other compiler can actually replicate, and is not equivalent to "vectorize(assume_safety)" (which asserts that no loop-carried dependencies are present). The good news is that, in conversations I've had with Intel, an openness to making these semantics more concrete has been expressed. I think it would be very useful to have ivdep in Clang, but only after we nail down the semantics with Intel is some useful way.

Agreed. I don't see a lot of value in having the compiler override a pragma that is supposed to override the compiler :) Cray's IVDEP really means what the documentation says: Ignore Vector DEPendencies. It doesn't remove all dependencies, just dependencies that inhibit vectorization. It also does not force vectorization. If it's not possible or not profitable to vectorize, then it won't vectorize.

I will add that ivdep is well used by Cray and its users, so I'd like to see it well defined in Clang/llvm.

https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Finkel, Hal J. via llvm-dev

unread,

Aug 9, 2019, 1:12:53 PM8/9/19

to Scott Manley, cameron....@nyu.edu, llvm...@lists.llvm.org, BHAVYA BAGLA, MAMIDALA SAI PRAHARSH, HAPPY Mahto, YASHAS ANDALURI

Thanks, Scott.

Regarding this:

> It doesn't remove all dependencies, just dependencies that inhibit vectorization.

This matches what Cray's manual says, but I'm also not sure how to interpret this statement. Does that means that the dependencies ignored are dependent on the selected target? I'm a bit worried that the dependencies interesting for vectorization might change over time or depend on the hardware being targeted.

Can you please take a look at the way that Intel's Fortran manual defines ivdep (https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-ivdep) and say whether those semantics would also make sense for Cray's implementation?

I believe our consensus view is that the semantics of these kinds of pragmas should be specified such that we could create a sanitizer which checks their dynamic semantic correctness independent of what the optimizer is actually capable of exploiting.

-Hal

Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

From: Scott Manley <rscott...@gmail.com>
Sent: Friday, August 9, 2019 10:57 AM
To: cameron....@nyu.edu <cameron....@nyu.edu>
Cc: Finkel, Hal J. <hfi...@anl.gov>; llvm...@lists.llvm.org <llvm...@lists.llvm.org>; MAMIDALA SAI PRAHARSH <es17bte...@iith.ac.in>; YASHAS ANDALURI <es17bte...@iith.ac.in>; HAPPY Mahto <cs17bte...@iith.ac.in>; BHAVYA BAGLA <cs17bte...@iith.ac.in>
Subject: Re: [llvm-dev] [LLVM] (RFC) Addition/Support of new Vectorization Pragmas in LLVM

Scott Manley via llvm-dev

unread,

Aug 9, 2019, 2:26:49 PM8/9/19

to Finkel, Hal J., llvm...@lists.llvm.org, BHAVYA BAGLA, HAPPY Mahto, YASHAS ANDALURI, MAMIDALA SAI PRAHARSH

> This matches what Cray's manual says, but I'm also not sure how to interpret this statement. Does that means that the dependencies ignored are dependent on the selected target? I'm a bit worried that the dependencies interesting for vectorization might change over time or depend on the hardware being targeted.

No, we don't consider the target with regards to ivdep -- but I'll admit I don't know what hardware might do in the future :)

Perhaps we could look at a classic vector dependency issue in what Cray calls a vector update (I believe Intel refers to it as a histogram) -- a[idx[i]] = a[idx[i]] + b[i] as an example? Some targets can vectorize this and thus isn't technically a dependency issue for those targets, but ivdep can still play a role here. Without ivdep, you can still safely vectorize this on Skylake but it requires a particular sequence of instructions to resolve properly. With ivdep, we can simply generate a gather/scatter. I imagine other vector dependency issues might benefit from a similar user driven choice on hardware that could possibly "resolve" some of the dependency problems.

> Can you please take a look at the way that Intel's Fortran manual defines ivdep (https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-ivdep) and say whether those semantics would also make sense for Cray's implementation?

Their semantics are certainly cover at least part of Cray's ivdep. I did try a few examples that vectorize with Cray's ivdep using icc and wasn't sure if some of their decisions were due to or in spite of ivdep, so I need to dig into that more. We'll put together a list of what we do with IVDEP and see if they are all covered under that wording.

Cheers,

Scott

Finkel, Hal J. via llvm-dev

unread,

Aug 9, 2019, 2:36:15 PM8/9/19

to Scott Manley, llvm...@lists.llvm.org, BHAVYA BAGLA, HAPPY Mahto, YASHAS ANDALURI, MAMIDALA SAI PRAHARSH

> We'll put together a list of what we do with IVDEP and see if they are all covered under that wording.

Thanks, that will be helpful.

-Hal

Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

From: Scott Manley <rscott...@gmail.com>
Sent: Friday, August 9, 2019 1:26 PM
To: Finkel, Hal J. <hfi...@anl.gov>
Cc: cameron....@nyu.edu <cameron....@nyu.edu>; llvm...@lists.llvm.org <llvm...@lists.llvm.org>; MAMIDALA SAI PRAHARSH <es17bte...@iith.ac.in>; YASHAS ANDALURI <es17bte...@iith.ac.in>; HAPPY Mahto <cs17bte...@iith.ac.in>; BHAVYA BAGLA <cs17bte...@iith.ac.in>

Subject: Re: [llvm-dev] [LLVM] (RFC) Addition/Support of new Vectorization Pragmas in LLVM

> This matches what Cray's manual says, but I'm also not sure how to interpret this statement. Does that means that the dependencies ignored are dependent on the selected target? I'm a bit worried that the dependencies interesting for vectorization might change over time or depend on the hardware being targeted.

No, we don't consider the target with regards to ivdep -- but I'll admit I don't know what hardware might do in the future :)

Perhaps we could look at a classic vector dependency issue in what Cray calls a vector update (I believe Intel refers to it as a histogram) -- a[idx[i]] = a[idx[i]] + b[i] as an example? Some targets can vectorize this and thus isn't technically a dependency issue for those targets, but ivdep can still play a role here. Without ivdep, you can still safely vectorize this on Skylake but it requires a particular sequence of instructions to resolve properly. With ivdep, we can simply generate a gather/scatter. I imagine other vector dependency issues might benefit from a similar user driven choice on hardware that could possibly "resolve" some of the dependency problems.

> Can you please take a look at the way that Intel's Fortran manual defines ivdep (https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-ivdep) and say whether those semantics would also make sense for Cray's implementation?

IVDEP | Intel® Fortran Compiler 19.0

software.intel.com

The IVDEP directive is an assertion to the compiler's optimizer about the order of memory references inside a DO loop. IVDEP:LOOP implies no loop-carried dependencies.

David Greene via llvm-dev

unread,

Aug 9, 2019, 2:50:59 PM8/9/19

to HAPPY Mahto via llvm-dev, MAMIDALA SAI PRAHARSH, YASHAS ANDALURI, HAPPY Mahto, BHAVYA BAGLA

HAPPY Mahto via llvm-dev <llvm...@lists.llvm.org> writes:

> 2 Nontemporal

Is this a hint or a command? If it's a command then this would
implicitly specify the data is aligned on some targets (e.g. Intel X86).
I'm not sure we want to make that implicit assumption as it is very easy
for the programmer to get this wrong.

-David

Jeff Hammond via llvm-dev

unread,

Aug 9, 2019, 5:09:16 PM8/9/19

to Scott Manley, llvm...@lists.llvm.org, BHAVYA BAGLA, HAPPY Mahto, YASHAS ANDALURI, MAMIDALA SAI PRAHARSH

On Fri, Aug 9, 2019 at 8:57 AM Scott Manley via llvm-dev <llvm...@lists.llvm.org> wrote:

> There is a fundamental problem with the way that ivdep is defined by Intel's current documentation, at least for C/C++. As you note in your Google doc, it essentially says that the optimizer may ignore loop-carried dependencies except for those dependencies it can definitely prove are present. These are not semantics that any other compiler can actually replicate, and is not equivalent to "vectorize(assume_safety)" (which asserts that no loop-carried dependencies are present). The good news is that, in conversations I've had with Intel, an openness to making these semantics more concrete has been expressed. I think it would be very useful to have ivdep in Clang, but only after we nail down the semantics with Intel is some useful way.

Agreed. I don't see a lot of value in having the compiler override a pragma that is supposed to override the compiler :) Cray's IVDEP really means what the documentation says: Ignore Vector DEPendencies. It doesn't remove all dependencies, just dependencies that inhibit vectorization. It also does not force vectorization. If it's not possible or not profitable to vectorize, then it won't vectorize.

+1

This one is particularly useful because some compilers implement "omp-simd" as ignore the cost model and vectorize unconditionally, so it is really useful in C/C++ code to be able to provide a weaker statement to the compiler. I disagree with the strong interpretation of the OpenMP standard but am not willing to quit my job over it ;-)

I will add that ivdep is well used by Cray and its users, so I'd like to see it well defined in Clang/llvm.

51K references on GitHub (https://github.com/search?q=pragma+ivdep&type=Code) suggest it is widely used beyond the Cray compiler.

Jeff

--

Jeff Hammond
jeff.s...@gmail.com
http://jeffhammond.github.io/

Jeff Hammond via llvm-dev

unread,

Aug 9, 2019, 5:31:43 PM8/9/19

to David Greene, HAPPY Mahto via llvm-dev, BHAVYA BAGLA, MAMIDALA SAI PRAHARSH, HAPPY Mahto, YASHAS ANDALURI

On Fri, Aug 9, 2019 at 11:51 AM David Greene via llvm-dev <llvm...@lists.llvm.org> wrote:

HAPPY Mahto via llvm-dev <llvm...@lists.llvm.org> writes:

> 2 Nontemporal

Is this a hint or a command? If it's a command then this would
implicitly specify the data is aligned on some targets (e.g. Intel X86).
I'm not sure we want to make that implicit assumption as it is very easy
for the programmer to get this wrong.

I think it has to be a hint. If it is a command, what is it's meaning on non-x86 processors where write-through and write-back are controlled in different ways (or are just uncontrollable)?

For example, some PPC set cache write back/through at the page level (https://www.nxp.com/docs/en/data-sheet/MPC603.pdf). Would the command implementation have to try to set the page properties to do as the user directed?

There are also cases where the compiler may know that the user is often wrong about the utility of non-temporal memory access and ignoring it is an effective optimization. This is potentially relevant to profile-guided optimization.

Jeff

Sjoerd Meijer via llvm-dev

unread,

Aug 13, 2019, 12:59:31 PM8/13/19

to Finkel, Hal J., cameron....@nyu.edu, llvm...@lists.llvm.org, BHAVYA BAGLA, MAMIDALA SAI PRAHARSH, HAPPY Mahto, YASHAS ANDALURI

vecremainder/novecremainder: Should the pragma simply call the vectorizer to attempt to vectorize the remainder loop, or should the vectorizer use a different method?

>

> Something like that. There were patches posted at some point to enable tail-loop vectorization. At this point, I imagine that you'd construct a VPlan with the vectorized tail.

Yep, committed in https://reviews.llvm.org/rL366989 and https://reviews.llvm.org/D65197.

The pragma name is different, but I think it tries to achieve the same thing.

From: llvm-dev <llvm-dev...@lists.llvm.org> on behalf of Cameron McInally via llvm-dev <llvm...@lists.llvm.org>
Sent: 09 August 2019 01:50

To: Finkel, Hal J. <hfi...@anl.gov>

Cc: llvm...@lists.llvm.org <llvm...@lists.llvm.org>; MAMIDALA SAI PRAHARSH <es17bte...@iith.ac.in>; YASHAS ANDALURI <es17bte...@iith.ac.in>; HAPPY Mahto <cs17bte...@iith.ac.in>; BHAVYA BAGLA <cs17bte...@iith.ac.in>
Subject: Re: [llvm-dev] [LLVM] (RFC) Addition/Support of new Vectorization Pragmas in LLVM

Michael Kruse via llvm-dev

unread,

Aug 13, 2019, 2:30:06 PM8/13/19

to Sjoerd Meijer, llvm...@lists.llvm.org, BHAVYA BAGLA, HAPPY Mahto, YASHAS ANDALURI, MAMIDALA SAI PRAHARSH

Am Di., 13. Aug. 2019 um 11:59 Uhr schrieb Sjoerd Meijer via llvm-dev
<llvm...@lists.llvm.org>:

>
> vecremainder/novecremainder: Should the pragma simply call the vectorizer to attempt to vectorize the remainder loop, or should the vectorizer use a different method?
>
> >
>
> > Something like that. There were patches posted at some point to enable tail-loop vectorization. At this point, I imagine that you'd construct a VPlan with the vectorized tail.
>
>
> Yep, committed in https://reviews.llvm.org/rL366989 and https://reviews.llvm.org/D65197.
>
> The pragma name is different, but I think it tries to achieve the same thing.

If I understand Intel's documentation correctly, these are different things:

vectorize.predicate.enable: Do not create an epilogue loop (use masked
instructions in the main loop instead)
vecremainder: If there is an epilogue loop, vectorize it as well
(which will require masked instructions in the epilogue, but not in
the main loop)

Michael

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org

https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sjoerd Meijer via llvm-dev

unread,

Aug 13, 2019, 2:44:39 PM8/13/19

to Michael Kruse, llvm...@lists.llvm.org, BHAVYA BAGLA, HAPPY Mahto, YASHAS ANDALURI, MAMIDALA SAI PRAHARSH

Ah yes, not exactly the same things, thanks for clarifying.

From: Michael Kruse <llv...@meinersbur.de>
Sent: 13 August 2019 19:29:15
To: Sjoerd Meijer <Sjoerd...@arm.com>
Cc: Finkel, Hal J. <hfi...@anl.gov>; cameron....@nyu.edu <cameron....@nyu.edu>; llvm...@lists.llvm.org <llvm...@lists.llvm.org>; BHAVYA BAGLA <cs17bte...@iith.ac.in>; MAMIDALA SAI PRAHARSH <es17bte...@iith.ac.in>; HAPPY Mahto <cs17bte...@iith.ac.in>; YASHAS ANDALURI <es17bte...@iith.ac.in>

Subject: Re: [llvm-dev] [LLVM] (RFC) Addition/Support of new Vectorization Pragmas in LLVM

Terry Greyzck via llvm-dev

unread,

Aug 15, 2019, 4:05:29 PM8/15/19

to llvm...@lists.llvm.org

The ivdep pragma is designed to do exactly what the name states - ignore
vector dependencies. Cray Research first implemented this in 1978 in
their CFT compiler, and has supported it since.

This pragma is typically used by application developers who want
vectorized code when the compiler cannot automatically determine safety;
it is not equivalent to the OpenMP SIMD pragma in that the compiler is
still expected to automatically detect such things as reductions.

The Cray implementation accepts an optional 'safevl=<const>' clause,
however this is not commonly used and not really needed for new
implementations.

Characteristics of ivdep:

* ivdep can be applied to any loop in a nest, but only affects the
    immediately following loop
      * Only trims dependencies at that loop nest level
      * If the user annotates more than one loop in a loop nest, only
        the ivdep on the innermost loop or loops is honored

* ivdep can be used on for (), while {}, and do { } while loops

* Primarily ivdep allows ambiguous dependencies to be ignored, examples:
      * p[i] = q[j]
      * a[ix[i]] = b[iy[i]]
      * a[ix[i]] += 1.0

* ivdep still requires automatic detection of reductions, including
    multiple homogeneous reductions on a single variable, examples:
      * x = x + a[i]
      * x = x + a[i]; if ( c[i] > 0.0 ) { x = x + b[i] }

* ivdep implies loop control variables, loop termination expressions,
    and primary induction increment expressions do not alias anything
    else in the loop nest body

* For Fortran, ivdep allows the following for array syntax
      * Assumption that there is no overlap between the target and right
          hand side in assignments
      * Assumption there is no overlap between the target and the
          address expression for the target

Things that ivdep will /not /do:

* ivdep on a loop nest will not diagnose or reject loops with
    'provable' dependencies

* ivdep on a loop nest will not restructure loops as necessary for
    correctness
      * will not reorder statements for dependence resolution
      * will not peel for dependence resolution
      * will not perform loop distribution to remove recurrences

* ivdep does not force vector code to be generated; the compiler can
    decide to not vectorize the loop nest for any reason it sees fit,
    including but not limited to:
      * Expectations the scalar version will be faster than the vector
      * Non-vectorizable function calls in the loop nest
      * Vector operations that lack hardware support
      * Extremely unstructured control flow

--
Terry Greyzck | Cray Inc.
Sr. Principal Engineer, Compiler Optimization

Doerfert, Johannes via llvm-dev

unread,

Aug 19, 2019, 11:31:03 AM8/19/19

to Terry Greyzck, llvm...@lists.llvm.org

Hi Terry,

I'm curious.

> * Primarily ivdep allows ambiguous dependencies to be ignored, examples:
>       * p[i] = q[j]
>       * a[ix[i]] = b[iy[i]]
>       * a[ix[i]] += 1.0
>
> * ivdep still requires automatic detection of reductions, including
>     multiple homogeneous reductions on a single variable, examples:
>       * x = x + a[i]
>       * x = x + a[i]; if ( c[i] > 0.0 ) { x = x + b[i] }

How do you define the difference between

a[ix[i]] += 1.0

and
x += 1.0
as you require reduction detection for the latter but seem to ignore the
(histogram) reduction dependences for the former.

Thanks,
Johannes

signature.asc

Michael Kruse via llvm-dev

unread,

Aug 19, 2019, 3:33:51 PM8/19/19

to Terry Greyzck, llvm...@lists.llvm.org

I think some of the semantics could be implemented using the
"llvm.mem.parallel_loop_access" annotation we already have, modulo the
difficulties mentioned below.

Am Do., 15. Aug. 2019 um 15:06 Uhr schrieb Terry Greyzck via llvm-dev
<llvm...@lists.llvm.org>:

> * Primarily ivdep allows ambiguous dependencies to be ignored, examples:
> * p[i] = q[j]
> * a[ix[i]] = b[iy[i]]
> * a[ix[i]] += 1.0

"ambiguous dependencies" is very vague. Does it mean the compiler has
to do some analysis to detect non-ambiguous dependencies?

When using "llvm.mem.parallel_loop_access", this would mean the
front-end would have to detect which accesses are non-ambiguous and
not annotate them. However, the annotation is for single accesses, not
dependencies. Both "p[i]" and "q[j]" look non-ambiguous individually,
but the vectorizer would have to add a runtime-check and loop
versioning to ensure that these are not aliasing.

> * ivdep still requires automatic detection of reductions, including
> multiple homogeneous reductions on a single variable, examples:
> * x = x + a[i]
> * x = x + a[i]; if ( c[i] > 0.0 ) { x = x + b[i] }

We could leave away the "llvm.mem.parallel_loop_access" for the
LoadInst and StoreInst of the reduction variable, assuming detected
reductions are limited over scalar variables. However, mem2reg/sroa
would remove those memory accesses anyway, including their annotation,
requiring the LoopVectorizer to detect that the resulting PHINode is a
reduction. Mem2reg/sroa/LICM would also do so with non-reductions, and
array elements that are promoted to registers during the execution of
the loop, such that the loop would not be vectorizable.

Michael

Terry Greyzck via llvm-dev

unread,

Aug 20, 2019, 3:23:41 PM8/20/19

to Doerfert, Johannes, LLVM Dev list

The ivdep pragma is intended to assist automatic vectorization -
basically, automatic vectorization behaves as it normally does, but if
it gets into a spot where it finds a potential dependence, it continues
on rather than rejecting the loop.

Reductions are part of cycle-breaking; one possible way to identify
potential reduction objects is that their address is provably invariant
with respect to the vectorized loop. Some examples (assume 'i' is the
loop primary induction):

x = x + b[i]

   &x is invariant with respect to the 'i' vector loop, and can be a
reduction candidate

a[0] = a[0] + b[i]
   &a[0] is invariant with respect to the 'i' vector loop, and can be a
reduction candidate

a[ix[i]] = a[ix[i]] + b[i]
   &a[ix[i]] varies with respect to the 'i' vector loop, and is not a
reduction candidate
* I am ignoring the end case here where all values of ix[i] are
identical
      * Without an ivdep, this would be considered a histogram (or
what Cray used to call 'vector update'), due to possible repeated values
in array 'ix'
      * With an ivdep, this becomes a gather, load and add followed by
a scatter

When outer loop vectorization is considered, identifying vector
reductions becomes somewhat more complicated, and the simple invariant
address test is not always sufficient. Examples on request, though that
is really a different (and possibly lengthy) discussion.

--
Terry Greyzck | Cray Inc.
Sr. Principal Engineer, Compiler Optimization

Doerfert, Johannes via llvm-dev

unread,

Aug 20, 2019, 3:57:13 PM8/20/19

to Terry Greyzck, LLVM Dev list

I realize this is probably out of scope but I wanted to put this out
here if people consider to add ivdep.

The ignored case is what I was asking about. Basically, I was wondering
how you explain when reduction dependences are needed to be resolved by
the vectorizer and when you ignore potential reduction dependences.

While I can see the appeal towards users, I dislike something like
"their address provably invariant". Provably invariant can change
depending on the surrounding, the analyses, and the transformations
applied. Let's say ix is, after some transformations, known to be a
uniform array, reduction handling becomes required, but if the
uniformity is not exposed before the vectorizer, the potential
dependences will be ignored, right?. I mean, if it worked for the user
and then something unrelated screwed up replacement of the ix[i] values
with constants (in the user code or in the compiler), then it is hard to
debug.

Do you see what I mean? Please let me know if I misunderstand how Cray
handles ivdep though.

Cheers,
Johannes

> On 8/19/2019 10:30 AM, Doerfert, Johannes wrote:
> > Hi Terry,
> >
> > I'm curious.
> >
> >> * Primarily ivdep allows ambiguous dependencies to be ignored, examples:
> >>       * p[i] = q[j]
> >>       * a[ix[i]] = b[iy[i]]
> >>       * a[ix[i]] += 1.0
> >>
> >> * ivdep still requires automatic detection of reductions, including
> >>     multiple homogeneous reductions on a single variable, examples:
> >>       * x = x + a[i]
> >>       * x = x + a[i]; if ( c[i] > 0.0 ) { x = x + b[i] }
> > How do you define the difference between
> > a[ix[i]] += 1.0
> > and
> > x += 1.0
> > as you require reduction detection for the latter but seem to ignore the
> > (histogram) reduction dependences for the former.
> >
> > Thanks,
> > Johannes
>

--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoe...@anl.gov

signature.asc

Terry Greyzck via llvm-dev

unread,

Aug 26, 2019, 3:17:25 PM8/26/19

to Michael Kruse, llvm...@lists.llvm.org

My intent was to present the expected behavior for an "ivdep" pragma
implementation, rather than diving into the implementation details -
that seems like it should be another thread.

That said, trying to predict in the front end what edges will eventually
cause difficulties with automatic vectorization does seem problematic.
Generally "ivdep" is an assist to automatic vectorization; for older
Cray compilers that basically means the front end does nothing but pass
along the "ivdep" property, and dependency analysis for vectorization
uses that property directly.

One thing to remember is that is perfectly valid for the "ivdep" loop
nest to still be rejected as a vector candidate for any reason, so
support for an "ivdep" pragma could be implemented in stages if desired.

Terry

Michael Kruse via llvm-dev

unread,

Aug 26, 2019, 4:20:13 PM8/26/19

to Terry Greyzck, llvm...@lists.llvm.org

Am Mo., 26. Aug. 2019 um 14:16 Uhr schrieb Terry Greyzck <t...@cray.com>:
> That said, trying to predict in the front end what edges will eventually
> cause difficulties with automatic vectorization does seem problematic.
> Generally "ivdep" is an assist to automatic vectorization; for older
> Cray compilers that basically means the front end does nothing but pass
> along the "ivdep" property, and dependency analysis for vectorization
> uses that property directly.

This is what makes implementing ivep with Cray's semantics difficult.
To be compatible, we'd need to replicate Cray's cycle breaking.
Missing a detected reduction means ignoring its dependency cycle and
therefore a miscompilation where Cray's vectorizer might have produced
correct code (and the other way around). Unpredictably miscompiling
programs is probably not what users would expect.

> One thing to remember is that is perfectly valid for the "ivdep" loop
> nest to still be rejected as a vector candidate for any reason, so
> support for an "ivdep" pragma could be implemented in stages if desired.

The vectorizer rejecting any "ivdep" loop that has unbroken dependency
cycles makes the annotation useless. We'd need to have a description
of dependencies that any Cray compiler (including past and future
versions) will ignore (instead of breaking by e.g. reduction
detection) with ivdep such that Clang never miscompiles a loop that a
Cray compiler compiles correctly.

Reply all

Reply to author

Forward