Hal Finkel (ANL) and Xinmin Tian (Intel)
This is a proposal for adding an experimental infrastructure to support
annotating regions in LLVM IR, making use of intrinsics and metadata, and
a generic analysis to allow transformations to easily make use of these
annotated regions. This infrastructure is flexible enough to support
representation of directives for parallelization, vectorization, and
offloading of both loops and more-general code regions. Under this scheme,
the conceptual distance between source-level directives and the region
annotations need not be significant, making the incremental cost of
supporting new directives and modifiers often small. It is not, however,
specific to those use cases.
Problem Statement
=================
There are a series of discussions on LLVM IR extensions for representing 
region
and loop annotations for parallelism, and other user-guided 
transformations,
among both industrial and academic members of the LLVM community. 
Increasing
the quality of our OpenMP implementation is an important motivating use 
case,
but certainly not the only one. For OpenMP in particular, we've discussed
having an IR representation for years. Presently, all OpenMP pragmas are
transformed directly into runtime-library calls in Clang, and outlining 
(i.e.
extracting parallel regions into their own functions to be invoked by the
runtime library) is done in Clang as well. Our implementation does not 
further
optimize OpenMP constructs, and a lot of thought has been put into how 
we might
improve this. For some optimizations, such as redundant barrier removal, we
could use a TargetLibraryInfo-like mechanism to recognize 
frontend-generated
runtime calls and proceed from there. Dealing with cases where we lose
pointer-aliasing information, information on loop bounds, etc. we could 
improve
by improving our inter-procedural-analysis capabilities. We should do that
regardless. However, there are important cases where the underlying 
scheme we
want to use to lower the various parallelism constructs, especially when
targeting accelerators, changes depending on what is in the parallel 
region.
In important cases where we can see everything (i.e. there aren't arbitrary
external calls), code generation should proceed in a way that is very 
different
from the general case. To have a sensible implementation, this must be done
after inlining. When using LTO, this should be done during the link-time 
phase.
As a result, we must move away from our purely-front-end based lowering 
scheme.
The question is what to do instead, and how to do it in a way that is 
generally
useful to the entire community.
Designs previously discussed can be classified into four categories:
(a) Add a large number of new kinds of LLVM metadata, and use them to 
annotate
     each necessary instruction for parallelism, data attributes, etc.
(b) Add several new LLVM instructions such as, for parallelism, fork, 
spawn,
     join, barrier, etc.
(c) Add a large number of LLVM intrinsics for directives and clauses, each
     intrinsic representing a directive or a clause.
(d) Add a small number of LLVM intrinsics for region or loop annotations,
     represent the directive/clause names using metadata and the remaining
     information using arguments.
Here we're proposing (d), and below is a brief pros and cons analysis 
based on
these discussions and our own experiences of supporting region/loop 
annotations
in LLVM-based compilers. The table below shows a short summary of our 
analysis.
Various commercial compilers (e.g. from Intel, IBM, Cray, PGI), and GCC 
[1,2],
have IR-level representations for parallelism constructs. Based on 
experience
from these previous developments, we'd like a solution for LLVM that 
maximizes
optimization enablement while minimizing the maintenance costs and 
complexity
increase experienced by the community as a whole.
Representing the desired information in the LLVM IR is just the first 
step. The
challenge is to maintain the desired semantics without blocking useful
optimizations. With options (c) and (d), dependencies can be preserved 
mainly
based on the use/def chain of the arguments of each intrinsic, and a 
manageable
set LLVM analysis and transformations can be made aware of certain kinds of
annotations in order to enable specific optimizations. In this regard,
options (c) and (d) are close with respect to maintenance efforts. However,
based on our experiences, option (d) is preferable because it is easier to
extend to support new directives and clauses in the future without the 
need to
add new intrinsics as required by option (c).
Table 1. Pros/cons summary of LLVM IR experimental extension options
--------+----------------------+-----------------------------------------------
Options |         Pros         | Cons
--------+----------------------+----------------------------------------------- 
(a)     | No need to add new   | LLVM passes do not always maintain 
metadata.
         | instructions or      | Need to educate many passes (if not 
all) to
         | new intrinsics       | understand and handle them.
--------+----------------------+----------------------------------------------- 
(b)     | Parallelism becomes  | Huge effort for extending all LLVM 
passes and
         | first class citizen  | code generation to support new 
instructions.
         |                      | A large set of information still needs 
to be
         |                      | represented using other means.
--------+----------------------+----------------------------------------------- 
(c)     | Less impact on the   | A large number of intrinsics must be 
added.
         | exist LLVM passes.   | Some of the optimizations need to be
         | Fewer requirements   | educated to understand them.
         | for passes to        |
         | maintain metadata.   |
--------+----------------------+----------------------------------------------- 
(d)     | Minimal impact on    | Some of the optimizations need to be
         | existing LLVM        | educated to understand them.
         | optimizations passes.| No requirements for all passes to 
maintain
         | directive and clause | large set of metadata with values.
         | names use metadata   |
         | strings.             |
--------+----------------------+----------------------------------------------- 
Regarding (a), LLVM already uses metadata for certain loop information 
(e.g.
annotations directing loop transformations and assertions about 
loop-carried
dependencies), but there is no natural or consistent way to extend this 
scheme
to represent necessary data-movement or region information.
New Intrinsics for Region and Value Annotations
==============================================
The following new (experimental) intrinsics are proposed which allow:
a) Annotating a code region marked with directives / pragmas,
b) Annotating values associated with the region (or loops), that is, those
    values associated with directives / pragmas.
c) Providing information on LLVM IR transformations needed for the 
annotated
    code regions (or loops).
These can be used both by frontends and also by transformation passes (e.g.
automated parallelization). The names used here are similar to those 
used by
our internal prototype, but obviously we expect a community bikeshed
discussion.
def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty],
                                    [IntrArgMemOnly],
"llvm.experimental.directive">;
def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual">;
def int_experimental_dir_qual_opnd : Intrinsic<[],
[llvm_metadata_ty, llvm_any_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opnd">;
def int_experimental_dir_qual_opndlist : Intrinsic<
                                         [],
[llvm_metadata_ty, llvm_vararg_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opndlist">;
Note that calls to these intrinsics might need to be annotated with the
convergent attribute when they represent fork/join operations, barriers, 
and
similar.
Usage Examples
==============
This section shows a few examples using these experimental intrinsics.
LLVM developers who will use these intrinsics can defined their own 
MDstring.
All details of using these intrinsics on representing OpenMP 4.5 
constructs are described in [1][3].
Example I: An OpenMP combined construct
#pragma omp target teams distribute parallel for simd
   loop
LLVM IR
-------
call void @llvm.experimental.directive(metadata !0)
call void @llvm.experimental.directive(metadata !1)
call void @llvm.experimental.directive(metadata !2)
call void @llvm.experimental.directive(metadata !3)
   loop
call void @llvm.experimental.directive(metadata !6)
call void @llvm.experimental.directive(metadata !5)
call void @llvm.experimental.directive(metadata !4)
!0 = metadata !{metadata !DIR.OMP.TARGET}
!1 = metadata !{metadata !DIR.OMP.TEAMS}
!2 = metadata !{metadata !DIR.OMP.DISTRIBUTE.PARLOOP.SIMD}
!6 = metadata !{metadata !DIR.OMP.END.DISTRIBUTE.PARLOOP.SIMD}
!5 = metadata !{metadata !DIR.OMP.END.TEAMS}
!4 = metadata !{metadata !DIR.OMP.END.TARGET}
Example II: Assume x,y,z are int variables, and s is a non-POD variable.
             Then, lastprivate(x,y,s,z) is represented as:
LLVM IR
-------
call void @llvm.experimental.dir.qual.opndlist(
                 metadata !1, %x, %y, metadata !2, %a, %ctor, %dtor, %z)
!1 = metadata !{metadata !QUAL.OMP.PRIVATE}
!2 = metadata !{metadata !QUAL.OPND.NONPOD}
Example III: A prefetch pragma example
// issue vprefetch1 for xp with a distance of 20 vectorized iterations 
ahead
// issue vprefetch0 for yp with a distance of 10 vectorized iterations 
ahead
#pragma prefetch x:1:20 y:0:10
for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }
LLVM IR
-------
call void @llvm.experimental.directive(metadata !0)
call void @llvm.experimental.dir.qual.opnslist(metadata !1, %xp, 1, 20,
                                                metadata !1, %yp, 0, 10)
   loop
call void @llvm.experimental.directive(metadata !3)
References
==========
[1] LLVM Framework and IR extensions for Parallelization, SIMD 
Vectorization
     and Offloading Support. SC'2016 LLVM-HPC3 Workshop. (Xinmin Tian 
et.al.)
     Saltlake City, Utah.
[2] Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer 
loop
     auto-vectorization. (Hideki Saito, et.al.) LLVM Developers' Meeting 
2016,
     San Jose.
[3] Intrinsics, Metadata, and Attributes: The Story continues! (Hal Finkel)
     LLVM Developers' Meeting, 2016. San Jose
[4] LLVM Intrinsic Function and Metadata String Interface for Directive (or
     Pragmas) Representation. Specification Draft v0.9, Intel 
Corporation, 2016.
Acknowledgements
================
We would like to thank Chandler Carruth (Google), Johannes Doerfert 
(Saarland
Univ.), Yaoqing Gao (HuaWei), Michael Wong (Codeplay), Ettore Tiotto,
Carlo Bertolli, Bardia Mahjour (IBM), and all other LLVM-HPC IR 
Extensions WG
members for their constructive feedback on the LLVM framework and IR 
extension
proposal.
Proposed Implementation
=======================
Two sets of patches of supporting these experimental intrinsics and 
demonstrate
the usage are ready for community review.
a) Clang patches that support core OpenMP pragmas using this approach.
b) W-Region framework patches: CFG restructuring to form single-entry-
    single-exit work region (W-Region) based on annotations, Demand-driven
    intrinsic parsing, and WRegionInfo collection and analysis passes,
    Dump functions of WRegionInfo.
On top of this functionality, we will provide the transformation patches 
for
core OpenMP constructs (e.g. start with "#pragma omp parallel for" loop for
lowering and outlining, and "#pragma omp simd" to hook it up with
LoopVectorize.cpp). We have internal implementations for many constructs 
now.
We will break this functionality up to create a series of patches for
community review.
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
David, one quick question, is there a way to preserve and associate a set of “properties, value info/attr ” to the given region using Token?
Thanks,
Xinmin
for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }
#pragma prefetch x:1:20 y:0:10
for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }
From: Hongbin Zheng [mailto:ethe...@gmail.com]
Sent: Wednesday, January 11, 2017 3:09 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: David Majnemer <david.m...@gmail.com>; Hal Finkel <hfi...@anl.gov>; llvm...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
We are experimenting similar thing on SESE regions. We introduce an intrinsic to produce a token and another to consume the token. These two intrinsics mark the region, and we annotate extra information as OpBundle of the intrinsic that produce the token.
Thanks
Hongbin
Interesting, this is similar to what we have.
One more question, these stuff in the yellow, are they represented as LLVM VALUEs? In other words, does the LLVM optimizer update them? ,E.g. %m is re-named %m.1 in the loop, is the “m” in the token @..... is updated as well? In the RFC, the “m” is argument of intrinsic call, all use-def info are used by optimizer, and optimizer updates them during optimization as regular function arguments. I am trying understand if there is any difference between token scheme and intrinsic scheme in this regard.
And “map” and “firstprivate” … are represented as MDString, right? Thanks.
+1, tokens are the current True Way to create single-entry multi-exit regions. Your example for an annotated loop would look like:%region = call token @llvm.openmp.regionstart(metadata ...) ; whatever parameters you need here
loop
call void @llvm.openmp.regionend(token %region)If you use tokens, I would recommend proposal (c), where you introduce new intrinsics for every new kind of region, instead of adding one overly generic set of region intrinsics.
I think they are not MDString, but "bundle tags" that managed by LLVMContextImpl::getOrInsertBundleTag.
Got it. Thanks.
def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty],
[IntrArgMemOnly],
"llvm.experimental.directive">;
def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual">;
def int_experimental_dir_qual_opnd : Intrinsic<[],
[llvm_metadata_ty, llvm_any_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opnd">;
def int_experimental_dir_qual_opndlist : Intrinsic<
[],
[llvm_metadata_ty, llvm_vararg_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opndlist">;
On Jan 11, 2017, at 3:51 PM, Reid Kleckner via llvm-dev <llvm...@lists.llvm.org> wrote:+1, tokens are the current True Way to create single-entry multi-exit regions. Your example for an annotated loop would look like:%region = call token @llvm.openmp.regionstart(metadata ...) ; whatever parameters you need here
loop
call void @llvm.openmp.regionend(token %region)If you use tokens, I would recommend proposal (c), where you introduce new intrinsics for every new kind of region, instead of adding one overly generic set of region intrinsics.
Can you elaborate why? I’m curious.
On Jan 12, 2017, at 5:02 PM, Hal Finkel <hfi...@anl.gov> wrote:On 01/12/2017 06:20 PM, Reid Kleckner via llvm-dev wrote:
On Wed, Jan 11, 2017 at 8:13 PM, Mehdi Amini <mehdi...@apple.com> wrote:Can you elaborate why? I’m curious.
The con of proposal c was that many passes would need to learn about many region intrinsics. With tokens, you only need to teach all passes about tokens, which they should already know about because WinEH and other things use them.
With tokens, we can add as many region-introducing intrinsics as makes sense without any additional cost to the middle end. We don't need to make one omnibus region intrinsic set that describes every parallel loop annotation scheme supported by LLVM. Instead we would factor things according to other software design considerations.
I think that, unless we allow frontends to add their own intrinsics without recompiling LLVM, this severely restricts the usefulness of this feature.
Thank you all David, Hongbin, Reid, Mehdi, Daniel, Vikram for your review and constructive feedback for this RFC. We will update our Clang FE patch to use Token and Tags suggested by David, Hongbin, et.al. instead of using metadata and function arguments for IR-annotation intrinsic function calls to see how it goes to preserve all necessary information for our LLVM middle-end / back-end transformation. Going with Token and Tag approach, the changes need to be made in our W-Region framework is relative small as well.
Vikram, many points you made below are well-taken. Hal and I had a long discussion at SC'16 on how to build an practical infrastructure for people to experiment with and study all pros and cons for IR extensions for expressing parallelism. optimization parallel code, and many other usage for directive/pragma information. Personally, I would agree, eventually, the solution likely could be a combination of b and d when we go with parallel IR, after the community finally agreed on what are the most common to be represented as LLVM instructions.
Having said that, this RFC serves as the first step, the intrinsics we proposed are language neutral, but, "tag" or "metadata" are specific to language constructs/directive/pragma...we are expecting more and more feedback and discussion on this work. Thank you all again.
From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of David Majnemer via llvm-dev
Sent: Wednesday, January 11, 2017 2:18 PM
To: Hal Finkel <hfi...@anl.gov>
Cc: llvm-dev <llvm...@lists.llvm.org>
Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic?
For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic?
Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region?
I feel I missed a piece (but maybe I should start with the provided references?) :)
— 
Mehdi
> On Jan 13, 2017, at 9:00 AM, Tian, Xinmin <xinmi...@intel.com> wrote:
> 
> Mehdi, thanks for good questions.
> 
>>>>>> Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic?
> 
> There are some small changes we have to make in some optimizations to make sure the optimizations does not validation attached annotation semantics.
I fear that this does not seem to play well with the original claim of the RFC about a “minimal impact" on existing passes.
Especially since Hal mentioned “the motivation here is to support frontends inserting custom region annotations”, it is not clear if we wouldn’t have to teach passes to treat the intrinsics as optimization barriers by default (which kind of defeat the whole point about this), and then teach passes about the semantic of each kind of region. 
It may be possible to abstract some properties about region, à la TTI, with hooks that the passes would query. But that seems like something that’d need a lot of scrutiny before being able to evaluate the viability of the design.
> 1) provide hand-shaking / query utils for optimization to know the region is parallel loop, 2) setup a proper optimization phase ordering. In our product compiler ICC, we used both approaches.     
> 
>>>>>> For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic?
> 
> Yes, there are a list of such cases, e.g. forward substitution, strength reduction, gloable constant propagation. Here is another example, under serial semantic, you can do constant propagation, but, under parallel semantics,  we can't do constant propagation. All these issues are considered 
> 
> Int x = 100;
> 
> parallel num_threads(4) 
> {
>    ....
>     atomic { 
>           x = x + 600
>     } 
> }
> 
> These issues exists already when you do IPO optimization cross OpenCL or Cuda kernel functions, or outlined function from ClangFE.
Right but fortunately there are only a few passes to teach about IPO, and we already have generic mechanism to inhibit IPO, which is not the case with peephole or other function passes.
> 
>>>>>> Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region?
> 
> ClangFE will need set the "convergent" attribute for the intrinsic calls (call side) based on the language construct semantics.
Convergent does not prevent reordering AFAIK:
convergent call llvm.region.begin(“parallel.omp.for”)
for (I : 0->N)
  a[I] = b[I] + c[I];
convergent call llvm.region.end(“parallel.omp.for")
Can become:
for (I : 0->N)
  a[I] = b[I] + c[I];
convergent call llvm.region.begin(“parallel.omp.for”)
convergent call llvm.region.end(“parallel.omp.for")
— 
Mehdi
Thanks,
—Mehdi
Mehdi, thanks for good questions.
>>>>>Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic?
There are some small changes we have to make in some optimizations to make sure the optimizations does not validation attached annotation semantics. 1) provide hand-shaking / query utils for optimization to know the region is parallel loop, 2) setup a proper optimization phase ordering. In our product compiler ICC, we used both approaches.
Yonghong,
Hal and I have been very careful about this RFC, given the long time experience Hal had with the community, as you can see, we positioned it as “experimental”. So, we can add an infrastructure for people to use, improve and extend over time.
For your API calls, SPMD divergence, our current implementation does not cover them, as the 1st step, we focused on language constructs like parallel for, simd, cilk_for. We are open for any suggestion and proposal to cover your usage cases.
Thanks,
Xinmin
From: Yonghong Yan [mailto:y...@oakland.edu]
Sent: Friday, January 13, 2017 9:55 AM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: David Majnemer <david.m...@gmail.com>; Hal Finkel <hfi...@anl.gov>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
I have been following the discussion and this is something we are looking for for years. I am glad that you have the patch that at least we can use. I however, have several comments/requests based on our experience:
1. The idea of creating a set of representations for language-neutral parallel constructs and then allows for extending for language-specific representation and passes sounds very well, and it is definitely worth to give a try as next step I think. We did a survey of multiple threading programming interfaces (https://www.hpcwire.com/2015/03/02/a-comparison-of-heterogeneous-and-manycore-programming-models) and it is obvious that parallel interfaces (even including inter-node model such as PGAS, APGAS) share some common mechanisms for representing parallelism, data/affinity, synchronization and mutual exclusion.
2. There are APIs calls or typical statements that are meant for parallelism, but the language-based IR extensions are not able to represent them. E.g. pthread_create/join (or other runtime calls such as C++ thread/async, etc) are fork/join parallelism. Frontend can be enhanced to recognize those calls and create/append PIR info to those calls. It however would be nicer if we have a meta approach, e.g. providing a file that tells the frontend that a parallel IR should be used for specific API calls.
3. SPMD divergence such as the following, or if statement inside vector loop body
if (omp_get_thread_num() == 4) { /* same for UPC or MPI internode/PGAS as we use this to different computation for each thread/proc*/
} else {
}
Do your patch have support for that? Basically I can imagine we need to add some metadata/token/tags to branching IR.
Thank you!
Yonghong Yan
Assistant Professor
Department of Computer Science and Engineering
School of Engineering and Computer Science
Oakland University
Office: EC 534
Phone: 248-370-4087
Email: y...@oakland.edu
>>>>If you are assuming these intrinsics will only be used to implement a specific set of annotations, with specific semantics, i'm probably with Reid on the "please use specific constructs" bandwagon.
I wouldn’t disagree on this part if these intrinsics end up with usages for a specific set of annotations.
From: Daniel Berlin [mailto:dbe...@dberlin.org]
Sent: Friday, January 13, 2017 10:01 AM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Mehdi Amini <mehdi...@apple.com>; Hal Finkel <hfi...@anl.gov>; llvm...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
We obviously need to work out the details here, but one motivation is to 
allow the same facility to both represent concepts common to many 
programming models as well as programming-model-specific concepts. Also, 
I'd like to be able to transition from programming-model-specific 
representations (where I imagine most things will start) toward 
abstracted concepts. The goal is to retain programming-model-specific 
semantics while allowing the creation of transformations and analysis 
which deal with abstract concepts. One way we might accomplish this is 
by using both like this:
    1. A frontend generates region annotations. A frontend like Clang 
will generate (mostly) programming-model-specific region annotations. 
Frontends for other languages might directly use the abstract concepts 
for their region annotations.
   2. During optimization, a transformation pass analyzes 
programming-model-specific region annotations and, if legal, transforms 
them into abstract-concept annotations. It might:
!"omp.barrier" -> !"llvm.parallel.barrier", !"openmp"
Such that the barrier is now a general concept that transformations 
might understand (and, for example, eliminate redundant barriers). It is 
tagged with !"openmp" do that in the end, should it survive, the concept 
will be lowered using OpenMP.
  3. During optimization, transformations optimize abstract-concept 
annotations (i.e. eliminate redundant barriers, fuse parallel regions, etc.)
  4. Later in the pipeline, programming-model specific code lowers 
annotations for each programming model into concrete IR (i.e. runtime 
function calls, etc.). For abstract concepts without a specific 
programming-model tag, some default programming model is selected.
The programming-model-specific to abstract-concept translation in (2) 
can sometimes be done on a syntactic basis alone (we already do this, in 
fact, for atomics), but sometimes will require analysis that can be done 
only after inlining/IPA (to make sure, for example, that the parallel 
region does not contain certain classes of runtime-library calls). Plus, 
this allows the translation logic to be shared easily by different 
frontends.
Thoughts?
-Hal
    
Hi. Regarding the token approach, I've read some documentation (review D11861, EH in llvm, and Reid and David's presentation) but couldn't answer the following question.Does the intrinsic or the instruction returning a token type object act as a code motion barrier? In other words, does it prevent other operations from being reordered with it?If the answer is no, then does it mean the intrinsic has to be marked with the convergent property to achieve the code motion barrier effect?
If yes, can someone explain how is that achieved in llvm?
Greatly appreciated. Thanks.
Wael
_______________________________________________ LLVM Developers mailing list llvm...@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
On Jan 17, 2017, at 4:36 PM, Hal Finkel via llvm-dev <llvm...@lists.llvm.org> wrote:
On 01/17/2017 05:36 PM, Wael Yehia via llvm-dev wrote:
Hi. Regarding the token approach, I've read some documentation (review D11861, EH in llvm, and Reid and David's presentation) but couldn't answer the following question.Does the intrinsic or the instruction returning a token type object act as a code motion barrier? In other words, does it prevent other operations from being reordered with it?If the answer is no, then does it mean the intrinsic has to be marked with the convergent property to achieve the code motion barrier effect?
To clarify, convergent does not form a code-motion barrier. It prevents transformations from adding additional control dependencies. I suspect that the token type does the same (because tokens can't appear in PHIs).
    
On Jan 17, 2017, at 4:36 PM, Hal Finkel via llvm-dev <llvm...@lists.llvm.org> wrote:
On 01/17/2017 05:36 PM, Wael Yehia via llvm-dev wrote:
Hi. Regarding the token approach, I've read some documentation (review D11861, EH in llvm, and Reid and David's presentation) but couldn't answer the following question.Does the intrinsic or the instruction returning a token type object act as a code motion barrier? In other words, does it prevent other operations from being reordered with it?If the answer is no, then does it mean the intrinsic has to be marked with the convergent property to achieve the code motion barrier effect?
To clarify, convergent does not form a code-motion barrier. It prevents transformations from adding additional control dependencies. I suspect that the token type does the same (because tokens can't appear in PHIs).
What if the only user of the token is in a branch? It seems to me that you’d be allowed to sink the token producer there, but not if it is marked convergent.
On Jan 17, 2017, at 7:01 PM, Hal Finkel <hfi...@anl.gov> wrote:
On 01/17/2017 06:59 PM, Mehdi Amini wrote:
On Jan 17, 2017, at 4:36 PM, Hal Finkel via llvm-dev <llvm...@lists.llvm.org> wrote:
On 01/17/2017 05:36 PM, Wael Yehia via llvm-dev wrote:
Hi. Regarding the token approach, I've read some documentation (review D11861, EH in llvm, and Reid and David's presentation) but couldn't answer the following question.Does the intrinsic or the instruction returning a token type object act as a code motion barrier? In other words, does it prevent other operations from being reordered with it?If the answer is no, then does it mean the intrinsic has to be marked with the convergent property to achieve the code motion barrier effect?
To clarify, convergent does not form a code-motion barrier. It prevents transformations from adding additional control dependencies. I suspect that the token type does the same (because tokens can't appear in PHIs).
What if the only user of the token is in a branch? It seems to me that you’d be allowed to sink the token producer there, but not if it is marked convergent.
Can that happen if you start out with a "well formed" region?
I agree with this, but I’m also wondering if it needs to be first class in the IR?
For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs.
I wonder if the same wouldn’t apply to an openmp reduction clause for instance, where you could use the “basic” IR construct and the analysis would use a metadata emitted by the frontend instead of trying to infer the reduction.
Just a thought, I have given much time studying other constructs and how they map to the IR :)
> For example, automatic array privatization in parallel loops is a very hard problem (automatic scalar privatization is easier, but even that is interprocedural).  Reduction recognition is doable for common cases, but there are hard cases here as well.  These are all standard features of parallel programs, not specific to OpenMP (e.g., C++17 parallel template operators are likely to produce these as well).
> 
> If you support all these capabilities in the IR, a *lot* more than 6000 LOC (Tapir’s statistic; I’m sorry I don’t recall the number for PIR) would probably have to be modified in LLVM.
> 
> 
>>> [...]
>>> (b) Add several new LLVM instructions such as, for parallelism, fork, spawn,
>>>   join, barrier, etc.
>>> [...]
>> 
>> For me fork and spawn are serving the same purpose, most new schemes suggested three new instructions in total.
> 
> A reasonable question is whether to use (#b) first-class instructions for some features, *in combination with* (#d) — i.e., region markers + metadata — or to use #d exclusively.  There are almost certainly far too many essential features in parallel programs to capture them all as new instructions. I don’t see a need to answer this question on Day 1.  Instead, we can begin with regions and metadata annotations, and then “promote” a few features to first-class instructions if the benefit is justified.
> 
> Does that make sense?
Now that I read this, I wonder if it isn’t close to what I tried to express above :)
— 
Mehdi
> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Hi Johannes,
>
>> I am especially curious where you get your data from. Tapir [0] (and to
>> some degree PIR [1]) have shown that, counterintuitively, only a few changes
>> to LLVM passes are needed. Tapir was recently used in an MIT class with a
>> lot of students and it seemed to work well with only minimal changes
>> to analysis and especially transformation passes.
>
> TAPIR is an elegant, small extension and, in particular, I think the idea of asymmetric parallel tasks and control flow is a clever way to express parallelism with serial semantics, as in Cilk. Encoding the control flow extensions as explicit instructions is orthogonal to that, though arguably more elegant than using region tags + metadata.
>
> However, Cilk is a tiny language compared with the full complexity of other languages, like OpenMP. To take just one example, TAPIR cannot express the ORDERED construct of OpenMP. A more serious concern, IMO, is that TAPIR (like Cilk) requires serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction. Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses.
I agree with this, but I’m also wondering if it needs to be first class in the IR?
For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs.
—Mehdi
_______________________________________________ LLVM Developers mailing list llvm...@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
My bias is to use both (b) and (d), since they have complementary
strengths.  We should use (b) for expressing concepts that can't be
semantically modeled as a call or invoke (this branch takes both its
successors), and (d) for expressing things that can be (this call may
never return), and annotation like things (this region (denoted by
def-use of a token) is a reduction).
I don't grok OpenMP, but perhaps we can come with one or two
"generalized control flow"-type instructions that can be used to model
the non-call/invoke like semantics we'd like LLVM to know about, and
model the rest with (d)?
-- Sanjoy
On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
Yes, that's exactly what we have been looking at recently here, but the region tags seem to make it possible to express the control flow as well, so I think we could start with reg ions+metadata, as Hal and Xinmin proposed, and then figure out what needs to be first class instructions.
--Vikram Adve
On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand
<va...@illinois.edu> wrote:
> Hi Sanjoy,
>
> Yes, that's exactly what we have been looking at recently here, but
> the region tags seem to make it possible to express the control flow
> as well, so I think we could start with reg ions+metadata, as Hal and
I'm not yet convinced that region tags are sufficient to model exotic
control flow.
(I don't know OpenMP so this is a copy-pasted-edited example)
Say we have:
void main() {
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val;
    compute_something_into_val(&val, i);
    a[i] = val;
  }
}
I presume the (eventual) intended lowering is something like this (if
the intended lowering is different than this, and avoids the issue I'm
trying to highlight then my point is moot):
void main() {
  tok = llvm.experimental.intrinsic_a();
  int i = omp_get_thread_num();
  i32* val = alloca i32
  compute_something_into_val(val, i);
  a[i] = val;
  llvm.experimental.intrinsic_b(tok);
}
However, LLVM is free to hoist the alloca to the entry block:
void main() {
  i32* val = alloca i32
  tok = llvm.experimental.intrinsic_a();
  int i = omp_get_thread_num();
  compute_something_into_val(val, i);
  a[i] = val;
  llvm.experimental.intrinsic_b(tok);
}
and now you have a race between the four parallel forks.
The problem here is that nothing in the IR expresses that we have four
copies of the region running "at the same time".  In fact, such a
control flow is alien to LLVM today.
For instance, another evil optimization may turn:
void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }
  return a[0] + a[1];
}
to
void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }
  return undef;
}
since a[i] = val could have initialized at most one element in a.
Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but then
calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
You're going to have to do the same kind of auditing of passes either
way, and the only extra cost of a new instruction is the extra bitcode
reading / writing code.
I hope I made sense.
-- Sanjoy
On 01/13/2017 12:11 PM, Mehdi Amini wrote:
    
On Jan 13, 2017, at 9:41 AM, Hal Finkel <hfi...@anl.gov> wrote:
On 01/13/2017 12:29 AM, Mehdi Amini wrote:
On Jan 12, 2017, at 5:02 PM, Hal Finkel <hfi...@anl.gov> wrote:On 01/12/2017 06:20 PM, Reid Kleckner via llvm-dev wrote:
On Wed, Jan 11, 2017 at 8:13 PM, Mehdi Amini <mehdi...@apple.com> wrote:Can you elaborate why? I’m curious.
The con of proposal c was that many passes would need to learn about many region intrinsics. With tokens, you only need to teach all passes about tokens, which they should already know about because WinEH and other things use them.
With tokens, we can add as many region-introducing intrinsics as makes sense without any additional cost to the middle end. We don't need to make one omnibus region intrinsic set that describes every parallel loop annotation scheme supported by LLVM. Instead we would factor things according to other software design considerations.
I think that, unless we allow frontends to add their own intrinsics without recompiling LLVM, this severely restricts the usefulness of this feature.
I’m not convinced that “building a frontend without recompiling LLVM while injecting custom passes” is a strong compelling use-case, i.e. can you explain why requiring such use-case/frontends to rebuild LLVM is so limiting?
I don't understand your viewpoint. Many frontends either compose their own pass pipelines or use the existing extension-point mechanism. Some frontends, Chapel for example, can insert code using custom address spaces and then insert passes later to turn accesses using pointers to those address spaces into runtime calls. This is the kind of design we'd like to support, without forcing frontends to use custom versions of LLVM, but with annotated regions instead of just with address spaces.
I think we’re talking about two different things here: you mentioned originally “without recompiling LLVM”, which I don’t see as major blocker, while now you’re now clarifying I think that you’re more concerned about putting a requirement on a *custom* LLVM, as in “it wouldn’t work with the source from a vanilla upstream LLVM”, which I agree is a different story.
That said, it extends the point from the other email (in parallel) about the semantics of the intrinsics: while your solution allows these frontend to reuse the intrinsics, it means that upstream optimization have to consider such intrinsics as optimization barrier because their semantic is unknown.
—Mehdi
On Jan 20, 2017, at 6:59 AM, Hal Finkel <hfi...@anl.gov> wrote:On 01/13/2017 12:11 PM, Mehdi Amini wrote:
On Jan 13, 2017, at 9:41 AM, Hal Finkel <hfi...@anl.gov> wrote:
On 01/13/2017 12:29 AM, Mehdi Amini wrote:
On Jan 12, 2017, at 5:02 PM, Hal Finkel <hfi...@anl.gov> wrote:On 01/12/2017 06:20 PM, Reid Kleckner via llvm-dev wrote:
On Wed, Jan 11, 2017 at 8:13 PM, Mehdi Amini <mehdi...@apple.com> wrote:Can you elaborate why? I’m curious.
The con of proposal c was that many passes would need to learn about many region intrinsics. With tokens, you only need to teach all passes about tokens, which they should already know about because WinEH and other things use them.
With tokens, we can add as many region-introducing intrinsics as makes sense without any additional cost to the middle end. We don't need to make one omnibus region intrinsic set that describes every parallel loop annotation scheme supported by LLVM. Instead we would factor things according to other software design considerations.
I think that, unless we allow frontends to add their own intrinsics without recompiling LLVM, this severely restricts the usefulness of this feature.
I’m not convinced that “building a frontend without recompiling LLVM while injecting custom passes” is a strong compelling use-case, i.e. can you explain why requiring such use-case/frontends to rebuild LLVM is so limiting?
I don't understand your viewpoint. Many frontends either compose their own pass pipelines or use the existing extension-point mechanism. Some frontends, Chapel for example, can insert code using custom address spaces and then insert passes later to turn accesses using pointers to those address spaces into runtime calls. This is the kind of design we'd like to support, without forcing frontends to use custom versions of LLVM, but with annotated regions instead of just with address spaces.
I think we’re talking about two different things here: you mentioned originally “without recompiling LLVM”, which I don’t see as major blocker, while now you’re now clarifying I think that you’re more concerned about putting a requirement on a *custom* LLVM, as in “it wouldn’t work with the source from a vanilla upstream LLVM”, which I agree is a different story.
That said, it extends the point from the other email (in parallel) about the semantics of the intrinsics: while your solution allows these frontend to reuse the intrinsics, it means that upstream optimization have to consider such intrinsics as optimization barrier because their semantic is unknown.
I see no reason why this needs to be true (at least so long as you're willing to accept a certain amount of "as if" parallelism).
Moreover, if it is true, then we'll lose the benefits of, for example, being able to hoist scalar loads out of parallel loops. We might need to include dependencies on "inaccessible memory", so cover natural runtime dependencies by default (this can be refined with custom AA logic), but that is not a complete code-motion barrier. Memory being explicitly managed will end up as arguments to the region intrinsics, so we'll automatically get more-fine-grained information.
I’m not sure how you imagine going around the optimization barrier that goes with “this intrinsic has an unknown semantic that can impact the control flow of the program implicitly”, unless it acts as a “hint” only (but I don’t believe it is the direction?).—Mehdi
On Jan 20, 2017, at 10:45 AM, Yonghong Yan <yan...@gmail.com> wrote:On Fri, Jan 20, 2017 at 12:52 PM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:On Jan 20, 2017, at 6:59 AM, Hal Finkel <hfi...@anl.gov> wrote:On 01/13/2017 12:11 PM, Mehdi Amini wrote:
On Jan 13, 2017, at 9:41 AM, Hal Finkel <hfi...@anl.gov> wrote:
On 01/13/2017 12:29 AM, Mehdi Amini wrote:On Jan 12, 2017, at 5:02 PM, Hal Finkel <hfi...@anl.gov> wrote:On 01/12/2017 06:20 PM, Reid Kleckner via llvm-dev wrote:
On Wed, Jan 11, 2017 at 8:13 PM, Mehdi Amini <mehdi...@apple.com> wrote:Can you elaborate why? I’m curious.The con of proposal c was that many passes would need to learn about many region intrinsics. With tokens, you only need to teach all passes about tokens, which they should already know about because WinEH and other things use them.With tokens, we can add as many region-introducing intrinsics as makes sense without any additional cost to the middle end. We don't need to make one omnibus region intrinsic set that describes every parallel loop annotation scheme supported by LLVM. Instead we would factor things according to other software design considerations.
I think that, unless we allow frontends to add their own intrinsics without recompiling LLVM, this severely restricts the usefulness of this feature.I’m not convinced that “building a frontend without recompiling LLVM while injecting custom passes” is a strong compelling use-case, i.e. can you explain why requiring such use-case/frontends to rebuild LLVM is so limiting?
I don't understand your viewpoint. Many frontends either compose their own pass pipelines or use the existing extension-point mechanism. Some frontends, Chapel for example, can insert code using custom address spaces and then insert passes later to turn accesses using pointers to those address spaces into runtime calls. This is the kind of design we'd like to support, without forcing frontends to use custom versions of LLVM, but with annotated regions instead of just with address spaces.I think we’re talking about two different things here: you mentioned originally “without recompiling LLVM”, which I don’t see as major blocker, while now you’re now clarifying I think that you’re more concerned about putting a requirement on a *custom* LLVM, as in “it wouldn’t work with the source from a vanilla upstream LLVM”, which I agree is a different story.That said, it extends the point from the other email (in parallel) about the semantics of the intrinsics: while your solution allows these frontend to reuse the intrinsics, it means that upstream optimization have to consider such intrinsics as optimization barrier because their semantic is unknown.
I see no reason why this needs to be true (at least so long as you're willing to accept a certain amount of "as if" parallelism).Sorry, I didn’t quite get that?Moreover, if it is true, then we'll lose the benefits of, for example, being able to hoist scalar loads out of parallel loops. We might need to include dependencies on "inaccessible memory", so cover natural runtime dependencies by default (this can be refined with custom AA logic), but that is not a complete code-motion barrier. Memory being explicitly managed will end up as arguments to the region intrinsics, so we'll automatically get more-fine-grained information.Sanjoy gave an example of the kind of optimization that can break the semantic: http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html ; I haven’t yet seen an explanation about how this is addressed?If you were asking how this is addressed in the current clang/openmp, the code in the whole parallel region is outlined into a new function by frontend and parallel fork-join is transformed to a runtime call (kmpc_fork_call) that takes as input a pointer to the outlined function. so procedure-based optimization would not perform those optimization Sanjoy listed.
Here we come to the interesting part: the hoisting of  "i32* I = alloca 32” above the intrinsics required to update the intrinsics information “QUAL.PRIVATE”. 
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how it an openmp agnostic supposed to behave in face of llvm.experimental.intrinsic_a?
— 
Mehdi
Yonghong, In our implementation (not open sourced), we don’t do outlining the Front-End. See my previous reply to Medhi’s email.
Xinmin
From: Yonghong Yan [mailto:y...@oakland.edu]
Sent: Friday, January 20, 2017 11:18 AM
To: Mehdi Amini
Cc: Tian, Xinmin; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org; Adve, Vikram Sadanand
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
Xinmin,
outlining turns a parallel program into a sequential one from compiler's perspective, and that is why most of the parallel-ignorant pass would hurt. In your IR description for Sanjoy's example, does the current approach of outlining impacting the way of the IR should be enhanced for parallelism?
For that specific example (or other analysis and optimization SPMD) and what is implemented in clang, I am not sure whether we are going to change the frontend so not to outline the parallel region, or allow to perform certain optimization such as hoisting that alloca in clang which is not desired I believe. Or annotate the outlined function together with the intrinsic_a so that hoisting can be performed, in which case the instrisic_a would like this:
tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4) plus info for OUTLINED_call.
Mehdi,
I think i am asking the same question as you asked me.
Yonghong
I'm going to club together some responses.
I agree that outlining function sub-bodies and passing in the function
pointers to said outlined bodies to OpenMP helpers lets us correctly
implement the semantics we need.  However, unless I severely
misunderstood the thread, I thought the key idea was to move *away*
from that representation and towards a representation that _allows_
optimization?
My problem with representing parallel regions with
intrinsic-denoted-regions is that we're lying to the optimizer about
what the code actually does.  Calls, even to intrinsics, can "at
worst" do some combination of the following:
 - Write to and read from arbitrary memory
 - Have UB (but we're allowed to pretend that they don't)
 - Throw an exception
 - Never return, either by infinite looping or by calling exit(0)
 - Have memory synchronization operations, like fences, atomic loads,
   stores etc.
 - Have side effects like IO, volatile writes
If an intrinsic's behavior can be explained by some subset of the
above, then you should not need to edit any pass to preserve
_correctness_ -- all optimization passes (today) conservatively assume
that calls that they don't understand have all of the behaviors
outlined above.
However, if to preserve *correctness* you have to edit optimization
passes and teach them that certain intrinsic calls have behavior
*outside* the set mentioned above then the instruction really does not
have "call semantics".  `call @llvm.experimental.region_begin()` is
really a fundamentally new instruction masquerading as an intrinsic,
and it is probably better to call a spade a spade and represent it as
a new instruction.
The setting for the examples I gave was not that "here is a case we
need to get right".  The setting was that "here is a *symptom* that
shows that we've lied to the optimizer".  We can go ahead and fix all
the symptoms by adding bailouts to the respective passes, but that
does not make us immune to passes that we don't know about
e.g. downstream passes, passes that will be added later.  It also puts
us in a weird spot around semantics of call instructions.
-- Sanjoy
On Fri, Jan 20, 2017 at 11:22 AM, Tian, Xinmin via llvm-dev
Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic.
The fact that you needed this pre-step in the first place seems to indicate to me that it confirms what multiple people expressed in this thread, for example what Daniel wrote here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
— 
Mehdi
— 
Mehdi
I’m not sure if we’re talking about the same thing here: my understanding at this point is that the design trading-off you’re making “simplicity” by scarifying “correctness”.
Requiring the IR to stay in what you’re calling a “canonical” form in your answer to Sanjoy in order to not miscompile a program is not an approach that seems compatible with how we deal with the IR usually.
> Regardless it is a new instruction or an intrinsics with token/tag, it will consist of model-agnostic part and model-non-agnostic part. The package comes with a framework for parsing and using these intrinsics. See the reply I had for Sanjoy's email.
The answer to Sanjoy is not really helpful to clarify anything to me. At this point I still don’t understand how this is supposed to be correct in general.
It would be helpful to have a LangRef patch that describes the semantic associated to your region intrinsics. Then we may be able to process some examples through the formalized description.
— 
Mehdi
The first thing that comes to my mind is inlining that can put the IR in a form that breaks the invariant you tried to enforce with your "prepare-phase” (for example by hoisting an allocas).
Remember that, the prepare-phase is invoked in the FE or right after FE, inlining is not happening, that is why we don't call it "pass". Chandler made a good point for this case a long time back.
Hoisting alloca is totally ok. A new alloca is generated during outlining later on for anything marked as "private" (so long the "private" information is saves in the tag). I thought we talked this in an early email.
In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.
In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds ”private” attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it.
Xinmin
On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.
In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds ”private” attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it.
From: mehdi...@apple.com [mailto:mehdi...@apple.com]
Sent: Tuesday, January 31, 2017 9:03 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Sanjoy Das <san...@playingwithpointers.com>; Adve, Vikram Sadanand <va...@illinois.edu>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.
So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?
[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park # 2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.
Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.
[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, …. Each instance of them will have their semantics which will be in following up RFCs
In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds ”private” attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it.
Again, you’re trying to address a specific case, while I’m just trying to identify a generic class of problem that your proposal fails to address explicitly.
[XT] It looks there is a mis-understanding of proposal. The proposal is to build up experimental framework with proposed interface to evaluate set of changes (or cost) needed for usage cases we know of. Hal and I carefully positioned this RFC. Even all intrinsics are named as experimental. So far, cost for handling these cases identified is within a range of our expectation.
On Jan 31, 2017, at 10:59 PM, Tian, Xinmin <xinmi...@intel.com> wrote:From: mehdi...@apple.com [mailto:mehdi...@apple.com]
Sent: Tuesday, January 31, 2017 9:03 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Sanjoy Das <san...@playingwithpointers.com>; Adve, Vikram Sadanand <va...@illinois.edu>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region AnnotationsOn Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park # 2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, …. Each instance of them will have their semantics which will be in following up RFCs
    
On Jan 31, 2017, at 10:59 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
From: mehdi...@apple.com [mailto:mehdi...@apple.com]
Sent: Tuesday, January 31, 2017 9:03 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Sanjoy Das <san...@playingwithpointers.com>; Adve, Vikram Sadanand <va...@illinois.edu>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region AnnotationsOn Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park # 2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.
This didn’t mean that the changes were meant specifically for OpenMP. My understanding was that this proposal is for a generic "IR-level Region Annotations” mechanism, and that’s what the changes were for. Now it ends up being “let’s support OpenMP semantic without adding openmp in the intrinsic names”.
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, …. Each instance of them will have their semantics which will be in following up RFCs
At this point I don’t see any advantage in having a “generic intrinsic" that has an opaque tag since all the semantic is in the tag anyway. I’d have to see what is really “generic” in the handling of it...
Reid identified this very early in the thread (he is a lot more perspicacious than I am) here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108914.html
On Feb 1, 2017, at 9:34 AM, Hal Finkel <hfi...@anl.gov> wrote:
On 02/01/2017 01:29 AM, Mehdi Amini via llvm-dev wrote:
On Jan 31, 2017, at 10:59 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
From: mehdi...@apple.com [mailto:mehdi...@apple.com]
Sent: Tuesday, January 31, 2017 9:03 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Sanjoy Das <san...@playingwithpointers.com>; Adve, Vikram Sadanand <va...@illinois.edu>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region AnnotationsOn Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park # 2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.
This didn’t mean that the changes were meant specifically for OpenMP. My understanding was that this proposal is for a generic "IR-level Region Annotations” mechanism, and that’s what the changes were for. Now it ends up being “let’s support OpenMP semantic without adding openmp in the intrinsic names”.
The point here is to abstract the properties about which other passes might need to know by using a set of generic intrinsics.
The fact that you can't hoist allocas past one of these intrinsics, is nowhere close to saying that the individual optimization passes need to know anything about OpenMP, parallelism, etc. Regardless of how many LOC are in Intel's prototype, we're obviously aiming for minimal impact on the current upstream infrastructure.
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, …. Each instance of them will have their semantics which will be in following up RFCs
At this point I don’t see any advantage in having a “generic intrinsic" that has an opaque tag since all the semantic is in the tag anyway. I’d have to see what is really “generic” in the handling of it...
This is completely opposite to the point. The semantics relevant to the rest of the optimization pipeline should be in the intrinsics themselves. I've yet to see anything to suggest that we can't do that.
Reid identified this very early in the thread (he is a lot more perspicacious than I am) here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108914.html
There are multiple levels here:
a) Semantics relevant to the rest of the pipeline
b) Semantics relevant to parallelism-specific optimizations (e.g. redundant barrier removal)
c) Semantics relevant to specific programming model / extension (OpenMP, OpenACC, C++ parallel algorithms, whatever)
We'd like to separate these three levels, and I believe the proposed scheme allows us to do that. Obviously, this assumes that we can indeed have a small set of intrinsics that satisfy the needs of (a). Furthermore, if we're going to use intrinsics, we need to decide whether all of the relevant semantics are reasonable to encode in intrinsics (e.g. it is reasonable to have an intrinsic past which you can't hoist an alloca, or would that need to be an instruction, etc.)
+1 to Hal’s point.
    
Sent from my iPhoneRemember that, the prepare-phase is invoked in the FE or right after FE, inlining is not happening, that is why we don't call it "pass". Chandler made a good point for this case a long time back.
What I was describing is the inlining in the optimizer pipeline.
Hoisting alloca is totally ok. A new alloca is generated during outlining later on for anything marked as "private" (so long the "private" information is saves in the tag). I thought we talked this in an early email.
Can you describe how (and at which point) you get the private for "var" added to the tag?
On Tue, Jan 31, 2017 at 5:26 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
> [XT] Back from Biz trips, trying to catch up with the discussion.
>
>>>>>I agree that outlining function sub-bodies and passing in the function pointers to said outlined bodies to OpenMP helpers lets us correctly implement the semantics we need.  However, unless I severely misunderstood the thread, I thought the key idea was to move *away* from that representation and towards a representation that _allows_ optimization?
>
> [XT]: Your understanding is correct.  But, the IR-level region annotation RFC is not just for OpenMP. OpenMP is one of usage cases..
>
>>>>>My problem with representing parallel regions with intrinsic-denoted-regions is that we're lying to the optimizer about what the code actually does.  Calls, even to intrinsics, can "at worst" do some combination of the following:
>
> - Write to and read from arbitrary memory
>  - Have UB (but we're allowed to pretend that they don't)
>  - Throw an exception
>  - Never return, either by infinite looping or by calling exit(0)
>  - Have memory synchronization operations, like fences, atomic loads,
>    stores etc.
>  - Have side effects like IO, volatile writes
>
> [XT] Based on Google and Xilinx's suggestion, the IR-level region
> annotation can use token and tags with intrinsic functions to model
> region and memory dependency (use/def).  Above list are handled based
> on language rules. E.g.  OpenMP rule says, in a parallel region, throw
> an exception is allowed, but it has been caught within the region,
> i.e. no control-flow edge is allowed to across the region boundary.
> "exit" is one exception which is allowed, as it terminate the
> program..  Our solution is to have FE and/or one central place in ME
> to deal with language specifics.
I should have been clearer: I am not talking about user code here.
I'm trying to say that intrinsic calls are, first and foremost, calls;
and their behavior should be describable within the framework above.
This precludes using intrinsics to represent magic control flow.
For instance, if we go back to one of the problematic transformations
from earlier in the mail:
void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }
  return a[0] + a[1];
}
to
void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }
  return undef;
}
or some variant of this, Say the parallel region is demarcated by a
pair of intrinsics.  The said intrinsics could have any subset of the
behavior demarcated above, but the transform would still be valid; and
so to rule out this transform you would need these demarcating (or
region-creating) intrinsics to have some exotic property not in that
list.  That's the bit I'm worried about.  IOW I'm worried about
changes like this:
  if (!isa<TerminatorInst>(I))
    V.push_back(I);
  else
    return false;
to
  if (!isa<TerminatorInst>(I) && !llvm::isTerminatorLikeIntrinsic(I))
    V.push_back(I);
  else
    return false;
If you can define the semantics of the intrinsics you want to add as a
subset of the behaviors specified above then I have no problems.
The other option is to define all or a subset of intrinsic calls as
*not* `CallInst` s but as fundamentally different "things" in the LLVM
Instruction hierarchy; at which point we can give them whatever
semantics we want.  What I don't want is a `CallInst` doing something
a `CallInst` ought not do.
>>>>>However, if to preserve *correctness* you have to edit optimization passes and teach them that certain intrinsic calls have behavior
> *outside* the set mentioned above then the instruction really does not have "call semantics".  `call @llvm.experimental.region_begin()` is really a fundamentally new instruction masquerading as an intrinsic, and it is probably better to call a spade a spade and represent it as a new instruction.
>
> [XT] Yes and No. Yes: w.r.t region scope annotation, No: it is more
> than one new instruction, it is more like a sequence of
> instructions. Assume we have a "fork" instruction, omp fork and cilk
> fork/spawn semantics are differently in terms of stack frame
> allocation and ABI.  When we introduce a new instruction, the exact
> semantics needs to be defined, it can't be alter. Thus, we proposed to
Can that problem be solved by (this is both a technical and cultural
shift) allowing "experimental" instructions, whose semantics can be
evolved over time with no guarantee for backwards compatibility?
> start with experimental_intrinsics, and it is proven working. We can
> always convert the intrinsics with token/tag to instructions when we
> have enough solid cases / justification for the part of model-agnostic
> for the conversion.
I don't have any problem with that in principle, and I'd
wholeheartedly agree if the intrinsics had call semantics.
>>>>>>The setting for the examples I gave was not that "here is a case we need to get right".  The setting was that "here is a *symptom* that shows that we've lied to the optimizer".  We can go ahead and fix all the symptoms by adding bailouts to the respective passes, but that does not make us immune to passes that we don't know about e.g. downstream passes, passes that will be added later.  It also puts us in a weird spot around semantics of call instructions.
>
> [XT] I would say, it is a design trade-off between having a central
It *is* a design trade-off but I think you're discounting how much
we're trading-off on one side of the equation.
> place to deal with specifics or make drastic changes to begin with
> from day one.  Our process is to have a central place to get all
> working, then, turning off the support for some "symptoms" in this
> central place one-by-one to trigger downstream fails and fixed.  I
Again, I have no issue with this process; but I'm speculating that the
nature of the fixes will add more technical debt to LLVM than we're
comfortable with absorbing.
(Assuming by "central place" you mean the frontend and by downstream
fixes you meant fixes to LLVM passes?)
> think our ultimate goal is more or less same, just taking a different
> approach to get there.  The central place / prepare-phase for getting
> IR to a "canonical" form with help to address the issue . " downstream
> passes, passes that will be added later.  It also puts us in a weird
> spot around semantics of call instructions." you mentioned.
>
> Thanks for all questions, discussions and feedback.
Thank you too! :)
-- Sanjoy