[llvm-dev] [RFC] IR-level Region Annotations

211 views
Skip to first unread message

Hal Finkel via llvm-dev

unread,
Jan 11, 2017, 5:02:59 PM1/11/17
to llvm-dev
A Proposal for adding an experimental IR-level region-annotation
infrastructure
=============================================================================

Hal Finkel (ANL) and Xinmin Tian (Intel)

This is a proposal for adding an experimental infrastructure to support
annotating regions in LLVM IR, making use of intrinsics and metadata, and
a generic analysis to allow transformations to easily make use of these
annotated regions. This infrastructure is flexible enough to support
representation of directives for parallelization, vectorization, and
offloading of both loops and more-general code regions. Under this scheme,
the conceptual distance between source-level directives and the region
annotations need not be significant, making the incremental cost of
supporting new directives and modifiers often small. It is not, however,
specific to those use cases.

Problem Statement
=================
There are a series of discussions on LLVM IR extensions for representing
region
and loop annotations for parallelism, and other user-guided
transformations,
among both industrial and academic members of the LLVM community.
Increasing
the quality of our OpenMP implementation is an important motivating use
case,
but certainly not the only one. For OpenMP in particular, we've discussed
having an IR representation for years. Presently, all OpenMP pragmas are
transformed directly into runtime-library calls in Clang, and outlining
(i.e.
extracting parallel regions into their own functions to be invoked by the
runtime library) is done in Clang as well. Our implementation does not
further
optimize OpenMP constructs, and a lot of thought has been put into how
we might
improve this. For some optimizations, such as redundant barrier removal, we
could use a TargetLibraryInfo-like mechanism to recognize
frontend-generated
runtime calls and proceed from there. Dealing with cases where we lose
pointer-aliasing information, information on loop bounds, etc. we could
improve
by improving our inter-procedural-analysis capabilities. We should do that
regardless. However, there are important cases where the underlying
scheme we
want to use to lower the various parallelism constructs, especially when
targeting accelerators, changes depending on what is in the parallel
region.
In important cases where we can see everything (i.e. there aren't arbitrary
external calls), code generation should proceed in a way that is very
different
from the general case. To have a sensible implementation, this must be done
after inlining. When using LTO, this should be done during the link-time
phase.
As a result, we must move away from our purely-front-end based lowering
scheme.
The question is what to do instead, and how to do it in a way that is
generally
useful to the entire community.

Designs previously discussed can be classified into four categories:

(a) Add a large number of new kinds of LLVM metadata, and use them to
annotate
each necessary instruction for parallelism, data attributes, etc.
(b) Add several new LLVM instructions such as, for parallelism, fork,
spawn,
join, barrier, etc.
(c) Add a large number of LLVM intrinsics for directives and clauses, each
intrinsic representing a directive or a clause.
(d) Add a small number of LLVM intrinsics for region or loop annotations,
represent the directive/clause names using metadata and the remaining
information using arguments.

Here we're proposing (d), and below is a brief pros and cons analysis
based on
these discussions and our own experiences of supporting region/loop
annotations
in LLVM-based compilers. The table below shows a short summary of our
analysis.

Various commercial compilers (e.g. from Intel, IBM, Cray, PGI), and GCC
[1,2],
have IR-level representations for parallelism constructs. Based on
experience
from these previous developments, we'd like a solution for LLVM that
maximizes
optimization enablement while minimizing the maintenance costs and
complexity
increase experienced by the community as a whole.

Representing the desired information in the LLVM IR is just the first
step. The
challenge is to maintain the desired semantics without blocking useful
optimizations. With options (c) and (d), dependencies can be preserved
mainly
based on the use/def chain of the arguments of each intrinsic, and a
manageable
set LLVM analysis and transformations can be made aware of certain kinds of
annotations in order to enable specific optimizations. In this regard,
options (c) and (d) are close with respect to maintenance efforts. However,
based on our experiences, option (d) is preferable because it is easier to
extend to support new directives and clauses in the future without the
need to
add new intrinsics as required by option (c).

Table 1. Pros/cons summary of LLVM IR experimental extension options

--------+----------------------+-----------------------------------------------

Options | Pros | Cons
--------+----------------------+-----------------------------------------------

(a) | No need to add new | LLVM passes do not always maintain
metadata.
| instructions or | Need to educate many passes (if not
all) to
| new intrinsics | understand and handle them.
--------+----------------------+-----------------------------------------------

(b) | Parallelism becomes | Huge effort for extending all LLVM
passes and
| first class citizen | code generation to support new
instructions.
| | A large set of information still needs
to be
| | represented using other means.
--------+----------------------+-----------------------------------------------

(c) | Less impact on the | A large number of intrinsics must be
added.
| exist LLVM passes. | Some of the optimizations need to be
| Fewer requirements | educated to understand them.
| for passes to |
| maintain metadata. |
--------+----------------------+-----------------------------------------------

(d) | Minimal impact on | Some of the optimizations need to be
| existing LLVM | educated to understand them.
| optimizations passes.| No requirements for all passes to
maintain
| directive and clause | large set of metadata with values.
| names use metadata |
| strings. |
--------+----------------------+-----------------------------------------------


Regarding (a), LLVM already uses metadata for certain loop information
(e.g.
annotations directing loop transformations and assertions about
loop-carried
dependencies), but there is no natural or consistent way to extend this
scheme
to represent necessary data-movement or region information.


New Intrinsics for Region and Value Annotations
==============================================
The following new (experimental) intrinsics are proposed which allow:

a) Annotating a code region marked with directives / pragmas,
b) Annotating values associated with the region (or loops), that is, those
values associated with directives / pragmas.
c) Providing information on LLVM IR transformations needed for the
annotated
code regions (or loops).

These can be used both by frontends and also by transformation passes (e.g.
automated parallelization). The names used here are similar to those
used by
our internal prototype, but obviously we expect a community bikeshed
discussion.

def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty],
[IntrArgMemOnly],
"llvm.experimental.directive">;

def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual">;

def int_experimental_dir_qual_opnd : Intrinsic<[],
[llvm_metadata_ty, llvm_any_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opnd">;

def int_experimental_dir_qual_opndlist : Intrinsic<
[],
[llvm_metadata_ty, llvm_vararg_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opndlist">;

Note that calls to these intrinsics might need to be annotated with the
convergent attribute when they represent fork/join operations, barriers,
and
similar.

Usage Examples
==============

This section shows a few examples using these experimental intrinsics.
LLVM developers who will use these intrinsics can defined their own
MDstring.
All details of using these intrinsics on representing OpenMP 4.5
constructs are described in [1][3].


Example I: An OpenMP combined construct

#pragma omp target teams distribute parallel for simd
loop

LLVM IR
-------
call void @llvm.experimental.directive(metadata !0)
call void @llvm.experimental.directive(metadata !1)
call void @llvm.experimental.directive(metadata !2)
call void @llvm.experimental.directive(metadata !3)
loop
call void @llvm.experimental.directive(metadata !6)
call void @llvm.experimental.directive(metadata !5)
call void @llvm.experimental.directive(metadata !4)

!0 = metadata !{metadata !DIR.OMP.TARGET}
!1 = metadata !{metadata !DIR.OMP.TEAMS}
!2 = metadata !{metadata !DIR.OMP.DISTRIBUTE.PARLOOP.SIMD}

!6 = metadata !{metadata !DIR.OMP.END.DISTRIBUTE.PARLOOP.SIMD}
!5 = metadata !{metadata !DIR.OMP.END.TEAMS}
!4 = metadata !{metadata !DIR.OMP.END.TARGET}

Example II: Assume x,y,z are int variables, and s is a non-POD variable.
Then, lastprivate(x,y,s,z) is represented as:

LLVM IR
-------
call void @llvm.experimental.dir.qual.opndlist(
metadata !1, %x, %y, metadata !2, %a, %ctor, %dtor, %z)

!1 = metadata !{metadata !QUAL.OMP.PRIVATE}
!2 = metadata !{metadata !QUAL.OPND.NONPOD}

Example III: A prefetch pragma example

// issue vprefetch1 for xp with a distance of 20 vectorized iterations
ahead
// issue vprefetch0 for yp with a distance of 10 vectorized iterations
ahead
#pragma prefetch x:1:20 y:0:10
for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }

LLVM IR
-------
call void @llvm.experimental.directive(metadata !0)
call void @llvm.experimental.dir.qual.opnslist(metadata !1, %xp, 1, 20,
metadata !1, %yp, 0, 10)
loop
call void @llvm.experimental.directive(metadata !3)

References
==========

[1] LLVM Framework and IR extensions for Parallelization, SIMD
Vectorization
and Offloading Support. SC'2016 LLVM-HPC3 Workshop. (Xinmin Tian
et.al.)
Saltlake City, Utah.

[2] Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer
loop
auto-vectorization. (Hideki Saito, et.al.) LLVM Developers' Meeting
2016,
San Jose.

[3] Intrinsics, Metadata, and Attributes: The Story continues! (Hal Finkel)
LLVM Developers' Meeting, 2016. San Jose

[4] LLVM Intrinsic Function and Metadata String Interface for Directive (or
Pragmas) Representation. Specification Draft v0.9, Intel
Corporation, 2016.


Acknowledgements
================
We would like to thank Chandler Carruth (Google), Johannes Doerfert
(Saarland
Univ.), Yaoqing Gao (HuaWei), Michael Wong (Codeplay), Ettore Tiotto,
Carlo Bertolli, Bardia Mahjour (IBM), and all other LLVM-HPC IR
Extensions WG
members for their constructive feedback on the LLVM framework and IR
extension
proposal.

Proposed Implementation
=======================

Two sets of patches of supporting these experimental intrinsics and
demonstrate
the usage are ready for community review.

a) Clang patches that support core OpenMP pragmas using this approach.
b) W-Region framework patches: CFG restructuring to form single-entry-
single-exit work region (W-Region) based on annotations, Demand-driven
intrinsic parsing, and WRegionInfo collection and analysis passes,
Dump functions of WRegionInfo.

On top of this functionality, we will provide the transformation patches
for
core OpenMP constructs (e.g. start with "#pragma omp parallel for" loop for
lowering and outlining, and "#pragma omp simd" to hook it up with
LoopVectorize.cpp). We have internal implementations for many constructs
now.
We will break this functionality up to create a series of patches for
community review.

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

David Majnemer via llvm-dev

unread,
Jan 11, 2017, 5:18:45 PM1/11/17
to Hal Finkel, llvm-dev
FWIW, we needed to maintain single entry-multiple exit regions for WinEH and we accomplished it via a different mechanism.

We had an instruction which produces a value of type Token (http://llvm.org/docs/LangRef.html#token-type) which let us establish the region and another instruction to exit the region by consuming it. The dominance rules allowed us to avoid situations where the compiler might trash the regions in weird ways and made sure that regions would be left unharmed.

AFAIK, a similar approach using Token could work here. I think it would reduce the amount of stuff you'd need LLVM to maintain.

Tian, Xinmin via llvm-dev

unread,
Jan 11, 2017, 5:53:14 PM1/11/17
to David Majnemer, Hal Finkel, llvm...@lists.llvm.org

David, one quick question, is there a way to preserve and associate a set of “properties, value info/attr ” to the given region using Token? 

 

Thanks,

Xinmin  

Hongbin Zheng via llvm-dev

unread,
Jan 11, 2017, 6:09:24 PM1/11/17
to Tian, Xinmin, llvm...@lists.llvm.org
We are experimenting similar thing on SESE regions. We introduce an intrinsic to produce a token and another to consume the token. These two intrinsics mark the region, and we annotate extra information as OpBundle of the intrinsic that produce the token.

Thanks
Hongbin

Tian, Xinmin via llvm-dev

unread,
Jan 11, 2017, 6:19:58 PM1/11/17
to Hongbin Zheng, llvm...@lists.llvm.org

Would you send us the LLVM IR for below example using token and OpBundle. So, we can understand better. Thanks.

 

#pragma omp target teams distribute parallel for simd shared(xp, yp) linear(i) firstprivate(m, n) map(m, n)


for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }

 

 

#pragma prefetch x:1:20 y:0:10
for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }



From: Hongbin Zheng [mailto:ethe...@gmail.com]
Sent: Wednesday, January 11, 2017 3:09 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: David Majnemer <david.m...@gmail.com>; Hal Finkel <hfi...@anl.gov>; llvm...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

 

We are experimenting similar thing on SESE regions. We introduce an intrinsic to produce a token and another to consume the token. These two intrinsics mark the region, and we annotate extra information as OpBundle of the intrinsic that produce the token.

 

Thanks

Hongbin

Hongbin Zheng via llvm-dev

unread,
Jan 11, 2017, 6:29:30 PM1/11/17
to Tian, Xinmin, llvm...@lists.llvm.org
I am not an OpenMP expert, so some annotation may be wrong:

// CHECK: [[ENTRY:%[a-zA-Z0-9\.]+]] = tail call token @llvm.directive.scope.entry() [ "target teams distribute"(),  "parallel for", "simd" (), "shared" (i32 *xp, i32 *yp), "linear_iv" (),  "firstprivate" (i32 m, i32 n), "map" (m, n) ] ; notice that I use "linear_iv" for linear induction variable, you may want to fix this
#pragma omp target teams distribute parallel for simd shared(xp, yp) linear(i) firstprivate(m, n) map(m, n)
for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }

// CHECK: tail call void @llvm.directive.scope.exit(token [[ENTRY]])


// CHECK: [[ENTRY:%[a-zA-Z0-9\.]+]] = tail call token @llvm.directive.scope.entry() [ "prefetch"(i32 *xp, i64 1, i64 20, i32 *yp, i64 0, i64 10) ]
#pragma prefetch x:1:20 y:0:10
for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }
// CHECK: tail call void @llvm.directive.scope.exit(token [[ENTRY]])

Tian, Xinmin via llvm-dev

unread,
Jan 11, 2017, 6:47:53 PM1/11/17
to Hongbin Zheng, llvm...@lists.llvm.org

Interesting, this is similar to what we have.

 

One more question, these stuff in the yellow, are they represented as LLVM VALUEs? In other words, does the LLVM  optimizer update them? ,E.g.  %m is re-named %m.1 in the loop,   is the “m” in the token @..... is updated as well?  In the RFC,  the “m” is argument of intrinsic call, all  use-def info are used by optimizer, and optimizer updates them during optimization as regular function arguments. I am trying understand if there is any difference between token scheme and intrinsic scheme in this regard.  

 

tail call token @llvm.directive.scope.entry() [ "target teams distribute"(),  "parallel for", "simd" (), "shared" (i32 *xp, i32 *yp), "linear_iv" (),  "firstprivate" (i32 m, i32 n), "map" (m, n) ] ;

 

 

Reid Kleckner via llvm-dev

unread,
Jan 11, 2017, 6:51:11 PM1/11/17
to David Majnemer, llvm-dev
+1, tokens are the current True Way to create single-entry multi-exit regions. Your example for an annotated loop would look like:

%region = call token @llvm.openmp.regionstart(metadata ...) ; whatever parameters you need here
  loop
call void @llvm.openmp.regionend(token %region)

If you use tokens, I would recommend proposal (c), where you introduce new intrinsics for every new kind of region, instead of adding one overly generic set of region intrinsics.

We already have a way to form regions with real barriers, and it's tokens.

Hongbin Zheng via llvm-dev

unread,
Jan 11, 2017, 6:57:44 PM1/11/17
to Tian, Xinmin, llvm...@lists.llvm.org
Yes, those are LLVM SSA values. "map" (m, n)  should be "map" (i32 m, i32 n) .

Thanks
Hongbin

Tian, Xinmin via llvm-dev

unread,
Jan 11, 2017, 7:01:46 PM1/11/17
to Hongbin Zheng, llvm...@lists.llvm.org

And  “map”  and “firstprivate” … are represented as MDString, right?  Thanks.

Hongbin Zheng via llvm-dev

unread,
Jan 11, 2017, 7:07:55 PM1/11/17
to Tian, Xinmin, llvm...@lists.llvm.org
I think they are not MDString, but "bundle tags" that managed by  LLVMContextImpl::getOrInsertBundleTag.

Hongbin Zheng via llvm-dev

unread,
Jan 11, 2017, 7:09:44 PM1/11/17
to Reid Kleckner, llvm-dev
On Wed, Jan 11, 2017 at 3:51 PM, Reid Kleckner via llvm-dev <llvm...@lists.llvm.org> wrote:
+1, tokens are the current True Way to create single-entry multi-exit regions. Your example for an annotated loop would look like:

%region = call token @llvm.openmp.regionstart(metadata ...) ; whatever parameters you need here
  loop
call void @llvm.openmp.regionend(token %region)

If you use tokens, I would recommend proposal (c), where you introduce new intrinsics for every new kind of region, instead of adding one overly generic set of region intrinsics.
Maybe we can come up with several categories of regions, and create new intrinsic for each category, instead of creating new intrinsic for every *kind*.

Thanks
Hongbin

Hongbin Zheng via llvm-dev

unread,
Jan 11, 2017, 7:10:52 PM1/11/17
to Tian, Xinmin, llvm...@lists.llvm.org
On Wed, Jan 11, 2017 at 4:07 PM, Hongbin Zheng <ethe...@gmail.com> wrote:
I think they are not MDString, but "bundle tags" that managed by  LLVMContextImpl::getOrInsertBundleTag.
I just treat them as something like the string that returned by Inst->getName()

Tian, Xinmin via llvm-dev

unread,
Jan 11, 2017, 7:32:17 PM1/11/17
to Hongbin Zheng, llvm...@lists.llvm.org

Got it. Thanks.

Daniel Berlin via llvm-dev

unread,
Jan 11, 2017, 8:15:16 PM1/11/17
to Hal Finkel, llvm-dev
def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty],
                                   [IntrArgMemOnly],
"llvm.experimental.directive">;

def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual">;

def int_experimental_dir_qual_opnd : Intrinsic<[],
[llvm_metadata_ty, llvm_any_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opnd">;

def int_experimental_dir_qual_opndlist : Intrinsic<
                                        [],
[llvm_metadata_ty, llvm_vararg_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opndlist">;


I'll bite.

What does argmemonly mean when the operands are metadata/?
:)

If the rest is an attempt to keep the intrinsic from being floated or removed, i'm strongly against extending a way we already know to have significant effect on optimization (fake memory dependence) to do this.  Particularly for something so major. 

Mehdi Amini via llvm-dev

unread,
Jan 11, 2017, 11:13:34 PM1/11/17
to Reid Kleckner, llvm-dev
On Jan 11, 2017, at 3:51 PM, Reid Kleckner via llvm-dev <llvm...@lists.llvm.org> wrote:

+1, tokens are the current True Way to create single-entry multi-exit regions. Your example for an annotated loop would look like:

%region = call token @llvm.openmp.regionstart(metadata ...) ; whatever parameters you need here
  loop
call void @llvm.openmp.regionend(token %region)

If you use tokens, I would recommend proposal (c), where you introduce new intrinsics for every new kind of region, instead of adding one overly generic set of region intrinsics.

Can you elaborate why? I’m curious.

Thanks,

— 
Mehdi

Reid Kleckner via llvm-dev

unread,
Jan 12, 2017, 7:20:38 PM1/12/17
to Mehdi Amini, llvm-dev
On Wed, Jan 11, 2017 at 8:13 PM, Mehdi Amini <mehdi...@apple.com> wrote:
Can you elaborate why? I’m curious.

The con of proposal c was that many passes would need to learn about many region intrinsics. With tokens, you only need to teach all passes about tokens, which they should already know about because WinEH and other things use them.

With tokens, we can add as many region-introducing intrinsics as makes sense without any additional cost to the middle end. We don't need to make one omnibus region intrinsic set that describes every parallel loop annotation scheme supported by LLVM. Instead we would factor things according to other software design considerations.

Hal Finkel via llvm-dev

unread,
Jan 12, 2017, 8:02:20 PM1/12/17
to Reid Kleckner, Mehdi Amini, llvm-dev
I think that, unless we allow frontends to add their own intrinsics without recompiling LLVM, this severely restricts the usefulness of this feature. The motivation here is to support frontends inserting custom region annotations which are handled by custom passes the frontends inject into the (end of the) pipeline.

 -Hal

Adve, Vikram Sadanand via llvm-dev

unread,
Jan 13, 2017, 1:13:59 AM1/13/17
to llvm-dev, llvm-dev...@lists.llvm.org
> (d) Add a small number of LLVM intrinsics for region or loop annotations,
> represent the directive/clause names using metadata and the remaining
> information using arguments.
>
> Here we're proposing (d),

I think this would serve the goal of communicating source-level directives and annotations down to LLVM passes and back-ends, while deferring inlining and allowing optimizations and code-generation for parallel code to happen more effectively at the IR level. Essentially, you’re tunneling language-specific information through the front-end, incorporating the new information in the IR using minimally invasive mechanisms, but the contents (captured in the metadata strings) are entirely language-specific, both in detailed syntax and semantics.

However, as you also said, a parallel IR needs to be able to support a wide range of parallel languages, besides OpenMP. Two near-term examples I can think of are C++17 and Cilk (or Cilk++). I can imagine adding support for other existing languages like Halide and perhaps the PGAS languages like UPC. Many of the *parallel* optimizations and code-generation tasks are likely to be common for these different languages (e.g., optimizations of reductions, parallel loop fusion, or GPU code generation).

Fundamentally, I think we should keep most of the parallel IR extensions as language-neutral as possible. There are a number of common parallel constructs that are common to many, perhaps most, of these parallel languages: parallel loops, barriers, associative and non-associative reductions, loop scheduling specifications, affinity specifications, etc. It seems to me that we could define a core set of parallel constructs that capture *most* of the common requirements of these languages. How this is encoded is really a separate issue, e.g., perhaps(*) we could use approach (d). We’d essentially be doing what you’re proposing for OpenMP, but with more language-neutral structure and semantics for the parallel constructs. The corresponding language features would be directly lowered to these constructs. The missed features could use the per-language encodings you’re describing, but with a good design they should be relatively few such features. The overall design complexity and the impact on the existing passes should be no greater, and potentially less, than what you’ve proposed.


(*) Or possibly some combination of (b) and (d). In particular, after thinking about this quite a bit recently, I’m not convinced that flexible parallel control flow, e.g., task creation and joins, are best encoded as intrinsics. Adding a very small number of first-class instructions for them may achieve this goal more effectively. But I’d be happy to be convinced otherwise. In any case, that’s a detail we can discuss separately — it doesn’t change my main point that we should keep most of the parallel IR as language-neutral as possible.


-—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// va...@illinois.edu
// http://llvm.org



> On Jan 11, 2017, at 4:49 PM, via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Date: Wed, 11 Jan 2017 16:02:52 -0600
> From: Hal Finkel via llvm-dev <llvm...@lists.llvm.org>
> To: llvm-dev <llvm...@lists.llvm.org>
> Subject: [llvm-dev] [RFC] IR-level Region Annotations
> Message-ID: <37e418db-da79-7408...@anl.gov>
> Content-Type: text/plain; charset="utf-8"; format=flowed

Mehdi Amini via llvm-dev

unread,
Jan 13, 2017, 1:27:28 AM1/13/17
to Reid Kleckner, llvm-dev
Thanks for explaining!

— 
Mehdi

Mehdi Amini via llvm-dev

unread,
Jan 13, 2017, 1:29:41 AM1/13/17
to Hal Finkel, llvm-dev
On Jan 12, 2017, at 5:02 PM, Hal Finkel <hfi...@anl.gov> wrote:

On 01/12/2017 06:20 PM, Reid Kleckner via llvm-dev wrote:

On Wed, Jan 11, 2017 at 8:13 PM, Mehdi Amini <mehdi...@apple.com> wrote:
Can you elaborate why? I’m curious.

The con of proposal c was that many passes would need to learn about many region intrinsics. With tokens, you only need to teach all passes about tokens, which they should already know about because WinEH and other things use them.

With tokens, we can add as many region-introducing intrinsics as makes sense without any additional cost to the middle end. We don't need to make one omnibus region intrinsic set that describes every parallel loop annotation scheme supported by LLVM. Instead we would factor things according to other software design considerations.

I think that, unless we allow frontends to add their own intrinsics without recompiling LLVM, this severely restricts the usefulness of this feature.

I’m not convinced that “building a frontend without recompiling LLVM while injecting custom passes” is a strong compelling use-case, i.e. can you explain why requiring such use-case/frontends to rebuild LLVM is so limiting?

Thanks,

— 
Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 13, 2017, 1:59:10 AM1/13/17
to David Majnemer, Hal Finkel, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org

Thank you all David, Hongbin, Reid, Mehdi, Daniel, Vikram for your review and constructive feedback for this RFC. We will update our Clang FE patch to use Token and Tags suggested by David, Hongbin, et.al. instead of using metadata and function arguments for IR-annotation intrinsic function calls to see how it goes to preserve all necessary information for our LLVM middle-end / back-end transformation.  Going with Token and Tag approach, the changes need to be made in our W-Region framework is relative small as well.

 

Vikram, many points you made below are well-taken.  Hal and I had a long discussion at SC'16 on how to build an practical infrastructure for people to experiment with and study all pros and cons for IR extensions for expressing parallelism.  optimization parallel code, and many other usage for directive/pragma information. Personally, I would agree, eventually, the solution likely could be a combination of b and d when we go with parallel IR, after the community finally agreed on what are the most common to be represented as LLVM instructions. 

 

Having said that, this RFC serves as the first step, the intrinsics we proposed are language neutral, but, "tag" or "metadata" are specific to language constructs/directive/pragma...we are expecting more and more feedback and discussion on this work. Thank you all again. 

 

 

From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of David Majnemer via llvm-dev


Sent: Wednesday, January 11, 2017 2:18 PM
To: Hal Finkel <hfi...@anl.gov>
Cc: llvm-dev <llvm...@lists.llvm.org>

Mehdi Amini via llvm-dev

unread,
Jan 13, 2017, 2:06:40 AM1/13/17
to Hal Finkel, llvm-dev


Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic?

For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic?

Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region?

I feel I missed a piece (but maybe I should start with the provided references?) :)


Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 13, 2017, 12:00:48 PM1/13/17
to Mehdi Amini, Hal Finkel, llvm...@lists.llvm.org
Mehdi, thanks for good questions.

>>>>>Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic?

There are some small changes we have to make in some optimizations to make sure the optimizations does not validation attached annotation semantics. 1) provide hand-shaking / query utils for optimization to know the region is parallel loop, 2) setup a proper optimization phase ordering. In our product compiler ICC, we used both approaches.

>>>>>For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic?

Yes, there are a list of such cases, e.g. forward substitution, strength reduction, gloable constant propagation. Here is another example, under serial semantic, you can do constant propagation, but, under parallel semantics, we can't do constant propagation. All these issues are considered

Int x = 100;

parallel num_threads(4)
{
....
atomic {
x = x + 600
}
}

These issues exists already when you do IPO optimization cross OpenCL or Cuda kernel functions, or outlined function from ClangFE.

>>>>>Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region?

ClangFE will need set the "convergent" attribute for the intrinsic calls (call side) based on the language construct semantics.

Thanks,
Xinmin


-----Original Message-----
From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of Mehdi Amini via llvm-dev
Sent: Thursday, January 12, 2017 11:07 PM
To: Hal Finkel <hfi...@anl.gov>
Cc: llvm-dev <llvm...@lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations


> --------+----------------------+---------
> Options | Pros | Cons
> --------+----------------------+--------------------------------------
> --------+----------------------+---------
> (a) | No need to add new | LLVM passes do not always maintain metadata.
> | instructions or | Need to educate many passes (if not all) to
> | new intrinsics | understand and handle them.
> --------+----------------------+--------------------------------------
> --------+----------------------+---------
> (b) | Parallelism becomes | Huge effort for extending all LLVM passes and
> | first class citizen | code generation to support new instructions.
> | | A large set of information still needs to be
> | | represented using other means.
> --------+----------------------+--------------------------------------
> --------+----------------------+---------
> (c) | Less impact on the | A large number of intrinsics must be added.
> | exist LLVM passes. | Some of the optimizations need to be
> | Fewer requirements | educated to understand them.
> | for passes to |
> | maintain metadata. |
> --------+----------------------+--------------------------------------
> --------+----------------------+---------
> (d) | Minimal impact on | Some of the optimizations need to be
> | existing LLVM | educated to understand them.
> | optimizations passes.| No requirements for all passes to maintain
> | directive and clause | large set of metadata with values.
> | names use metadata |
> | strings. |
> --------+----------------------+--------------------------------------
> --------+----------------------+---------

Mehdi Amini via llvm-dev

unread,
Jan 13, 2017, 12:16:55 PM1/13/17
to Tian, Xinmin, llvm...@lists.llvm.org
Hi,

> On Jan 13, 2017, at 9:00 AM, Tian, Xinmin <xinmi...@intel.com> wrote:
>
> Mehdi, thanks for good questions.
>
>>>>>> Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic?
>
> There are some small changes we have to make in some optimizations to make sure the optimizations does not validation attached annotation semantics.

I fear that this does not seem to play well with the original claim of the RFC about a “minimal impact" on existing passes.
Especially since Hal mentioned “the motivation here is to support frontends inserting custom region annotations”, it is not clear if we wouldn’t have to teach passes to treat the intrinsics as optimization barriers by default (which kind of defeat the whole point about this), and then teach passes about the semantic of each kind of region.
It may be possible to abstract some properties about region, à la TTI, with hooks that the passes would query. But that seems like something that’d need a lot of scrutiny before being able to evaluate the viability of the design.


> 1) provide hand-shaking / query utils for optimization to know the region is parallel loop, 2) setup a proper optimization phase ordering. In our product compiler ICC, we used both approaches.
>
>>>>>> For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic?
>
> Yes, there are a list of such cases, e.g. forward substitution, strength reduction, gloable constant propagation. Here is another example, under serial semantic, you can do constant propagation, but, under parallel semantics, we can't do constant propagation. All these issues are considered
>
> Int x = 100;
>
> parallel num_threads(4)
> {
> ....
> atomic {
> x = x + 600
> }
> }
>
> These issues exists already when you do IPO optimization cross OpenCL or Cuda kernel functions, or outlined function from ClangFE.

Right but fortunately there are only a few passes to teach about IPO, and we already have generic mechanism to inhibit IPO, which is not the case with peephole or other function passes.


>
>>>>>> Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region?
>
> ClangFE will need set the "convergent" attribute for the intrinsic calls (call side) based on the language construct semantics.

Convergent does not prevent reordering AFAIK:

convergent call llvm.region.begin(“parallel.omp.for”)
for (I : 0->N)
a[I] = b[I] + c[I];
convergent call llvm.region.end(“parallel.omp.for")

Can become:

for (I : 0->N)
a[I] = b[I] + c[I];
convergent call llvm.region.begin(“parallel.omp.for”)
convergent call llvm.region.end(“parallel.omp.for")


Mehdi

Hal Finkel via llvm-dev

unread,
Jan 13, 2017, 12:41:50 PM1/13/17
to Mehdi Amini, llvm-dev
I don't understand your viewpoint. Many frontends either compose their own pass pipelines or use the existing extension-point mechanism. Some frontends, Chapel for example, can insert code using custom address spaces and then insert passes later to turn accesses using pointers to those address spaces into runtime calls. This is the kind of design we'd like to support, without forcing frontends to use custom versions of LLVM, but with annotated regions instead of just with address spaces.

 -Hal


Thanks,

— 
Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 13, 2017, 12:45:17 PM1/13/17
to mehdi...@apple.com, llvm...@lists.llvm.org
The "minimal" is the most desired goal. But, <1000 LOC changes is better than 5000 LOC changes, right?

For the RFC, The intrinsic comes with use-def of a, b, c and memory read/write. So far, no re-ordering happens as you showed with -O2 and O3 without any changes in other passes, I am not say we don't need to make any change in the future.

convergent call llvm.region.begin(“parallel.omp.for”) (a, b, c) // either use argument or Taken with tags.
for (I : 0->N)
a[I] = b[I] + c[I];
convergent call llvm.region.end(“parallel.omp.for")

A question back to you

X[i] = ...
Convergent call @llvm.Barrier(..)

Did you see the re-ordering happening to move barrier before x[i] = ...?

>>>>Right but fortunately there are only a few passes to teach about IPO, and we already have generic mechanism to inhibit IPO, which is not the case with peephole or other function passes.

Agreed, we will need changes in peephole or function passes. But, it is reasonably manageable. In our product compiler, it is ~500 LOC overall.


-----Original Message-----
From: mehdi...@apple.com [mailto:mehdi...@apple.com]
Sent: Friday, January 13, 2017 9:17 AM
To: Tian, Xinmin <xinmi...@intel.com>

Yonghong Yan via llvm-dev

unread,
Jan 13, 2017, 12:57:10 PM1/13/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org
I have been following the discussion and this is something we are looking for for years. I am glad that you have the patch that at least we can use. I however, have several comments/requests based on our experience:

1. The idea of creating a set of representations for language-neutral parallel constructs and then allows for extending for language-specific representation and passes sounds very well, and it is definitely worth to give a try as next step I think. We did a survey of multiple threading programming interfaces (https://www.hpcwire.com/2015/03/02/a-comparison-of-heterogeneous-and-manycore-programming-models) and it is obvious that parallel interfaces (even including inter-node model such as PGAS, APGAS) share some common mechanisms for representing parallelism, data/affinity, synchronization and mutual exclusion. 

2. There are APIs calls or typical statements that are meant for parallelism, but the language-based IR extensions are not able to represent them. E.g. pthread_create/join (or other runtime calls such as C++ thread/async, etc) are fork/join parallelism. Frontend can be enhanced to recognize those calls and create/append PIR info to those calls. It however would be nicer if we have a meta approach, e.g. providing a file that tells the frontend that a parallel IR should be used for specific API calls. 

3. SPMD divergence such as the following, or if statement inside vector loop body

   if (omp_get_thread_num() == 4) { /* same for UPC or MPI internode/PGAS as we use this to different computation for each thread/proc*/

   } else {

   }

Do your patch have support for that? Basically I can imagine it needs to add some metadata/token/tags to branching IR. 

Thank you!

Yonghong Yan
Assistant Professor
Department of Computer Science and Engineering
School of Engineering and Computer Science
Oakland University
Office: EC 534
Phone: 248-370-4087
Email: y...@oakland.edu

Daniel Berlin via llvm-dev

unread,
Jan 13, 2017, 1:01:32 PM1/13/17
to Tian, Xinmin, llvm...@lists.llvm.org
On Fri, Jan 13, 2017 at 9:00 AM, Tian, Xinmin via llvm-dev <llvm...@lists.llvm.org> wrote:
Mehdi, thanks for good questions.

>>>>>Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic?

There are some small changes we have to make in some optimizations to make sure the optimizations does not validation attached annotation semantics. 1) provide hand-shaking / query utils for optimization to know the region is parallel loop, 2) setup a proper optimization phase ordering. In our product compiler ICC, we used both approaches.


But this is very different than what you said earlier, becuase it's not minimal impact.

Also,  what you've proposed are very generic annotations, and what you are talking about here is a very specific set of ones, and their effects.

If you are assuming these intrinsics will only be used to implement a specific set of annotations, with specific semantics, i'm probably with Reid on the "please use specific constructs" bandwagon.

Otherwise, the regions could be and do anything, and the handshaking/querying has to be done everywhere, for everything, because you don't actually know what they are implementing.
Maybe my region annotation is for "don't PRE things when they have exactly 73 predecessors" regions.


Mehdi Amini via llvm-dev

unread,
Jan 13, 2017, 1:12:00 PM1/13/17
to Hal Finkel, llvm-dev
I think we’re talking about two different things here: you mentioned originally “without recompiling LLVM”, which I don’t see as major blocker, while now you’re now clarifying I think that you’re more concerned about putting a requirement on a *custom* LLVM, as in “it wouldn’t work with the source from a vanilla upstream LLVM”, which I agree is a different story.

That said, it extends the point from the other email (in parallel) about the semantics of the intrinsics: while your solution allows these frontend to reuse the intrinsics, it means that upstream optimization have to consider such intrinsics as optimization barrier because their semantic is unknown.


— 
Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 13, 2017, 1:20:31 PM1/13/17
to Yonghong Yan, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org

Yonghong,

 

Hal and I have been very careful about this RFC, given the long time experience Hal had with the community, as you can see, we positioned it as “experimental”. So, we can add an infrastructure for people to use, improve and extend over time.

 

For your API calls, SPMD divergence, our current implementation does not cover them, as the 1st step, we focused on language constructs like parallel for, simd, cilk_for.  We are open for any suggestion and proposal to cover your usage cases.     

 

Thanks,

Xinmin

 

From: Yonghong Yan [mailto:y...@oakland.edu]

Sent: Friday, January 13, 2017 9:55 AM
To: Tian, Xinmin <xinmi...@intel.com>

Cc: David Majnemer <david.m...@gmail.com>; Hal Finkel <hfi...@anl.gov>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

 

I have been following the discussion and this is something we are looking for for years. I am glad that you have the patch that at least we can use. I however, have several comments/requests based on our experience:

 

1. The idea of creating a set of representations for language-neutral parallel constructs and then allows for extending for language-specific representation and passes sounds very well, and it is definitely worth to give a try as next step I think. We did a survey of multiple threading programming interfaces (https://www.hpcwire.com/2015/03/02/a-comparison-of-heterogeneous-and-manycore-programming-models) and it is obvious that parallel interfaces (even including inter-node model such as PGAS, APGAS) share some common mechanisms for representing parallelism, data/affinity, synchronization and mutual exclusion. 

 

2. There are APIs calls or typical statements that are meant for parallelism, but the language-based IR extensions are not able to represent them. E.g. pthread_create/join (or other runtime calls such as C++ thread/async, etc) are fork/join parallelism. Frontend can be enhanced to recognize those calls and create/append PIR info to those calls. It however would be nicer if we have a meta approach, e.g. providing a file that tells the frontend that a parallel IR should be used for specific API calls. 

 

3. SPMD divergence such as the following, or if statement inside vector loop body

 

   if (omp_get_thread_num() == 4) { /* same for UPC or MPI internode/PGAS as we use this to different computation for each thread/proc*/

 

   } else {

 

   }

 

Do your patch have support for that? Basically I can imagine we need to add some metadata/token/tags to branching IR. 

 

Thank you!

 

Yonghong Yan
Assistant Professor
Department of Computer Science and Engineering
School of Engineering and Computer Science
Oakland University
Office: EC 534
Phone: 248-370-4087
Email: y...@oakland.edu

 

Tian, Xinmin via llvm-dev

unread,
Jan 13, 2017, 1:25:24 PM1/13/17
to Daniel Berlin, llvm...@lists.llvm.org

>>>>If you are assuming these intrinsics will only be used to implement a specific set of annotations, with specific semantics, i'm probably with Reid on the "please use specific constructs" bandwagon.

 

I wouldn’t disagree on this part if these intrinsics end up with usages for a specific set of annotations.

 

From: Daniel Berlin [mailto:dbe...@dberlin.org]

Sent: Friday, January 13, 2017 10:01 AM
To: Tian, Xinmin <xinmi...@intel.com>

Cc: Mehdi Amini <mehdi...@apple.com>; Hal Finkel <hfi...@anl.gov>; llvm...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

 

 

 

On Fri, Jan 13, 2017 at 9:00 AM, Tian, Xinmin via llvm-dev <llvm...@lists.llvm.org> wrote:

Hal Finkel via llvm-dev

unread,
Jan 17, 2017, 9:10:01 AM1/17/17
to Adve, Vikram Sadanand, llvm-dev

On 01/13/2017 12:13 AM, Adve, Vikram Sadanand via llvm-dev wrote:
>> (d) Add a small number of LLVM intrinsics for region or loop annotations,
>> represent the directive/clause names using metadata and the remaining
>> information using arguments.
>>
>> Here we're proposing (d),
> I think this would serve the goal of communicating source-level directives and annotations down to LLVM passes and back-ends, while deferring inlining and allowing optimizations and code-generation for parallel code to happen more effectively at the IR level. Essentially, you’re tunneling language-specific information through the front-end, incorporating the new information in the IR using minimally invasive mechanisms, but the contents (captured in the metadata strings) are entirely language-specific, both in detailed syntax and semantics.
>
> However, as you also said, a parallel IR needs to be able to support a wide range of parallel languages, besides OpenMP. Two near-term examples I can think of are C++17 and Cilk (or Cilk++). I can imagine adding support for other existing languages like Halide and perhaps the PGAS languages like UPC. Many of the *parallel* optimizations and code-generation tasks are likely to be common for these different languages (e.g., optimizations of reductions, parallel loop fusion, or GPU code generation).
>
> Fundamentally, I think we should keep most of the parallel IR extensions as language-neutral as possible.

We obviously need to work out the details here, but one motivation is to
allow the same facility to both represent concepts common to many
programming models as well as programming-model-specific concepts. Also,
I'd like to be able to transition from programming-model-specific
representations (where I imagine most things will start) toward
abstracted concepts. The goal is to retain programming-model-specific
semantics while allowing the creation of transformations and analysis
which deal with abstract concepts. One way we might accomplish this is
by using both like this:

1. A frontend generates region annotations. A frontend like Clang
will generate (mostly) programming-model-specific region annotations.
Frontends for other languages might directly use the abstract concepts
for their region annotations.

2. During optimization, a transformation pass analyzes
programming-model-specific region annotations and, if legal, transforms
them into abstract-concept annotations. It might:

!"omp.barrier" -> !"llvm.parallel.barrier", !"openmp"

Such that the barrier is now a general concept that transformations
might understand (and, for example, eliminate redundant barriers). It is
tagged with !"openmp" do that in the end, should it survive, the concept
will be lowered using OpenMP.

3. During optimization, transformations optimize abstract-concept
annotations (i.e. eliminate redundant barriers, fuse parallel regions, etc.)

4. Later in the pipeline, programming-model specific code lowers
annotations for each programming model into concrete IR (i.e. runtime
function calls, etc.). For abstract concepts without a specific
programming-model tag, some default programming model is selected.

The programming-model-specific to abstract-concept translation in (2)
can sometimes be done on a syntactic basis alone (we already do this, in
fact, for atomics), but sometimes will require analysis that can be done
only after inlining/IPA (to make sure, for example, that the parallel
region does not contain certain classes of runtime-library calls). Plus,
this allows the translation logic to be shared easily by different
frontends.

Thoughts?

-Hal

Wael Yehia via llvm-dev

unread,
Jan 17, 2017, 6:44:43 PM1/17/17
to llvm...@lists.llvm.org
Hi. Regarding the token approach, I've read some documentation (review D11861, EH in llvm, and Reid and David's presentation) but couldn't answer the following question.
Does the intrinsic or the instruction returning a token type object act as a code motion barrier? In other words, does it prevent other operations from being reordered with it?
If the answer is no, then does it mean the intrinsic has to be marked with the convergent property to achieve the code motion barrier effect?
If yes, can someone explain how is that achieved in llvm?
Greatly appreciated. Thanks.

Wael

Hal Finkel via llvm-dev

unread,
Jan 17, 2017, 7:36:52 PM1/17/17
to Wael Yehia, llvm...@lists.llvm.org


On 01/17/2017 05:36 PM, Wael Yehia via llvm-dev wrote:
Hi. Regarding the token approach, I've read some documentation (review D11861, EH in llvm, and Reid and David's presentation) but couldn't answer the following question.
Does the intrinsic or the instruction returning a token type object act as a code motion barrier? In other words, does it prevent other operations from being reordered with it?
If the answer is no, then does it mean the intrinsic has to be marked with the convergent property to achieve the code motion barrier effect?

To clarify, convergent does not form a code-motion barrier. It prevents transformations from adding additional control dependencies. I suspect that the token type does the same (because tokens can't appear in PHIs).

Actual code motion barriers are generally achieved by adding unknown memory dependencies.

 -Hal

If yes, can someone explain how is that achieved in llvm?
Greatly appreciated. Thanks.

Wael


_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Mehdi Amini via llvm-dev

unread,
Jan 17, 2017, 7:59:41 PM1/17/17
to Hal Finkel, llvm...@lists.llvm.org
On Jan 17, 2017, at 4:36 PM, Hal Finkel via llvm-dev <llvm...@lists.llvm.org> wrote:


On 01/17/2017 05:36 PM, Wael Yehia via llvm-dev wrote:
Hi. Regarding the token approach, I've read some documentation (review D11861, EH in llvm, and Reid and David's presentation) but couldn't answer the following question.
Does the intrinsic or the instruction returning a token type object act as a code motion barrier? In other words, does it prevent other operations from being reordered with it?
If the answer is no, then does it mean the intrinsic has to be marked with the convergent property to achieve the code motion barrier effect?

To clarify, convergent does not form a code-motion barrier. It prevents transformations from adding additional control dependencies. I suspect that the token type does the same (because tokens can't appear in PHIs).

What if the only user of the token is in a branch? It seems to me that you’d be allowed to sink the token producer there, but not if it is marked convergent.

— 
Mehdi

Hal Finkel via llvm-dev

unread,
Jan 17, 2017, 10:01:59 PM1/17/17
to Mehdi Amini, llvm...@lists.llvm.org


On 01/17/2017 06:59 PM, Mehdi Amini wrote:

On Jan 17, 2017, at 4:36 PM, Hal Finkel via llvm-dev <llvm...@lists.llvm.org> wrote:


On 01/17/2017 05:36 PM, Wael Yehia via llvm-dev wrote:
Hi. Regarding the token approach, I've read some documentation (review D11861, EH in llvm, and Reid and David's presentation) but couldn't answer the following question.
Does the intrinsic or the instruction returning a token type object act as a code motion barrier? In other words, does it prevent other operations from being reordered with it?
If the answer is no, then does it mean the intrinsic has to be marked with the convergent property to achieve the code motion barrier effect?

To clarify, convergent does not form a code-motion barrier. It prevents transformations from adding additional control dependencies. I suspect that the token type does the same (because tokens can't appear in PHIs).

What if the only user of the token is in a branch? It seems to me that you’d be allowed to sink the token producer there, but not if it is marked convergent.

Can that happen if you start out with a "well formed" region?

 -Hal

Adve, Vikram Sadanand via llvm-dev

unread,
Jan 18, 2017, 12:55:40 AM1/18/17
to Hal Finkel, llvm-dev
> On Jan 17, 2017, at 8:09 AM, Hal Finkel <hfi...@anl.gov> wrote:
>
>
> On 01/13/2017 12:13 AM, Adve, Vikram Sadanand via llvm-dev wrote:
>>> (d) Add a small number of LLVM intrinsics for region or loop annotations,
>>> represent the directive/clause names using metadata and the remaining
>>> information using arguments.
>>> Here we're proposing (d),
>> I think this would serve the goal of communicating source-level directives and annotations down to LLVM passes and back-ends, while deferring inlining and allowing optimizations and code-generation for parallel code to happen more effectively at the IR level. Essentially, you’re tunneling language-specific information through the front-end, incorporating the new information in the IR using minimally invasive mechanisms, but the contents (captured in the metadata strings) are entirely language-specific, both in detailed syntax and semantics.
>>
>> However, as you also said, a parallel IR needs to be able to support a wide range of parallel languages, besides OpenMP. Two near-term examples I can think of are C++17 and Cilk (or Cilk++). I can imagine adding support for other existing languages like Halide and perhaps the PGAS languages like UPC. Many of the *parallel* optimizations and code-generation tasks are likely to be common for these different languages (e.g., optimizations of reductions, parallel loop fusion, or GPU code generation).
>>
>> Fundamentally, I think we should keep most of the parallel IR extensions as language-neutral as possible.
>
> We obviously need to work out the details here, but one motivation is to allow the same facility to both represent concepts common to many programming models as well as programming-model-specific concepts.

Yes, I agree. There will inevitably be programming-model-specific features that are not supported by any generic abstraction. The hope is that they will be few, especially for the "smaller" languages.


> Also, I'd like to be able to transition from programming-model-specific representations (where I imagine most things will start) toward abstracted concepts. The goal is to retain programming-model-specific semantics while allowing the creation of transformations and analysis which deal with abstract concepts. One way we might accomplish this is by using both like this:
>
> 1. A frontend generates region annotations. A frontend like Clang will generate (mostly) programming-model-specific region annotations. Frontends for other languages might directly use the abstract concepts for their region annotations.
>
> 2. During optimization, a transformation pass analyzes programming-model-specific region annotations and, if legal, transforms them into abstract-concept annotations. It might:
>
> !"omp.barrier" -> !"llvm.parallel.barrier", !"openmp"
>
> Such that the barrier is now a general concept that transformations might understand (and, for example, eliminate redundant barriers). It is tagged with !"openmp" do that in the end, should it survive, the concept will be lowered using OpenMP.


Yes, this is exactly what I have in mind too. We can discuss the details — what should particular front-ends should generate directly; what back-end components can be shared even when doing programming-model-specific code generation — but this flow has many advantages.

Some specific goals I’d like to see are:
+ Have as many optimizations and back-end components as possible be driven by the programming-model-agnostic information and shared among multiple languages, i.e., minimize the need for passes that use the programming-model-specific information.
+ Allow concepts in different languages be mixed and matched, to maximize performance, e.g., a work-stealing scheduler used with an OMP parallel loop; a static schedule be used with a Cilk_for parallel loop; an SIMD directive and hints used with a Cilk_for loop; etc.).
+ In the same vein, allow optimization and code generation passes to leverage available features of run-time systems and target hardware, to maximize performance in similar ways.
+ (This is not a separate goal, but rather a strategy to enable the previous two goals.) Use the annotations to decouple front-ends and upstream auto-parallelization passes from optimizations and code generation, so that the optimizations and code generation phases "don’t care" what source language(s) or other mechanisms were used to parallelize code.
+ Allow a flexible parallel run-time system that can span multiple hardware targets, e.g., a pipeline that runs some pipeline stages on a shared memory multicore host and some on one or more GPUs.

I didn’t explicitly spell out other goals like ones in your original email, especially to make sure standard optimization passes (const. prop.; redundancy elim; strength reduction; etc.) should continue to be as effective as possible, while minimizing the need to rewrite them to respect parallel semantics. E.g., avoiding outlining in the front-end is likely to be an important requirement.



>
> 3. During optimization, transformations optimize abstract-concept annotations (i.e. eliminate redundant barriers, fuse parallel regions, etc.)
>
> 4. Later in the pipeline, programming-model specific code lowers annotations for each programming model into concrete IR (i.e. runtime function calls, etc.). For abstract concepts without a specific programming-model tag, some default programming model is selected.

For code with programming-model specific tags, it may still be possible to map into a more general run-time. See examples above.


> The programming-model-specific to abstract-concept translation in (2) can sometimes be done on a syntactic basis alone (we already do this, in fact, for atomics), but sometimes will require analysis that can be done only after inlining/IPA (to make sure, for example, that the parallel region does not contain certain classes of runtime-library calls). Plus, this allows the translation logic to be shared easily by different frontends.
>
> Thoughts?

I generally agree. My main additional point (perhaps also what you had in mind) is that we should aim to maximize flexibility in the opts. and code-gen passes, while minimizing the dependence on programming-model-specific semantics.

Mehdi Amini via llvm-dev

unread,
Jan 18, 2017, 1:08:28 AM1/18/17
to Hal Finkel, llvm...@lists.llvm.org
On Jan 17, 2017, at 7:01 PM, Hal Finkel <hfi...@anl.gov> wrote:


On 01/17/2017 06:59 PM, Mehdi Amini wrote:

On Jan 17, 2017, at 4:36 PM, Hal Finkel via llvm-dev <llvm...@lists.llvm.org> wrote:


On 01/17/2017 05:36 PM, Wael Yehia via llvm-dev wrote:
Hi. Regarding the token approach, I've read some documentation (review D11861, EH in llvm, and Reid and David's presentation) but couldn't answer the following question.
Does the intrinsic or the instruction returning a token type object act as a code motion barrier? In other words, does it prevent other operations from being reordered with it?
If the answer is no, then does it mean the intrinsic has to be marked with the convergent property to achieve the code motion barrier effect?

To clarify, convergent does not form a code-motion barrier. It prevents transformations from adding additional control dependencies. I suspect that the token type does the same (because tokens can't appear in PHIs).

What if the only user of the token is in a branch? It seems to me that you’d be allowed to sink the token producer there, but not if it is marked convergent.

Can that happen if you start out with a "well formed" region?

LangRef seems to allow it, there might not be any users of token that would do that though?
I haven’t given much thought about it, I just raw-analyzed your suspected equivalence on this aspect :)

Tian, Xinmin via llvm-dev

unread,
Jan 18, 2017, 1:41:02 AM1/18/17
to Adve, Vikram Sadanand, Hal Finkel, llvm...@lists.llvm.org
Some specific goals I’d like to see are:
+ Have as many optimizations and back-end components as possible be driven by the programming-model-agnostic information and shared among multiple languages, i.e., minimize the need for passes that use the programming-model-specific information.
+ Allow concepts in different languages be mixed and matched, to maximize performance, e.g., a work-stealing scheduler used with an OMP parallel loop; a static schedule be used with a Cilk_for parallel loop; an SIMD directive and hints used with a Cilk_for loop; etc.).
+ In the same vein, allow optimization and code generation passes to leverage available features of run-time systems and target hardware, to maximize performance in similar ways.

>>>>Yes, these goals are aligned with some major features we have supported in the Intel compilers and the LLVM IR and compiler development we are targeting for.

+ Allow a flexible parallel run-time system that can span multiple hardware targets, e.g., a pipeline that runs some pipeline stages on a shared memory multicore host and some on one or more GPUs.

>>>>At this point, it sounds a very good research topic / project.

Xinmin

-----Original Message-----
From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of Adve, Vikram Sadanand via llvm-dev
Sent: Tuesday, January 17, 2017 9:56 PM
To: Hal Finkel <hfi...@anl.gov>
Cc: llvm-dev <llvm...@lists.llvm.org>
>>> --------+----------------------+-----------
>>>
>>> Options | Pros | Cons
>>> --------+----------------------+------------------------------------
>>> --------+----------------------+-----------
>>>
>>> (a) | No need to add new | LLVM passes do not always maintain
>>> metadata.
>>> | instructions or | Need to educate many passes (if not
>>> all) to
>>> | new intrinsics | understand and handle them.
>>> --------+----------------------+------------------------------------
>>> --------+----------------------+-----------
>>>
>>> (b) | Parallelism becomes | Huge effort for extending all LLVM
>>> passes and
>>> | first class citizen | code generation to support new
>>> instructions.
>>> | | A large set of information still needs
>>> to be
>>> | | represented using other means.
>>> --------+----------------------+------------------------------------
>>> --------+----------------------+-----------
>>>
>>> (c) | Less impact on the | A large number of intrinsics must be
>>> added.
>>> | exist LLVM passes. | Some of the optimizations need to be
>>> | Fewer requirements | educated to understand them.
>>> | for passes to |
>>> | maintain metadata. |
>>> --------+----------------------+------------------------------------
>>> --------+----------------------+-----------
>>>
>>> (d) | Minimal impact on | Some of the optimizations need to be
>>> | existing LLVM | educated to understand them.
>>> | optimizations passes.| No requirements for all passes to
>>> maintain
>>> | directive and clause | large set of metadata with values.
>>> | names use metadata |
>>> | strings. |
>>> --------+----------------------+------------------------------------
>>> --------+----------------------+-----------

Johannes Doerfert via llvm-dev

unread,
Jan 19, 2017, 10:33:09 AM1/19/17
to Hal Finkel, llvm-dev, Simon Moll
Hi Hal, Hi Xinmin,

First let me thank you for pushing in this direction, it means more
people are interested in some kind of change here.

While "our" RFC will be sent out next week I want to comment on a specific
point of this one right now:

> [...]
> (b) Add several new LLVM instructions such as, for parallelism, fork, spawn,
> join, barrier, etc.
> [...]

For me fork and spawn are serving the same purpose, most new schemes suggested three new instructions in total.

> Options | Pros | Cons
> [...]
> (b) | Parallelism becomes | Huge effort for extending all LLVM passes and
> | first class citizen | code generation to support new instructions.
> | | A large set of information still needs to be
> | | represented using other means.
> [...]

I am especially curious where you get your data from. Tapir [0] (and to
some degree PIR [1]) have shown that, counterintuitively, only a few changes
to LLVM passes are needed. Tapir was recently used in an MIT class with a
lot of students and it seemed to work well with only minimal changes
to analysis and especially transformation passes.

Also the "code generation" issues you mention do, in my opinion, apply to
_all_ proposed schemes as all have to be lowered to either sequential
code or parallel library runtime calls eventually. The sequentialization
for our new Tapir/PIR hybrid has less than 50 lines and generating
parallel runtime calls will probably be similar in all schemes anyway.

Regarding the last point, "A large set of information still needs to be
represented using other means", I am curious why this is a bad thing. I
think IR should be simple, each instruction/intrinsic etc. should have
clear and minimal semantics. Also I think we already have a lot of
simple constructs in the IR to express high-level information properly,
e.g. 'atomicrmw' instructions for high-level reduction. While we currently
lack analysis passes to extract information from such low-level
representations, these are certainly possible [2,3]. I would argue that such
analysis are a better way to do things than placing "high-level
intrinsics" in the IR to mark things like reductions.

Cheers,
Johannes, on behalf of the Tapir and PIR team

[0] https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf
[1] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
[2] Section 3 in https://arxiv.org/pdf/1505.07716
[3] Section 3.2 and 3.3 in https://www.st.cs.uni-saarland.de/publications/files/streit-taco-2015.pdf
> --------+----------------------+-----------------------------------------------
>
> Options | Pros | Cons
> --------+----------------------+-----------------------------------------------
>
> (a) | No need to add new | LLVM passes do not always maintain
> metadata.
> | instructions or | Need to educate many passes (if not all) to
> | new intrinsics | understand and handle them.
> --------+----------------------+-----------------------------------------------
>
> (b) | Parallelism becomes | Huge effort for extending all LLVM passes
> and
> | first class citizen | code generation to support new
> instructions.
> | | A large set of information still needs to
> be
> | | represented using other means.
> --------+----------------------+-----------------------------------------------
>
> (c) | Less impact on the | A large number of intrinsics must be added.
> | exist LLVM passes. | Some of the optimizations need to be
> | Fewer requirements | educated to understand them.
> | for passes to |
> | maintain metadata. |
> --------+----------------------+-----------------------------------------------
>
> (d) | Minimal impact on | Some of the optimizations need to be
> | existing LLVM | educated to understand them.
> | optimizations passes.| No requirements for all passes to maintain
> | directive and clause | large set of metadata with values.
> | names use metadata |
> | strings. |
> --------+----------------------+-----------------------------------------------
Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doer...@cs.uni-saarland.de
Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert
signature.asc

Adve, Vikram Sadanand via llvm-dev

unread,
Jan 19, 2017, 2:37:09 PM1/19/17
to llvm-dev, llvm-dev...@lists.llvm.org
Hi Johannes,

> I am especially curious where you get your data from. Tapir [0] (and to
> some degree PIR [1]) have shown that, counterintuitively, only a few changes
> to LLVM passes are needed. Tapir was recently used in an MIT class with a
> lot of students and it seemed to work well with only minimal changes
> to analysis and especially transformation passes.

TAPIR is an elegant, small extension and, in particular, I think the idea of asymmetric parallel tasks and control flow is a clever way to express parallelism with serial semantics, as in Cilk. Encoding the control flow extensions as explicit instructions is orthogonal to that, though arguably more elegant than using region tags + metadata.

However, Cilk is a tiny language compared with the full complexity of other languages, like OpenMP. To take just one example, TAPIR cannot express the ORDERED construct of OpenMP. A more serious concern, IMO, is that TAPIR (like Cilk) requires serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction. Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses. For example, automatic array privatization in parallel loops is a very hard problem (automatic scalar privatization is easier, but even that is interprocedural). Reduction recognition is doable for common cases, but there are hard cases here as well. These are all standard features of parallel programs, not specific to OpenMP (e.g., C++17 parallel template operators are likely to produce these as well).

If you support all these capabilities in the IR, a *lot* more than 6000 LOC (Tapir’s statistic; I’m sorry I don’t recall the number for PIR) would probably have to be modified in LLVM.


>> [...]
>> (b) Add several new LLVM instructions such as, for parallelism, fork, spawn,
>> join, barrier, etc.
>> [...]
>
> For me fork and spawn are serving the same purpose, most new schemes suggested three new instructions in total.

A reasonable question is whether to use (#b) first-class instructions for some features, *in combination with* (#d) — i.e., region markers + metadata — or to use #d exclusively. There are almost certainly far too many essential features in parallel programs to capture them all as new instructions. I don’t see a need to answer this question on Day 1. Instead, we can begin with regions and metadata annotations, and then “promote” a few features to first-class instructions if the benefit is justified.

Does that make sense?


> Also I think we already have a lot of
> simple constructs in the IR to express high-level information properly,
> e.g. 'atomicrmw' instructions for high-level reduction. While we currently
> lack analysis passes to extract information from such low-level
> representations, these are certainly possible [2,3]. I would argue that such
> analysis are a better way to do things than placing "high-level intrinsics" in the IR to mark things like reductions.

IMO, this is too fragile for ensuring performance for realistic applications. While there is a lot of work on reduction recognition in the literature going back to the 1980s, there will always be cases where these algorithms miss a reduction pattern and you get terrible parallel performance. Second, I’m not convinced that you can use specific operations like “atomicrmw” to find candidate loop nests in the first place that *might* be reductions, because the atomic op may in fact be in a library operation linked in as binary code. Third, this greatly increases the complexity of a “first working prototype” for a language like OpenMP because you can’t exclude REDUCTION clauses. Fourth, I believe array privatization is even harder; more generally, you cannot rely on heroic compiler optimizations to take the place of explicit language directives.

I may be showing my age, but I co-wrote a full HPF-to-MPI source-to-source compiler in the late 1990s at Rice University with some quite sophisticated optimizations [1,2], and we and others in the HPF compiler community — both product and research teams — encountered many cases where compiler analyses were inadequate. There is a credible view that HPF failed despite very high hopes and significant industry and research investment because it relied too much on heroic compiler technology.

[1] http://vadve.web.engr.illinois.edu/Papers/set_framework.pldi98.ps (PLDI 1998)
[2] http://vadve.web.engr.illinois.edu/Papers/SC98.NASBenchmarksStudy/dhpfsc98.pdf (SC 1998)

-—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// va...@illinois.edu
// http://llvm.org



> On Jan 19, 2017, at 11:20 AM, via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Date: Thu, 19 Jan 2017 16:30:51 +0100
> From: Johannes Doerfert via llvm-dev <llvm...@lists.llvm.org>
> To: Hal Finkel <hfi...@anl.gov>
> Cc: llvm-dev <llvm...@lists.llvm.org>, Simon Moll
> <mo...@cs.uni-saarland.de>
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
> Message-ID: <20170119153...@arch-linux-jd.home>
> Content-Type: text/plain; charset="utf-8"

Mehdi Amini via llvm-dev

unread,
Jan 19, 2017, 2:46:12 PM1/19/17
to Adve, Vikram Sadanand, llvm-dev, llvm-dev...@lists.llvm.org

> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Hi Johannes,
>
>> I am especially curious where you get your data from. Tapir [0] (and to
>> some degree PIR [1]) have shown that, counterintuitively, only a few changes
>> to LLVM passes are needed. Tapir was recently used in an MIT class with a
>> lot of students and it seemed to work well with only minimal changes
>> to analysis and especially transformation passes.
>
> TAPIR is an elegant, small extension and, in particular, I think the idea of asymmetric parallel tasks and control flow is a clever way to express parallelism with serial semantics, as in Cilk. Encoding the control flow extensions as explicit instructions is orthogonal to that, though arguably more elegant than using region tags + metadata.
>
> However, Cilk is a tiny language compared with the full complexity of other languages, like OpenMP. To take just one example, TAPIR cannot express the ORDERED construct of OpenMP. A more serious concern, IMO, is that TAPIR (like Cilk) requires serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction. Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses.

I agree with this, but I’m also wondering if it needs to be first class in the IR?
For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs.
I wonder if the same wouldn’t apply to an openmp reduction clause for instance, where you could use the “basic” IR construct and the analysis would use a metadata emitted by the frontend instead of trying to infer the reduction.

Just a thought, I have given much time studying other constructs and how they map to the IR :)


> For example, automatic array privatization in parallel loops is a very hard problem (automatic scalar privatization is easier, but even that is interprocedural). Reduction recognition is doable for common cases, but there are hard cases here as well. These are all standard features of parallel programs, not specific to OpenMP (e.g., C++17 parallel template operators are likely to produce these as well).
>
> If you support all these capabilities in the IR, a *lot* more than 6000 LOC (Tapir’s statistic; I’m sorry I don’t recall the number for PIR) would probably have to be modified in LLVM.
>
>
>>> [...]
>>> (b) Add several new LLVM instructions such as, for parallelism, fork, spawn,
>>> join, barrier, etc.
>>> [...]
>>
>> For me fork and spawn are serving the same purpose, most new schemes suggested three new instructions in total.
>
> A reasonable question is whether to use (#b) first-class instructions for some features, *in combination with* (#d) — i.e., region markers + metadata — or to use #d exclusively. There are almost certainly far too many essential features in parallel programs to capture them all as new instructions. I don’t see a need to answer this question on Day 1. Instead, we can begin with regions and metadata annotations, and then “promote” a few features to first-class instructions if the benefit is justified.
>
> Does that make sense?

Now that I read this, I wonder if it isn’t close to what I tried to express above :)


Mehdi

Daniel Berlin via llvm-dev

unread,
Jan 19, 2017, 3:04:40 PM1/19/17
to Mehdi Amini, llvm-dev, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:

> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Hi Johannes,
>
>> I am especially curious where you get your data from. Tapir [0] (and to
>> some degree PIR [1]) have shown that, counterintuitively, only a few changes
>> to LLVM passes are needed. Tapir was recently used in an MIT class with a
>> lot of students and it seemed to work well with only minimal changes
>> to analysis and especially transformation passes.
>
> TAPIR is an elegant, small extension and, in particular, I think the idea of asymmetric parallel tasks and control flow is a clever way to express parallelism with serial semantics, as in Cilk.  Encoding the control flow extensions as explicit instructions is orthogonal to that, though arguably more elegant than using region tags + metadata.
>
> However, Cilk is a tiny language compared with the full complexity of other languages, like OpenMP.  To take just one example, TAPIR cannot express the ORDERED construct of OpenMP.  A more serious concern, IMO, is that TAPIR (like Cilk) requires serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction.  Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,  that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses.

I agree with this, but I’m also wondering if it needs to be first class in the IR?
For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs.

FWIW, while i agree with the general point, i wouldn't use this example.
Because we pretty much still suffer to this day because of it (both in AA, and devirt, and ...)  :)
We can't always even tell fields apart


Adve, Vikram Sadanand via llvm-dev

unread,
Jan 19, 2017, 3:37:55 PM1/19/17
to Mehdi Amini, llvm-dev, llvm-dev...@lists.llvm.org
> On Jan 19, 2017, at 1:46 PM, Mehdi Amini <mehdi...@apple.com> wrote:
>
>>
>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev <llvm...@lists.llvm.org> wrote:
>>
<snip>
>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses.
>
> I agree with this, but I’m also wondering if it needs to be first class in the IR?
> For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs.
> I wonder if the same wouldn’t apply to an openmp reduction clause for instance, where you could use the “basic” IR construct and the analysis would use a metadata emitted by the frontend instead of trying to infer the reduction.


As Danny said (next message), lack of strong AA does limit some of our optimizations today. Even beyond that, however, parallel performance is a different beast from scalar optimizations. If you don’t devirtualize a subset of indirect calls, you may take a few % hit. If you fail to recognize a reduction and run the loop serially instead of in parallel, you could easily take a 10x or greater slowdown on that loop, leading to a substantial slowdown overall, depending on the importance of the reduction.

Put another way, individual parallel optimizations (whether done manually or by a compiler) have far larger performance effects, generally speaking, than individual scalar optimizations.


>
> Just a thought, I have given much time studying other constructs and how they map to the IR :)
>
>>>> [...]
>>>> (b) Add several new LLVM instructions such as, for parallelism, fork, spawn,
>>>> join, barrier, etc.
>>>> [...]
>>>
>>> For me fork and spawn are serving the same purpose, most new schemes suggested three new instructions in total.
>>
>> A reasonable question is whether to use (#b) first-class instructions for some features, *in combination with* (#d) — i.e., region markers + metadata — or to use #d exclusively. There are almost certainly far too many essential features in parallel programs to capture them all as new instructions. I don’t see a need to answer this question on Day 1. Instead, we can begin with regions and metadata annotations, and then “promote” a few features to first-class instructions if the benefit is justified.
>>
>> Does that make sense?
>
> Now that I read this, I wonder if it isn’t close to what I tried to express above :)


:-).

Mehdi Amini via llvm-dev

unread,
Jan 19, 2017, 4:12:28 PM1/19/17
to Daniel Berlin, llvm-dev, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
Is it inherent to the infrastructure, i.e. using metadata instead of first class IR construct or is it just a “quality of implementation” issue?

— 
Mehdi



Daniel Berlin via llvm-dev

unread,
Jan 19, 2017, 4:32:39 PM1/19/17
to Mehdi Amini, llvm-dev, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
Not to derail this conversation:

IMHO, At some point there is no real difference :)

Because otherwise, everything is a QOI issue.

IE if it's super tricky to get metadata that works well and works right, doesn't get lost, etc, and that's inherent to using metadata, that to me is not a QOI issue.

So could it be done with metadata? Probably?
But at the same time,  if it had been done with more first class constructs, it would have happened years ago  and been much lower cost.

So it's all tradeoffs.



Mehdi Amini via llvm-dev

unread,
Jan 19, 2017, 4:36:39 PM1/19/17
to Daniel Berlin, llvm-dev, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
This is what I meant by “inherent to the infrastructure”, thanks for clarifying.

— 
Mehdi


Hal Finkel via llvm-dev

unread,
Jan 19, 2017, 11:29:14 PM1/19/17
to Mehdi Amini, Daniel Berlin, llvm-dev, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
To clarify, we were proposing metadata that is used as arguments to the region-annotation intrinsics. This metadata has the nice property that it does not get dropped (so it is just being used as a way of encoding whatever data structures are necessary without predefining a syntactic schema).

 -Hal


— 
Mehdi




_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sanjoy Das via llvm-dev

unread,
Jan 20, 2017, 12:03:08 AM1/20/17
to Hal Finkel, llvm-dev, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
Hi,

My bias is to use both (b) and (d), since they have complementary
strengths. We should use (b) for expressing concepts that can't be
semantically modeled as a call or invoke (this branch takes both its
successors), and (d) for expressing things that can be (this call may
never return), and annotation like things (this region (denoted by
def-use of a token) is a reduction).

I don't grok OpenMP, but perhaps we can come with one or two
"generalized control flow"-type instructions that can be used to model
the non-call/invoke like semantics we'd like LLVM to know about, and
model the rest with (d)?

-- Sanjoy

On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev

Adve, Vikram Sadanand via llvm-dev

unread,
Jan 20, 2017, 12:27:21 AM1/20/17
to Sanjoy Das, llvm-dev, llvm-dev...@lists.llvm.org
Hi Sanjoy,

Yes, that's exactly what we have been looking at recently here, but the region tags seem to make it possible to express the control flow as well, so I think we could start with reg ions+metadata, as Hal and Xinmin proposed, and then figure out what needs to be first class instructions.

--Vikram Adve

Sanjoy Das via llvm-dev

unread,
Jan 20, 2017, 1:13:24 AM1/20/17
to Adve, Vikram Sadanand, llvm-dev, llvm-dev...@lists.llvm.org
Hi Vikram,

On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand
<va...@illinois.edu> wrote:
> Hi Sanjoy,
>
> Yes, that's exactly what we have been looking at recently here, but
> the region tags seem to make it possible to express the control flow
> as well, so I think we could start with reg ions+metadata, as Hal and

I'm not yet convinced that region tags are sufficient to model exotic
control flow.

(I don't know OpenMP so this is a copy-pasted-edited example)

Say we have:

void main() {
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val;
compute_something_into_val(&val, i);
a[i] = val;
}
}

I presume the (eventual) intended lowering is something like this (if
the intended lowering is different than this, and avoids the issue I'm
trying to highlight then my point is moot):

void main() {
tok = llvm.experimental.intrinsic_a();

int i = omp_get_thread_num();
i32* val = alloca i32
compute_something_into_val(val, i);
a[i] = val;

llvm.experimental.intrinsic_b(tok);
}

However, LLVM is free to hoist the alloca to the entry block:

void main() {
i32* val = alloca i32
tok = llvm.experimental.intrinsic_a();

int i = omp_get_thread_num();
compute_something_into_val(val, i);
a[i] = val;

llvm.experimental.intrinsic_b(tok);
}

and now you have a race between the four parallel forks.

The problem here is that nothing in the IR expresses that we have four
copies of the region running "at the same time". In fact, such a
control flow is alien to LLVM today.

For instance, another evil optimization may turn:

void main() {
int a[4];
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val = compute_something_into_val(i);
a[i] = val;
}

return a[0] + a[1];
}

to

void main() {
int a[4];
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val = compute_something_into_val(i);
a[i] = val;
}

return undef;
}

since a[i] = val could have initialized at most one element in a.

Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but then
calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
You're going to have to do the same kind of auditing of passes either
way, and the only extra cost of a new instruction is the extra bitcode
reading / writing code.

I hope I made sense.

-- Sanjoy

Johannes Doerfert via llvm-dev

unread,
Jan 20, 2017, 7:46:22 AM1/20/17
to Adve, Vikram Sadanand, llvm-dev, llvm-dev...@lists.llvm.org
Hi Vikram,

Thanks for the elaborate answer. I will try to be brief with my comment.

On 01/19, Adve, Vikram Sadanand via llvm-dev wrote:
> Hi Johannes,
>
> > I am especially curious where you get your data from. Tapir [0] (and to
> > some degree PIR [1]) have shown that, counterintuitively, only a few changes
> > to LLVM passes are needed. Tapir was recently used in an MIT class with a
> > lot of students and it seemed to work well with only minimal changes
> > to analysis and especially transformation passes.
>
> TAPIR is an elegant, small extension and, in particular, I think the
> idea of asymmetric parallel tasks and control flow is a clever way to
> express parallelism with serial semantics, as in Cilk. Encoding the
> control flow extensions as explicit instructions is orthogonal to
> that, though arguably more elegant than using region tags + metadata.

I agree that any form of first class parallelism seems more elegant than
region tags + metadata. Choosing a clever structure for the parallel
constructs can additionally save a lot of work wrt. existing passes,
e.g., the dominance relationship will prevent most illegal
transformations in our proposed fork-join scheme.

> However, Cilk is a tiny language compared with the full complexity of
> other languages, like OpenMP. To take just one example, TAPIR cannot
> express the ORDERED construct of OpenMP.

As with reductions, ORDERED is something we might want to express with
low-level constructs we already have, e.g., atomics. Alternatively, the
fork instructions can be extended with "attributes" to express more
elaborate schemes, e.g., the "lockstep" annotation we introduced for the
PIR white paper.

> A more serious concern, IMO, is that TAPIR (like Cilk) requires serial
> semantics, whereas there are many parallel languages, OpenMP included,
> that do not obey that restriction.

Serial semantics are good to have but do not need to be enforced. In PIR
we used a "force" attribute to force parallel semantics and disable
potential serialization, though OpenMP does not necessarily enforce
parallelism.

> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,
> that are needed because without that, you’d be dependent on
> fundamentally hard compiler analyses to extract the same information
> for satisfactory parallel performance; realistic applications cannot
> depend on the success of such analyses. For example, automatic array
> privatization in parallel loops is a very hard problem (automatic
> scalar privatization is easier, but even that is interprocedural).
> Reduction recognition is doable for common cases, but there are hard
> cases here as well. These are all standard features of parallel
> programs, not specific to OpenMP (e.g., C++17 parallel template
> operators are likely to produce these as well).

In the end it boils down how much of this information we want/need to
encode as special case (intrinsics/metadata) in the IR and what
information should be analyzed from constructs build with common IR.
Additionally, one has to consider which is easier do deal/optimize with
in the existing passes. In the PIR white paper we shown how OpenMP
(FIRST)PRIVATE clauses can be expressed using only well placed alloca
instructions. From there, analysis is not as hard as automatic
paralellization.

> If you support all these capabilities in the IR, a *lot* more than
> 6000 LOC (Tapir’s statistic; I’m sorry I don’t recall the number for
> PIR) would probably have to be modified in LLVM.

All in all TAPIR was implemented in 5k LOC. However, >1k are explicit
parallel optimization, 1k is used to add new instructions (thus
mechanical), 2k are used to lower the parallelism (thus needed for
basically all schemes), and the rest is needed to make it work with
existing analysis and transformation passes. To this end I do not see
how any other scheme would require much less.

> >> [...] (b) Add several new LLVM instructions such as, for
> >> parallelism, fork, spawn, join, barrier, etc. [...]
> >
> > For me fork and spawn are serving the same purpose, most new schemes
> > suggested three new instructions in total.
>
> A reasonable question is whether to use (#b) first-class instructions
> for some features, *in combination with* (#d) — i.e., region markers +
> metadata — or to use #d exclusively. There are almost certainly far
> too many essential features in parallel programs to capture them all
> as new instructions. I don’t see a need to answer this question on Day
> 1. Instead, we can begin with regions and metadata annotations, and
> then “promote” a few features to first-class instructions if the
> benefit is justified.
>
> Does that make sense?

Given parallelism as a first-class citizen we can still add new features
in various ways, including intrinsics or metadata. However, if we walk
down a path it is hard to change the direction completely, especially
since region annotations and fork-join parallel regions do serve
overlapping needs.

> > Also I think we already have a lot of simple constructs in the IR to
> > express high-level information properly, e.g. 'atomicrmw'
> > instructions for high-level reduction. While we currently lack
> > analysis passes to extract information from such low-level
> > representations, these are certainly possible [2,3]. I would argue
> > that such analysis are a better way to do things than placing
> > "high-level intrinsics" in the IR to mark things like reductions.
>
> IMO, this is too fragile for ensuring performance for realistic
> applications. While there is a lot of work on reduction recognition in
> the literature going back to the 1980s, there will always be cases
> where these algorithms miss a reduction pattern and you get terrible
> parallel performance.

This might be true but also a problem of current optimizations e.g.,
loop vectorization. Missed optimization opportunities will not go away
one way or another.

> Second, I’m not convinced that you can use specific operations like
> “atomicrmw” to find candidate loop nests in the first place that
> *might* be reductions, because the atomic op may in fact be in a
> library operation linked in as binary code.

I do not get the library operation comment as atomicrmw is an intrinsic.
Also the front-end would place the atomicrmw intrinsics in a parallel
loop so there is no finding of candidate loop nests involved, right? (I
have the feeling you mix automatic parallelization with representation
of existing parallelism here.)

> Third, this greatly increases the complexity of a “first working
> prototype” for a language like OpenMP because you can’t exclude
> REDUCTION clauses. Fourth, I believe array privatization is even
> harder; more generally, you cannot rely on heroic compiler
> optimizations to take the place of explicit language directives.

As above, it is not about automatic parallelization of sequential
program but expressing parallel constructs and reasoning about them
afterwards. The latter is a different and often much easier problem.

> I may be showing my age, but I co-wrote a full HPF-to-MPI
> source-to-source compiler in the late 1990s at Rice University with
> some quite sophisticated optimizations [1,2], and we and others in the
> HPF compiler community — both product and research teams — encountered
> many cases where compiler analyses were inadequate. There is a
> credible view that HPF failed despite very high hopes and significant
> industry and research investment because it relied too much on heroic
> compiler technology.
>
> [1] http://vadve.web.engr.illinois.edu/Papers/set_framework.pldi98.ps
> (PLDI 1998) [2]
> http://vadve.web.engr.illinois.edu/Papers/SC98.NASBenchmarksStudy/dhpfsc98.pdf
> (SC 1998)
>
> -—Vikram
>
> // Vikram S. Adve // Professor, Department of Computer Science //
> University of Illinois at Urbana-Champaign // va...@illinois.edu //
> http://llvm.org

As a final remark I want to stress that a lot of the problems or
complications you mention do not go away with region annotations and/or
intrinsics/metadata. We always have to extract high-level knowledge from
low-level constructs, the question is only how these constructs will
look like. I try to argue that simple low-level constructs are just
better suited for the job as they are easier to understand and
transformed by existing passes. Nevertheless, the will always be cases
we cannot handle well in IR and which should be optimized prior, similar
to the optimizations functional languages do prior to the LLVM IR.

Cheers,
Johannes
signature.asc

Johannes Doerfert via llvm-dev

unread,
Jan 20, 2017, 7:54:56 AM1/20/17
to Adve, Vikram Sadanand, llvm-dev, llvm-dev...@lists.llvm.org
On 01/19, Adve, Vikram Sadanand via llvm-dev wrote:
> > On Jan 19, 2017, at 1:46 PM, Mehdi Amini <mehdi...@apple.com> wrote:
> >
> >>
> >> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev <llvm...@lists.llvm.org> wrote:
> >>
> <snip>
> >> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses.
> >
> > I agree with this, but I’m also wondering if it needs to be first class in the IR?
> > For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs.
> > I wonder if the same wouldn’t apply to an openmp reduction clause for instance, where you could use the “basic” IR construct and the analysis would use a metadata emitted by the frontend instead of trying to infer the reduction.
>
>
> As Danny said (next message), lack of strong AA does limit some of our
> optimizations today. Even beyond that, however, parallel performance
> is a different beast from scalar optimizations. If you don’t
> devirtualize a subset of indirect calls, you may take a few % hit. If
> you fail to recognize a reduction and run the loop serially instead
> of in parallel, you could easily take a 10x or greater slowdown on
> that loop, leading to a substantial slowdown overall, depending on the
> importance of the reduction.

This is not about automatic detection of parallelism and/or reductions.
The front-end would emit parallel code and encode the reduction one way
or another. As long as the encoding is semantically correct we would
always get a parallel running result. The question here is:
If the reduction is recognized in the parallel lowering step we can
use the reduction support of the parallel runtime library and if not
we would "have to" implement it the way the front-end encoded it e.g.,
atomicrmw operations.

> Put another way, individual parallel optimizations (whether done
> manually or by a compiler) have far larger performance effects,
> generally speaking, than individual scalar optimizations.

As these proposals are all about parallel representation and not yet
optimization I think it is hard to argue that one is inherently better
suited to optimize already represented parallel regions than the other.
signature.asc

Johannes Doerfert via llvm-dev

unread,
Jan 20, 2017, 7:58:11 AM1/20/17
to Daniel Berlin, llvm-dev, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
Right. And while I think the metadata Hal proposed is different (see the
answer by Hal), prior experience (llvm.parallel.loop) tought us that it
takes a lot of effort to keep arbitrary attached metadata up to date and
working well.

> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


signature.asc

Johannes Doerfert via llvm-dev

unread,
Jan 20, 2017, 8:04:25 AM1/20/17
to Sanjoy Das, llvm-dev, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
Hi Sanjoy,

thanks for the example! It is petty similar to what we have in our PIR
white paper (see page 10 in [0]). We think such cases nicely showcase
that we should use actual control flow (in combination with SSA and
dominance) to prevent current optimizations to break parallel invariants
even without them knowing about parallelism.

Cheers,
Johannes

[0] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
signature.asc

Johannes Doerfert via llvm-dev

unread,
Jan 20, 2017, 8:34:50 AM1/20/17
to Daniel Berlin, llvm-dev
On 01/11, Daniel Berlin via llvm-dev wrote:
> >
> > def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty],
> > [IntrArgMemOnly],
> > "llvm.experimental.directive">;
> >
> > def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty],
> > [IntrArgMemOnly],
> > "llvm.experimental.dir.qual">;
> >
> > def int_experimental_dir_qual_opnd : Intrinsic<[],
> > [llvm_metadata_ty, llvm_any_ty],
> > [IntrArgMemOnly],
> > "llvm.experimental.dir.qual.opnd">;
> >
> > def int_experimental_dir_qual_opndlist : Intrinsic<
> > [],
> > [llvm_metadata_ty, llvm_vararg_ty],
> > [IntrArgMemOnly],
> > "llvm.experimental.dir.qual.opndlist">;
> >
> >
> I'll bite.
>
> What does argmemonly mean when the operands are metadata/?
> :)
>
> If the rest is an attempt to keep the intrinsic from being floated or
> removed, i'm strongly against extending a way we already know to have
> significant effect on optimization (fake memory dependence) to do this.
> Particularly for something so major.

I guess that any kind of extension that does pretend to have some sort
of side effect will have a significant effect on optimizations. The
tricky part is to find the right representation (and side effects) that
implicitly keep the semantics/invariants of parallel code preserved
while allowing as much transformations as possible.

[The following paragraph might be a bit abstract. If it is unclear
please tell me and I will add a code example.]

In the example by Sanjoy [0] we saw that "parallel regions markers" need
to be a barrier for alloca movement, though we might want some
transformations to "move" them nevertheless, e.g., to aggregate in
parallel executed allocas outside the parallel region as a means of
communication. To make a transformation like this happening but prevent
the movement Sanjoy described at the same time, we probably have to
educate some passes on the semantics of "parallel region markers".
Alternatively, (my hope is that) if we use use known concepts (mainly
dominance) to encode parts of the parallel invariants such optimizations
should come at a much lower cost.

Cheers,
Johannes

[0] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html
signature.asc

Yonghong Yan via llvm-dev

unread,
Jan 20, 2017, 9:50:00 AM1/20/17
to Johannes Doerfert, llvm-dev
Johannes and Sanjoy, 

Thank you for the info, illustration and examples. 

In Sanjoy's example, the current situation is that the whole parallel region is outlined as a new function by the frontend, and intrinsic_a includes the kmp_fork_call which takes input a pointer of that function to the runtime for parallel execution.  Thus it prevented those exotic or evil optimization happening. The same thing for OpenMP task and cilk_spawn (cilk_spawn specifically requires a new function because of the way it handles the frame and stack in the runtime)

if we do not let the compiler outline (or enable inter-procedural analysis/optimization) and have a parallel region marker thing (either region metadata, instructions or others). The region could be marked 1) as if it is a sequential loop with 4 iterations and then most of the pass for sequential code would do proper optimization. We however limit us for only parallel-ignorant optimization, which is the current situation regardless of that. or 2) The region is marked as a real parallel region and develop parallel-aware passes that are independent of parallel-ignorant passes. In any approach of making the PIR, I feel safer if we can separate parallel-ignorant passes and parallel-aware one, i.e. not to introduce too much or even no changes to the current passes. 

Yonghong Yan
Assistant Professor
Department of Computer Science and Engineering
School of Engineering and Computer Science
Oakland University
Office: EC 534
Phone: 248-370-4087
Email: y...@oakland.edu

Hal Finkel via llvm-dev

unread,
Jan 20, 2017, 9:59:10 AM1/20/17
to Mehdi Amini, llvm-dev

On 01/13/2017 12:11 PM, Mehdi Amini wrote:


On Jan 13, 2017, at 9:41 AM, Hal Finkel <hfi...@anl.gov> wrote:


On 01/13/2017 12:29 AM, Mehdi Amini wrote:

On Jan 12, 2017, at 5:02 PM, Hal Finkel <hfi...@anl.gov> wrote:

On 01/12/2017 06:20 PM, Reid Kleckner via llvm-dev wrote:

On Wed, Jan 11, 2017 at 8:13 PM, Mehdi Amini <mehdi...@apple.com> wrote:
Can you elaborate why? I’m curious.

The con of proposal c was that many passes would need to learn about many region intrinsics. With tokens, you only need to teach all passes about tokens, which they should already know about because WinEH and other things use them.

With tokens, we can add as many region-introducing intrinsics as makes sense without any additional cost to the middle end. We don't need to make one omnibus region intrinsic set that describes every parallel loop annotation scheme supported by LLVM. Instead we would factor things according to other software design considerations.

I think that, unless we allow frontends to add their own intrinsics without recompiling LLVM, this severely restricts the usefulness of this feature.

I’m not convinced that “building a frontend without recompiling LLVM while injecting custom passes” is a strong compelling use-case, i.e. can you explain why requiring such use-case/frontends to rebuild LLVM is so limiting?

I don't understand your viewpoint. Many frontends either compose their own pass pipelines or use the existing extension-point mechanism. Some frontends, Chapel for example, can insert code using custom address spaces and then insert passes later to turn accesses using pointers to those address spaces into runtime calls. This is the kind of design we'd like to support, without forcing frontends to use custom versions of LLVM, but with annotated regions instead of just with address spaces.

I think we’re talking about two different things here: you mentioned originally “without recompiling LLVM”, which I don’t see as major blocker, while now you’re now clarifying I think that you’re more concerned about putting a requirement on a *custom* LLVM, as in “it wouldn’t work with the source from a vanilla upstream LLVM”, which I agree is a different story.

That said, it extends the point from the other email (in parallel) about the semantics of the intrinsics: while your solution allows these frontend to reuse the intrinsics, it means that upstream optimization have to consider such intrinsics as optimization barrier because their semantic is unknown.

I see no reason why this needs to be true (at least so long as you're willing to accept a certain amount of "as if" parallelism). Moreover, if it is true, then we'll lose the benefits of, for example, being able to hoist scalar loads out of parallel loops. We might need to include dependencies on "inaccessible memory", so cover natural runtime dependencies by default (this can be refined with custom AA logic), but that is not a complete code-motion barrier. Memory being explicitly managed will end up as arguments to the region intrinsics, so we'll automatically get more-fine-grained information.

 -Hal



— 
Mehdi

Johannes Doerfert via llvm-dev

unread,
Jan 20, 2017, 12:43:39 PM1/20/17
to Yonghong Yan, llvm-dev
On 01/20, Yonghong Yan wrote:
> Johannes and Sanjoy,
>
> Thank you for the info, illustration and examples.

If you are interested, there are some more in the white paper of PIR:
http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf

> In Sanjoy's example, the current situation is that the whole parallel
> region is outlined as a new function by the frontend, and intrinsic_a
> includes the kmp_fork_call which takes input a pointer of that function to
> the runtime for parallel execution. Thus it prevented those exotic or evil
> optimization happening. The same thing for OpenMP task and cilk_spawn
> (cilk_spawn specifically requires a new function because of the way it
> handles the frame and stack in the runtime)
>
> if we do not let the compiler outline (or enable inter-procedural
> analysis/optimization) and have a parallel region marker thing (either
> region metadata, instructions or others). The region could be marked 1) as
> if it is a sequential loop with 4 iterations and then most of the pass for
> sequential code would do proper optimization. We however limit us for only
> parallel-ignorant optimization, which is the current situation regardless
> of that. or 2) The region is marked as a real parallel region and develop
> parallel-aware passes that are independent of parallel-ignorant passes. In
> any approach of making the PIR, I feel safer if we can separate
> parallel-ignorant passes and parallel-aware one, i.e. not to introduce too
> much or even no changes to the current passes.

I agree, especially with your last sentence. Optimizations that
explicitly target parallel regions/loops/... should be developed and
maintained separately with different heuristics, etc. At the same time I
don't think it is too far fetched to expect some existing passes to work
even across the sequential to parallel boundary with little to none
modifications, iff that boundary is designed "in a certain way".
signature.asc

Mehdi Amini via llvm-dev

unread,
Jan 20, 2017, 12:52:12 PM1/20/17
to Hal Finkel, llvm-dev
On Jan 20, 2017, at 6:59 AM, Hal Finkel <hfi...@anl.gov> wrote:

On 01/13/2017 12:11 PM, Mehdi Amini wrote:


On Jan 13, 2017, at 9:41 AM, Hal Finkel <hfi...@anl.gov> wrote:


On 01/13/2017 12:29 AM, Mehdi Amini wrote:

On Jan 12, 2017, at 5:02 PM, Hal Finkel <hfi...@anl.gov> wrote:

On 01/12/2017 06:20 PM, Reid Kleckner via llvm-dev wrote:

On Wed, Jan 11, 2017 at 8:13 PM, Mehdi Amini <mehdi...@apple.com> wrote:
Can you elaborate why? I’m curious.

The con of proposal c was that many passes would need to learn about many region intrinsics. With tokens, you only need to teach all passes about tokens, which they should already know about because WinEH and other things use them.

With tokens, we can add as many region-introducing intrinsics as makes sense without any additional cost to the middle end. We don't need to make one omnibus region intrinsic set that describes every parallel loop annotation scheme supported by LLVM. Instead we would factor things according to other software design considerations.

I think that, unless we allow frontends to add their own intrinsics without recompiling LLVM, this severely restricts the usefulness of this feature.

I’m not convinced that “building a frontend without recompiling LLVM while injecting custom passes” is a strong compelling use-case, i.e. can you explain why requiring such use-case/frontends to rebuild LLVM is so limiting?

I don't understand your viewpoint. Many frontends either compose their own pass pipelines or use the existing extension-point mechanism. Some frontends, Chapel for example, can insert code using custom address spaces and then insert passes later to turn accesses using pointers to those address spaces into runtime calls. This is the kind of design we'd like to support, without forcing frontends to use custom versions of LLVM, but with annotated regions instead of just with address spaces.

I think we’re talking about two different things here: you mentioned originally “without recompiling LLVM”, which I don’t see as major blocker, while now you’re now clarifying I think that you’re more concerned about putting a requirement on a *custom* LLVM, as in “it wouldn’t work with the source from a vanilla upstream LLVM”, which I agree is a different story.

That said, it extends the point from the other email (in parallel) about the semantics of the intrinsics: while your solution allows these frontend to reuse the intrinsics, it means that upstream optimization have to consider such intrinsics as optimization barrier because their semantic is unknown.

I see no reason why this needs to be true (at least so long as you're willing to accept a certain amount of "as if" parallelism).

Sorry, I didn’t quite get that?

Moreover, if it is true, then we'll lose the benefits of, for example, being able to hoist scalar loads out of parallel loops. We might need to include dependencies on "inaccessible memory", so cover natural runtime dependencies by default (this can be refined with custom AA logic), but that is not a complete code-motion barrier. Memory being explicitly managed will end up as arguments to the region intrinsics, so we'll automatically get more-fine-grained information.

Sanjoy gave an example of the kind of optimization that can break the semantic: http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html ; I haven’t yet seen an explanation about how this is addressed?

I’m not sure how you imagine going around the optimization barrier that goes with “this intrinsic has an unknown semantic that can impact the control flow of the program implicitly”, unless it acts as a “hint” only (but I don’t believe it is the direction?).

— 
Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 20, 2017, 1:44:56 PM1/20/17
to Sanjoy Das, Adve, Vikram Sadanand, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org
Sanjoy, the IR would be like something below. It is ok to hoist alloca instruction outside the region. There are some small changes in optimizer to understand region-annotation intrinsic.

{ void main() {
i32* val = alloca i32
tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]

int i = omp_get_thread_num();
compute_something_into_val(val, i);
a[i] = val;

llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

With above representation, we can do privatization and outlining as below

{ void main() {
i32* val = alloca i32
i32* I = alloca 32
tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]

%ii = omp_get_thread_num();
compute_something_into_val(%val, %i);
a[i] = %val;

llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

1. create i32* priv_val = alloca i32 %priv_i = ...in the region, and replace all %val with %prv_val in the region.
2. perform outlining.

Caller code
....
omp_push_num_threads(4)
omp_fork_call( .... outline_par_region....) ....

Callee code:
Outlined_par_rgion {
I32* priv_val = alloca 32
I32* priv_i = ....

Ret
}

For OpenMP, we do support it at -O0, -O1, -O2 and -O3. We had to make sure it runs correctly w/ and w/o optimizations and advanced analysis. So we need to preserve all source information for BE.
You can take a look our LLVM-HPC paper for a bit some details. There are still tons of work to be done. Thanks.

Xinmin

-----Original Message-----
From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of Sanjoy Das via llvm-dev
Sent: Thursday, January 19, 2017 10:13 PM
To: Adve, Vikram Sadanand <va...@illinois.edu>
Cc: llvm-dev <llvm...@lists.llvm.org>; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

Yonghong Yan via llvm-dev

unread,
Jan 20, 2017, 1:45:24 PM1/20/17
to Mehdi Amini, llvm-dev
If you were asking how this is addressed in the current clang/openmp, the code in the whole parallel region is outlined into a new function by frontend and parallel fork-join is transformed to a runtime call (kmpc_fork_call) that takes as input a pointer to the outlined function. so procedure-based optimization would not perform those optimization Sanjoy listed.

Yonghong

I’m not sure how you imagine going around the optimization barrier that goes with “this intrinsic has an unknown semantic that can impact the control flow of the program implicitly”, unless it acts as a “hint” only (but I don’t believe it is the direction?).

— 
Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 20, 2017, 1:45:51 PM1/20/17
to Sanjoy Das, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand


In the case, "val" is shared per OpenMP language rule. There is no privatization needed. %val is on the stack of master, to share %val among all threads, &val is passed to the outlined function.

void main() {
int val;
#pragma omp parallel num_threads(4)
{
// Really bad naming, won't pass code review. :)
compute_something_into_val(&val, omp_get_thread_num());
}
}

The IR would be.

{ void main() {
i32* val = alloca i32
tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.SHARED"(i32* %val), "QUAL.NUM_THREADS"(i32 4)
%1 = omp_get_thread_num();
compute_something_into_val(%val, %1);
llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

Xinmin

-----Original Message-----
From: Sanjoy Das [mailto:san...@playingwithpointers.com]
Sent: Thursday, January 19, 2017 11:40 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Adve, Vikram Sadanand <va...@illinois.edu>; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

On Thu, Jan 19, 2017 at 11:27 PM, Sanjoy Das <san...@playingwithpointers.com> wrote:
> Hi Xinmin,
>
> On Thu, Jan 19, 2017 at 11:20 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
>> Sanjoy, the IR would be like something below. It is ok to hoist alloca instruction outside the region. There are some small changes in optimizer to understand region-annotation intrinsic.
>>
>> { void main() {
>> i32* val = alloca i32
>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
>>
>> int i = omp_get_thread_num();
>> compute_something_into_val(val, i);
>> a[i] = val;
>>
>> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>> With above representation, we can do privatization and outlining as
>> below
>>
>> { void main() {
>> i32* val = alloca i32
>> i32* I = alloca 32
>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]
>>
>> %ii = omp_get_thread_num();
>> compute_something_into_val(%val, %i);
>> a[i] = %val;
>>
>> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>> 1. create i32* priv_val = alloca i32 %priv_i = ...in the region, and replace all %val with %prv_val in the region.
>> 2. perform outlining.
>
> But then what if compute_something_into_val is
>
> void compute_something_into_val(i32* ptr, i32 idx) {
> static i32* cookie = null;
> lock_mutex();
> if (cookie == null)
> cookie = ptr
> else
> assert(cookie == ptr);
> unlock_mutex();
> // don't write to ptr, so there is no race }
>
> In other words, how do you differentiate between the hoisted-alloca
> situation arising due to a hoist vs. arising because that's what the
> programmer intended (and you're required to pass in the same address
> to each call into compute_something_into_val)?

Just to be a 100% clear, the source program in the latter case would have to be:

void main() {
int val;
#pragma omp parallel num_threads(4)
{
// Really bad naming, won't pass code review. :)
compute_something_into_val(&val, omp_get_thread_num());
}
}

-- Sanjoy

Mehdi Amini via llvm-dev

unread,
Jan 20, 2017, 1:48:24 PM1/20/17
to Yonghong Yan, llvm-dev

On Jan 20, 2017, at 10:45 AM, Yonghong Yan <yan...@gmail.com> wrote:



On Fri, Jan 20, 2017 at 12:52 PM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:

On Jan 20, 2017, at 6:59 AM, Hal Finkel <hfi...@anl.gov> wrote:

On 01/13/2017 12:11 PM, Mehdi Amini wrote:


On Jan 13, 2017, at 9:41 AM, Hal Finkel <hfi...@anl.gov> wrote:


On 01/13/2017 12:29 AM, Mehdi Amini wrote:

On Jan 12, 2017, at 5:02 PM, Hal Finkel <hfi...@anl.gov> wrote:

On 01/12/2017 06:20 PM, Reid Kleckner via llvm-dev wrote:

On Wed, Jan 11, 2017 at 8:13 PM, Mehdi Amini <mehdi...@apple.com> wrote:
Can you elaborate why? I’m curious.

The con of proposal c was that many passes would need to learn about many region intrinsics. With tokens, you only need to teach all passes about tokens, which they should already know about because WinEH and other things use them.

With tokens, we can add as many region-introducing intrinsics as makes sense without any additional cost to the middle end. We don't need to make one omnibus region intrinsic set that describes every parallel loop annotation scheme supported by LLVM. Instead we would factor things according to other software design considerations.

I think that, unless we allow frontends to add their own intrinsics without recompiling LLVM, this severely restricts the usefulness of this feature.

I’m not convinced that “building a frontend without recompiling LLVM while injecting custom passes” is a strong compelling use-case, i.e. can you explain why requiring such use-case/frontends to rebuild LLVM is so limiting?

I don't understand your viewpoint. Many frontends either compose their own pass pipelines or use the existing extension-point mechanism. Some frontends, Chapel for example, can insert code using custom address spaces and then insert passes later to turn accesses using pointers to those address spaces into runtime calls. This is the kind of design we'd like to support, without forcing frontends to use custom versions of LLVM, but with annotated regions instead of just with address spaces.

I think we’re talking about two different things here: you mentioned originally “without recompiling LLVM”, which I don’t see as major blocker, while now you’re now clarifying I think that you’re more concerned about putting a requirement on a *custom* LLVM, as in “it wouldn’t work with the source from a vanilla upstream LLVM”, which I agree is a different story.

That said, it extends the point from the other email (in parallel) about the semantics of the intrinsics: while your solution allows these frontend to reuse the intrinsics, it means that upstream optimization have to consider such intrinsics as optimization barrier because their semantic is unknown.

I see no reason why this needs to be true (at least so long as you're willing to accept a certain amount of "as if" parallelism).

Sorry, I didn’t quite get that?

Moreover, if it is true, then we'll lose the benefits of, for example, being able to hoist scalar loads out of parallel loops. We might need to include dependencies on "inaccessible memory", so cover natural runtime dependencies by default (this can be refined with custom AA logic), but that is not a complete code-motion barrier. Memory being explicitly managed will end up as arguments to the region intrinsics, so we'll automatically get more-fine-grained information.

Sanjoy gave an example of the kind of optimization that can break the semantic: http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html ; I haven’t yet seen an explanation about how this is addressed?
If you were asking how this is addressed in the current clang/openmp, the code in the whole parallel region is outlined into a new function by frontend and parallel fork-join is transformed to a runtime call (kmpc_fork_call) that takes as input a pointer to the outlined function. so procedure-based optimization would not perform those optimization Sanjoy listed.


Right, but the question is rather how it’ll work with this proposal.

— 
Mehdi

Mehdi Amini via llvm-dev

unread,
Jan 20, 2017, 1:54:13 PM1/20/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand

> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Sanjoy, the IR would be like something below. It is ok to hoist alloca instruction outside the region. There are some small changes in optimizer to understand region-annotation intrinsic.
>
> { void main() {
> i32* val = alloca i32
> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
>
> int i = omp_get_thread_num();
> compute_something_into_val(val, i);
> a[i] = val;
>
> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
>
> With above representation, we can do privatization and outlining as below
>
> { void main() {
> i32* val = alloca i32
> i32* I = alloca 32
> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]
>
> %ii = omp_get_thread_num();
> compute_something_into_val(%val, %i);
> a[i] = %val;
>
> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
>

Here we come to the interesting part: the hoisting of "i32* I = alloca 32” above the intrinsics required to update the intrinsics information “QUAL.PRIVATE”.
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how it an openmp agnostic supposed to behave in face of llvm.experimental.intrinsic_a?


Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 20, 2017, 2:17:57 PM1/20/17
to mehdi...@apple.com, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
>>>>This means that the optimizer has to be aware of it, I’m missing the magic here?

This is one option.

The another option is that, as I mentioned in our LLVM-HPC paper in our implementation. We have a "prepare phase for pre-privatization" can be invoked by both Clang FE and Fortran FE right after LLVM IR is generated. So, in this way, we are able to minimize the optimizations impact for the original val and i

{ void main() {
i32* val = alloca i32
i32* I = alloca 32
i32* priv_val = alloca i32
i32* priv_i alloca 32
tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),"QUAL.PRIVATE"(i32* %priv_val, i32 %priv_i), "QUAL.NUM_THREADS"(i32 4)]

%priv_i = omp_get_thread_num();
compute_something_into_val(%priv_val, %priv_i);
a[priv_i] = %priv_val;

llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
....

I =
Val =
Foo(val, i).
}

"Prepare phase" is our way of minimizing the impact to existing optimizations.

Xinmin

Yonghong Yan via llvm-dev

unread,
Jan 20, 2017, 2:21:13 PM1/20/17
to llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
Xinmin,

outlining turns a parallel program into a sequential one from compiler's perspective, and that is why most of the parallel-ignorant pass would hurt. In your IR description for Sanjoy's example, does the current approach of outlining impacting the way of the IR should be enhanced for parallelism?

For that specific example (or other analysis and optimization SPMD) and what is implemented in clang, I am not sure whether we are going to change the frontend so not to outline the parallel region, or allow to perform certain optimization such as hoisting that alloca in clang which is not desired I believe. Or annotate the outlined function together with the intrinsic_a so that hoisting can be performed, in which case the instrisic_a would like this:

tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4) plus info for OUTLINED_call.

Mehdi, 

I think i am asking the same question as you asked me. 

Yonghong

Tian, Xinmin via llvm-dev

unread,
Jan 20, 2017, 2:22:44 PM1/20/17
to Yonghong Yan, Mehdi Amini, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand

Yonghong, In our implementation (not open sourced), we don’t do outlining the Front-End.  See my previous reply to Medhi’s email.

 

Xinmin

 

From: Yonghong Yan [mailto:y...@oakland.edu]
Sent: Friday, January 20, 2017 11:18 AM
To: Mehdi Amini
Cc: Tian, Xinmin; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org; Adve, Vikram Sadanand
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

 

Xinmin,

 

outlining turns a parallel program into a sequential one from compiler's perspective, and that is why most of the parallel-ignorant pass would hurt. In your IR description for Sanjoy's example, does the current approach of outlining impacting the way of the IR should be enhanced for parallelism?

 

For that specific example (or other analysis and optimization SPMD) and what is implemented in clang, I am not sure whether we are going to change the frontend so not to outline the parallel region, or allow to perform certain optimization such as hoisting that alloca in clang which is not desired I believe. Or annotate the outlined function together with the intrinsic_a so that hoisting can be performed, in which case the instrisic_a would like this:

 

tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4) plus info for OUTLINED_call.

 

Mehdi, 

 

I think i am asking the same question as you asked me. 

 

Yonghong

 

Sanjoy Das via llvm-dev

unread,
Jan 20, 2017, 2:36:01 PM1/20/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Yonghong Yan, Adve, Vikram Sadanand
Hi,

I'm going to club together some responses.

I agree that outlining function sub-bodies and passing in the function
pointers to said outlined bodies to OpenMP helpers lets us correctly
implement the semantics we need. However, unless I severely
misunderstood the thread, I thought the key idea was to move *away*
from that representation and towards a representation that _allows_
optimization?

My problem with representing parallel regions with
intrinsic-denoted-regions is that we're lying to the optimizer about
what the code actually does. Calls, even to intrinsics, can "at
worst" do some combination of the following:

- Write to and read from arbitrary memory
- Have UB (but we're allowed to pretend that they don't)
- Throw an exception
- Never return, either by infinite looping or by calling exit(0)
- Have memory synchronization operations, like fences, atomic loads,
stores etc.
- Have side effects like IO, volatile writes

If an intrinsic's behavior can be explained by some subset of the
above, then you should not need to edit any pass to preserve
_correctness_ -- all optimization passes (today) conservatively assume
that calls that they don't understand have all of the behaviors
outlined above.

However, if to preserve *correctness* you have to edit optimization
passes and teach them that certain intrinsic calls have behavior
*outside* the set mentioned above then the instruction really does not
have "call semantics". `call @llvm.experimental.region_begin()` is
really a fundamentally new instruction masquerading as an intrinsic,
and it is probably better to call a spade a spade and represent it as
a new instruction.

The setting for the examples I gave was not that "here is a case we
need to get right". The setting was that "here is a *symptom* that
shows that we've lied to the optimizer". We can go ahead and fix all
the symptoms by adding bailouts to the respective passes, but that
does not make us immune to passes that we don't know about
e.g. downstream passes, passes that will be added later. It also puts
us in a weird spot around semantics of call instructions.

-- Sanjoy

On Fri, Jan 20, 2017 at 11:22 AM, Tian, Xinmin via llvm-dev

Mehdi Amini via llvm-dev

unread,
Jan 21, 2017, 4:57:50 PM1/21/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand

> On Jan 20, 2017, at 11:17 AM, Tian, Xinmin <xinmi...@intel.com> wrote:
>
>>>>> This means that the optimizer has to be aware of it, I’m missing the magic here?
>
> This is one option.
>
> The another option is that, as I mentioned in our LLVM-HPC paper in our implementation. We have a "prepare phase for pre-privatization" can be invoked by both Clang FE and Fortran FE right after LLVM IR is generated. So, in this way, we are able to minimize the optimizations impact for the original val and I

Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic.

The fact that you needed this pre-step in the first place seems to indicate to me that it confirms what multiple people expressed in this thread, for example what Daniel wrote here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html


Mehdi

Mehdi Amini via llvm-dev

unread,
Jan 21, 2017, 5:05:02 PM1/21/17
to Sanjoy Das, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Yonghong Yan, Adve, Vikram Sadanand
I missed this before answering my last email about how "as is” it does not sound like something that can scale non-intrusively in LLVM as a model-agnostic and general-purpose region semantic.
Sanjoy explains it just better :), so right now my understanding is the same as what Sanjoy expressed below.


Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 31, 2017, 8:26:26 PM1/31/17
to Sanjoy Das, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Yonghong Yan, Adve, Vikram Sadanand
[XT] Back from Biz trips, trying to catch up with the discussion.

>>>>I agree that outlining function sub-bodies and passing in the function pointers to said outlined bodies to OpenMP helpers lets us correctly implement the semantics we need. However, unless I severely misunderstood the thread, I thought the key idea was to move *away* from that representation and towards a representation that _allows_ optimization?

[XT]: Your understanding is correct. But, the IR-level region annotation RFC is not just for OpenMP. OpenMP is one of usage cases..

>>>>My problem with representing parallel regions with intrinsic-denoted-regions is that we're lying to the optimizer about what the code actually does. Calls, even to intrinsics, can "at worst" do some combination of the following:

- Write to and read from arbitrary memory
- Have UB (but we're allowed to pretend that they don't)
- Throw an exception
- Never return, either by infinite looping or by calling exit(0)
- Have memory synchronization operations, like fences, atomic loads,
stores etc.
- Have side effects like IO, volatile writes

[XT] Based on Google and Xilinx's suggestion, the IR-level region annotation can use token and tags with intrinsic functions to model region and memory dependency (use/def). Above list are handled based on language rules. E.g. OpenMP rule says, in a parallel region, throw an exception is allowed, but it has been caught within the region, i.e. no control-flow edge is allowed to across the region boundary. "exit" is one exception which is allowed, as it terminate the program.. Our solution is to have FE and/or one central place in ME to deal with language specifics.

>>>>However, if to preserve *correctness* you have to edit optimization passes and teach them that certain intrinsic calls have behavior
*outside* the set mentioned above then the instruction really does not have "call semantics". `call @llvm.experimental.region_begin()` is really a fundamentally new instruction masquerading as an intrinsic, and it is probably better to call a spade a spade and represent it as a new instruction.

[XT] Yes and No. Yes: w.r.t region scope annotation, No: it is more than one new instruction, it is more like a sequence of instructions. Assume we have a "fork" instruction, omp fork and cilk fork/spawn semantics are differently in terms of stack frame allocation and ABI. When we introduce a new instruction, the exact semantics needs to be defined, it can't be alter. Thus, we proposed to start with experimental_intrinsics, and it is proven working. We can always convert the intrinsics with token/tag to instructions when we have enough solid cases / justification for the part of model-agnostic for the conversion.

>>>>>The setting for the examples I gave was not that "here is a case we need to get right". The setting was that "here is a *symptom* that shows that we've lied to the optimizer". We can go ahead and fix all the symptoms by adding bailouts to the respective passes, but that does not make us immune to passes that we don't know about e.g. downstream passes, passes that will be added later. It also puts us in a weird spot around semantics of call instructions.

[XT] I would say, it is a design trade-off between having a central place to deal with specifics or make drastic changes to begin with from day one. Our process is to have a central place to get all working, then, turning off the support for some "symptoms" in this central place one-by-one to trigger downstream fails and fixed. I think our ultimate goal is more or less same, just taking a different approach to get there. The central place / prepare-phase for getting IR to a "canonical" form with help to address the issue . " downstream passes, passes that will be added later. It also puts us in a weird spot around semantics of call instructions." you mentioned.

Thanks for all questions, discussions and feedback.

Xinmin

-----Original Message-----
From: Sanjoy Das [mailto:san...@playingwithpointers.com]
Sent: Friday, January 20, 2017 11:36 AM
To: Tian, Xinmin <xinmi...@intel.com>

Tian, Xinmin via llvm-dev

unread,
Jan 31, 2017, 8:38:25 PM1/31/17
to mehdi...@apple.com, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
>>>>Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic.

I would say it is a design trade-off. Regardless it is a new instruction or an intrinsics with token/tag, it will consist of model-agnostic part and model-non-agnostic part. The package comes with a framework for parsing and using these intrinsics. See the reply I had for Sanjoy's email.

Xinmin

-----Original Message-----
From: mehdi...@apple.com [mailto:mehdi...@apple.com]
Sent: Saturday, January 21, 2017 1:57 PM
To: Tian, Xinmin <xinmi...@intel.com>

Mehdi Amini via llvm-dev

unread,
Jan 31, 2017, 8:47:24 PM1/31/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand

> On Jan 31, 2017, at 5:38 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
>
>>>>> Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic.
>
> I would say it is a design trade-off.

I’m not sure if we’re talking about the same thing here: my understanding at this point is that the design trading-off you’re making “simplicity” by scarifying “correctness”.

Requiring the IR to stay in what you’re calling a “canonical” form in your answer to Sanjoy in order to not miscompile a program is not an approach that seems compatible with how we deal with the IR usually.

> Regardless it is a new instruction or an intrinsics with token/tag, it will consist of model-agnostic part and model-non-agnostic part. The package comes with a framework for parsing and using these intrinsics. See the reply I had for Sanjoy's email.

The answer to Sanjoy is not really helpful to clarify anything to me. At this point I still don’t understand how this is supposed to be correct in general.

It would be helpful to have a LangRef patch that describes the semantic associated to your region intrinsics. Then we may be able to process some examples through the formalized description.


Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 31, 2017, 9:48:40 PM1/31/17
to mehdi...@apple.com, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
Let me try this.

You can simply consider the prepare-phase (e.g. pre-privatization) were done in FE (actually a library can be used by multiple FEs at LLVM IR level), the region is run with 1 thread, region annotation (scope, single-entry-single-exit) as memory barrier conservatively for now (instead of checking individual memory dependency, aliasing via tags which is the actual implementation is done) marked with region intrinsic functions. What optimization will mess up with this region-annotation?

Mehdi Amini via llvm-dev

unread,
Jan 31, 2017, 10:07:55 PM1/31/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand

> On Jan 31, 2017, at 6:48 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
>
> Let me try this.
>
> You can simply consider the prepare-phase (e.g. pre-privatization) were done in FE (actually a library can be used by multiple FEs at LLVM IR level), the region is run with 1 thread, region annotation (scope, single-entry-single-exit) as memory barrier conservatively for now (instead of checking individual memory dependency, aliasing via tags which is the actual implementation is done) marked with region intrinsic functions. What optimization will mess up with this region-annotation?

The first thing that comes to my mind is inlining that can put the IR in a form that breaks the invariant you tried to enforce with your "prepare-phase” (for example by hoisting an allocas).

Tian, Xinmin via llvm-dev

unread,
Jan 31, 2017, 10:27:38 PM1/31/17
to mehdi...@apple.com, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
Remember that, the prepare-phase is invoked in the FE or right after FE, inlining is not happening, that is why we don't call it "pass". Chandler made a good point for this case a long time back.

Hoisting alloca is totally ok. A new alloca is generated during outlining later on for anything marked as "private" (so long the "private" information is saves in the tag). I thought we talked this in an early email.

By the way, all concerns you have are all valid, we had worked on resolving these issues 10+ years back when we did similar things in our compilers. I wouldn't claim we have perfect solutions, but we do reasonable good solutions for handling general directives and openmp directives.

Mehdi Amini via llvm-dev

unread,
Jan 31, 2017, 11:55:22 PM1/31/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand


Sent from my iPhone

On Jan 31, 2017, at 7:27 PM, Tian, Xinmin <xinmi...@intel.com> wrote:

Remember that,  the prepare-phase is invoked in the FE or right after FE, inlining is not happening, that is why we don't call it "pass".  Chandler made a good point for this case a long time back.  


What I was describing is the inlining in the optimizer pipeline.



Hoisting alloca is totally ok.  A new alloca is generated during outlining later on for anything marked as "private" (so long the "private" information is saves in the tag). I thought we talked this in an early email.  


Can you describe how (and at which point) you get the private for "var"  added to the tag?


-- 
Mehdi

Tian, Xinmin via llvm-dev

unread,
Jan 31, 2017, 11:56:07 PM1/31/17
to mehdi...@apple.com, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand

In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.  

 

In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds  ”private”  attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it.

 

Xinmin

Mehdi Amini via llvm-dev

unread,
Feb 1, 2017, 12:03:34 AM2/1/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:

In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.  


So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?

It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.

Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.

 
In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds  ”private”  attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it. 

Again, you’re trying to address a specific case, while I’m just trying to identify a generic class of problem that your proposal fails to address explicitly.

Best,

— 
Mehdi

Tian, Xinmin via llvm-dev

unread,
Feb 1, 2017, 2:21:39 AM2/1/17
to mehdi...@apple.com, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand

 

 

From: mehdi...@apple.com [mailto:mehdi...@apple.com]
Sent: Tuesday, January 31, 2017 9:03 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Sanjoy Das <san...@playingwithpointers.com>; Adve, Vikram Sadanand <va...@illinois.edu>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

 

 

On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:

 

In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.  

So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?

 

[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park #  2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.

 

It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.

 

Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.

 

[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, ….  Each instance of them will have their semantics which will be in following up RFCs

  

In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds  ”private”  attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it. 

 

Again, you’re trying to address a specific case, while I’m just trying to identify a generic class of problem that your proposal fails to address explicitly.

 

[XT] It looks there is a mis-understanding of proposal. The proposal is to build up experimental framework with proposed interface to evaluate set of changes (or cost) needed for usage cases we know of. Hal and I carefully positioned this RFC. Even all intrinsics are named as experimental.  So far,  cost for handling these cases identified is within a range of our expectation.   

Mehdi Amini via llvm-dev

unread,
Feb 1, 2017, 2:29:23 AM2/1/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
On Jan 31, 2017, at 10:59 PM, Tian, Xinmin <xinmi...@intel.com> wrote:

 
From: mehdi...@apple.com [mailto:mehdi...@apple.com] 
Sent: Tuesday, January 31, 2017 9:03 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Sanjoy Das <san...@playingwithpointers.com>; Adve, Vikram Sadanand <va...@illinois.edu>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
 
 
On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
 
In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.  
 
 
So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?
 
[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park #  2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.


This didn’t mean that the changes were meant specifically for OpenMP. My understanding was that this proposal is for a generic "IR-level Region Annotations” mechanism, and that’s what the changes were for. Now it ends up being “let’s support OpenMP semantic without adding openmp in the intrinsic names”.


 
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.
 
Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.
 
[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, ….  Each instance of them will have their semantics which will be in following up RFCs


At this point I don’t see any advantage in having a “generic intrinsic" that has an opaque tag since all the semantic is in the tag anyway. I’d have to see what is really “generic” in the handling of it...

Reid identified this very early in the thread (he is a lot more perspicacious than I am) here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108914.html

— 
Mehdi

Hal Finkel via llvm-dev

unread,
Feb 1, 2017, 12:36:03 PM2/1/17
to Mehdi Amini, Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand


On 02/01/2017 01:29 AM, Mehdi Amini via llvm-dev wrote:

On Jan 31, 2017, at 10:59 PM, Tian, Xinmin <xinmi...@intel.com> wrote:

 
From: mehdi...@apple.com [mailto:mehdi...@apple.com] 
Sent: Tuesday, January 31, 2017 9:03 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Sanjoy Das <san...@playingwithpointers.com>; Adve, Vikram Sadanand <va...@illinois.edu>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
 
 
On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
 
In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.  
 
 
So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?
 
[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park #  2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.


This didn’t mean that the changes were meant specifically for OpenMP. My understanding was that this proposal is for a generic "IR-level Region Annotations” mechanism, and that’s what the changes were for. Now it ends up being “let’s support OpenMP semantic without adding openmp in the intrinsic names”.

The point here is to abstract the properties about which other passes might need to know by using a set of generic intrinsics. The fact that you can't hoist allocas past one of these intrinsics, is nowhere close to saying that the individual optimization passes need to know anything about OpenMP, parallelism, etc. Regardless of how many LOC are in Intel's prototype, we're obviously aiming for minimal impact on the current upstream infrastructure.




 
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.
 
Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.
 
[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, ….  Each instance of them will have their semantics which will be in following up RFCs


At this point I don’t see any advantage in having a “generic intrinsic" that has an opaque tag since all the semantic is in the tag anyway. I’d have to see what is really “generic” in the handling of it...

This is completely opposite to the point. The semantics relevant to the rest of the optimization pipeline should be in the intrinsics themselves. I've yet to see anything to suggest that we can't do that.



Reid identified this very early in the thread (he is a lot more perspicacious than I am) here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108914.html

There are multiple levels here:
 a) Semantics relevant to the rest of the pipeline
 b) Semantics relevant to parallelism-specific optimizations (e.g. redundant barrier removal)
 c) Semantics relevant to specific programming model / extension (OpenMP, OpenACC, C++ parallel algorithms, whatever)

We'd like to separate these three levels, and I believe the proposed scheme allows us to do that. Obviously, this assumes that we can indeed have a small set of intrinsics that satisfy the needs of (a). Furthermore, if we're going to use intrinsics, we need to decide whether all of the relevant semantics are reasonable to encode in intrinsics (e.g. it is reasonable to have an intrinsic past which you can't hoist an alloca, or would that need to be an instruction, etc.)

 -Hal

Mehdi Amini via llvm-dev

unread,
Feb 1, 2017, 12:43:02 PM2/1/17
to Hal Finkel, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
On Feb 1, 2017, at 9:34 AM, Hal Finkel <hfi...@anl.gov> wrote:


On 02/01/2017 01:29 AM, Mehdi Amini via llvm-dev wrote:

On Jan 31, 2017, at 10:59 PM, Tian, Xinmin <xinmi...@intel.com> wrote:

 
From: mehdi...@apple.com [mailto:mehdi...@apple.com] 
Sent: Tuesday, January 31, 2017 9:03 PM
To: Tian, Xinmin <xinmi...@intel.com>
Cc: Sanjoy Das <san...@playingwithpointers.com>; Adve, Vikram Sadanand <va...@illinois.edu>; llvm...@lists.llvm.org; llvm-dev...@lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
 
 
On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
 
In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.  
 
 
So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?
 
[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park #  2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.


This didn’t mean that the changes were meant specifically for OpenMP. My understanding was that this proposal is for a generic "IR-level Region Annotations” mechanism, and that’s what the changes were for. Now it ends up being “let’s support OpenMP semantic without adding openmp in the intrinsic names”.

The point here is to abstract the properties about which other passes might need to know by using a set of generic intrinsics.

The fact is that I’m still not sure what these properties are, which is exactly why I’m asking for a LangRef patch.
If it has been explained in thread, then sorry I missed it, and I’d appreciate a pointer to the right post.

The fact that you can't hoist allocas past one of these intrinsics, is nowhere close to saying that the individual optimization passes need to know anything about OpenMP, parallelism, etc. Regardless of how many LOC are in Intel's prototype, we're obviously aiming for minimal impact on the current upstream infrastructure.



 
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.
 
Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.
 
[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, ….  Each instance of them will have their semantics which will be in following up RFCs


At this point I don’t see any advantage in having a “generic intrinsic" that has an opaque tag since all the semantic is in the tag anyway. I’d have to see what is really “generic” in the handling of it...

This is completely opposite to the point. The semantics relevant to the rest of the optimization pipeline should be in the intrinsics themselves. I've yet to see anything to suggest that we can't do that.

I’ve yet to see anything that suggest we can :)

How will it be expressed that we’re not allowed to be able to hoist an alloca? How are other constraints gonna be expressed?




Reid identified this very early in the thread (he is a lot more perspicacious than I am) here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108914.html

There are multiple levels here:
 a) Semantics relevant to the rest of the pipeline
 b) Semantics relevant to parallelism-specific optimizations (e.g. redundant barrier removal)
 c) Semantics relevant to specific programming model / extension (OpenMP, OpenACC, C++ parallel algorithms, whatever)

We'd like to separate these three levels, and I believe the proposed scheme allows us to do that. Obviously, this assumes that we can indeed have a small set of intrinsics that satisfy the needs of (a). Furthermore, if we're going to use intrinsics, we need to decide whether all of the relevant semantics are reasonable to encode in intrinsics (e.g. it is reasonable to have an intrinsic past which you can't hoist an alloca, or would that need to be an instruction, etc.)

I agree with all that, and for now I’m only worried about a), which is why I asked for a LangRef update because I don’t feel I read a proper explanation there. 
 

— 
Mehdi

Tian, Xinmin via llvm-dev

unread,
Feb 1, 2017, 12:43:33 PM2/1/17
to Hal Finkel, Mehdi Amini, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand

+1 to Hal’s point.

Hal Finkel via llvm-dev

unread,
Feb 1, 2017, 1:19:31 PM2/1/17
to Mehdi Amini, Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand



On 01/31/2017 09:30 PM, Mehdi Amini via llvm-dev wrote:


Sent from my iPhone

On Jan 31, 2017, at 7:27 PM, Tian, Xinmin <xinmi...@intel.com> wrote:

Remember that,  the prepare-phase is invoked in the FE or right after FE, inlining is not happening, that is why we don't call it "pass".  Chandler made a good point for this case a long time back.  


What I was describing is the inlining in the optimizer pipeline.

FWIW, we only do this for allocas in the entry block, and it would be trivial to having the code stop scanning the entry block upon hitting some specifically-tagged thing(s).





Hoisting alloca is totally ok.  A new alloca is generated during outlining later on for anything marked as "private" (so long the "private" information is saves in the tag). I thought we talked this in an early email.  


Can you describe how (and at which point) you get the private for "var"  added to the tag?

I don't think this is necessarily the scheme that we'd use for an upstream implementation.

Thanks again,
Hal

Sanjoy Das via llvm-dev

unread,
Feb 1, 2017, 1:36:33 PM2/1/17
to Tian, Xinmin, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Yonghong Yan, Adve, Vikram Sadanand
Hi Xinmin,

On Tue, Jan 31, 2017 at 5:26 PM, Tian, Xinmin <xinmi...@intel.com> wrote:
> [XT] Back from Biz trips, trying to catch up with the discussion.
>
>>>>>I agree that outlining function sub-bodies and passing in the function pointers to said outlined bodies to OpenMP helpers lets us correctly implement the semantics we need. However, unless I severely misunderstood the thread, I thought the key idea was to move *away* from that representation and towards a representation that _allows_ optimization?
>
> [XT]: Your understanding is correct. But, the IR-level region annotation RFC is not just for OpenMP. OpenMP is one of usage cases..
>
>>>>>My problem with representing parallel regions with intrinsic-denoted-regions is that we're lying to the optimizer about what the code actually does. Calls, even to intrinsics, can "at worst" do some combination of the following:
>
> - Write to and read from arbitrary memory
> - Have UB (but we're allowed to pretend that they don't)
> - Throw an exception
> - Never return, either by infinite looping or by calling exit(0)
> - Have memory synchronization operations, like fences, atomic loads,
> stores etc.
> - Have side effects like IO, volatile writes
>
> [XT] Based on Google and Xilinx's suggestion, the IR-level region
> annotation can use token and tags with intrinsic functions to model
> region and memory dependency (use/def). Above list are handled based
> on language rules. E.g. OpenMP rule says, in a parallel region, throw
> an exception is allowed, but it has been caught within the region,
> i.e. no control-flow edge is allowed to across the region boundary.
> "exit" is one exception which is allowed, as it terminate the
> program.. Our solution is to have FE and/or one central place in ME
> to deal with language specifics.

I should have been clearer: I am not talking about user code here.
I'm trying to say that intrinsic calls are, first and foremost, calls;
and their behavior should be describable within the framework above.
This precludes using intrinsics to represent magic control flow.

For instance, if we go back to one of the problematic transformations
from earlier in the mail:

void main() {
int a[4];
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val = compute_something_into_val(i);
a[i] = val;
}

return a[0] + a[1];
}

to

void main() {
int a[4];
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val = compute_something_into_val(i);
a[i] = val;
}

return undef;
}

or some variant of this, Say the parallel region is demarcated by a
pair of intrinsics. The said intrinsics could have any subset of the
behavior demarcated above, but the transform would still be valid; and
so to rule out this transform you would need these demarcating (or
region-creating) intrinsics to have some exotic property not in that
list. That's the bit I'm worried about. IOW I'm worried about
changes like this:

if (!isa<TerminatorInst>(I))
V.push_back(I);
else
return false;

to

if (!isa<TerminatorInst>(I) && !llvm::isTerminatorLikeIntrinsic(I))
V.push_back(I);
else
return false;

If you can define the semantics of the intrinsics you want to add as a
subset of the behaviors specified above then I have no problems.

The other option is to define all or a subset of intrinsic calls as
*not* `CallInst` s but as fundamentally different "things" in the LLVM
Instruction hierarchy; at which point we can give them whatever
semantics we want. What I don't want is a `CallInst` doing something
a `CallInst` ought not do.

>>>>>However, if to preserve *correctness* you have to edit optimization passes and teach them that certain intrinsic calls have behavior
> *outside* the set mentioned above then the instruction really does not have "call semantics". `call @llvm.experimental.region_begin()` is really a fundamentally new instruction masquerading as an intrinsic, and it is probably better to call a spade a spade and represent it as a new instruction.
>
> [XT] Yes and No. Yes: w.r.t region scope annotation, No: it is more
> than one new instruction, it is more like a sequence of
> instructions. Assume we have a "fork" instruction, omp fork and cilk
> fork/spawn semantics are differently in terms of stack frame
> allocation and ABI. When we introduce a new instruction, the exact
> semantics needs to be defined, it can't be alter. Thus, we proposed to

Can that problem be solved by (this is both a technical and cultural
shift) allowing "experimental" instructions, whose semantics can be
evolved over time with no guarantee for backwards compatibility?

> start with experimental_intrinsics, and it is proven working. We can
> always convert the intrinsics with token/tag to instructions when we
> have enough solid cases / justification for the part of model-agnostic
> for the conversion.

I don't have any problem with that in principle, and I'd
wholeheartedly agree if the intrinsics had call semantics.

>>>>>>The setting for the examples I gave was not that "here is a case we need to get right". The setting was that "here is a *symptom* that shows that we've lied to the optimizer". We can go ahead and fix all the symptoms by adding bailouts to the respective passes, but that does not make us immune to passes that we don't know about e.g. downstream passes, passes that will be added later. It also puts us in a weird spot around semantics of call instructions.
>
> [XT] I would say, it is a design trade-off between having a central

It *is* a design trade-off but I think you're discounting how much
we're trading-off on one side of the equation.

> place to deal with specifics or make drastic changes to begin with
> from day one. Our process is to have a central place to get all
> working, then, turning off the support for some "symptoms" in this
> central place one-by-one to trigger downstream fails and fixed. I

Again, I have no issue with this process; but I'm speculating that the
nature of the fixes will add more technical debt to LLVM than we're
comfortable with absorbing.

(Assuming by "central place" you mean the frontend and by downstream
fixes you meant fixes to LLVM passes?)

> think our ultimate goal is more or less same, just taking a different
> approach to get there. The central place / prepare-phase for getting
> IR to a "canonical" form with help to address the issue . " downstream
> passes, passes that will be added later. It also puts us in a weird
> spot around semantics of call instructions." you mentioned.
>
> Thanks for all questions, discussions and feedback.

Thank you too! :)
-- Sanjoy

Tian, Xinmin via llvm-dev

unread,
Feb 1, 2017, 3:35:27 PM2/1/17
to Sanjoy Das, llvm...@lists.llvm.org, llvm-dev...@lists.llvm.org, Yonghong Yan, Adve, Vikram Sadanand
>>>I'm trying to say that intrinsic calls are, first and foremost, calls; and their behavior should be describable within the framework above.
>>>This precludes using intrinsics to represent magic control flow.

Totally agreed, we need to make it clear. These intrinsic calls are calls with properties, they should preclude "magic" control flow representation.

Johannes Doerfert via llvm-dev

unread,
Feb 2, 2017, 5:47:33 AM2/2/17
to Sanjoy Das, llvm-dev, llvm-dev...@lists.llvm.org, Adve, Vikram Sadanand
On 01/19, Sanjoy Das via llvm-dev wrote:
> Now you could say that the llvm.experimental.intrinsic_a and
> llvm.experimental.intrinsic_b intrinsics are magic, and even such
> "obvious" optimizations are not allowed to happen across them; but then
> calls to these intrinsics is pretty fundamentally different from
> "normal" calls, and are probably best modeled as new instructions.
> You're going to have to do the same kind of auditing of passes either
> way, and the only extra cost of a new instruction is the extra bitcode
> reading / writing code.
It does and I think ppl should therefor take a look at the alternative
proposal send out last week [0]. The "extra bitcode reading/writing
code" [1] is mechanical and should not be an issue.

Cheers,
Johannes

[0] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109615.html
[1] https://reviews.llvm.org/D29250
signature.asc
Reply all
Reply to author
Forward
0 new messages