I don't know much, maybe this topic should be bridged with polly-dev
(adding it to CC) to bring it more attention.
Indeed, polly uses ScopPass, that creates serious limitations in
compatibility with other passes. To my understanding, scops are used
because ISL loop analysis tool uses scops. In fact, just for handling
OpenMP directives scops are not required, unless one need to make sure
OpenMP directive is set for loop with parallel iterations.
Btw, it would be very interesting to know more about your
project/purpose for this!
- D.
2011/9/8 Jimborean Alexandra <xinfi...@yahoo.com>:
> Hi,
>
> I want to execute the iterations of a loop in parallel, by inserting calls
> either to pthreads or to the gomp library at the LLVM IR level. As a first
> step, I inserted an omp pragma in a C file and compiled it with llvm-gcc to
> check the generated LLVM code. If I understand correctly, to parallelize the
> loop in LLVM IR, I have to separate the loop in a new function, put all
> required parameters in a structure, make the call to the gomp library, and
> restore all values from the structure in the original variables.
> Also, I have to compute the number of iterations allocated to each thread
> and insert in the loop body a new condition, such that each thread executes
> only its slice.
> Is that correct?
> As far as I know, both llvm-gcc and Polly already offer support for OpenMP,
> by inserting calls to the gomp library. Can this code be reused? Is there a
> pass that I can call to do all these code transformations?
>
> I had a look at the CodeGeneration from Polly. Is it possible to use it
> without creating the Scops, by transforming it into a LoopPass?
> Could you indicate how is this handled in llvm-gcc?
> Thank you for your help.
> Alexandra.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
On Thu, Sep 8, 2011 at 13:53, Jimborean Alexandra <xinfi...@yahoo.com> wrote:
> I had a look at the CodeGeneration from Polly. Is it possible to use it
> without creating the Scops, by transforming it into a LoopPass?
Yes. If you don't want to use the autopar of Polly and just rely on
the pragmas inserted by the programmer, you don't need SCoPs.
You can have the parallel code generation part of Polly working as
a LoopPass.
If you want to use the auto-parallelization part of Polly, you have to
to restrict the form of the code on which data dependences can be
computed: this is done by using SCoPs (Static Control Parts).
Sebastian
Thanks Dimitry for moving this over. I would also have replied on the
LLVM list, but was away this weekend. Now I am back.
> Indeed, polly uses ScopPass, that creates serious limitations in
> compatibility with other passes. To my understanding, scops are used
> because ISL loop analysis tool uses scops.
Scops are used, as this is the problem domain we are working on. Scops
are not just loops, but can also be nested conditions without any loops.
Also describing them as loop passes is not what we want, as the idea of
our polyhedral description is actually to abstract away the specific
structure of the loops.
I would be very interested to hear what use cases you believe are
limited because of the use of ScopPass? If you are e.g. only interested
in Dependency Information or Alias Information, it might be possible to
create a LoopPass that proxies the relevant information, such that it is
available to other passes.
> In fact, just for handling
> OpenMP directives scops are not required, unless one need to make sure
> OpenMP directive is set for loop with parallel iterations.
Right. Scops are not needed to handle OpenMP directives and Polly is
actually not ment to handle OpenMP directives. It is just one of several
ways to introduce OpenMP parallelism. Polly does this if the loop it
generate is parallel, clang would do it if the user added OpenMP
directives and Alex may introduce parallelism in similar cases.
For me handling 'OpenMP directives' could mean two things:
1) clang understand OpenMP pragmas and lowers them to a set of OpenMP
intrinsics/function calls and structures.
2) An LLVM optimization pass understands a set of high level OpenMP
intrinsics that it can optimize and transform to specific libgomp or
mpc.sf.net library calls.
Both are not yet available in LLVM, but would be highly useful.
Especially 2) would be nice for Polly and probably also for Alexandra.
> Btw, it would be very interesting to know more about your
> project/purpose for this!
Oh yes. I am also highly interested.
> 2011/9/8 Jimborean Alexandra<xinfi...@yahoo.com>:
>> Hi,
>>
>> I want to execute the iterations of a loop in parallel, by inserting calls
>> either to pthreads or to the gomp library at the LLVM IR level. As a first
>> step, I inserted an omp pragma in a C file and compiled it with llvm-gcc to
>> check the generated LLVM code.
This is a good step. When you do this, make sure you use
'schedule (runtime)'. Otherwise gcc will use not only function calls to
set up libgomp, but also inline instructions. This makes the code a lot
harder to understand.
>> If I understand correctly, to parallelize the
>> loop in LLVM IR, I have to separate the loop in a new function, put all
>> required parameters in a structure, make the call to the gomp library, and
>> restore all values from the structure in the original variables.
>> Also, I have to compute the number of iterations allocated to each thread
>> and insert in the loop body a new condition, such that each thread executes
>> only its slice.
>> Is that correct?
Partially. And it depends also on what kind of OpenMP parallel loop you
want to generate. I suggest you want to generate a schedule(runtime)
OpenMP parallel loop, as this is the easiest one to generate. Here you
need basically do this:
Host function:
(The function that contains the loop you want to parallelize)
Here you replace the loop with calls to:
GOMP_parallel_loop_runtime_start(subfunction, subfunction_data,
number_of_threads, lower_bound,
upper_bound, stride)
subfunction()
GOMP_parallel_end()
subfunction is the address of a new function, called subfunction.
subfunction_data, is the address of the structure that contains the data
needed in the subfunction. The remaining arguments should be obvious.
subfunction is now basically:
int lower_bound, upper_bound;
while(GOMP_loop_runtime_next(*lower_bound, upper_bound)) {
for (int i = lower_bound; i < upper_bound; i += stride) {
// Put here your loop body
}
}
>> As far as I know, both llvm-gcc and Polly already offer support for OpenMP,
>> by inserting calls to the gomp library.
Polly support automatic parallelization of the loops it generates. gcc
(and therefore llvm-gcc) supports both user added OpenMP pragmas and
automatically added OpenMP calls (provided by the -autopar pass).
> Can this code be reused?
Depends on what you plan to do. The code in Polly is currently specific
to Polly , only creates the calls to OpenMP that we need and basically
builds an OpenMP loop from scratch.
What you most probably want is a pass, that takes an existing LLVM-IR
loop and translates it into an OpenMP parallel loop.
From Polly you could get the functions that create the definitions of
the relevant LibGOMP functions, that set up the functions and that
create the new loop structure. To get a pass that translates an LLVM-IR
to LLVM-IR loop, you still need to implement the actual transformation.
>> Is there a pass that I can call to do all these code transformations?
No. LLVM has no included passes for OpenMP transformations. Polly just
does code generation for its own, specific use case. gcc has some passes
that lower high-level OpenMP calls to more individual OpenMP calls and
calculations.
>> I had a look at the CodeGeneration from Polly. Is it possible to use it
>> without creating the Scops, by transforming it into a LoopPass?
No. It is not a pass that translates from LLVM-IR to LLVM-IR, but it
creates new LLVM-IR from a polyhedral description. So without Scops and
therefore without a polyhedral description it cannot be used.
>> Could you indicate how is this handled in llvm-gcc?
The only pass that generates OpenMP calls in llvm-gcc is the gcc
-autopar pass. It basically detects if a loop is parallel and introduces
calls to libgomp (I am not sure if it creates higher level intrinsics
that are lowered by a subsequent gcc pass to the actual libgomp calls or
equivalent instructions).
Alex, let me know what you are planning to do. Maybe we can work
together on some OpenMP infrastructure in LLVM. I would love to have
both a generic OpenMP builder that can be used by both your
transformations and Polly, as well as a Pass that lowers high level
OpenMP intrinsics to higher performance OpenMP code and low level
function calls.
Cheers
Tobi
> If you want to use the auto-parallelization part of Polly, you have to
> to restrict the form of the code on which data dependences can be
> computed: this is done by using SCoPs (Static Control Parts).
In Polly we just generate OpenMP calls, if we need them. We do not
provide a generic OpenMP infrastructure (Something nice to have in LLVM
and where I would love to contribute). However, such a generic
infrastructure has a broader SCoP than the single specific OpenMP
pattern we implemented in Polly.
Cheers
Tobi
I mean you could rewrite that code to work on a per loop basis, like
graphite does: flag the parallel loops in the code generation of polly
and then independently of polly, as a LoopPass, iterate over all the
loops and code generate the parallel ones.
Sebastian
Yes, that's true and probably a good idea. It is more work as generating
it directly in Polly, but it facilitates reuse of code. We can probably
reuse quite a bit of the code in the existing OpenMP code generation.
We (or actually someone) may work on three things:
1) The high level OpenMP intrinsics that can be lowered to actual
library calls
This can be reused by an clang OpenMP extensions as well as by polly
directly.
2) An OpenMP builder, that allows to build OpenMP loops from scratch
3) The actual parallelizer, that takes an LLVM-IR loop and parallelizes
it with OpenMP intrinsics.
Cheers
Tobi
With it supported I think we can step much closer to natural
unification of OpenMP and multidimensional compute grids of GPU
programming models.
2011/9/12 Tobias Grosser <tob...@grosser.es>:
> With it supported I think we can step much closer to natural
> unification of OpenMP and multidimensional compute grids of GPU
> programming models.
I would be very interested to see a use case, where this improves
performance. In case this use case appears frequently, it would be nice
to see work in this direction. I would be glad to contribute to relevant
discussions.
Cheers
Tobi
> What I am looking for is a pass that can automatically build the
> structure of parameters required for the thread creation and execution,
> and modify the loop code accordingly. Namely, to load the parameters
> necessary for the thread into a structure and replace the use of these
> values inside the loop with the corresponding field in the structure.
> Similarly, once the threads finish the execution to update the values of
> the variables with the correct values from the structure. Also, the loop
> code has to be modified to execute only the slice allocated to the
> thread. I did not go yet into further details on the private and shared
> variables between the threads and other OpenMP features.
I would also like to have such a pass. It is more complicated as the
code in Polly right now, but most probably worth the effort.
> As far as I know this code is not yet available in LLVM. Do you think
> the parallel code generation of Polly can be adapted to perform this?
Yes and No. I think we should be able to extract the generic functions
to build the new OpenMP loop structure, the parameter structures and the
function calls. With those we could create an OpenMPBuilder as we today
have an LLVM-IR builder.
Based on this, building a pass that translates a regular LLVM-IR loop
into a OpenMP parallelized one should not be too difficult. (Especially,
as you perform speculative parallelization and you do not need any
dependency analysis)
> My
> work is not focused on adding OpenMP support in LLVM, but I was looking
> for codes that already enable parallelism in the LLVM IR. Surely, if
> necessary I will work on building the code and inserting calls to the
> gomp library, to parallelize regular loops. In that case, we can discuss
> the best approach to stay general and to meet all our requirements.
Yes, I assume there is no finished solution available. Though, I am
interested in work to get one.
Cheers
Tobi
It was proposed to implement OpenMP framework.
Is there any progress in this area? Was the question raised at the
Euro-LLVM? Does anybody work on implementation?
2011/9/12 Tobias Grosser <tob...@grosser.es>:
I am not aware of any work that was done or is planned to be done in public.
At Euro-LLVM I had a short discussion with Alexandra about targeting the
OpenMP run time. We would both be interested in code that can translate
normal LLVM-IR loops in OpenMP parallel loops. I am not sure,
if Alexandra has some code that she can share. I for myself looked into
the problem. I believe the loop extractor is a good start to implement
such functionality. Yet, I do not have any written code ready.
Do you happen to be interested to work on this? ;-) I can definitely
give you some hints.
Cheers
Tobi
So OpenMP has various constructs such as parallel, barrier, single,
for, etc. And there is at least two libraries to generate OpenMP code:
libgomp and mpc. We want to be independent of specific library.
We should create an interface with methods which present manual
inserting of OpenMP pragmas in C or Fortran code. These sounds like
"this code block should be run with #parallel" or "this for loop
should be run with #parallel for loop". Constructs like barrier,
atomic can be simply presented by methods 'barrier' and 'atomic'. But
parallel or for constructs allows many modifiers (schedule, private,
shared, etc). Сan not devise a good solution for these cases.
Guess this interface is close to IRBuilder. You suggested it will be
OpenMPBuilder. This mean IR will be created from scratch. Other way is
to provide methods to modify existing IR, like user says "I want this
loop to be parallel, do it for me". What you think about this?
Implementation is calls of OpenMP library functions in a way provided
by the library abi. For example, in mpc:
pragma omp for schedule(runtime)
for ( i=lb ; i COND b ; i+=incr ) { A ; }
->
if ( __mpcomp_runtime_loop_begin( ... ) ) {
do {
for ( ... ) {
A ;
}
} while ( __mpcomp_runtime_loop_next( ... ) ) ;
}
__mpcomp_runtime_loop_end() ;
I think this work is not simple for me, so any comments are welcomed.
2012/1/7 Tobias Grosser <tob...@grosser.es>:
Great. ;-)
> So OpenMP has various constructs such as parallel, barrier, single,
> for, etc. And there is at least two libraries to generate OpenMP code:
> libgomp and mpc. We want to be independent of specific library.
True.
> We should create an interface with methods which present manual
> inserting of OpenMP pragmas in C or Fortran code. These sounds like
> "this code block should be run with #parallel" or "this for loop
> should be run with #parallel for loop". Constructs like barrier,
> atomic can be simply presented by methods 'barrier' and 'atomic'. But
> parallel or for constructs allows many modifiers (schedule, private,
> shared, etc). О©╫an not devise a good solution for these cases.
I think we should distinguish here how to represent the OpenMP
constructs in LLVM-IR and how to create such LLVM-IR.
> Guess this interface is close to IRBuilder. You suggested it will be
> OpenMPBuilder. This mean IR will be created from scratch. Other way is
> to provide methods to modify existing IR, like user says "I want this
> loop to be parallel, do it for me". What you think about this?
I think both are valid methods to create LLVM-IR that uses OpenMP
extensions.
Alexandra and other people interested in automatic parallelization are
probably more interested in transforming a plain LLVM-IR loop into
LLVM-IR with OpenMP extensions. To add OpenMP annotation support to
clang, the clang code generation would probably be better off by
directly generating LLVM-IR with OpenMP constructs.
> Implementation is calls of OpenMP library functions in a way provided
> by the library abi. For example, in mpc:
>
> pragma omp for schedule(runtime)
> for ( i=lb ; i COND b ; i+=incr ) { A ; }
> ->
> if ( __mpcomp_runtime_loop_begin( ... ) ) {
> do {
> for ( ... ) {
> A ;
> }
> } while ( __mpcomp_runtime_loop_next( ... ) ) ;
> }
> __mpcomp_runtime_loop_end() ;
>
>
> I think this work is not simple for me, so any comments are welcomed.
I have some ideas, but as this is a larger project, we should probably
get a wider audience to review such a proposal. (You may also want to
copy the mpc authors, who have some experience in how to add support for
OpenMP into a compiler IR.
Some ideas from my side:
How to represent OpenMP in LLVM?
I do not believe OpenMP needs direct LLVM-IR extensions, but should
rather be represented through a set of intrinsics. The intrinsics may be
similar to function calls generated with OpenMP SCHEDULE (runtime),
but should be runtime library independent. They should also provide
enough information to optimize them within LLVM, in case the necessary
information is already available at compile time.
How to translate LLVM OpenMP intrinsics to actual code
Library specific passes can be used to lower the intrinsics to the
specific libgomp or mpc library calls. If already possible at compile
time, fast LLVM-IR sequences can be inlined instead of calling actual
library functions.
pragma omp for schedule(runtime)
for ( i=lb ; i COND b ; i+=incr ) { A ; }
-> Represented as LLVM-IR
def original_function() {
llvm.openmp.loop(lower_bound, upper_bound, body, context, schedule)
}
def body(i, context) {
A(i);
}
-> Lowering to MPC: (for schedule = runtime)
if ( __mpcomp_runtime_loop_begin( ... ) ) {
do {
for ( ... ) {
A ;
}
} while ( __mpcomp_runtime_loop_next( ... ) ) ;
}
__mpcomp_runtime_loop_end() ;
-> Lowering to MPC: (for schedule = static)
Some fast LLVM-IR implementations of __mpc_static_loop_begin() may be
called and inlined.
How to generate the OpenMP intrinsics?
There are two ways:
* People implement the OpenMP frontend stuff in clang and clang can
directly emit the relevant intrinsics when generating LLVM-IR.
* An automatic parallelizer generates code by transforming existing
LLVM-IR. Here projects like Polly or Alexandra's tools come in.
Vlad:
If you interested to work on such a thing, I propose you start by
writing up a small proposal. This may contain the general architecture
of all this, a proposal of how to represent OpenMP at LLVM-IR level and
everything else you think is important. You can than put this proposal
for discussion on the LLVM mailing list. I and hopefully several other
people will be glad to give our opinion.
Cheers
Tobi
FWIW, there is a BSD-licensed OpenMP 3 source-to-source translator
available as part of the ROSE project: http://www.rosecompiler.org/
It might be worth looking at.
>
> Some ideas from my side:
>
> How to represent OpenMP in LLVM?
>
> I do not believe OpenMP needs direct LLVM-IR extensions, but should
> rather be represented through a set of intrinsics. The intrinsics may be
> similar to function calls generated with OpenMP SCHEDULE (runtime),
> but should be runtime library independent. They should also provide
> enough information to optimize them within LLVM, in case the necessary
> information is already available at compile time.
>
> How to translate LLVM OpenMP intrinsics to actual code
>
> Library specific passes can be used to lower the intrinsics to the
> specific libgomp or mpc library calls. If already possible at compile
> time, fast LLVM-IR sequences can be inlined instead of calling actual
> library functions.
I agree, the ability to have target-specific lowering of the
parallelization constructs will be very important for some platforms. We
may need some new attributes in the IR so that we can still safely do
things like LICM.
-Hal
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
I'd like to put in some effort into this too -- perhaps I can write an
backend for libgomp while someone else works on a libmpc one.
As far as the architecture is concerned, I concur with what has
already been discussed: mapping OpenMP constructs to LLVM intrinsics.
I think it would make sense to leave out intrinsics for things like
"parallel for", "parallel sections", which are essentially syntactic
sugar.
One way to denote a structured block could be by nesting it within two
intrinsics like
llvm.openmp.parallel_begin (llvm.openmp.clause_if(%1) <and so forth>)
body
llvm.openmp.parallel_end()
or we could pass in an end label to parallel_begin (we can then call
it parallel instead of parallel_begin). I'm not sure which one is the
better idea. There will have to be restrictions on the body, of
course. We can't allow control to jump outside body without
encountering the barrier, for instance.
The more difficult problem would be, I think, to lift the structured
block and create a function out of it that takes a closure.
I also think there is potential for optimizing things like the
reduction clause and the atomic construct by lowering them directly
into CAS or other machine instructions. But I'll stop here before I
get ahead of myself. :)
Thanks!
--
Sanjoy Das
http://playingwithpointers.com
> Hi all,
>
> I'd like to put in some effort into this too -- perhaps I can write an
> backend for libgomp while someone else works on a libmpc one.
>
> As far as the architecture is concerned, I concur with what has
> already been discussed: mapping OpenMP constructs to LLVM intrinsics.
> I think it would make sense to leave out intrinsics for things like
> "parallel for", "parallel sections", which are essentially syntactic
> sugar.
I agree.
>
> One way to denote a structured block could be by nesting it within two
> intrinsics like
I am not sure how well this will work in practice, but we should come
up with a plan. As far as I can tell, we have two options: intrinsics
and metadata; we may want to use some combination of the two. One key
issue is to decide how optimization passes will interact with the
parallelization constructs. I think that it is particularly important
that using OpenMP does not preclude loop unrolling and LICM, for
example.
As you mention, we'll also want to provide the ability to do
target-specific lowering of certain constructs (like atomics). We also
want to make sure that we can recognize and optimize expressions that
are constant for a given thread within a parallel region (like a
get-thread-id operation).
Thoughts?
-Hal
>
> llvm.openmp.parallel_begin (llvm.openmp.clause_if(%1) <and so forth>)
> body
> llvm.openmp.parallel_end()
>
> or we could pass in an end label to parallel_begin (we can then call
> it parallel instead of parallel_begin). I'm not sure which one is the
> better idea. There will have to be restrictions on the body, of
> course. We can't allow control to jump outside body without
> encountering the barrier, for instance.
>
> The more difficult problem would be, I think, to lift the structured
> block and create a function out of it that takes a closure.
>
> I also think there is potential for optimizing things like the
> reduction clause and the atomic construct by lowering them directly
> into CAS or other machine instructions. But I'll stop here before I
> get ahead of myself. :)
>
> Thanks!
--