Re: question about graphite auto-parallelization

272 views
Skip to first unread message

Tobias Grosser

unread,
Jun 28, 2011, 7:57:55 AM6/28/11
to Anthony Falzone, GCC GRAPHITE
On 06/24/2011 02:50 PM, Anthony Falzone wrote:
> hi,
> I have tried asking this on three different forums and got no answer.I
> saw you might be able to answer this.
The best place to ask is the gcc-graphite mailing list.
> On the GCC page it says graphite
> can do auto-parallelization on outer loops now. I have two codes that
> should be able to be auto-parallelized. Definitely one of them for sure.
> But when I try it nothing happens.
> I have tried MinGW, MinGW-w64, and TDM-GCC. My codes along with the
> commands I’m using our located on my SourceForge page.
> https://sourceforge.net/projects/propdesign/
> <https://sourceforge.net/projects/propdesign/>
> The codes may provide a good test case for development of
> auto-parallelization using GCC. My question is, does
> auto-parallelization only work on Linux right now?
> It doesn’t seem to
> work on any of the Windows ports. Has anyone ever tested to see of
> auto-parallelization works on a windows port of GCC? My experience has
> been that it simply does nothing.
I do not know of anything systematic that would block the use of
autoparallelization on Windows or any other platform.
Auto-parallelization should be platform independent. However, I must
admit that we mainly test on Linux 64-bit.

> The code remains single threaded
> weather or not you try to auto-parallelize it. I don’t know if there is
> a way to get some feedback during compilation regarding what graphite is
> doing.

Yes. You can use -fdump-tree-graphite-all to get some debugging output
form graphite.

If you want anybody to look into this the best would be to open a
bugreport with a reduced test case and the exact command lines you use.
Please CC me in the bug report.

Cheers
Tobi

Anthony Falzone

unread,
Jul 8, 2011, 2:46:20 AM7/8/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Thanks Tobias,

My computer was in the shop, just got it back.  I'll give what you said a try and post back here.

Anthony Falzone

unread,
Jul 9, 2011, 12:52:26 AM7/9/11
to GCC GRAPHITE
I added the command you mentioned to get debugging information. Not
sure what it means though. Here is the compilation command I used and
the output from graphite. The code remains single threaded.

For the SP version the commands I'm using are:

x86_64-w64-mingw32-gfortran SP_PROP_DESIGN.f -o SP_PROP_DESIGN.exe -O2
-mtune=generic -funroll-loops -ftree-parallelize-loops=4 -floop-
parallelize-all -fdump-tree-graphite-all

The graphite debugging info is:


;; Function MAIN__ (MAIN__)


Pass statistics:
----------------


Global statistics (BBS:454, LOOPS:45, CONDITIONS:222, STMTS:5663)

Global profiling statistics (BBS:0, LOOPS:0, CONDITIONS:0, STMTS:0)

Pass statistics:
----------------


;; Function main (main)


Pass statistics:
----------------


Global statistics (BBS:3, LOOPS:0, CONDITIONS:0, STMTS:4)

Global profiling statistics (BBS:0, LOOPS:0, CONDITIONS:0, STMTS:0)

Pass statistics:
----------------

For the MP version, the commands are:

x86_64-w64-mingw32-gfortran MP_PROP_DESIGN.f -o MP_PROP_DESIGN.exe -O2
-mtune=generic -funroll-loops -ftree-parallelize-loops=4 -floop-
parallelize-all -fdump-tree-graphite-all

The graphite debugging info is:


;; Function MAIN__ (MAIN__)


Pass statistics:
----------------


Global statistics (BBS:211, LOOPS:21, CONDITIONS:100, STMTS:1913)

Global profiling statistics (BBS:0, LOOPS:0, CONDITIONS:0, STMTS:0)

Pass statistics:
----------------


;; Function main (main)


Pass statistics:
----------------


Global statistics (BBS:3, LOOPS:0, CONDITIONS:0, STMTS:4)

Global profiling statistics (BBS:0, LOOPS:0, CONDITIONS:0, STMTS:0)

Pass statistics:
----------------

The MP version of PROP_DESIGN simple takes the SP version and runs it
multiple times essentially. So if you imagine that one day gfortran
could automatically parallize the code across multiple cpu and gpu
cores, the MP version could run just as fast as the SP version, minus
the overhead. I don't know how much parallelization opportunities
might lie in the SP version. It runs so fast anyway on one thread it
hardly matters. But any speed up is a good thing regardless. I'm
mainly interested in getting the MP version to use all available
threads. Which in my case is four.

Thanks for your time,

Anthony

Anthony Falzone

unread,
Jul 11, 2011, 9:49:18 AM7/11/11
to gcc-gr...@googlegroups.com
I got an e-mail from someone who made some suggestions.  Basically I found out that Graphite is saying all my loops are inner loops and can't do anything with them.  In the 4.5 what's new file it says they expanded graphite to cover outer loops.  I'm I missing something?

razya ladelsky

unread,
Jul 11, 2011, 11:35:04 AM7/11/11
to gcc-gr...@googlegroups.com


On Mon, Jul 11, 2011 at 4:49 PM, Anthony Falzone <prop_...@live.com> wrote:
I got an e-mail from someone who made some suggestions.  Basically I found out that Graphite is saying all my loops are inner loops and can't do anything with them.  In the 4.5 what's new file it says they expanded graphite to cover outer loops.  I'm I missing something?

Hi
I think that you meant that the automatic parallelization pass was enhanced to support parallelization of outer loops.
Does your code have outer loops?
Can you send the dump file created by autopar pass? (just enable -fdump-tree-parloops-details)
Thanks,
Razya

Anthony Falzone

unread,
Jul 11, 2011, 12:23:33 PM7/11/11
to gcc-gr...@googlegroups.com
Razya, this is a repost of an e-mail, since my others didn't seem to make it to you:

It won't let me upload the batch files so here is there contents:

x86_64-w64-mingw32-gfortran MP_PROP_DESIGN.f -o MP_PROP_DESIGN.exe -O2 -mtune=generic -funroll-loops -ftree-parallelize-loops=4 -floop-parallelize-all -fdump-tree-graphite-all -fdump-tree-parloops-details -ffast-math

x86_64-w64-mingw32-gfortran SP_PROP_DESIGN.f -o SP_PROP_DESIGN.exe -O2 -mtune=generic -funroll-loops -ftree-parallelize-loops=4 -floop-parallelize-all -fdump-tree-graphite-all -fdump-tree-parloops-details -ffast-math

Hope you get this,

I sent you a few e-mails yesterday. I tried your suggestion about fast math and got a 35% speedup. The results seem pretty much the same. So I’m considering using that. Your debugging suggestion helped me see what is going on. Here is the info you requested.  There a two c.bat files one for each version of the code. They are in different folders on my computer.

Thanks a lot,

Anthony
MP_PROP_DESIGN.f
MP_PROP_DESIGN.f.104t.graphite
MP_PROP_DESIGN.f.116t.parloops
SP_PROP_DESIGN.f
SP_PROP_DESIGN.f.104t.graphite
SP_PROP_DESIGN.f.116t.parloops

Anthony Falzone

unread,
Jul 13, 2011, 4:32:25 PM7/13/11
to gcc-gr...@googlegroups.com
I got this e-mail from Razya yesterday:

Indeed, graphite finds the loops to be non-parallelizable. Not sure what the reason is.
Could you try to enable autopar without graphite, so we could have an indication what the lambda framework does in this case?
just drop the -floop-parallelize-all and send me the dumps again.
thanks,
Razya

I tried to respond but nothing gets to him via e-mail.  Can we stick to the forum in the future.  In any event here is what you wanted.  The compile commands were:

MP_PROP_DESIGN:

x86_64-w64-mingw32-gfortran MP_PROP_DESIGN.f -o MP_PROP_DESIGN.exe -O2 -mtune=generic -funroll-loops -ftree-parallelize-loops=4 -floop-parallelize-all -fdump-tree-parloops-details -ffast-math

SP_PROP_DESIGN:

x86_64-w64-mingw32-gfortran SP_PROP_DESIGN.f -o SP_PROP_DESIGN.exe -O2 -mtune=generic -funroll-loops -ftree-parallelize-loops=4 -fdump-tree-graphite-all -fdump-tree-parloops-details -ffast-math

Attached are the debug files.

Anthony Falzone

unread,
Jul 13, 2011, 4:35:31 PM7/13/11
to gcc-gr...@googlegroups.com
Whoops,

Here are the debug files, they didn't get attached to the previous message.  Also, you can download my code and experiment for yourself if you like.  There is no reason it should run in parallel.


Anthony
MP_PROP_DESIGN.f.104t.graphite
MP_PROP_DESIGN.f.116t.parloops
SP_PROP_DESIGN.f.104t.graphite
SP_PROP_DESIGN.f.116t.parloops

Razya Ladelsky

unread,
Jul 14, 2011, 4:00:55 AM7/14/11
to Anthony Falzone, gcc-gr...@googlegroups.com
Hi Anthony,
Could you please run:


x86_64-w64-mingw32-gfortran MP_PROP_DESIGN.f -o MP_PROP_DESIGN.exe -O2
-mtune=generic -funroll-loops -ftree-parallelize-loops=4
-fdump-tree-parloops-details -ffast-math
(Note that I removed the -floop-parallelize-all)
and then resend the parloops dump file to me.
Thanks,
Razya


gcc-gr...@googlegroups.com wrote on 13/07/2011 23:35:31:

> Anthony[attachment "MP_PROP_DESIGN.f.104t.graphite" deleted by Razya
> Ladelsky/Haifa/IBM] [attachment "MP_PROP_DESIGN.f.116t.parloops"
> deleted by Razya Ladelsky/Haifa/IBM] [attachment "SP_PROP_DESIGN.f.
> 104t.graphite" deleted by Razya Ladelsky/Haifa/IBM] [attachment
> "SP_PROP_DESIGN.f.116t.parloops" deleted by Razya Ladelsky/Haifa/IBM]

Anthony Falzone

unread,
Jul 14, 2011, 4:32:39 AM7/14/11
to gcc-gr...@googlegroups.com, Anthony Falzone
here ya go
MP_PROP_DESIGN.f.116t.parloops

Razya Ladelsky

unread,
Jul 14, 2011, 5:23:45 AM7/14/11
to gcc-gr...@googlegroups.com, gcc-gr...@googlegroups.com, Anthony Falzone
Hi Anthony,
I see that indeed none of the loops is parallelized.
Do you know which loops ware supposed to be parallelized?
Razya

From: Anthony Falzone <prop_...@live.com>
To: gcc-gr...@googlegroups.com
Cc: Anthony Falzone <prop_...@live.com>
Date: 14/07/2011 11:33
Subject: Re: question about graphite auto-parallelization
Sent by: gcc-gr...@googlegroups.com

here ya go[attachment "MP_PROP_DESIGN.f.116t.parloops" deleted by Razya
Ladelsky/Haifa/IBM]

Anthony Falzone

unread,
Jul 14, 2011, 6:03:19 AM7/14/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Yes, that's what it looked like to me as well.  I don't know which loops are supposed to be.  However, there should definitely be some.  The MP version of PROP_DESIGN basically runs the SP version multiple times.  It varies the speed of the aircraft.  So it should definitely be able to run multiple speeds at once, since they are completely independent runs.  If you imagine you have several hundred cores available, which looks like its not too far away based on the Intel MIC architecture, the MP version would run just as fast as the SP version.  Besides that loop there is probably even more chances for parallelism.  But I'm not an expert in this area.  I don't know OpenMP, so the only way I have to make the code parallel is via automatic parallelization.  People have often said use OpenMP, but that is not an option for me.  Honestly, the code runs pretty darn fast already.  I'd just like to take full advantage of the CPU and compiler using basic Fortran 77.

Razya Ladelsky

unread,
Jul 14, 2011, 8:14:37 AM7/14/11
to gcc-gr...@googlegroups.com, gcc-gr...@googlegroups.com, Anthony Falzone
gcc-gr...@googlegroups.com wrote on 14/07/2011 13:03:19:

> From: Anthony Falzone <prop_...@live.com>
> To: gcc-gr...@googlegroups.com
> Cc: Anthony Falzone <prop_...@live.com>
> Date: 14/07/2011 13:03
> Subject: Re: question about graphite auto-parallelization
> Sent by: gcc-gr...@googlegroups.com
>

> Yes, that's what it looked like to me as well. I don't know which
> loops are supposed to be. However, there should definitely be some.
> The MP version of PROP_DESIGN basically runs the SP version multiple
> times. It varies the speed of the aircraft. So it should
> definitely be able to run multiple speeds at once, since they are
> completely independent runs.

but it depends whether the code is written in a way that the compiler
has a chance to optimize.
If you could point out the loop in the code that you expect to be
parallel,
I could try to see why the compiler fails to do so.

Razya

Anthony Falzone

unread,
Jul 14, 2011, 8:44:11 AM7/14/11
to gcc-gr...@googlegroups.com, Anthony Falzone
hi,

i thought you might ask that.

For the MP code:

 - line 240 starts the loops for input file one and two

 - line 500 starts the loops for each of the four pitch cases

 - line 589 starts the loop for each aircraft velocity step

For the SP code:

 - line 343 starts the loops for input files one and two

 - there are a lot of other loops that should be able to be run in parallel.  moreover input and output processing.

 - i'm not sure how much of the actual calculation could be made parallel, possible none.

both the MP and SP codes iterate on two variables.  that is at the core of the calculations, so I don't know if that is part of the issue graphite is having.  this is not a closed form solution.  so there is always some unknown amount of looping that is going to transpire.  i have limits set so that it will eventually give up and error out gracefully.  i've never had to use those though.  the code always seems to converge.

in the 4.5.3 update for gcc it says outer loops are now supported.  are they only partially supported?  moreover, is there still no code generation part as indicated on the gcc graphite page?

i provided the source and input files in case you want to compile and run the code to see how it works better.  also there is a manual, examples, etc... on my sourceforge page https://sourceforge.net/projects/propdesign/.

thanks for your time.  i'm glad to know its installed and running.  i didn't even know that prior to you and tobias helping me.  i had been trying to get answers for several months.  so i'm grateful for the help.  i'm thinking graphite just needs more work to be able to analyze anyone's fortran code.

feel free to use my code for testing going forward.  it should be capable of really demonstrating the power of automatic parallelization once intel mic is out.  if i could afford the absoft compiler it would be interesting to see how it handles my code.
MP_PROP_DESIGN.f
MP_INPUT_ONE.TXT
MP_INPUT_TWO.TXT
SP_PROP_DESIGN.f
SP_INPUT_ONE.TXT
SP_INPUT_TWO.TXT

Anthony Falzone

unread,
Jul 25, 2011, 5:08:48 PM7/25/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Razya,

Did you have any luck figuring out why Graphite won't auto-parallelize my code?

Razya Ladelsky

unread,
Jul 26, 2011, 5:16:14 AM7/26/11
to gcc-gr...@googlegroups.com, gcc-gr...@googlegroups.com, Anthony Falzone
gcc-gr...@googlegroups.com wrote on 26/07/2011 12:08:48 AM:

> From: Anthony Falzone <prop_...@live.com>
> To: gcc-gr...@googlegroups.com
> Cc: Anthony Falzone <prop_...@live.com>
> Date: 26/07/2011 12:08 AM
> Subject: Re: question about graphite auto-parallelization
> Sent by: gcc-gr...@googlegroups.com
>

> Razya,
>
> Did you have any luck figuring out why Graphite won't auto-
> parallelize my code?


Hi Anthony,

I tried compiling with a gcc4.7 version on x86 (that was the most
available one for me)
and I saw these loops getting parallelized for MP_PROP_DESIGN.f:

loop at MP_PROP_DESIGN.f:1147:
loop at MP_PROP_DESIGN.f:1116:
loop at MP_PROP_DESIGN.f:901:
loop at MP_PROP_DESIGN.f:837:
loop at MP_PROP_DESIGN.f:775:

I used the "old" autopar, which does not use the graphite based analysis.
I can give it a try with graphite as well, but let's make sure you get the
same with the
old autopar first.

all I do is:

gcc -O3 MP_PROP_DESIGN.f -c -ftree-parallelize-loops=4
-fdump-tree-parloops-details -fno-tree-vectorize

and then I grepped the string "SUCCESS: may be parallelized" in the dump
file that was created.
You should see this string appearing 5 times.

Please retry this, and let me know if you get result from grepping the
dump file.

Thanks,
Razya


Anthony Falzone

unread,
Jul 26, 2011, 9:23:22 AM7/26/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Hi Razya,

I tried compiling the code with what I have installed using the commands you did.  However, I don't think its supporterd.  The compilation happens too fast with no output and when I try to run the code it gives errors saying unsupported 64-bit code.  I'll try installing some of the other compiler versions and using those commands.  I only have MinGW-w64 installed because that made the fastest executables in my testing.  I'll re-install MinGW and see what happens.

The loops you are getting to parallelize are:

1147 calculates the induced angle of attack
1116 calculates circulation
901 calculates many of the variables needed such as cl, cd
837 uses the biot savart law to calculate wphi
775 uses the biot savart law to calculate wz

These are all interior loops, which I find surprising.  The low lying fruit so to speak it is blind to.  I would have thought that it would be easy to parallelize all the input and output loops as well as the loops for case and pitch.  The loops for case would take two cores.  The loops for pitch another four.  I wasn't sure if the inner loops could be parallelized.  So this is indicating that a lot of parallelization could be possible eventually.  The SP code is similar to the MP code except it has a lot more output loops and it does not have the four loops for pitch.  Both contain the two case loops and all the same interior calculation loops.

Anthony

Anthony Falzone

unread,
Jul 26, 2011, 9:59:11 AM7/26/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Hi Razya,

I may not understand what to do.  I'm getting the same issue regardless of what I try.  I have the latest versions of MinGW and MinGW-w64 installed via Cygwin.  My computer has Windows 7 64-bit home premium installed.  It's a laptop with a Intel Core i3 350M (Arrandale) processor.  It is dual core with 9.08 Gflops per core available.  Four threads can be run at one time due to hyperthreading.

I tried four different compiler options.  These are the two closest to what you said to try:

x86_64-w64-mingw32-gcc MP_PROP_DESIGN.f -c -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fno-tree-vectorize

i686-pc-mingw32-gcc MP_PROP_DESIGN.f -c -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fno-tree-vectorize

I only get a file with a .o extension and it seems to be in binary format.  I don't get any dump files that are viewable.  I also tried:

x86_64-w64-mingw32-gfortran MP_PROP_DESIGN.f -o MP_PROP_DESIGN.exe -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fno-tree-vectorize

i686-pc-mingw32-gfortran MP_PROP_DESIGN.f -o MP_PROP_DESIGN.exe -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fno-tree-vectorize

But then only got an exe file, still no dump files.  So I think I am missing something.

Razya Ladelsky

unread,
Jul 26, 2011, 10:08:39 AM7/26/11
to gcc-gr...@googlegroups.com, gcc-gr...@googlegroups.com, Anthony Falzone
Hi Anthony,
Just add -O3 to the options, and you should get the dump file.
Let me know,
Razya

gcc-gr...@googlegroups.com wrote on 26/07/2011 04:59:11 PM:

> From: Anthony Falzone <prop_...@live.com>
> To: gcc-gr...@googlegroups.com
> Cc: Anthony Falzone <prop_...@live.com>
> Date: 26/07/2011 04:59 PM
> Subject: Re: question about graphite auto-parallelization
> Sent by: gcc-gr...@googlegroups.com
>

Anthony Falzone

unread,
Jul 26, 2011, 10:30:21 AM7/26/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Ok,

that worked, here is the debug file.  I don't think I'm getting the same output you did though.  When I search the file for success it doesn't give the same thing you were showing.  When I run the exe file it makes it shows 1 thread still in the task manager.  And my cpu utilization remains at 25%.

i686-pc-mingw32-gfortran MP_PROP_DESIGN.f -o MP_PROP_DESIGN.exe -O3 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fno-tree-vectorize

Attached is the debug file I got.

MP_PROP_DESIGN.f.116t.parloops

Anthony Falzone

unread,
Aug 1, 2011, 9:14:05 AM8/1/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Hi Razya,

Did the file I attached in the last e-mail make any sense to you?

razya ladelsky

unread,
Aug 1, 2011, 9:16:51 AM8/1/11
to gcc-gr...@googlegroups.com
Hi Anthony,
I also tried using an older gcc version such as GCC4.6 and indeed I see less loops getting parallelized.
I haven't had time to explore what was going on yet.
Hoping to look into it in the next few days.
Is it you possible for you to use a newer GCC version by any chance?
Thanks,
Razya

Anthony Falzone

unread,
Aug 1, 2011, 9:39:37 AM8/1/11
to gcc-gr...@googlegroups.com
Hi,

Sorry for the confusion, I was using the latest version available through SourceForge.  I wasn't getting the same output as you.  So that was part of my confusion.  You mentioned something about grepped which I don't know what that means but I assumed it was the same as a search.  When I searched the document the output wasn't the same as what you had mentioned, but it did say it was able to find some loops.  But they were all very interior loops.  I have been thinking about this.  The code has two loops which are completely indepent.  These are the input file loops.  So both SP and MP will run the code twice with two totally different input files.  So this is at least one loop that should be able to be parallelized.  Then the MP code loops itself with four different pitch settings.  So that is another loops that can be parallelized.  So the SP version of the code should definitely be able to use two cores.  The MP version should be able to run on 8 cores.  These are no brainer loops.  If the either of the auto-parallelization codes can't find these loops then there is a serious problem..  There are absolutely no dependencies what so ever here.  Then beyond that is all gravy.  It seems autopar is finding a few inner loops, but that isn't going to by that much.  Graphite seems to be finding nothing at all.  If I new how to use OpenMP that would be the way to go here.  But I think this is a good exercise for the auto-paralelization capabilities of GCC.  So far not a good showing.

FYI, In a previous post I think I wrote my processor was 9 gflops per core.  But it is half that.  So SP is running in seconds and MP in minutes currently with one core at 4.5 gflops.  I only have two cores available.  So even if the auto-parallelization did nothing more that run input file loop on different cores, that would max out my cpu.

Anthony Falzone

unread,
Aug 1, 2011, 10:05:46 AM8/1/11
to gcc-gr...@googlegroups.com
I forgot the aircraft velocity loop in MP.  That would take the cores from 8 up to a user defined about.  Currently it is running around 500 points total.  So that would be around 500 cores easily with no dependencies.  So to me it seems like the auto-parallelization is missing some very basic stuff that would add a lot of speed.  It seems like they are working completely backwards, from inside out, rather than outside in.  In the case of SP there is one outer loop with no dependencies  With MP there are three outer loops with no dependencies.  The inner loops aren't going to buy much speed.  Especially when its missing all the outer loops.  If they could see the outer loops and you had enough cores SP would run twice as fast as it is now and MP would run just as fast as MP.  The biggest improvement would be taking MP from minutes down to seconds.  I don't have enough cores for that, but in the not to distant future there will be 100s of cores available in a CPU.

Razya Ladelsky

unread,
Aug 4, 2011, 4:26:39 AM8/4/11
to Anthony Falzone, gcc-gr...@googlegroups.com
gcc-gr...@googlegroups.com wrote on 01/08/2011 05:05:46 PM:

> From: Anthony Falzone <prop_...@live.com>
> To: gcc-gr...@googlegroups.com
> Date: 01/08/2011 05:05 PM
> Subject: Re: question about graphite auto-parallelization
> Sent by: gcc-gr...@googlegroups.com
>

> I forgot the aircraft velocity loop in MP. That would take the
> cores from 8 up to a user defined about.

Hi Anthony,
Could you tell me the line number where this loop begins in the source
code?
Then I can check why it is not getting parallelized.
In any case, I think we need to go for a newer GCC version, as it seems we
get
more loops parallelized there.
Thanks,
Razya

Anthony Falzone

unread,
Aug 4, 2011, 9:02:39 AM8/4/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Hi Razya,

I just posted an updated version of PROP_DESIGN on my SourceForge site, https://sourceforge.net/projects/propdesign/.  For the MP code the loop that selects input files is on line 237, this is loop number 702.  If this loop was parallelized you would go from one core to two cores.  On line 518 of MP you will find the pitch loop, loop number 701.  This loop runs four different pitch values.  So now you would go from 2 cores up to eight cores.  On line 607 is the aircraft velocity loop, loop number 700.  This loop runs the code at different aircraft velocities.  The amount of velocity inputs is determined by the user.  However, you can expect going from eight cores up to hundreds of cores.  Since I only have a two core processor, parallelizing loop 702, the input file loop, would be sufficient to max out the useful processing power.

For the SP code there is only one main loop.  That is the input file loop.  It is on line 328, loop number 701.

Anthony

P.S.:  For convience, I attached the updated codes and input files.  I'm going to be out of town a few days, for a wedding.  I should be able to respond to the forum again on Saturday.  Thanks again for your help.  I was just considering learning how to add OpenMP directives in order to get this solved.
MP_PROP_DESIGN.f
MP_INPUT_ONE.TXT
MP_INPUT_TWO.TXT
SP_PROP_DESIGN.f
SP_INPUT_ONE.TXT
SP_INPUT_TWO.TXT

Anthony Falzone

unread,
Aug 4, 2011, 10:38:38 AM8/4/11
to GCC GRAPHITE
Found something that may explain what is going on. When I tried
OpenMP it flagged every single goto statement. Perhaps this is what
graphite is having a problem with.

On Aug 4, 9:02 am, Anthony Falzone <prop_des...@live.com> wrote:
> Hi Razya,
>
> I just posted an updated version of PROP_DESIGN on my SourceForge site,https://sourceforge.net/projects/propdesign/.  For the MP code the loop that
>  MP_PROP_DESIGN.f
> 38KViewDownload
>
>  MP_INPUT_ONE.TXT
> 1KViewDownload
>
>  MP_INPUT_TWO.TXT
> 1KViewDownload
>
>  SP_PROP_DESIGN.f
> 67KViewDownload
>
>  SP_INPUT_ONE.TXT
> 1KViewDownload
>
>  SP_INPUT_TWO.TXT
> 1KViewDownload

Anthony Falzone

unread,
Aug 4, 2011, 7:18:25 PM8/4/11
to gcc-gr...@googlegroups.com
I was able to get rid of all the goto 810 statements that was pissing openmp off.  I'm using a stop statement instead.  Graphite still isn't parallelizing anything though.  So I guess that wasn't the problem.  I must be missing something with OpenMP though because the exe won't run and gives off a weird error.  I'm thinking I don't know how to use OpenMP yet.  I changed the loop number of interest on the SP code to match that of the MP code.  So loop number 702 is the one that needs to be parallelized.  This will run two seperate cases.  It will double the speed of each code.  So just search each code for 702 and it will take you to the do loop of interest.  I attached the codes again.  I haven't uploaded them to sourceforge yet, as this version is still in development.
SP_PROP_DESIGN.f
SP_INPUT_ONE.TXT
SP_INPUT_TWO.TXT
MP_PROP_DESIGN.f
MP_INPUT_ONE.TXT
MP_OUTPUT.TXT

Anthony Falzone

unread,
Aug 5, 2011, 10:21:29 AM8/5/11
to gcc-gr...@googlegroups.com
I see that OpenMP doesn't deal with I/O.  So I would also have to change a lot of that as well.  Perhaps this is something Graphite is having an issue with as well.  Even though the calculations are independent, they are sharing input and output files.  Everything was written as serial code.  I was hoping auto-parallelization would be able to deal with all that.  As far as OpenMP goes it isn't going to be any better than if I wrote two different codes and ran them at the same time with a batch file.  Having to re-code everything defeats the purpose of auto-parallelization.  So hopefully it is going to be better than OpenMP.  The reason I looked into OpenMP is people had suggested I use it when I first looked for help with this issue.  Then when I read about how the Intel compiler's auto-parallelization worked it seemed like it was just introducing OpenMP directives.

Anthony Falzone

unread,
Sep 11, 2011, 10:48:58 AM9/11/11
to gcc-gr...@googlegroups.com
I read through the documentation of a lot of different compilers and they all have the same limitations.  They only work on loops with a set number of iterations.  They also don't allow for read or write statements.  So these two things make auto-parallelization useless for PROP_DESIGN.  So my original thought that PROP_DESIGN would be a good candidate for auto-parallelization is not true.  Not without a drastic change in the technology.

Razya Ladelsky

unread,
Sep 11, 2011, 11:07:14 AM9/11/11
to Anthony Falzone, gcc-gr...@googlegroups.com
gcc-gr...@googlegroups.com wrote on 11/09/2011 05:48:58 PM:

> From: Anthony Falzone <prop_...@live.com>
> To: gcc-gr...@googlegroups.com

> Date: 11/09/2011 05:49 PM
> Subject: Update
> Sent by: gcc-gr...@googlegroups.com
>
> I read through the documentation of a lot of different compilers and
> they all have the same limitations. They only work on loops with a
> set number of iterations.

The limitation regarding number of iterations is that we have a
description of number of iterations of the loop.
(It could be a symbol like 'N', but not something depending on the data
for example)

> They also don't allow for read or write
> statements.

Do you mean function calls or loads and stores?

Anthony Falzone

unread,
Sep 11, 2011, 12:10:29 PM9/11/11
to gcc-gr...@googlegroups.com
Hi Razya

PROP_DESIGN iterates on two variables.  So there are do loops which have hard limits, however, its designed to stop once convergence to a specified tolerance is achieved.  There are read and write statements throughout the code.  Designed to read and write in the sequence specified.  So the fact that none of the auto-parallelization or OpenMP stuff allows for any read or write statements is a problem as well.

What I could use doesn't exist.  A compiler and processor that could utilize hundreds of 64-bit FPUs (whether they be on a CPU or GPU or APU etc...).  The compiler would auto-vectorize the loops across all the available FPUs in the system.  So the code would run in serial but the loops would run in parallel.  Pretty much the same as regular Fortran except it would happen with FPUs outside of one core.  It would be like having auto-vectorization across all available FPUs in the system not just all available FPUs in one single x86 core.

The dumb thing is I can copy the code and run it in parallel using batch files.  The pain in the neck with that is all the files you would have if you scaled it out to hundreds of parallel ops.  Plus all the headache with processing the output.  The way I use that approach is to run multiple different examples.  Trying to do it for one example wouldn't be practical for me to implement

I learned Fortran way before parallel processing was around.  So I don't know what I would need to do to make PROP_DESIGN work with OpenMP.  I played around with it some but it never worked.  I even took out the write statements in the section I was playing with (there were no read statements).  It still didn't work.  Then I read somewhere about auto-parallelization requiring fixed loop counts.  I think that was the remaining problem.  It can't parallelize the core of PROP_DESIGN, which contains the two variable iteration stuff, which has loops with variable stops.

Basically, graphite may be working just as good as any other auto-parallelization software.  But that simply isn't good enough to be useful to me.

Anthony Falzone

unread,
Sep 26, 2011, 3:10:15 AM9/26/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Looks like the Intel compiler is really good.  I have been trying to get MP_PROP_DESIGN included in this benchmark for awhile.  Just saw it posted the other day.

http://polyhedron.com/pb05-lin64-f90bench_SBhtml

I don't have the Intel compiler.  I'm going to download the trial version as some point and see if the speedup is due to auto-parallelization or not.

Anthony Falzone

unread,
Sep 27, 2011, 4:39:05 PM9/27/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Got some information from Polyhedron. They said they did see a factor of 2 improvement with Intel auto-parallelization on for MP_PROP_DESIGN.  However, for Windows on Sandy Bridge they turned it off for an unknown error.  The error doesn't exist on Linux.  You can see they had auto-parallelization on for all benchmarks but Windows.  So perhaps i/o and loops with variable ends are not an issue for all auto-parallelization implementations.  At some point I definitely need to experiment with the Intel compiler, just so I can see how it treats my code in more detail.  Looks promising though.

razya ladelsky

unread,
Oct 2, 2011, 3:40:42 AM10/2/11
to gcc-gr...@googlegroups.com


On Tue, Sep 27, 2011 at 10:39 PM, Anthony Falzone <prop_...@live.com> wrote:
Got some information from Polyhedron. They said they did see a factor of 2 improvement with Intel auto-parallelization on for MP_PROP_DESIGN.  However, for Windows on Sandy Bridge they turned it off for an unknown error.  The error doesn't exist on Linux.  You can see they had auto-parallelization on for all benchmarks but Windows.  So perhaps i/o and loops with variable ends are not an issue for all auto-parallelization implementations.  At some point I definitely need to experiment with the Intel compiler, just so I can see how it treats my code in more detail.  Looks promising though.


Interesting.
I have to dig a bit deeper in order to understand what they can parallelize that GCC can not.
I could not open the page you sent, but certainly testing autopar on the polyhedron suite is very intersting for us.
I am now working on some improvements to autopar, maybe they can help with MP_PROP_DESIGN as well.
I will try to take another look at MP_PROP_DESIGN code once I get the chance.
Can you extract the specific loop you expect to be parallelized?
That could be very helpful for me to investigate.
Thanks,
Razya


Anthony Falzone

unread,
Oct 4, 2011, 4:25:39 PM10/4/11
to gcc-gr...@googlegroups.com
Hey Razya,

I moved the program to a new website.  You can find it here:


I made a new version recently just for benchmarking.  So you have a lot of codes you can work with if you want.  I wouldn't expect a lot of parallelization in the SP or SP_BENCHMARK versions.  The MP and MP_BENCHMARK versions basically run SP over and over again using different inputs.  So each and every run could run on a seperate core.  In fact, I can do this with batch files.  The issue becomes dealing with all the I/O.  So its not practical.  The MP_BENCHMARK version basically has no output.  There are mulitple input files however.  MP_BENCHMARK runs all the example files, so it takes a long time to run.  The loops that should run in parallel are 700, 701, 702, and 703.  I haven't used the Intel compiler, so I can't say what it is or is not running in parallel.  Even when not running parallel it is beating gfortran by a lot.  The polyhedron results show the following using MP_PROP_DESIGN:

Intel Fortran is 3.4x faster than gfortan with no parallelization on windows using a core i5 2500k processor
Intel Fortran is 9.3x faster than gfortran with no parallelization on windows using a amd phenom ii processor

Intel Fortran is 6.0x faster than gfortran with parallelization on linux using a core i5 2500k processor
Intel Fortran is 8.2X faster than gfortran with parallelization on linux using a amd phenom ii processor

Polyhedron said the reason that parallelization was not on with windows was an error yet to be figured out regarding the core i5 2500k proessor on windows.  You would have to experiment with the compiler yourself to get more detailed information.  I realize this isn't a lot of info to go on.

Anthony Falzone

unread,
Oct 5, 2011, 8:48:42 PM10/5/11
to gcc-gr...@googlegroups.com
Looks like they got the bug figured out.  So now Polyhedron is showing the following

Intel Fortran 10x faster than gfortran with Windows on an Intel Core i5 2500k CPU
Intel Fortran 9.3x faster than gfortran with Windows on an AMD Phenom II
Intel Fortran 6x faster than gfortran with Linux on an Intel Core i5 2500k CPU
Intel Fortran 8.2x faster than gfortran with Linux on an AMD Phenom II

Razya Ladelsky

unread,
Oct 6, 2011, 5:18:31 AM10/6/11
to Anthony Falzone, gcc-gr...@googlegroups.com
Hi Anthony,
Thansk for your info.
I definitely want to make time to look into the parallelization issue.
I'm curious - is MP_PROP going to be a part of the polyhedron suite?
Thanks,
Razya


gcc-gr...@googlegroups.com wrote on 04/10/2011 10:25:39 PM:

> From: Anthony Falzone <prop_...@live.com>
> To: gcc-gr...@googlegroups.com

> Date: 04/10/2011 10:25 PM
> Subject: Re: question about graphite auto-parallelization
> Sent by: gcc-gr...@googlegroups.com
>

Anthony Falzone

unread,
Oct 6, 2011, 8:01:44 AM10/6/11
to gcc-gr...@googlegroups.com, Anthony Falzone
Yes, it already is.  I had been trying to get it included for about a year now.  They just put it in last month.  You can see their results here http://www.polyhedron.com/compare0html, that is were I got the information I posted. Before I began posting here I was working with them to get it included because I believed it would show off auto-parallelization.  I expected there to be a big difference between compilers with and without auto-parallelization.  Then when I started using gfortran I wasn't so sure anymore.  Especially after I read about OpenMP, which it seems like a lot of the auto-parallelization routines rely on.  Now I'm just confused.
Reply all
Reply to author
Forward
0 new messages