Large time cost with Pluto

55 views
Skip to first unread message

Hengjie

unread,
May 31, 2020, 10:01:22 PM5/31/20
to Pluto development
Hello, 

I have installed the latest version of Pluto from Github (master, commit 25767405...). I have tested with several stencils like jacobi, heat equation, etc and it works fine.
When I use it on a 3D Burgers equation, which use 6 arrays and 3 statements in the perfect nested loop, it takes Pluto 1hr to generate the tiled code. 
I pass the "--moredebug" flag to polycc. Based on the output, most of the time seems to be spent on the following step:

[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)

Does this mean Pluto is solving a linear programming problem with 5398 equations? 

I am running Pluto via "./polycc --pet --parallel --tile". The original code is also attached. Is there a way to speedup the code generation?
Thank you so much.

Hengjie

burgers.c

Sven Verdoolaege

unread,
Jun 1, 2020, 5:12:59 AM6/1/20
to Hengjie, Pluto development
On Sun, May 31, 2020 at 07:01:22PM -0700, Hengjie wrote:
> I am running Pluto via "./polycc --pet --parallel --tile". The original
> code is also attached. Is there a way to speedup the code generation?

Have you tried PPCG? "ppcg --target=c --openmp --tile" should do something
similar, but may not be exactly what you want.
It's not guaranteed to be faster, but it only seems to take 2m30s here.

> #define is 232324
> #define js 482
> #define ijk i*232324+j*482+k
> double u[2][(NI+2)*(NJ+2)*(NK+2)], v[2][(NI+2)*(NJ+2)*(NK+2)], w[2][(NI+2)*(NJ+2)*(NK+2)];

Avoid such linearizations. That is, treat u as a 4-dimensional array
rather than pretending it's a 2-dimensional array. That will probably
lead to simpler constraints.

skimo

Uday R Bondhugula

unread,
Jun 1, 2020, 5:41:32 AM6/1/20
to Hengjie, Pluto development
Hi Hengjie,

On Mon, Jun 1, 2020 at 7:31 AM Hengjie <frank....@gmail.com> wrote:
Hello, 

I have installed the latest version of Pluto from Github (master, commit 25767405...). I have tested with several stencils like jacobi, heat equation, etc and it works fine.
When I use it on a 3D Burgers equation, which use 6 arrays and 3 statements in the perfect nested loop, it takes Pluto 1hr to generate the tiled code. 
I pass the "--moredebug" flag to polycc. Based on the output, most of the time seems to be spent on the following step:

[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)

Does this mean Pluto is solving a linear programming problem with 5398 equations?

Yes, with 5398 constraints.
 
 

I am running Pluto via "./polycc --pet --parallel --tile". The original code is also attached. Is there a way to speedup the code generation?

Yes, we should definitely be able to find ways to speed it up - this is an interesting use case scenario. Most of the time is being spent here in the solver. I recommend using --glpk. With the following options, it actually just runs in 1.76s on my workstation.

$ ./polycc --pet --parallel --tile ~/Downloads/burgers.c  --glpk --nodiamond --flic

However, the output may not be what you desire (with one large skewing coeff), but it's close and we can address that. I'll get back on this shortly.

- Uday

 
Thank you so much.

Hengjie

--
You received this message because you are subscribed to the Google Groups "Pluto development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pluto-developm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pluto-development/20e1e884-2826-4d37-98b8-885fffee6e64%40googlegroups.com.

Uday Reddy B

unread,
Jun 1, 2020, 5:43:02 AM6/1/20
to Uday R Bondhugula, Hengjie, Pluto development
On Mon, Jun 1, 2020 at 3:11 PM Uday R Bondhugula <ud...@iisc.ac.in> wrote:
Hi Hengjie,

On Mon, Jun 1, 2020 at 7:31 AM Hengjie <frank....@gmail.com> wrote:
Hello, 

I have installed the latest version of Pluto from Github (master, commit 25767405...). I have tested with several stencils like jacobi, heat equation, etc and it works fine.
When I use it on a 3D Burgers equation, which use 6 arrays and 3 statements in the perfect nested loop, it takes Pluto 1hr to generate the tiled code. 
I pass the "--moredebug" flag to polycc. Based on the output, most of the time seems to be spent on the following step:

[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)

Does this mean Pluto is solving a linear programming problem with 5398 equations?

Yes, with 5398 constraints.
 
 

I am running Pluto via "./polycc --pet --parallel --tile". The original code is also attached. Is there a way to speedup the code generation?

Yes, we should definitely be able to find ways to speed it up - this is an interesting use case scenario. Most of the time is being spent here in the solver. I recommend using --glpk. With the following options, it actually just runs in 1.76s on my workstation.

$ ./polycc --pet --parallel --tile ~/Downloads/burgers.c  --glpk --nodiamond --flic

However, the output may not be what you desire (with one large skewing coeff), but it's close and we can address that. I'll get back on this shortly.

Please use the following options, it finds the desired transformation (basically, the --coeff-bound option avoids large skews which are actually spurious). I've pasted the output as well:

Uday Reddy B

unread,
Jun 1, 2020, 5:45:25 AM6/1/20
to Uday R Bondhugula, Hengjie, Pluto development
On Mon, Jun 1, 2020 at 3:11 PM Uday R Bondhugula <ud...@iisc.ac.in> wrote:
Hi Hengjie,

On Mon, Jun 1, 2020 at 7:31 AM Hengjie <frank....@gmail.com> wrote:
Hello, 

I have installed the latest version of Pluto from Github (master, commit 25767405...). I have tested with several stencils like jacobi, heat equation, etc and it works fine.
When I use it on a 3D Burgers equation, which use 6 arrays and 3 statements in the perfect nested loop, it takes Pluto 1hr to generate the tiled code. 
I pass the "--moredebug" flag to polycc. Based on the output, most of the time seems to be spent on the following step:

[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)

Does this mean Pluto is solving a linear programming problem with 5398 equations?

Yes, with 5398 constraints.
 
 

I am running Pluto via "./polycc --pet --parallel --tile". The original code is also attached. Is there a way to speedup the code generation?

Yes, we should definitely be able to find ways to speed it up - this is an interesting use case scenario. Most of the time is being spent here in the solver. I recommend using --glpk. With the following options, it actually just runs in 1.76s on my workstation.

$ ./polycc --pet --parallel --tile ~/Downloads/burgers.c  --glpk --nodiamond --flic

However, the output may not be what you desire (with one large skewing coeff), but it's close and we can address that. I'll get back on this shortly.

[Previous message was sent unifinished.]

Please use the following options, it finds the desired transformation (basically, the --coeff-bound option avoids large skews which are actually spurious). I've pasted the output as well. It runs in about one second. (This is with diamond tiling disabled; enabling it appears to cause an overflow somewhere. I'll look into that shortly.)

===========================================================================
$ ./polycc --pet --parallel --tile ~/Downloads/burgers.c  --glpk --nodiamond --flic --coeff-bound=10
[pluto] compute_deps (isl)
[pluto] Number of statements: 3
[pluto] Total number of loops: 12
[pluto] Number of deps: 93
[pluto] Maximum domain dimensionality: 4
[pluto] Number of parameters: 0
[pluto] Affine transformations [<iter coeff's> <param> <const>]

T(S1): (t, 2t+i, j, k)
loop types (loop, loop, loop, loop)

T(S2): (t, 2t+i, j, k)
loop types (loop, loop, loop, loop)

T(S3): (t, 2t+i, j, k)
loop types (loop, loop, loop, loop)

[Pluto] After tiling:
T(S1): (t/32, (2t+i)/32, t, 2t+i, j, k)
loop types (loop, loop, loop, loop, loop, loop)

T(S2): (t/32, (2t+i)/32, t, 2t+i, j, k)
loop types (loop, loop, loop, loop, loop, loop)

T(S3): (t/32, (2t+i)/32, t, 2t+i, j, k)
loop types (loop, loop, loop, loop, loop, loop)

t6 {loop with stmts: S1, S2, S3, }
t5 {loop with stmts: S1, S2, S3, }
t4 {loop with stmts: S1, S2, S3, }
t3 {loop with stmts: S1, S2, S3, }
[Pluto] After tile scheduling:
T(S1): (t/32+(2t+i)/32, (2t+i)/32, t, 2t+i, j, k)
loop types (loop, loop, loop, loop, loop, loop)

T(S2): (t/32+(2t+i)/32, (2t+i)/32, t, 2t+i, j, k)
loop types (loop, loop, loop, loop, loop, loop)

T(S3): (t/32+(2t+i)/32, (2t+i)/32, t, 2t+i, j, k)
loop types (loop, loop, loop, loop, loop, loop)

[pluto] using statement-wise -fs/-ls options: S1(3,6), S2(3,6), S3(3,6),
[pluto-unroll-jam] No unroll jam loop candidates found
[Pluto] Output written to burgers.pluto.c

[pluto] Timing statistics
[pluto] SCoP extraction + dependence analysis time: 0.552603s
[pluto] Auto-transformation time: 0.203641s
[pluto] Total constraint solving time (LP/MIP/ILP) time: 0.135321s
[pluto] Code generation time: 0.014887s
[pluto] Other/Misc time: 0.249048s
[pluto] Total time: 1.020179s
[pluto] All times: 0.552603 0.203641 0.014887 0.249048
---------------------------------------------------------------------------------

Uday R Bondhugula

unread,
Jun 1, 2020, 6:03:09 AM6/1/20
to Uday R Bondhugula, Hengjie, Pluto development
For the diamond tiling part, I've created an issue to track this here:
 
- Uday

Hengjie Wang

unread,
Jun 1, 2020, 10:57:12 AM6/1/20
to Uday R Bondhugula, Pluto development
Hello Uday,

Thank you so much! Your reply is super helpful.

I am trying to pass your recommended options to polycc, then I got an error:

pluto: unrecognized option `--glpk'

Is this option in the master branch? The options "--flic", "--coeff-bound" are not in the options list of "./polycc -h" in my pluto build either. I have compiled pluto based on your latest commit (be7691b9) to the master branch.

I also have two follow-up questions on the options.
1 If diamond tiling is disabled, is Pluto using tiling algorithms that require a pipelined start?
2 Do I need to tune the value of "-coeff-bound" based on the tile sizes or the stencil?

Regards,

--
Hengjie Wang
Ph.D Candidate
Mechanical and Aerospace Engineering
University of California, Irvine
Email:frank....@gmail.com

Uday R Bondhugula

unread,
Jun 1, 2020, 11:06:43 AM6/1/20
to Hengjie Wang, Uday R Bondhugula, Pluto development
On Mon, Jun 1, 2020 at 8:27 PM Hengjie Wang <frank....@gmail.com> wrote:
Hello Uday,

Thank you so much! Your reply is super helpful.

I am trying to pass your recommended options to polycc, then I got an error:

pluto: unrecognized option `--glpk'

The --glpk option is only available if you configured Pluto with GLPK, i.e., with

$ ./configure --enable-glpk

Make sure glpk and its dev files (headers) are installed. 
 

Is this option in the master branch? The options "--flic", "--coeff-bound" are not in the options list of "./polycc -h" in my

Yes, it's available on the master branch.
 
pluto build either. I have compiled pluto based on your latest commit (be7691b9) to the master branch.

I also have two follow-up questions on the options.
1 If diamond tiling is disabled, is Pluto using tiling algorithms that require a pipelined start?

That's right. The missing support here to avoid the overflow is actually straightforward to implement and had just been missing from the start.
 
2 Do I need to tune the value of "-coeff-bound" based on the tile sizes or the stencil?

The bound is an upper bound on the absolute value of the transformation coefficients. So, it should be typically higher than your dependence distances along the dimensions (in your case, they appear to be +/- 1). Normally, 10 is sufficient for any stencils. It just prevents problem size proportional skews from getting into the transformations and those don't model anything useful (they are spurious).

- Uday

 

Hengjie

unread,
Jun 10, 2020, 12:05:45 AM6/10/20
to Pluto development
Hello Uday,

Thanks for the detailed explanation.
I went through some trouble installing with glpk and finally get through. I can use Pluto with your flags and re-produce your output.
However, tested on a 2 socket 36 cores Broadwell node with intel compilers, the tiled codes performs 2+ times slower than simply adding "OpenMP parallel for" pragma as follow:

for (t=0; t<nt;++t)
 
#pragma omp parallel for
 
for (i=0; i<ni; ++i)
   
for (j=0; j<nj; ++j)
     
for (k=0; k<nk; ++k)
       
//compute

BTW, I played with a few tile sizes and choose the one leading to the minimum running time. 
Am I missing anything here? Is there a way to further improve the performance with Pluto?

I can think of two reasons make the tiling for Burgers equation challenging,
* Burgers has an Arithmetic Intensity (AI) about 1.6 which is still memory-bound on Broadwell but not as much as typical temporal tiling benchmarks like heat equations. 
* The pipelined start of the tiling is not very efficient.
I think your comments on this can help me understand the performance better. I appreciate it. Thanks.

Uday R Bondhugula

unread,
Jun 10, 2020, 3:32:42 AM6/10/20
to Hengjie, Pluto development
On Wed, Jun 10, 2020 at 9:35 AM Hengjie <frank....@gmail.com> wrote:
Hello Uday,

Thanks for the detailed explanation.
I went through some trouble installing with glpk and finally get through. I can use Pluto with your flags and re-produce your output.
However, tested on a 2 socket 36 cores Broadwell node with intel compilers, the tiled codes performs 2+ times slower than simply adding "OpenMP parallel for" pragma as follow:

for (t=0; t<nt;++t)
 
#pragma omp parallel for
 
for (i=0; i<ni; ++i)
   
for (j=0; j<nj; ++j)
     
for (k=0; k<nk; ++k)
       
//compute

BTW, I played with a few tile sizes and choose the one leading to the minimum running time. 
Am I missing anything here? Is there a way to further improve the performance with Pluto?

How did you run times change when going through 1, 2, 4, 8, 16, 24, 32 cores? What tile sizes and problem sizes did you see? It's hard to say anything without this information.

- Uday
 

I can think of two reasons make the tiling for Burgers equation challenging,
* Burgers has an Arithmetic Intensity (AI) about 1.6 which is still memory-bound on Broadwell but not as much as typical temporal tiling benchmarks like heat equations. 
* The pipelined start of the tiling is not very efficient.
I think your comments on this can help me understand the performance better. I appreciate it. Thanks.


On Sunday, May 31, 2020 at 9:01:22 PM UTC-5, Hengjie wrote:
Hello, 

I have installed the latest version of Pluto from Github (master, commit 25767405...). I have tested with several stencils like jacobi, heat equation, etc and it works fine.
When I use it on a 3D Burgers equation, which use 6 arrays and 3 statements in the perfect nested loop, it takes Pluto 1hr to generate the tiled code. 
I pass the "--moredebug" flag to polycc. Based on the output, most of the time seems to be spent on the following step:

[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)

Does this mean Pluto is solving a linear programming problem with 5398 equations? 

I am running Pluto via "./polycc --pet --parallel --tile". The original code is also attached. Is there a way to speedup the code generation?
Thank you so much.

Hengjie

--
You received this message because you are subscribed to the Google Groups "Pluto development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pluto-developm...@googlegroups.com.

Uday Reddy B

unread,
Jun 10, 2020, 3:34:44 AM6/10/20
to Uday R Bondhugula, Hengjie, Pluto development
On Wed, Jun 10, 2020 at 1:02 PM Uday R Bondhugula <ud...@iisc.ac.in> wrote:


On Wed, Jun 10, 2020 at 9:35 AM Hengjie <frank....@gmail.com> wrote:
Hello Uday,

Thanks for the detailed explanation.
I went through some trouble installing with glpk and finally get through. I can use Pluto with your flags and re-produce your output.
However, tested on a 2 socket 36 cores Broadwell node with intel compilers, the tiled codes performs 2+ times slower than simply adding "OpenMP parallel for" pragma as follow:

for (t=0; t<nt;++t)
 
#pragma omp parallel for
 
for (i=0; i<ni; ++i)
   
for (j=0; j<nj; ++j)
     
for (k=0; k<nk; ++k)
       
//compute

BTW, I played with a few tile sizes and choose the one leading to the minimum running time. 
Am I missing anything here? Is there a way to further improve the performance with Pluto?

How did you run times change when going through 1, 2, 4, 8, 16, 24, 32 cores? What tile sizes and problem sizes did you see? It's hard to say anything without this information.

How did your run times change when going through 1, 2, 4, 8, 16, 24, 32 cores? What tile sizes did you choose and what were your problem sizes? It's hard to say anything without this information, but it's likely the pipelined startup flattened your performance pretty early. A simple space loop parallelization in such cases can give you good performance if you have the memory bandwidth and your data sets are within a certain size.

- Uday
Reply all
Reply to author
Forward
0 new messages