[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)
Hello,I have installed the latest version of Pluto from Github (master, commit 25767405...). I have tested with several stencils like jacobi, heat equation, etc and it works fine.When I use it on a 3D Burgers equation, which use 6 arrays and 3 statements in the perfect nested loop, it takes Pluto 1hr to generate the tiled code.I pass the "--moredebug" flag to polycc. Based on the output, most of the time seems to be spent on the following step:[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)Does this mean Pluto is solving a linear programming problem with 5398 equations?
I am running Pluto via "./polycc --pet --parallel --tile". The original code is also attached. Is there a way to speedup the code generation?
Thank you so much.Hengjie
--
You received this message because you are subscribed to the Google Groups "Pluto development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pluto-developm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pluto-development/20e1e884-2826-4d37-98b8-885fffee6e64%40googlegroups.com.
Hi Hengjie,On Mon, Jun 1, 2020 at 7:31 AM Hengjie <frank....@gmail.com> wrote:Hello,I have installed the latest version of Pluto from Github (master, commit 25767405...). I have tested with several stencils like jacobi, heat equation, etc and it works fine.When I use it on a 3D Burgers equation, which use 6 arrays and 3 statements in the perfect nested loop, it takes Pluto 1hr to generate the tiled code.I pass the "--moredebug" flag to polycc. Based on the output, most of the time seems to be spent on the following step:[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)Does this mean Pluto is solving a linear programming problem with 5398 equations?Yes, with 5398 constraints.I am running Pluto via "./polycc --pet --parallel --tile". The original code is also attached. Is there a way to speedup the code generation?Yes, we should definitely be able to find ways to speed it up - this is an interesting use case scenario. Most of the time is being spent here in the solver. I recommend using --glpk. With the following options, it actually just runs in 1.76s on my workstation.$ ./polycc --pet --parallel --tile ~/Downloads/burgers.c --glpk --nodiamond --flicHowever, the output may not be what you desire (with one large skewing coeff), but it's close and we can address that. I'll get back on this shortly.
Hi Hengjie,On Mon, Jun 1, 2020 at 7:31 AM Hengjie <frank....@gmail.com> wrote:Hello,I have installed the latest version of Pluto from Github (master, commit 25767405...). I have tested with several stencils like jacobi, heat equation, etc and it works fine.When I use it on a 3D Burgers equation, which use 6 arrays and 3 statements in the perfect nested loop, it takes Pluto 1hr to generate the tiled code.I pass the "--moredebug" flag to polycc. Based on the output, most of the time seems to be spent on the following step:[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)Does this mean Pluto is solving a linear programming problem with 5398 equations?Yes, with 5398 constraints.I am running Pluto via "./polycc --pet --parallel --tile". The original code is also attached. Is there a way to speedup the code generation?Yes, we should definitely be able to find ways to speed it up - this is an interesting use case scenario. Most of the time is being spent here in the solver. I recommend using --glpk. With the following options, it actually just runs in 1.76s on my workstation.$ ./polycc --pet --parallel --tile ~/Downloads/burgers.c --glpk --nodiamond --flicHowever, the output may not be what you desire (with one large skewing coeff), but it's close and we can address that. I'll get back on this shortly.
pluto: unrecognized option `--glpk'
Hello Uday,Thank you so much! Your reply is super helpful.I am trying to pass your recommended options to polycc, then I got an error:pluto: unrecognized option `--glpk'
Is this option in the master branch? The options "--flic", "--coeff-bound" are not in the options list of "./polycc -h" in my
pluto build either. I have compiled pluto based on your latest commit (be7691b9) to the master branch.I also have two follow-up questions on the options.1 If diamond tiling is disabled, is Pluto using tiling algorithms that require a pipelined start?
2 Do I need to tune the value of "-coeff-bound" based on the tile sizes or the stencil?
To view this discussion on the web visit https://groups.google.com/d/msgid/pluto-development/CAPsUvkffeEk_Kv0iXmvhpbwgY%2BY%3DE9znRNTd4T%3DTOTohBEfwpA%40mail.gmail.com.
for (t=0; t<nt;++t)
#pragma omp parallel for
for (i=0; i<ni; ++i)
for (j=0; j<nj; ++j)
for (k=0; k<nk; ++k)
//compute
Hello Uday,Thanks for the detailed explanation.I went through some trouble installing with glpk and finally get through. I can use Pluto with your flags and re-produce your output.However, tested on a 2 socket 36 cores Broadwell node with intel compilers, the tiled codes performs 2+ times slower than simply adding "OpenMP parallel for" pragma as follow:
for (t=0; t<nt;++t)
#pragma omp parallel for
for (i=0; i<ni; ++i)
for (j=0; j<nj; ++j)
for (k=0; k<nk; ++k)
//computeBTW, I played with a few tile sizes and choose the one leading to the minimum running time.Am I missing anything here? Is there a way to further improve the performance with Pluto?
I can think of two reasons make the tiling for Burgers equation challenging,* Burgers has an Arithmetic Intensity (AI) about 1.6 which is still memory-bound on Broadwell but not as much as typical temporal tiling benchmarks like heat equations.* The pipelined start of the tiling is not very efficient.I think your comments on this can help me understand the performance better. I appreciate it. Thanks.
On Sunday, May 31, 2020 at 9:01:22 PM UTC-5, Hengjie wrote:Hello,I have installed the latest version of Pluto from Github (master, commit 25767405...). I have tested with several stencils like jacobi, heat equation, etc and it works fine.When I use it on a 3D Burgers equation, which use 6 arrays and 3 statements in the perfect nested loop, it takes Pluto 1hr to generate the tiled code.I pass the "--moredebug" flag to polycc. Based on the output, most of the time seems to be spent on the following step:[pluto] (Band 1) Solving for hyperplane #4
[pluto] pluto_prog_constraints_lexmin (16 variables, 5398 constraints)
[pluto] pluto_constraints_lexmin_isl (16 variables, 5398 constraints)Does this mean Pluto is solving a linear programming problem with 5398 equations?I am running Pluto via "./polycc --pet --parallel --tile". The original code is also attached. Is there a way to speedup the code generation?Thank you so much.Hengjie
--
You received this message because you are subscribed to the Google Groups "Pluto development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pluto-developm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pluto-development/c06f9b07-4767-416d-8692-31dc83476a1eo%40googlegroups.com.
On Wed, Jun 10, 2020 at 9:35 AM Hengjie <frank....@gmail.com> wrote:Hello Uday,Thanks for the detailed explanation.I went through some trouble installing with glpk and finally get through. I can use Pluto with your flags and re-produce your output.However, tested on a 2 socket 36 cores Broadwell node with intel compilers, the tiled codes performs 2+ times slower than simply adding "OpenMP parallel for" pragma as follow:
for (t=0; t<nt;++t)
#pragma omp parallel for
for (i=0; i<ni; ++i)
for (j=0; j<nj; ++j)
for (k=0; k<nk; ++k)
//computeBTW, I played with a few tile sizes and choose the one leading to the minimum running time.Am I missing anything here? Is there a way to further improve the performance with Pluto?How did you run times change when going through 1, 2, 4, 8, 16, 24, 32 cores? What tile sizes and problem sizes did you see? It's hard to say anything without this information.