Milonga profiling - some bad news

Vitor Vasconcelos

unread,

Apr 4, 2018, 12:41:41 PM4/4/18

to was...@seamplex.com

Hi fellows,

I've been playing with meshes to try to have something meaninful for
profiling milonga.

I have two meshes: a tet and a hex. Both meshes are ran with the combination of
--elements/volumes --s2/diffusion and they have about 670 thousand
elements each (a minimum to profile milonga under conditions similar
to meshes used by other software).

Below is my very first output:

---------------------------------------------------------------------------------------------------------------------------------------
[cfx@caprara-lx profiling]$ ./test-profiling.sh | tee run-1-output.file
# ------------------------------------------------------------------------
# cylinder-tet-elements-s2"
# ------------------------------------------------------------------------
# keff = 0.97349609 ( -2722.5 pcm )
# nodes = 117815
# elements = 704457
# CPU usage:
# init = 0.069 seconds
# build = 241.803 seconds
# solve = 1650.471 seconds
# total = 1892.343 seconds
# ------------------------------------------------------------------------
Done with cylinder-tet-elements-s2 in 1913 seconds.
# ------------------------------------------------------------------------
# cylinder-tet-elements-diffusion"
# ------------------------------------------------------------------------
# keff = 0.91092155 ( -9778.9 pcm )
# nodes = 121597
# elements = 728305
# CPU usage:
# init = 0.020 seconds
# build = 23.869 seconds
# solve = 38.996 seconds
# total = 62.885 seconds
# ------------------------------------------------------------------------
Done with cylinder-tet-elements-diffusion in 81 seconds.
# ------------------------------------------------------------------------
# cylinder-tet-volumes-s2"
# ------------------------------------------------------------------------
# keff = 0.96724567 ( -3386.4 pcm )
# nodes = 117815
# elements = 704457
# CPU usage:
# init = 32.202 seconds
# build = 59.991 seconds
# solve = 1252.752 seconds
# total = 1344.945 seconds
# ------------------------------------------------------------------------
Done with cylinder-tet-volumes-s2 in 1376 seconds.
# ------------------------------------------------------------------------
# cylinder-tet-volumes-diffusion"
# ------------------------------------------------------------------------
# keff = 0.91948739 ( -8756.2 pcm )
# nodes = 121597
# elements = 728305
# CPU usage:
# init = 32.335 seconds
# build = 3.906 seconds
# solve = 186.635 seconds
# total = 222.877 seconds
# ------------------------------------------------------------------------
Done with cylinder-tet-volumes-diffusion in 242 seconds.
./test-profiling.sh: line 6: 22283 Killed milonga
../profiling.mil --$j --$k cylinder-$i-$j-$k
Done with cylinder-hex-elements-s2 in 179 seconds.
./test-profiling.sh: line 6: 22457 Killed milonga
../profiling.mil --$j --$k cylinder-$i-$j-$k
Done with cylinder-hex-elements-diffusion in 391 seconds.
error: mesh inconsistency, element 544295 has less neighbors (3) than faces (5)
Done with cylinder-hex-volumes-s2 in 69 seconds.
error: mesh inconsistency, element 544295 has less neighbors (3) than faces (5)
Done with cylinder-hex-volumes-diffusion in 63 seconds.
--------------------------------------------------------------------------------------------------------------------------------

From this output we can have some conclusions:

- Tetrahedral meshes are easily computed by milonga
- Hexahedral meshes are still troublesome.
(1) hex meshes under finite elements get killed by the OS (CentOS) for
memory consumption. I guess there is some memory leak I'll try to
address using valgrind.
(2) hex meshes under finite volumes present the well-know error
message warning about ill-formed elements. I struggle with this since
my first use of milonga and the solution for this is probably learning
to better generate meshes in gmsh and maybe making milonga more robust
to these elements (if it is possible considering the efforts).

I guess some of you are interested in giving a try on the .geo files
and the milonga script. They're available at:
https://github.com/vitorvas/pp-milonga/tree/master/profiling

I think it is a good time to stop trying to profile milonga at least
until a better ideia on what is going on for the case (1). I'll do my
best to keep you updated on the matter (I'm not sure when I'll have
time to attack this issue.).

Greetings from Belo Horizonte,

Vitor

jeremy theler

unread,

Apr 4, 2018, 1:39:09 PM4/4/18

to was...@seamplex.com

great! still, I would do the first iteration with coarser grids
also if you are going to use valgrind or callgrind (google for how to profile with kcachegrind) then running times are 10x or more the original ones

to address (1), try a parametric run increasing the number of nodes and plotting the consumed memory vs the number of nodes or elements, we might see something there
also, try using pure hexs like in a bare cube. Probably mixing hexs and quads is leaking memory.

to address (2) we need a more general way of testing wether two elements share a face or not, I will add this task to the backlog I might have some time later in the year to address this subject

greetings from my mobile phone while my kids take a nap in our family vacations at the beach

--
You received this message because you are subscribed to the Google Groups "wasora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wasora+un...@seamplex.com.
To post to this group, send email to was...@seamplex.com.
Visit this group at https://groups.google.com/a/seamplex.com/group/wasora/.
To view this discussion on the web visit https://groups.google.com/a/seamplex.com/d/msgid/wasora/CAE%3DfeK3AWZukW9mFh%3DVX%3DMmDN_48m-dGCtjzgwCRgNVdNCLL2g%40mail.gmail.com.
For more options, visit https://groups.google.com/a/seamplex.com/d/optout.

--

jeremy theler

www.seamplex.com

Vitor Vasconcelos

unread,

Apr 4, 2018, 1:57:38 PM4/4/18

to was...@seamplex.com

Hi Germán!

> great! still, I would do the first iteration with coarser grids
> also if you are going to use valgrind or callgrind (google for how to
> profile with kcachegrind) then running times are 10x or more the original
> ones

Yes, sure. I'm just making the first run with the setup I have now
with the "big" meshes. I already generated a coarser one
to make things easier. Anyway, you mentioned kcachegrind once. Time to
give it a try.

> to address (1), try a parametric run increasing the number of nodes and
> plotting the consumed memory vs the number of nodes or elements, we might
> see something there
> also, try using pure hexs like in a bare cube. Probably mixing hexs and
> quads is leaking memory.

The parametric run is an interesting idea... Well bare cube can
also be enlightening, but I'll
think about it as soon as I beat my lazyness with .geo files...
hehehehhe (I'll certainly use your
cube example).

> to address (2) we need a more general way of testing wether two elements
> share a face or not, I will add this task to the backlog I might have some
> time later in the year to address this subject

Cool. I agree this can wait a bit. One thing at time.

> greetings from my mobile phone while my kids take a nap in our family
> vacations at the beach

You make me feel bad disturbing your vacations. :-)
Enjoy,

Vitor

Vitor Vasconcelos

unread,

Apr 5, 2018, 5:21:57 PM4/5/18

to was...@seamplex.com

Gentleman,

Found a memory leak, fixed it and made the pull request.
Later I'll try to invest some time in the parametric simulations.

Cheers,

Vitor

jeremy theler

unread,

Apr 5, 2018, 7:09:27 PM4/5/18

to was...@seamplex.com

great job! pull request accepted and merged (did not have time to look at it thoroughly, next week back at home I will)

now I want to know what the difference between tets and hexes in s2 fem is

--
You received this message because you are subscribed to the Google Groups "wasora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wasora+un...@seamplex.com.
To post to this group, send email to was...@seamplex.com.
Visit this group at https://groups.google.com/a/seamplex.com/group/wasora/.

To view this discussion on the web visit https://groups.google.com/a/seamplex.com/d/msgid/wasora/CAE%3DfeK3v1Kzdfe5wi-j4FOb_jsYPMEt_VgYFxcG5L_EqVEZjtw%40mail.gmail.com.

For more options, visit https://groups.google.com/a/seamplex.com/d/optout.

jeremy theler

unread,

Apr 5, 2018, 7:51:32 PM4/5/18

to was...@seamplex.com

I just took a quick look and even though it is memory that is alloced and never free I do not think that it qualifies as a memory leak
a leak is memory that is alloced in a loop and never freed so it takes up memory in a way that scales with the number of unknowns or something like that

did this commit fix the problem of running out of memory?

we had an interesting discussion regarding a similar issue with ramiro perhaps he remember what it was about

Vitor Vasconcelos

unread,

Apr 5, 2018, 8:01:16 PM4/5/18

to was...@seamplex.com

Well, I must say I never tought that much about the terminology. Maybe
you're right. Anyway, the problem I found (at least to my knowledge of
milonga`s code, which is not high) does not scale: it is purely
related to the mesh size. (Correct me if I`m wrong). However, once you
try to run a bigger mesh you had more memory allocated. Eventually,
until the point where the OS kills milonga. So, can be seen as, let's
say, a kind of memory leak - in the absence of a better term.

Back to practicalities: I have it running in one machine at the
office. Tomorrow I'll check the results and give you a feedback.

Meanwhile, I'll take a glance on the discussion with Ramiro. At that
time I didn't pay proper attention.

Vitor

> https://groups.google.com/a/seamplex.com/d/msgid/wasora/CAK0LiykJOFJDM2kt1QvzEMokiAwd32%3DeONvzaPiq%2B0EbPsUTRw%40mail.gmail.com.

jeremy theler

unread,

Apr 5, 2018, 8:22:31 PM4/5/18

to was...@seamplex.com

the thing is that if the os kills the process before reaching the free() you added, the problem is not fixed

calling free() just before the program ends is polite but useless, as the os frees the unused memory after a process ends after all

it might be needed in the case of milonga in parametric or pseudo-transient problems though, so it was not work in vain :-)

the disucssion with ramiro was not public I think

To view this discussion on the web visit https://groups.google.com/a/seamplex.com/d/msgid/wasora/CAE%3DfeK010e7htsdaN8OrtaZDO_yC_AV4goQbGfJ1WD%2Bd3ds19g%40mail.gmail.com.

For more options, visit https://groups.google.com/a/seamplex.com/d/optout.

Jeremy Theler

unread,

Apr 10, 2018, 10:55:28 AM4/10/18

to was...@seamplex.com

Vitor,

was the memory error fixed by your pull request?

btw, can you check if this commit

https://bitbucket.org/seamplex/wasora/commits/a4eb0e5b9ee62f43da2b0282528eb86a2b15415d

fixes the "less neighbors than faces" problem?

I checked with the example geometry at

https://bitbucket.org/seamplex/milonga/src/516cb0781509f03eb08e686d5622c68791f74cef/examples/cylinder-axisymmetric.geo?at=master&fileviewer=file-view-default

and works for me.

I cloned your pp-milonga repo from github. I have some minor comments, but one that might be preventing you from moving forward: you do not need domain decomposition to parallelize milonga. For sure it is a desired feature, but you can have all the processes to read the whole mesh and process only their share. So the element-building routines can be paralellizaed by following the PETSc examples where the for loops run only over the local row indices, even though each process knows the whole mesh. Then, we can see how to implement domain decomposition to be more resource-efficient.

one more thing: I added your name (and Ramiro's and Pablo's) to the file AUTHORS. It contains your email in plain text... would you rather prefer to have say a linkedin profile page in the file? I know one it is commited, it is written in ink but...

Vitor Vasconcelos

unread,

Apr 10, 2018, 1:04:39 PM4/10/18

to was...@seamplex.com

Hi Germán!

The first tests I made show no memory errors in milonga functions
(there are some "leaks" due
to PETSc, minor ammounts of bytes. If you want I can share the file) .
However, the errors were not solved
for the examples I run (specifically: hex mesh with s2 method).

I'm running now rather simple profiling tests, non-parametric, just
to have a clue on what to look for when
thinking about parallelization.

I'll give a try on the new version asap while I prepare parametric tests.

> I cloned your pp-milonga repo from github. I have some minor comments, but
> one that might be preventing you from moving forward: you do not need domain
> decomposition to parallelize milonga. For sure it is a desired feature, but
> you can have all the processes to read the whole mesh and process only their
> share. So the element-building routines can be paralellizaed by following
> the PETSc examples where the for loops run only over the local row indices,
> even though each process knows the whole mesh. Then, we can see how to
> implement domain decomposition to be more resource-efficient.

Oh, you're absolutely right! The domain decomposition was only my
first idea and
it is one of the most difficult to implement. There are other options.
I didn't mention, but
I was (under my time allowance) studying the use of OpenMP to
parallelize loops on the
thread level. I guess this can be promising, but I can say nothing
before having at least
an ideia of the costly functions which will be candidates to have
their loops parallelized.

I'm interested in learning more of PETSc and I'll check its
examples as you said. Thanks for pointing it.

> one more thing: I added your name (and Ramiro's and Pablo's) to the file
> AUTHORS. It contains your email in plain text... would you rather prefer to
> have say a linkedin profile page in the file? I know one it is commited, it
> is written in ink but...

No problem having my e-mail in plain text. Actually, I have no
linkedin profile... hehehe.
My pleasure to collaborate to the team, thanks. I hope to be able
keep having a slice of time
to commit to milonga development. It is working now... ;-)

Regards,

Vitor

Jeremy Theler

unread,

Apr 10, 2018, 2:19:40 PM4/10/18

to was...@seamplex.com

On Tue, 2018-04-10 at 14:04 -0300, Vitor Vasconcelos wrote:

   The first tests I made show no memory errors in milonga functions

(there are some "leaks" due

to PETSc, minor ammounts of bytes. If you want I can share the file) .

However, the errors were not solved

for the examples I run (specifically: hex mesh with s2 method).

that's what I thought. The problem running out of memory is not due to a memory leak but probably to a poorly-designed (by me) scheme of memory allocation. Another chance is that the resulting set of equations need more memory from the SLEPc side, which we cannot control. But maybe a combination of spectral shitfs or preconditioners would improve convergence.

   I'm running now rather simple profiling tests, non-parametric, just

to have a clue on what to look for when

thinking about parallelization.

Good. Do you use a GUI to understand results? I only used kcachegrind but there are others.

    Oh, you're absolutely right! The domain decomposition was only my

first idea and

it is one of the most difficult to implement. There are other options.

I didn't mention, but

I was (under my time allowance) studying the use of OpenMP to

parallelize loops on the

thread level. I guess this can be promising, but I can say nothing

before having at least

an ideia of the costly functions which will be candidates to have

their loops parallelized.

I would not mix OpenMP with MPI. Given that PETSc gives a fairly straightforward way of parallelizing the ensamble of the big matrices using MPI (which is the same scheme that it is used in a potential parallelization of the linear solvers), then I would go that way. PETSc examples show what the principle is, mainly that each process only builds the rows that it "owns," and borrows information from the other processes as needed. BTW, in the "every-process-knows-the-whole-mesh" it is not needed to ask information from the other siblings, although it is less efficient overall.

jeremy

Ramiro Vignolo

unread,

Apr 10, 2018, 4:57:13 PM4/10/18

to was...@seamplex.com

Hey guys,

Sorry for my absence, although I been reading everything

Vitor, if you want to paralelize something within milonga using openMP, please look at https://bitbucket.org/rvignolo/milonga

There you will see a different discretization (checkout the moc branch) which can be naturally parallelized by openMP.

Ray tracing or tracking as well as solver routines would be the ones that should be inspected.

This is only if you really want to play with openMP. If your intention is working with other discretizations such as SN, please do as Jeremy says.

Thanks!

--
You received this message because you are subscribed to the Google Groups "wasora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wasora+un...@seamplex.com.
To post to this group, send email to was...@seamplex.com.
Visit this group at https://groups.google.com/a/seamplex.com/group/wasora/.

To view this discussion on the web visit https://groups.google.com/a/seamplex.com/d/msgid/wasora/0be0107844342874b4033f6635c9cc0b73fa3f1e.camel%40seamplex.com.

Vitor Vasconcelos

unread,

Apr 11, 2018, 10:44:53 AM4/11/18

to was...@seamplex.com

> that's what I thought. The problem running out of memory is not due to a
> memory leak but probably to a poorly-designed (by me) scheme of memory
> allocation. Another chance is that the resulting set of equations need more
> memory from the SLEPc side, which we cannot control. But maybe a combination
> of spectral shitfs or preconditioners would improve convergence.

I didn't finish an organized set of tests/profiling, but using the
last version of milonga,
However, I got al least three errors from PETSc:

- cylinder-hex-elements-s2: error: PETSc error 55-0 'Memory requested
14962484804' in /home/cfx/libs/petsc-3.8.4/src/mat/impls/aij/seq/aij.c
MatSeqAIJSetPreallocation_SeqAIJ:3630
- cylinder-tet-volumes-s2: error: PETSc error 55-0 'Memory requested
714002692' in /home/cfx/libs/petsc-3.8.4/src/vec/vec/impls/seq/bvec3.c
VecCreate_Seq:35
- cylinder-hex-volumes-s2: error: PETSc error 55-0 'Memory requested
708028484' in /home/cfx/libs/petsc-3.8.4/src/vec/vec/impls/seq/bvec3.c
VecCreate_Seq:35

I don't know if what you said about SLEPc also applies for PETSc, I
have no experience with neither.
The HEX mesh has about 650,000 elements. The TET mesh has 680,000 elements.

By the way, I'm using "big" meshes to be able to really see any
bottelneck functions. Otherwise, I might not see
costly functions.

> Good. Do you use a GUI to understand results? I only used kcachegrind but
> there are others.

I had previous experience with gprof, so, only text files. But I'm
giving a try on kcachegrind, like the graphics but still need to get
used to the way it shows the information. I don't feel it very
intuitive, but is just a matter of getting used.
I spent a lot of time far from software development fun since my
graduation, it is nice to "re-learn" things I didn't use for a long
time or new options. Thanks for the suggestion.

> I would not mix OpenMP with MPI.

It is a good point. Theoretically, if you have a thread-safe
implementation of MPI
this wouldn't be an issue. But I am also a bit conservative on this matter.
The good thinks about openMP is that is based on pragmas (you
probably know that, but just
for the sake of clarity) and if we don't want to used it, the compiled
can simply ignore it.
Moreover, since it is quite simple to use, I was thinking about
trying it. Specially because it is possible
to focus in some set of functions. A target could be the functions
which build the matrices from the mesh, which
are, as long as I could check, exacly those spending more time running.

> Given that PETSc gives a fairly
> straightforward way of parallelizing the ensamble of the big matrices using
> MPI (which is the same scheme that it is used in a potential parallelization
> of the linear solvers), then I would go that way. PETSc examples show what
> the principle is, mainly that each process only builds the rows that it
> "owns," and borrows information from the other processes as needed. BTW, in
> the "every-process-knows-the-whole-mesh" it is not needed to ask information
> from the other siblings, although it is less efficient overall.

I agree this is a better choice, specially because of PETSc
already developed over MPI.
Time to stop to study PETSc examples.

Vitor

> jeremy

>
>
> --
> You received this message because you are subscribed to the Google Groups
> "wasora" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to wasora+un...@seamplex.com.
> To post to this group, send email to was...@seamplex.com.
> Visit this group at https://groups.google.com/a/seamplex.com/group/wasora/.
> To view this discussion on the web visit

> https://groups.google.com/a/seamplex.com/d/msgid/wasora/0be0107844342874b4033f6635c9cc0b73fa3f1e.camel%40seamplex.com.

Vitor Vasconcelos

unread,

Apr 11, 2018, 11:35:27 AM4/11/18

to was...@seamplex.com

Hello Ramiro,

Thanks for pointing the repository. I'll certainly take a look on your work.

Well, to be honest, I don't know what my intentions are. :-) I am just
using some
time to learn more about nuclear engineering and numerical methods and
milonga seems a great tool to both learn and help the community at the
same time.

Regards,

Vitor

> https://groups.google.com/a/seamplex.com/d/msgid/wasora/CABqmnezyz4xrwpxtOyyJidspQAx%2BKg--egemHQ2DbGg7NfVafg%40mail.gmail.com.

Jeremy Theler

unread,

Apr 11, 2018, 2:24:41 PM4/11/18

to was...@seamplex.com

On Wed, 2018-04-11 at 11:44 -0300, Vitor Vasconcelos wrote:

   I didn't finish an organized set of tests/profiling, but using the

last version of milonga,

   However, I got al least three errors from PETSc:

- cylinder-hex-elements-s2: error: PETSc error 55-0 'Memory requested

14962484804' in /home/cfx/libs/petsc-3.8.4/src/mat/impls/aij/seq/aij.c

MatSeqAIJSetPreallocation_SeqAIJ:3630

- cylinder-tet-volumes-s2: error: PETSc error 55-0 'Memory requested

714002692' in /home/cfx/libs/petsc-3.8.4/src/vec/vec/impls/seq/bvec3.c

VecCreate_Seq:35

- cylinder-hex-volumes-s2: error: PETSc error 55-0 'Memory requested

708028484' in /home/cfx/libs/petsc-3.8.4/src/vec/vec/impls/seq/bvec3.c

VecCreate_Seq:35

These errors look like PETSc wanted to allocate virtual memory and there was not enough.

   I don't know if what you said about SLEPc also applies for PETSc, I

have no experience with neither.

What did I say about SLEPc?

   The HEX mesh has about 650,000 elements. The TET mesh has 680,000 elements.

   By the way, I'm using "big" meshes to be able to really see any

bottelneck functions. Otherwise, I might not see

costly functions.

You can see them with far fewer elements. And parametric runs with say 10k, 20k and 30k nodes would give us how the CPU time (and memory!) consumption depends on the mesh.

<blockquote type="cite">

I would not mix OpenMP with MPI.

</blockquote>

   It is a good point. Theoretically, if you have a thread-safe

implementation of MPI

this wouldn't be an issue. But I am also a bit conservative on this matter.

true

   The good thinks about openMP is that is based on pragmas (you

probably know that, but just

for the sake of clarity) and if we don't want to used it, the compiled

can simply ignore it.

yes, also if you don't run with mpirun, then a serialized version is run

   Moreover, since it is quite simple to use, I was thinking about

trying it. Specially because it is possible

to focus in some set of functions. A target could be the functions

which build the matrices from the mesh, which

are, as long as I could check, exacly those spending more time running.

sure, my point was that if after all we want to paralellize also the solution of the problem with PETSc, using MPI for the parallelization of the rest of the code would be killing two birds with the same shot

nevertheless, I agree that OpenMP is far simpler to implement, so it is worth trying it also. The candidates are those for loops that range over cells in FVM and over nodes or elements in FEM. The FVM part is simpler because there is a one-to-one (actually one-to-number-of-groups) correspondance between cells a matrix rows. In FEM the correspondance is with nodes, although there are many loops over elements.

Reply all

Reply to author

Forward