HPC simulation of MSRE - Exit code 139, Segmentation fault (signal 11)

62 views
Skip to first unread message

mateusz...@gmail.com

unread,
May 7, 2019, 8:40:34 AM5/7/19
to moltres-users
Hello everyone,

I've started using a supercomputer to perform some simulations that are getting heavier and heavier the deeper I go with MSRE recreation in Moltres. After successfully having installed the software on the cluster today, it turned out that a simulation I am trying to run is terminated just after the first time step begins. I am getting the following error message:

Bad termination of one of your application processes
PID 16267
Exit code: 139
......
Your application terminated with the exit string: Segmentation fault (signal 11)
This typically refers to a problem with your application.

The thing is that the same input file works perfectly on my personal computer. Moreover, another input file that uses the same input data and external files but is only a short eigenvalue problem was run on the cluster without any errors. Therefore I am asking you for advice, since I do not really know what could be wrong, since the input files are correct and the application seems to work properly. Is it possibly because of lack of memory? Or does this error mean anything else?

Thank you for your help,
Mateusz Pater

andrewryh

unread,
May 7, 2019, 8:59:55 AM5/7/19
to moltres-users
Hello,

To figure it out post the PBS file, please.

Andrei

Mateusz Pater

unread,
May 7, 2019, 9:11:21 AM5/7/19
to moltre...@googlegroups.com
Hi,
I actually used a standard mpirun, I'm sorry but I am new to cluster computing and I didn't use any PBS. I believe it's ok to do it this way.
Mateusz


--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.
To post to this group, send email to moltre...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/moltres-users/3f7915c0-9950-40a4-8c9f-c1916a8a945f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gavin Ridley

unread,
May 7, 2019, 9:13:26 AM5/7/19
to moltre...@googlegroups.com
How many processes are you trying to run? And how many degrees of freedom does your mesh have?

Gavin Ridley

Mateusz Pater

unread,
May 7, 2019, 10:00:32 AM5/7/19
to moltre...@googlegroups.com
I tried with 60, 30, and 4 processes. All failed.
The mesh is the 2d_lattice_structured.msh made from the .geo file of the same name, which you may be familiar with. There are 1437 nodes on lines, 43218 nodes on surfaces, and 43904 quadrangles.


On Tuesday, May 7, 2019, Gavin Ridley <gri...@vols.utk.edu> wrote:
How many processes are you trying to run? And how many degrees of freedom does your mesh have?

Gavin Ridley

On May 7, 2019, at 9:11 AM, Mateusz Pater <mateusz...@gmail.com> wrote:

Hi,
I actually used a standard mpirun, I'm sorry but I am new to cluster computing and I didn't use any PBS. I believe it's ok to do it this way.
Mateusz


On Tue, May 7, 2019 at 2:59 PM andrewryh <andr...@gmail.com> wrote:
Hello,

To figure it out post the PBS file, please.

Andrei

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-users+unsubscribe@googlegroups.com.

To post to this group, send email to moltre...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/moltres-users/3f7915c0-9950-40a4-8c9f-c1916a8a945f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-users+unsubscribe@googlegroups.com.

To post to this group, send email to moltre...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Mateusz Pater
Université Paris-Saclay, CEA-INSTN
InnoEnergy Master's School
European Master's in Nuclear Energy

mateusz...@gmail.com

unread,
May 7, 2019, 10:03:13 AM5/7/19
to moltres-users
The terminal says:

Mesh:
  Parallel Type:           replicated
  Mesh Dimension:          2
  Spatial Dimension:       2
  Nodes:
    Total:                 44655
    Local:                 11213
  Elems:
    Total:                 43904
    Local:                 10976
  Num Subdomains:          2
  Num Partitions:          4
  Partitioner:             metis

Nonlinear System:
  Num DOFs:                282411
  Num Local DOFs:          68737
  Variables:               "temp" { "pre1" "pre2" "pre3" "pre4" "pre5" "pre6" } { "group1" "group2" "group3"
                             "group4" }
  Finite Element Types:    "LAGRANGE" "MONOMIAL" "LAGRANGE"
  Approximation Orders:    "FIRST" "CONSTANT" "FIRST"

Auxiliary System:
  Num DOFs:                43904
  Num Local DOFs:          10976
  Variables:               "power_density"
  Finite Element Types:    "MONOMIAL"
  Approximation Orders:    "CONSTANT"

Gavin Ridley

unread,
May 7, 2019, 10:03:26 AM5/7/19
to moltre...@googlegroups.com
Well, as a rule of thumb, between 10,000 and 50,000 degrees of freedom (nodes times numbers of scalar variables) per processor will give you reasonable efficiency. Otherwise the calculation won’t speed up because communication between processors overtakes computation.

Could you maybe compile a debug binary? You also may want to check that you’re using MPICH and not OpenMPI which may cause PetSc to fail.

Gavin Ridley
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.

To post to this group, send email to moltre...@googlegroups.com.

Alexander Lindsay

unread,
May 7, 2019, 10:53:00 AM5/7/19
to moltres-users
On Tue, May 7, 2019 at 6:40 AM <mateusz...@gmail.com> wrote:
Hello everyone,

I've started using a supercomputer to perform some simulations that are getting heavier and heavier the deeper I go with MSRE recreation in Moltres. After successfully having installed the software on the cluster today, it turned out that a simulation I am trying to run is terminated just after the first time step begins. I am getting the following error message:

Bad termination of one of your application processes
PID 16267
Exit code: 139
......
Your application terminated with the exit string: Segmentation fault (signal 11)
This typically refers to a problem with your application.

The thing is that the same input file works perfectly on my personal computer.

It works in parallel on your personal machine?

Moreover, another input file that uses the same input data and external files but is only a short eigenvalue problem was run on the cluster without any errors. Therefore I am asking you for advice, since I do not really know what could be wrong, since the input files are correct and the application seems to work properly. Is it possibly because of lack of memory? Or does this error mean anything else?

Thank you for your help,
Mateusz Pater

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.
To post to this group, send email to moltre...@googlegroups.com.

andrewryh

unread,
May 7, 2019, 11:31:36 AM5/7/19
to moltres-users
It might happen if you try to use more resources that are available.
How are you requesting resources on your cluster? What is spec (node configuration)? And post exact mpirun command used, please.

M. Pater

unread,
May 8, 2019, 7:19:38 AM5/8/19
to moltres-users
Yes, it does work on 4 cores in parallel on my laptop
Message has been deleted

M. Pater

unread,
May 8, 2019, 7:25:17 AM5/8/19
to moltres-users
I had used mpirun -np 60 ../DIR/moltres-opt -i 4group.i
Today I tried to run it using only one core and also using another option with --n-threads=60. So far no errors but it still hasn't got to the first time step calculation, so I can't say if it resolves any issue

Alexander Lindsay

unread,
May 8, 2019, 10:24:35 AM5/8/19
to moltres-users
What preconditioner are you using? E.g. have you specified a `-pc_type` in your `petsc_options`?

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.
To post to this group, send email to moltre...@googlegroups.com.

M. Pater

unread,
May 10, 2019, 6:41:50 AM5/10/19
to moltres-users
Since I am no expert, I left the input file from the Moltres GitHub repository unchanged and the pc_type was lu. Nevertheless, replacing it with jacobi and adding sub_pc_type or any other option did not solve the issue. Petsc was configured properly and its tests were passed successfully. The moose_profile file got a few additional lines from the Moose installation guide website, so I really don't know what else could be the root cause of no possibility of running in parallel. Would you have any advice about other preconditioner options that should be there to make it work?
Thank you in advance,
Matt


W dniu środa, 8 maja 2019 16:24:35 UTC+2 użytkownik Alexander Lindsay napisał:
What preconditioner are you using? E.g. have you specified a `-pc_type` in your `petsc_options`?

On Wed, May 8, 2019 at 5:25 AM M. Pater <mateusz...@gmail.com> wrote:
I had used mpirun -np 60 ../DIR/moltres-opt -i 4group.i
Today I tried to run it using only one core and also using another option with --n-threads=60. So far no errors but it still hasn't got to the first time step calculation, so I can't say if it resolves any issue
 
It might happen if you try to use more resources that are available.
How are you requesting resources on your cluster? What is spec (node configuration)? And post exact mpirun command used, please

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltre...@googlegroups.com.

Alexander Lindsay

unread,
May 10, 2019, 10:04:07 AM5/10/19
to moltres-users
Well if you are manually running mpirun and not submitting through a scheduler, then debugging a segmentation fault is pretty straightforward. I highly recommend you read the section in this link titled "Parallel Debugging". Before you do that, make sure you have gdb available to you. If not, contact your cluster administrators about how you can obtain it, or you can even install it yourself locally in your home directory. Your
"--start-in-debugger" command will read "--start-in-debugger=gdb".

I don't think this has been asked yet...do you get the segmentation faults with a smaller problem? I highly recommend using a smaller problem for debugging. I would even try running a 2x2 mesh with two processes and see whether you can replicate the segmentation fault.

To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.

To post to this group, send email to moltre...@googlegroups.com.

Alexander Lindsay

unread,
May 10, 2019, 10:05:15 AM5/10/19
to moltres-users
Actually, have you tried running this problem in serial?

Mateusz Pater

unread,
May 13, 2019, 4:14:12 AM5/13/19
to moltre...@googlegroups.com
The problem does run properly on one processor only, in serial.
And no, smaller problems do not produce segmentation faults if I run them on multiple cores. 
I am running the dbg mode for the original problem now, I will let you know what the results are.

M. Pater

unread,
May 13, 2019, 4:57:41 AM5/13/19
to moltres-users
After running the simulation in the dbg mode, I got a myriad of similar messages:

[45] /home/....../libmesh/installed/include/libmesh/petsc_vector.h, line 1056, compiled May 13 2019 at 09:58:53
No index 3506 in ghosted vector.
Vector contains [1752,2618)
And empty ghost array.
.....
.....
We caught a libMesh error
Nonlinear solve did not converge due to DIVERGED_FUNCTION_DOMAIN interations 0
Solve did NOT Converge!

Could you please tell me what it means? As I said, exactly the same problem runs in parallel on my personal machine.

Alexander Lindsay

unread,
May 13, 2019, 8:51:17 AM5/13/19
to moltre...@googlegroups.com
What is your moose version (git hash) on both the cluster and your other machine? Are they the same? There have been some relatively recent changes to how MOOSE ghosts elements so I’m curious if that may be the problem...
--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.
To post to this group, send email to moltre...@googlegroups.com.

M. Pater

unread,
May 13, 2019, 9:16:29 AM5/13/19
to moltres-users
If you mean the package_version, the repo-hash is:
on my laptop 56178942a6f4cf47dfc6bb8b678d26bb6c5a32ed
on the cluster 68b130eadb09ec50c57b87f39947ab9dceedc735

Alexander Lindsay

unread,
May 13, 2019, 9:21:10 AM5/13/19
to moltres-users
Sorry no, not the package version. `cd` into your MOOSE repos and type:

`get rev-parse HEAD` and then please share the results.

It's also printed at top of the MOOSE header whenever you run an input file.

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.
To post to this group, send email to moltre...@googlegroups.com.

M. Pater

unread,
May 13, 2019, 9:57:43 AM5/13/19
to moltres-users
All right, that is 68fefed05677dec7f417ad0ce5c4b6a6f17db200 on my laptop and e5abd5ff24525421086c3028c409e3f6eb47d5cd on the cluster

Alexander Lindsay

unread,
May 13, 2019, 12:30:20 PM5/13/19
to moltres-users
Ok, the cluster commit is after our ghosting changes and the laptop commit is from before. Can you try changing your MOOSE hash on your laptop to the same hash as the cluster and see whether you get the same issue? Could you also share your input file? Then I could also test it.

On Mon, May 13, 2019 at 7:57 AM M. Pater <mateusz...@gmail.com> wrote:
All right, that is 68fefed05677dec7f417ad0ce5c4b6a6f17db200 on my laptop and e5abd5ff24525421086c3028c409e3f6eb47d5cd on the cluster

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.
To post to this group, send email to moltre...@googlegroups.com.

M. Pater

unread,
May 14, 2019, 8:17:22 AM5/14/19
to moltres-users
So I updated my personal Moose version to the commit that is installed on the cluster and the simulation crashed. It means that we found the root cause of the problem. I'm going to change the cluster's hash to another one. Thank you for tracking this down!

Alexander Lindsay

unread,
May 14, 2019, 12:21:41 PM5/14/19
to moltres-users
I'm glad that we have reproducibility now between the two systems...but we need to fix the actual problem, or else it will be dangerous to update Moltres to using new versions of MOOSE. So it would be very helpful if you could provide your input file so I can track down what's causing the bad ghosting.

On Tue, May 14, 2019 at 6:17 AM M. Pater <mateusz...@gmail.com> wrote:
So I updated my personal Moose version to the commit that is installed on the cluster and the simulation crashed. It means that we found the root cause of the problem. I'm going to change the cluster's hash to another one. Thank you for tracking this down!

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.
To post to this group, send email to moltre...@googlegroups.com.

Mateusz Pater

unread,
May 14, 2019, 1:02:36 PM5/14/19
to moltre...@googlegroups.com
Oh, of course, sorry I forgot. The case I am running is actually to be found in the Moltres github repository under the problems/2017_annals... and it's the 4group.i file. I hope it helps


On Tuesday, May 14, 2019, Alexander Lindsay <alexlin...@gmail.com> wrote:
I'm glad that we have reproducibility now between the two systems...but we need to fix the actual problem, or else it will be dangerous to update Moltres to using new versions of MOOSE. So it would be very helpful if you could provide your input file so I can track down what's causing the bad ghosting.

On Tue, May 14, 2019 at 6:17 AM M. Pater <mateusz...@gmail.com> wrote:
So I updated my personal Moose version to the commit that is installed on the cluster and the simulation crashed. It means that we found the root cause of the problem. I'm going to change the cluster's hash to another one. Thank you for tracking this down!

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-users+unsubscribe@googlegroups.com.

To post to this group, send email to moltre...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/moltres-users/5e6ea340-f827-41cd-a181-d2ed72ec2ba6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-users+unsubscribe@googlegroups.com.

To post to this group, send email to moltre...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/moltres-users/CANFcJrF9WQLDtiXM2749%2BDdXGAbPbPcY_G_J_8dTnE0QG%3DATdA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Alexander Lindsay

unread,
May 28, 2019, 11:37:37 AM5/28/19
to moltres-users
Ok Mateusz, with the merge of https://github.com/arfc/moltres/pull/89, you should now be able to run Moltres input files again in parallel with a current MOOSE checkout. Also note that in that PR I updated the problems/j033117_nts_temp_pre_parsed_mat/3d_auto_diff_rho.i and all the input files in problems/2017_annals_pub_msre_compare so that they will run with current Moltres.

Please continue to notify us if/when you encounter an input file that doesn't run. We want to achieve 100% coverage of our input files.

On Tue, May 14, 2019 at 11:02 AM Mateusz Pater <mateusz...@gmail.com> wrote:
Oh, of course, sorry I forgot. The case I am running is actually to be found in the Moltres github repository under the problems/2017_annals... and it's the 4group.i file. I hope it helps

On Tuesday, May 14, 2019, Alexander Lindsay <alexlin...@gmail.com> wrote:
I'm glad that we have reproducibility now between the two systems...but we need to fix the actual problem, or else it will be dangerous to update Moltres to using new versions of MOOSE. So it would be very helpful if you could provide your input file so I can track down what's causing the bad ghosting.

On Tue, May 14, 2019 at 6:17 AM M. Pater <mateusz...@gmail.com> wrote:
So I updated my personal Moose version to the commit that is installed on the cluster and the simulation crashed. It means that we found the root cause of the problem. I'm going to change the cluster's hash to another one. Thank you for tracking this down!

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.

To post to this group, send email to moltre...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/moltres-users/5e6ea340-f827-41cd-a181-d2ed72ec2ba6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.


--
Mateusz Pater
Université Paris-Saclay, CEA-INSTN
InnoEnergy Master's School
European Master's in Nuclear Energy

--
You received this message because you are subscribed to the Google Groups "moltres-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moltres-user...@googlegroups.com.

To post to this group, send email to moltre...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages