Simulation occasionally crashes on cluster (EXIT CODE: 9)

6,157 views
Skip to first unread message

Tim Struppi

unread,
Jul 3, 2017, 11:56:41 AM7/3/17
to moose-users
Hi there..
I ran into a problem with the moose app that I am using on a linux cluster.

Background information:
- I am simulating deep geothermal reservoirs --> darcy flow coupled with heat transport (fault zones as discrete features, different permeability zones, production and injection wells)
- Moose-App: Golem developed by the guys at GFZ Potsdam Germany
- I generated my mesh with MeshIT (also from GFZ)
- Tested and simulated several different scenarios with this mesh-file and Golem successfully on my workstation
- Managed to install Moose and Golem on a linux cluster (got some help from the support guys there)

So what I then did was taking a simulation that i had successfully finished on my workstation and copying it over to the cluster to test the scaling.
I ran about twenty simulations with exact the same simulation and changed only the number of compute nodes/number of cores.

Here I discovered that from time to time the simulation failed with the following output:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5334 RUNNING AT mpp2r04c05s02
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

I noticed that when I just resent the Job without changing the settings maybe one or two tries later the simulation finished without the error.
I read that this error could indicate that I am running out of memory but I am wondering why I get it only sometimes. Also the error happened regardless of the number of nodes (28cores each). And also the error occured at two different points in time: At the beginning of the simulation or at the end of a simulation (after the solve converged) before the output was written to the terminal.

I also attached two logs from my cluster (where you can see more information on mesh-size and so on). In one the simulation succeeds (339757) and in the other it fails (339807) but both have the same settings (8 nodes = 224 cores). (My simulation consists of a steady state calculation at the beginnging followed by a restart from this steady to calculate the transient; I guess there it would also be possible to calculate the steady just once and safe the output and then always just start the transient from there but i was lazy and used it like this because I needed this for the scenarios that i calculated on my workstation)

So my question basically is, how can I get rid of this error it really slowes down my work on the cluster?
Also if this is a problem with the memory is there something i could do with the mesh type? Im unsure what options I have here, my mesh file is an exodus file that I get from MeshIT. I read there is a parallel mesh type and i guess i am not using it. Does this make sense for me and where can i find information on how this works? Are there different types for mesh parallelisation that i could use?

If you need any more information from me please let me know.

Greetings from Munich
Florian
myjob.339757.mpp2r02c01s01.out
myjob.339807.mpp2r02c02s07.out

Peterson, JW

unread,
Jul 3, 2017, 12:13:56 PM7/3/17
to moose-users
On Mon, Jul 3, 2017 at 9:56 AM, Tim Struppi <heroes....@gmail.com> wrote:
Hi there..
I ran into a problem with the moose app that I am using on a linux cluster.

Background information:
- I am simulating deep geothermal reservoirs --> darcy flow coupled with heat transport (fault zones as discrete features, different permeability zones, production and injection wells)
- Moose-App: Golem developed by the guys at GFZ Potsdam Germany
- I generated my mesh with MeshIT (also from GFZ)
- Tested and simulated several different scenarios with this mesh-file and Golem successfully on my workstation
- Managed to install Moose and Golem on a linux cluster (got some help from the support guys there)

So what I then did was taking a simulation that i had successfully finished on my workstation and copying it over to the cluster to test the scaling.
I ran about twenty simulations with exact the same simulation and changed only the number of compute nodes/number of cores.

Here I discovered that from time to time the simulation failed with the following output:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5334 RUNNING AT mpp2r04c05s02
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

I noticed that when I just resent the Job without changing the settings maybe one or two tries later the simulation finished without the error.
I read that this error could indicate that I am running out of memory but I am wondering why I get it only sometimes. Also the error happened regardless of the number of nodes (28cores each). And also the error occured at two different points in time: At the beginning of the simulation or at the end of a simulation (after the solve converged) before the output was written to the terminal.

"Code 9" is simply SIGKILL in Linux/UNIX, and it just means that *something* killed the process, possibly the OS itself due to an out of memory condition as you suspect.

As far as why the job is sometimes killed and sometimes not, it could be a lot of things:

.) Do you know if the nodes on your cluster are assigned exclusively? That is, while you are running on a particular node is it possible that someone else is also running on it?
.) Are all the nodes of your cluster exactly the same, or do some have different amounts of memory?

The most common approach to reducing the amount of memory a simulation uses on a cluster is to run fewer processes (cores) per node. You should be able to consult your queuing system documentation to find out more about how to do this. It's also possible you are exceeding a per-core memory limit (say, 2GB) even though you are not exceeding the total amount of system memory, so keep that in mind as well.
 
--
John

Tim Struppi

unread,
Jul 6, 2017, 9:53:38 AM7/6/17
to moose-users
Hi John and thanks for your answer.

I contacted my cluster support:
1.) Yes i get nodes exclusively 
2.) Yes all nodes should be the same (if you like you can check out the information about the cluster that im using (CoolMUC2 mpp2) here: https://www.lrz.de/services/compute/linux-cluster/overview/  and here: http://ark.intel.com/products/81059/Intel-Xeon-Processor-E5-2697-v3-35M-Cache-2_60-GHz)

I'm going to test now what you suggested by running with fewer processes per node. 
Im going to report back after that.
Thanks a lot!
Message has been deleted
Message has been deleted
Message has been deleted

Andrew....@csiro.au

unread,
Oct 13, 2017, 7:15:50 PM10/13/17
to moose...@googlegroups.com
​Can you simply specify "block = whatever"?

To me this sounds like quite a serious and annoying bug - i added a note to the github issue.

a


From: moose...@googlegroups.com <moose...@googlegroups.com> on behalf of Tim Struppi <heroes....@gmail.com>
Sent: Friday, 13 October 2017 10:24 PM
To: moose-users
Subject: Re: Simulation occasionally crashes on cluster (EXIT CODE: 9)
 

Hi there so i was able to figure out that my error comes from having a postprocessor in my input file.

I was playing around with a new pretty simple model (that just solves for darcy flow in a porous medium). Here i am sure that there is not much memory needed and this model failed every time when i was running it on my cluster independent of the number of cores/nodes that i was using. After removing my postprocessor completely it worked nicley even while utilizing many nodes on the cluster.

here is my postprocessor block:

[Postprocessors]
  [./FoerderDruck]
    type = PointValue
    point = '14200 15000 0'
    variable = pore_pressure
    enable = true
  [../]
[]

I found the following report on github where it looks like this might be a bug related to the fact that there is no block number specified in this postprocessor: https://github.com/idaholab/moose/issues/9889
My problem is that i heavily rely on this postprocessor. I need the pressure evolution at this point as a final result in the form of .csv file.
So my question is if there is a work around or if someone has an idea how to fix this.
If needed i can provide my input file and my mesh.
Thanks for any help!


Greetings from Munich
Florian

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at https://groups.google.com/group/moose-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/moose-users/fadcb735-e1f2-42c8-adeb-ac693718fb1d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Flori K

unread,
Oct 14, 2017, 7:34:29 AM10/14/17
to moose-users
My posts here get deleted, not sure why, so i switched the googleaccount.

Yes I tried specifying a block number in my post processor without success.

Cheers Florian
Message has been deleted

Cody Permann

unread,
Jan 12, 2018, 2:46:29 PM1/12/18
to moose...@googlegroups.com
Exit Code 9 == Out of Memory

We run into this occasionally and the best fix is to not reduce the number of CPUs you use per node. Depending on your queuing system you can usually accomplish this by specifying the number of "chunks" you want where chunk is an atomic grouping of CPUs+memory requirements. If you are right on the cusp, just dropping one or two CPUs from each chunk should take care of it. Alternatively, you could try running on more processors to spread your problem out even further. 

On Fri, Jan 12, 2018 at 12:43 PM Tim Struppi <heroes....@gmail.com> wrote:
Are there any news on this? I am still struggling with the problem... on github I noticed that something happened but I'm not sure if this problem got fixed? Do I need to compile moose from the devel-branch?

Am Montag, 3. Juli 2017 17:56:41 UTC+2 schrieb Tim Struppi:

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at https://groups.google.com/group/moose-users.
Reply all
Reply to author
Forward
0 new messages