model stops at writing restart file

53 views
Skip to first unread message

Kemal

unread,
Jul 23, 2018, 3:12:36 PM7/23/18
to MPAS-Atmosphere Help
Hi,

I am running version 6.0. It runs and writes output files fine. However, when it comes to write restart file, the model stops without any error message on log files, but there is the following error message coming from the mpirun command:

-------------------------------------------------------
Running MPAS locally on compute node
[prun] Master compute host = n36
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./atmosphere_model (family=openmpi3)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
mpirun noticed that process rank 5 with PID 13963 on node n36 exited on signal 9 (Killed).
-------------------------------------------------------

I tried using various output formats (netcdf, netcdf4, pnetcdf, cdf5) but the result is the same. I attached the log, namelist and stream file for further information to this email. Do you have any suggestions what the problem is? Thank you for your help.

Kemal.
log.atmosphere.0000.out
namelist.atmosphere
streams.atmosphere

Kemal

unread,
Jul 23, 2018, 3:25:00 PM7/23/18
to MPAS-Atmosphere Help
Hi again,

I forgot to add the following information, which may or may not be helpful:

I use:

 - PIO2 libraries,
 - intel compiler version 18.0.3.222,
 - mpirun version 3.0.0, and
 - run the model on a 64 bit linux blade with 16 cores with 32 cpus total with 198 Gb of RAM

Kemal.

CarlP...@yahoo.com

unread,
Jul 25, 2018, 11:38:59 AM7/25/18
to MPAS-Atmosphere Help
Kemal -- did you run the regression-tests when you built PIO2?
I'm interested in knowing if you saw any failures.
I don't know that this would connect directly with the problem you're seeing, though.

Kemal

unread,
Jul 30, 2018, 12:40:30 PM7/30/18
to MPAS-Atmosphere Help
Hi Carl,

Thank you very much for your suggestion. I attached the script that I used to compile and install PIO2, as well as the Makefile that cmake creates and the outputs of compilation and tests to this forum. I cannot find the regression test that you mentioned in the test run and Makefile. From the output files, I do not see any failure of tests, and specifically regression tests, or compilation in general. I saw your correspondence in the forum for the regression tests when you wanted to turn off pnetcdf option. Since I don't come across with that error message, I assume that all tests are ran successfully and am not sure what else can be wrong.

What I don't understand is that why the model writes both history and diagnostic output files, but stops when it tries to write the restart file with no error information that I expect to see if there is something wrong that I did. I am checking the routines that write these three output files, but cannot see a reason why it shouldn't write. Any help is appreciated. Thank you all.

Kemal.
Makefile
out.pio2.gptl
out.pio2.tests
set.and.compile.pio2

CarlP...@yahoo.com

unread,
Jul 30, 2018, 4:09:01 PM7/30/18
to MPAS-Atmosphere Help
Kemal -- I run these commands inside the PIO2 build directory:

export MPIEXEC_TIMEOUT=3600 # In case something hangs.
export CTEST_OUTPUT_ON_FAILURE=1
make -i -k check

They will run all the regression-tests.
It *might* expose problems with your PIO.

Kemal

unread,
Jul 31, 2018, 11:08:35 AM7/31/18
to MPAS-Atmosphere Help
Hi Carl,

Thank you very much for your guidance. I appreciate that. I ran the test(s) that you mentioned and attached the output of all steps here (in order, out.cmake, out.make, and out.check). I also put the output of the installation (out.install) to see whether all necessary files that are expected to be installed are installed for the records.

Kemal.
out.check
out.cmake
out.install
out.make

Kemal

unread,
Jul 31, 2018, 11:15:05 AM7/31/18
to MPAS-Atmosphere Help
Hi Carl,

I also dumped the output of the command that I executed into a script file and attached here, which may give additional information since there is an error during the tests that may be important and are not seen in the out.check file.

Kemal.
typescript.out.pio2.install

CarlP...@yahoo.com

unread,
Jul 31, 2018, 11:21:51 AM7/31/18
to MPAS-Atmosphere Help
Kemal -- from the output of your PIO regression-tests, it looks like iy is not working properly.
I don't know if this is causing the MPAS-A problem or not.
You might want to post the results here so the MPAS-A support people can look.

Kemal

unread,
Jul 31, 2018, 11:35:45 AM7/31/18
to MPAS-Atmosphere Help
Hi Carl,

Thank you for looking into this. I thought I posted all the results here. Could you let me know where else and/or what other results I should post?

Kemal.
Message has been deleted
Message has been deleted

Kemal

unread,
Aug 1, 2018, 4:23:09 PM8/1/18
to MPAS-Atmosphere Help
Hello,

I just want to let you know that the problem of writing restart files is solved. For those of you who may have faced this or similar issue, the culprit was the amount of memory available on a machine (!!). I used to run the model on a machine with 128 Gb of RAM without any problems. Now we got new machines with 196, 384 and 512 Gb of RAMs but they are tapped into via a queuing software. I don't expect that software and/or handshaking with the compute node is the problem, but I am just stating the difference with the previous and current system. The model runs and writes model output files on all these machines, but still does not write restart file on them except on the one with 512 Gb of RAM. I cannot explain the reason for this but only suspect that it may have to do with the queuing software that is implemented and how a memory is allocated/used. On my part, all I need is to be able to run the model with all necessary output files.

PIO2 library appears to be installed just fine and cmake step, make step and test steps do not give any error messages. Only error message that I got was:

-----
Errors while running CTest
make[3]: [CMakeFiles/check] Error 8 (ignored)
-----

I don't know what that means, but cmake deems it to be ingnored and does so. A google search indicates an Error 8 message associated with cmake, being a junk (?) error message and should be removed from build process. When I checked the relevant directory where CTest says there is something wrong, I did not see anything wrong with the files per my naive mind. 

In short, thank you all who have looked into this issue and contributed with their own suggestions to resolve this issue to the best of their ability and time.

Kemal.

CarlP...@yahoo.com

unread,
Aug 3, 2018, 8:13:09 AM8/3/18
to MPAS-Atmosphere Help
One question to the developers: can you give some specific error-messages when there are internal array-allocation failures, and failures in calls to the I/O libraries? It seems like MPAS-A just keeps running and then gives a segfault when things get too corrupted.

Kemal

unread,
Aug 6, 2018, 12:35:21 PM8/6/18
to MPAS-Atmosphere Help
Hi all,

This is a follow up on Carl's inquiry about I/O information request. I have run into the same issue several times myself. Namely, the model runs fine first and then at erratic times, such as, sometimes a few hours into the run or sometimes after several days of run, the model stops with segmentation fault with no CFL violation or anything else similar to that extent error. If we can improve the dump of all messages by turning on some additional flags that we may not be aware of, please let us know. Turning on "config_print_detailed_minmax_vel" is very useful but not enough by itself. I assume that when the model stops for these kind of memory issues, I think some allocatable array dimensions need to be set larger to respond to the required larger arrays for calculation of, say, very deep convection that wasn't there before due to non-existence of a storm at an earlier time. If this is the case, then I can understand the reason for the model to blow up.

Kemal.
Reply all
Reply to author
Forward
0 new messages