CMS on an HPC issues

199 views
Skip to first unread message

bexeross

unread,
Nov 15, 2017, 11:22:48 AM11/15/17
to connectivity-modeling-system-club
Hi All,

Just a note to say I've been having some issues with the CMS on my HPC (high performance computing cluster) here, and wanted to post an overview here in case others are having similar issues or have overcome them and can offer advice. 

The last 30 attempted fullsize CMS jobs I have run has resulted in:
  •  1 success (not the latest run!) 
  •  3 times a run has completed but with all or some of the traj/con files still in the SCRATCH directory [sometimes completely empty (0 bytes)]  
  •  26 runs failed with one of the following errors:

...various WARNINGS from CMS that don't bother me reported from each processor (16 repeats in the test shown here, only 2 lines shown) followed by an mpi error with changing process names and exit codes (e.g. below):


WARNING: The fill value you have entered in nest_x.nml is not the one used by cms (1.2676506E30), is that what you intend?

WARNING: The fill value you have entered in nest_x.nml is not the one used by cms (1.2676506E30), is that what you intend?

-------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

-------------------------------------------------------

--------------------------------------------------------------------------

mpirun detected that one or more processes exited with non-zero status, thus causing

the job to be terminated. The first process to do so was:

 

  Process name: [[17771,1],4]

Exit code:    2




... or a complete failure (sometimes with a quick job termination, one time hanging for 3 days) saying:

  compute-1-10.26319ipath_userinit: assign_context command failed: Network is down

--------------------------------------------------------------------------

PSM was unable to open an endpoint. Please make sure that the network link is

active on the node and the hardware is functioning.

 

  Error: Could not detect network connectivity

--------------------------------------------------------------------------

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems.  This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

 

  PML add procs failed

--> Returned "Error" (-1) instead of "Success" (0)


My HPC guys have said they think this is to do with some issue in the CMS, but they haven't really got a clue. I have had scripts and inputs consistent between attempted runs as a test that come out with different errors (one job script and input file set resulting in the 1 success, and both of the errors reported above).

The first error above I have previously interpreted as being a sign that the CMS is using too much memory, but the fact that the same set up has resulted in a success in another instance may mean this interpretation is incorrect?

It may be that this is mostly a hardware issue, I am having them transfer me to the second cluster which may help..

In the meantime I wondered if anyone else had been having issues: either with similar MPI errors, or with the outputs files ending up in the SCRATCH folder when the run says it has completed?

I am getting very frustrated with busting this multitude of problems and I have a lot of runs to get done in the next couple of months (working to a deadline in February that seemed tight without having to waste weeks on thes issues!) so any help gratefully appreciated.

Things I have tried:

  • smaller release files
  • more cores
  • fewer cores
  • newest CMS version
  • traj-off
  • rewriting jobfiles
  • remaking release files
  • recompiling CMS
  • a complete reinstall of CMS
  • asking my HPC guys to check the hardware/MPI installation
  • polygon file remaking
For info at the moment I am trying to run only 1 year of daily nestfiles, with a 12MB releasefile (of around 258 000 lines, each releasing 100 particles), with strata polygons. I am allowed to use up to 80 processors at present (and have tried lots of iterations).

All the best,
Bex

Felipe Torquato

unread,
May 9, 2019, 6:51:11 AM5/9/19
to connectivity-modeling-system-club
Hi Bex,

do you remember how you fix it?

Best,
Felipe

bexeross

unread,
May 9, 2019, 7:14:33 AM5/9/19
to connectivity-modeling-system-club
Hi Felipe,

I am sorry to say my fix was being moved to their new cluster! I just reinstalled everything from scratch on a new system and had it working again. 

It is interesting if you are having similar problems now though. I presume you are not at the University Of Plymouth (so not using the same cluster as I was). Can you tell us a little about your issues, and which version of CMS and its dependencies you are using?

If you can't figure it out I would advocate reinstalling everything with the latest dependencies (if possible) - but that is probably not what you were hoping to hear, sorry! If it helps you can try using and modifying this shell script to semi-automate the process of reinstallation. https://groups.google.com/d/msg/connectivity-modeling-system-club/67kpxJTIHHo/6GruB75dAAAJ

All the best,
Bex

Felipe Torquato

unread,
May 9, 2019, 7:44:08 AM5/9/19
to connectivity-modeling-system-club
Hi Bex,

thanks so much for the quick reply.

I've running CMS 2.0v in the server from Uni. of Copenhagen since July 2017. Only now I'm having this kind of issues.

This is the error that I am getting:

--------------------------------------------------------------------------

[[47159,1],0]: A high-performance Open MPI point-to-point messaging module

was unable to find any relevant network interfaces:


Module: OpenFabrics (openib)

  Host: node925


Another transport will be used instead, although this may result in

lower performance.

--------------------------------------------------------------------------



As you, the vast majority of jobs stopped before ending, only one or two of them finished successfully. 

I'll try to reinstall CMS.

Thanks.
Felipe 
Reply all
Reply to author
Forward
0 new messages