Segmentation fault on cluster

947 views
Skip to first unread message

Gemma

unread,
Apr 13, 2015, 1:07:52 PM4/13/15
to migrate...@googlegroups.com
Dear all,

I am a relatively new user and would appreciate some help with diagnosing a problem with my analysis. 

I am running a dataset of 500 DNA sequence loci with 2 populations (for now, still exploring options) on a cluster with migrate 3.6.4. I have done the run on my laptop using 3.6.4 and it reached completion and gave results, however when running on the cluster I got an error when it reached the 500th locus and so no outfile.pdf was generated. The infile and parmfile were identical for the two runs, apart from the paths to where the files should be created.  The outfile, logfile, bayesfile etc were all created in the folder where the job was submitted from.

I ran the job across 5 nodes (I didn't specify how many processes per node as I'm new to HPC and didn't realise I should) and saw the same run speed as when it was running single-threaded on my macbook. So I have two problems: the job terminating with an error before completion, and no increase in speed with a multi-threaded run.

I have attached my parmfile (I added .txt to allow me to upload it) and job submission script. This is the contents of the error file from the cluster:

SEVERE ERROR: Segmentation fault
              this results in an non recoverable crash.
              But check the datatype and your infile for errors, too.
              Please report error with as much detail as possible to
              Peter Beerli <bee...@fsu.edu>

And the end of the run terminated like this:

04:00:14   Burn-in 99.0% complete
04:00:14   Sampling ...

04:00:21   [NODE:0, Locus: 500] 
           (prognosed end of run is 04:38 April 09 2015 [0.990099 done])

           Parameter     Acceptance Current      AutoCorr ESS
           ------------  ---------- ------------ -------- --------
           Theta_1           0.000      0.00153    0.908  4809.87
           Theta_2           0.000      0.00320    0.906  4949.29
           M_2->1            0.000    769.28647    0.525  31183.60
           M_1->2            0.000    886.30428    0.530  30676.64
           Genealogies       1.000   -171.54785    0.906  4920.56
Begin reading the bayesallfile back into the system
bayesallfile.gz opened
WARNING: above upper bound: 681469437.000000
WARNING: above upper bound: 789.810000
WARNING: above upper bound: 609275495252575.000000
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------

04:00:25   [NODE:0, Locus: 499] 
           (prognosed end of run is 04:52 April 09 2015 [0.986733 done])

           Parameter     Acceptance Current      AutoCorr ESS
           ------------  ---------- ------------ -------- --------
           Theta_1           0.000      0.00214    0.908  1452.05
           Theta_2           0.000      0.00103    0.905  1499.54
           M_2->1            0.000    656.00923    0.532  9159.87
           M_1->2            0.000    248.49182    0.531  9197.89
           Genealogies       1.000   -179.13129    0.902  1553.16
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[57848,1],1]
  Exit code:    11
--------------------------------------------------------------------------


Any help would be greatly appreciated and let me know if you need more info about the cluster - I'll do my best!
Gemma

2pops_500loci_1.sh
parmfile.txt

Peter Beerli

unread,
Apr 13, 2015, 7:23:18 PM4/13/15
to migrate...@googlegroups.com
Gemma,
how did you start your job?
given what you say I suspect that you run the single-cpu version in an mpi framework: this will fail.
In particular I would like to see the log file fragment that was created in your failed run, usually on batch systems you will get a logfile that represents the stdout;
for Example on my machine (HPC at FSU the first few lines look like this:

Reading parmfile "parmfile3_1t2a3"....
  =============================================
  MIGRATION RATE AND POPULATION SIZE ESTIMATION
  using Markov Chain Monte Carlo simulation
  =============================================
  Compiled for a PARALLEL COMPUTER ARCHITECTURE
  One master and 79 compute nodes are available.
  PDF output enabled [Letter-size]
  Version 3.6.4   [2177]
  Program started at   Sun Aug 24 11:33:12 2014


the outfile fragment does not show the same header (unfortunately) as the logfile.


Peter



--
You received this message because you are subscribed to the Google Groups "migrate-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to migrate-suppo...@googlegroups.com.
To post to this group, send email to migrate...@googlegroups.com.
Visit this group at http://groups.google.com/group/migrate-support.
For more options, visit https://groups.google.com/d/optout.
<2pops_500loci_1.sh><parmfile.txt>

Gemma

unread,
Apr 14, 2015, 5:54:30 AM4/14/15
to migrate...@googlegroups.com
Dear Peter,

Thanks for getting back to me so fast! I have put the stdout file and also the file named "logfile" in this folder as the files were large:


I started the run with:

mpirun $MPI_HOSTS migrate-n-mpi parmfile -nomenu


Many thanks for your help,
Gemma


Gemma

unread,
Apr 16, 2015, 6:50:07 AM4/16/15
to migrate...@googlegroups.com
Hi again,

Looking at the logfile from the sdtout from my run it does not begin with:

Compiled for a PARALLEL COMPUTER ARCHITECTURE

Instead it just says:

MPI_HOSTS: -np 5 -hostfile /var/spool/torque/aux//305565.headnode1.arcus.osc.local
Reading parmfile "parmfile"....
Reading parmfile "parmfile"....
  =============================================
  MIGRATION RATE AND POPULATION SIZE ESTIMATION
  using Markov Chain Monte Carlo simulation
  =============================================
  PDF output enabled [A4-size]
  Version 3.6.4 
Reading parmfile "parmfile"....
  =============================================
  MIGRATION RATE AND POPULATION SIZE ESTIMATION
  using Markov Chain Monte Carlo simulation
  =============================================
  PDF output enabled [A4-size]
  Version 3.6.4 
  Program started at   Mon Apr  6 11:35:39 2015


  Program started at   Mon Apr  6 11:35:39 2015


Reading parmfile "parmfile"....


So does that mean that it's been compiled incorrectly on the cluster?

Many thanks for any help,
Gemma


Peter Beerli

unread,
Apr 16, 2015, 9:00:24 AM4/16/15
to migrate...@googlegroups.com
yes!
or then you run the wrong binary,
on the cluster you should do this

./configure    
make 
#this will generate the single cpu binary
# now do this
make clean
make mpis
#this will generate a clean mpi binary which you can call in your batch file

mpirun -np #nodes migrate-n-mpi parmfile -nomenu

on some systems the -np #nodes is not needed because the number of nodes is already defined in the batchscript.

Peter



Gemma

unread,
Apr 17, 2015, 10:50:47 AM4/17/15
to migrate...@googlegroups.com
Thanks Peter, compiling in the way you suggested has fixed the problem. 

Best wishes,
Gemma


Ashley Murphy

unread,
Apr 26, 2017, 8:58:45 AM4/26/17
to migrate-support

Hi Gemma, Peter and MIGRATE users,

I am a new user of MIGRATE, and was struggling to get the program to run properly on my university cluster (although it ran successfully on my PC), as it would always output the “SEVERE ERROR: Segmentation fault” error message, as it did for Gemma here.

After searching this google group, I found this thread, and I am very happy to report that recompiling as suggested solved this problem for me too!

So thanks Gemma and Peter! J

Cheers,

Ash.

Reply all
Reply to author
Forward
0 new messages