Segmentation fault on cluster

Gemma

unread,

Apr 13, 2015, 1:07:52 PM4/13/15

to migrate...@googlegroups.com

Dear all,

I am a relatively new user and would appreciate some help with diagnosing a problem with my analysis.

I am running a dataset of 500 DNA sequence loci with 2 populations (for now, still exploring options) on a cluster with migrate 3.6.4. I have done the run on my laptop using 3.6.4 and it reached completion and gave results, however when running on the cluster I got an error when it reached the 500th locus and so no outfile.pdf was generated. The infile and parmfile were identical for the two runs, apart from the paths to where the files should be created. The outfile, logfile, bayesfile etc were all created in the folder where the job was submitted from.

I ran the job across 5 nodes (I didn't specify how many processes per node as I'm new to HPC and didn't realise I should) and saw the same run speed as when it was running single-threaded on my macbook. So I have two problems: the job terminating with an error before completion, and no increase in speed with a multi-threaded run.

I have attached my parmfile (I added .txt to allow me to upload it) and job submission script. This is the contents of the error file from the cluster:

SEVERE ERROR: Segmentation fault

this results in an non recoverable crash.

But check the datatype and your infile for errors, too.

Please report error with as much detail as possible to

Peter Beerli <bee...@fsu.edu>

And the end of the run terminated like this:

04:00:14 Burn-in 99.0% complete

04:00:14 Sampling ...

04:00:21 [NODE:0, Locus: 500]

(prognosed end of run is 04:38 April 09 2015 [0.990099 done])

Parameter Acceptance Current AutoCorr ESS

------------ ---------- ------------ -------- --------

Theta_1 0.000 0.00153 0.908 4809.87

Theta_2 0.000 0.00320 0.906 4949.29

M_2->1 0.000 769.28647 0.525 31183.60

M_1->2 0.000 886.30428 0.530 30676.64

Genealogies 1.000 -171.54785 0.906 4920.56

Begin reading the bayesallfile back into the system

bayesallfile.gz opened

WARNING: above upper bound: 681469437.000000

WARNING: above upper bound: 789.810000

WARNING: above upper bound: 609275495252575.000000

-------------------------------------------------------

Primary job terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

-------------------------------------------------------

04:00:25 [NODE:0, Locus: 499]

(prognosed end of run is 04:52 April 09 2015 [0.986733 done])

Parameter Acceptance Current AutoCorr ESS

------------ ---------- ------------ -------- --------

Theta_1 0.000 0.00214 0.908 1452.05

Theta_2 0.000 0.00103 0.905 1499.54

M_2->1 0.000 656.00923 0.532 9159.87

M_1->2 0.000 248.49182 0.531 9197.89

Genealogies 1.000 -179.13129 0.902 1553.16

--------------------------------------------------------------------------

mpirun detected that one or more processes exited with non-zero status, thus causing

the job to be terminated. The first process to do so was:

Process name: [[57848,1],1]

Exit code: 11

--------------------------------------------------------------------------

Any help would be greatly appreciated and let me know if you need more info about the cluster - I'll do my best!

Gemma

2pops_500loci_1.sh

parmfile.txt

Peter Beerli

unread,

Apr 13, 2015, 7:23:18 PM4/13/15

to migrate...@googlegroups.com

Gemma,

how did you start your job?

given what you say I suspect that you run the single-cpu version in an mpi framework: this will fail.

In particular I would like to see the log file fragment that was created in your failed run, usually on batch systems you will get a logfile that represents the stdout;

for Example on my machine (HPC at FSU the first few lines look like this:

Reading parmfile "parmfile3_1t2a3"....

  =============================================

  MIGRATION RATE AND POPULATION SIZE ESTIMATION

  using Markov Chain Monte Carlo simulation

  =============================================

  Compiled for a PARALLEL COMPUTER ARCHITECTURE

  One master and 79 compute nodes are available.

  PDF output enabled [Letter-size]

  Version 3.6.4   [2177]

  Program started at   Sun Aug 24 11:33:12 2014

the outfile fragment does not show the same header (unfortunately) as the logfile.

Peter

--
You received this message because you are subscribed to the Google Groups "migrate-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to migrate-suppo...@googlegroups.com.
To post to this group, send email to migrate...@googlegroups.com.
Visit this group at http://groups.google.com/group/migrate-support.
For more options, visit https://groups.google.com/d/optout.
<2pops_500loci_1.sh><parmfile.txt>

Gemma

unread,

Apr 14, 2015, 5:54:30 AM4/14/15

to migrate...@googlegroups.com

Dear Peter,

Thanks for getting back to me so fast! I have put the stdout file and also the file named "logfile" in this folder as the files were large:

https://www.dropbox.com/sh/fzcvofoj71wg2z5/AABuqMZiTleK2uBcYlfiYq8fa?dl=0

I started the run with:

mpirun $MPI_HOSTS migrate-n-mpi parmfile -nomenu

Many thanks for your help,

Gemma

unread,

Apr 16, 2015, 6:50:07 AM4/16/15

to migrate...@googlegroups.com

Hi again,

Looking at the logfile from the sdtout from my run it does not begin with:

Compiled for a PARALLEL COMPUTER ARCHITECTURE

Instead it just says:

MPI_HOSTS: -np 5 -hostfile /var/spool/torque/aux//305565.headnode1.arcus.osc.local

Reading parmfile "parmfile"....

=============================================

MIGRATION RATE AND POPULATION SIZE ESTIMATION

using Markov Chain Monte Carlo simulation

=============================================

PDF output enabled [A4-size]

Version 3.6.4

Reading parmfile "parmfile"....

=============================================

MIGRATION RATE AND POPULATION SIZE ESTIMATION

using Markov Chain Monte Carlo simulation

=============================================

PDF output enabled [A4-size]

Version 3.6.4

Program started at Mon Apr 6 11:35:39 2015

Reading parmfile "parmfile"....

So does that mean that it's been compiled incorrectly on the cluster?

Many thanks for any help,

Gemma

Peter Beerli

unread,

Apr 16, 2015, 9:00:24 AM4/16/15

to migrate...@googlegroups.com

yes!

or then you run the wrong binary,

on the cluster you should do this

./configure

make

#this will generate the single cpu binary

# now do this

make clean

make mpis

#this will generate a clean mpi binary which you can call in your batch file

mpirun -np #nodes migrate-n-mpi parmfile -nomenu

on some systems the -np #nodes is not needed because the number of nodes is already defined in the batchscript.

Peter

Gemma

unread,

Apr 17, 2015, 10:50:47 AM4/17/15

to migrate...@googlegroups.com

Thanks Peter, compiling in the way you suggested has fixed the problem.

Best wishes,

Gemma

Ashley Murphy

unread,

Apr 26, 2017, 8:58:45 AM4/26/17

to migrate-support

Hi Gemma, Peter and MIGRATE users,

I am a new user of MIGRATE, and was struggling to get the program to run properly on my university cluster (although it ran successfully on my PC), as it would always output the “SEVERE ERROR: Segmentation fault” error message, as it did for Gemma here.

After searching this google group, I found this thread, and I am very happy to report that recompiling as suggested solved this problem for me too!

So thanks Gemma and Peter! J

Cheers,

Ash.

Reply all

Reply to author

Forward