MrBayes Restart Problems

376 views
Skip to first unread message

klp...@columbia.edu

unread,
Feb 27, 2017, 3:53:32 PM2/27/17
to CIPRES Science Gateway Users
Good afternoon,

I've been trying to run MrBayes Restart. This is my third restart of this run, and I cannot get it to work. It keeps giving me this error:

intel/2013.1.117(49):ERROR:105: Unable to locate a modulefile for 'mkl'
[gcn-7-65.sdsc.edu:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8. MPI process died?
[gcn-7-65.sdsc.edu:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[gcn-7-65.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 4, pid: 51514) terminated with signal 9 -> abort job
[gcn-7-65.sdsc.edu:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node gcn-7-65 aborted: MPI process error (1)

I checked this forum for suggestions and stopped running Beagle. Here is my MrBayes block:

begin mrbayes;
set precision = 15;
set usebeagle=no;
set scientific=Yes;
set autoclose=yes;
set nowarnings=yes;
execute infile.nex;
lset coding = all;
lset Nst= 6;
lset Nucmodel= 4by4;
lset Rates= gamma;
lset Nbetacat= 5;
prset pinvarpr = uniform(0.0,1.0);
prset ratepr = fixed;
prset statefreqpr = dirichlet(1.0);
prset revmatpr = dirichlet(1.0,1.0,1.0,1.0,1.0,1.0);
prset ratecorrpr = uniform(-1.0,1.0);
prset covswitchpr = uniform(0.0,100.0);
prset Tratiopr = beta(1.0, 1.0);
prset brlenspr = unconstrained:exponential(10.0);
report Siterates=No;
report revmat=dirichlet;
mcmc append=yes ngen=12000000 nruns=2 nchains=4 temp=0.200 swapfreq=1 nswaps=1 samplefreq=1000 mcmcdiagn=Yes minpartfreq=0.1 allchains=Yes relburnin=Yes burnin=0 burninfrac=0.25 stoprule=Yes starttree=random stopval=0.01 Savebrlens=Yes nperts=0 Ordertaxa=Yes;
sump burnin=10 relburnin=Yes burninfrac=0.25 nruns=2 outputname=sumpoutput.out;
sumt burnin=10 relburnin=Yes burninfrac=0.25 nruns=2 ntrees=1 minpartfreq=0.05 contype=Halfcompat conformat=Figtree;
quit
end;


And lastly here is my _JobInfo.txt:

Task\ label=Parrot Bayesian Run 4.3 (nobeag)_2
Task\ ID=1115576
Tool=MRBAYES_321RESTARTBETA
created\ on=2017-02-26 15:30:26.0
JobHandle=NGBW-JOB-MRBAYES_321RESTARTBETA-174EF6FA5B1048B481E9357F87917FEF
resource=gordon
User\ ID=94385
User\ Name=kprovost
email=kpro...@amnh.org

Output=(ALL_FILES,*,UNKNOWN,UNKNOWN,UNKNOWN)
ChargeFactor=1.000000
cores=8
JOBID=2876537.gordon-fe2.local

Any help would be appreciated!

Kaiya

Mark Miller

unread,
Feb 27, 2017, 4:55:05 PM2/27/17
to CIPRES Science Gateway Users
Hey Kaiya,

This is just a guess, but it is dying suddenly here, without error message:

 

      Setting default partition (does not divide up characters)

      Setting model default

      Seed (for generating default start values) = 539544633

      Setting output file names to "infile.nex.run<i>.<p|t>"

   Exiting data block

   Reading mrbayes block

      Setting Precision to 15

      Setting usebeagle to no

 

Here is your command block (is this the same command block you used to create the first run?).

 

begin mrbayes;

set precision = 15;

set usebeagle=no;

set scientific=Yes;

 

I would try changing the order, so set scientific=Yes; comes before precision =15;

Or just strike set scientific=Yes; from the file, because “Yes”  is the default.


If that doesn't help, please let me know.


Mark

klp...@columbia.edu

unread,
Feb 27, 2017, 9:12:49 PM2/27/17
to CIPRES Science Gateway Users
Hi Mark,

I made your suggestions and ran the files again (only for 0.5 hours) but it did not seem to help, although this time I got a different error:

[gcn-18-46.sdsc.edu:mpirun_rsh][signal_processor] Caught signal 15, killing job
[gcn-18-46.sdsc.edu:mpispawn_0][report_error] connect() failed: Connection refused (111)

Here is the jobinfo.txt for the new file:

Task\ label=Parrot Bayesian Run 4.3 (nb TEST)
Task\ ID=1116477
Tool=MRBAYES_321RESTARTBETA
created\ on=2017-02-27 17:25:32.0
JobHandle=NGBW-JOB-MRBAYES_321RESTARTBETA-D99D9732D58F4B3698DE4FAF4168C099
resource=gordon
User\ ID=94385
User\ Name=kprovost
email=kpro...@amnh.org

Output=(ALL_FILES,*,UNKNOWN,UNKNOWN,UNKNOWN)
ChargeFactor=1.000000
cores=8
JOBID=2877395.gordon-fe2.local


Mark Miller

unread,
Mar 2, 2017, 12:46:45 PM3/2/17
to CIPRES Science Gateway Users
Hey Kaiya,

Sorry for my slow response, I was out yesterday.
This time the job was running fine, but the ,max time limit was hit before the job completed.
The way you tell is to look at the scheduler_stderr.txt file. It has this message:

 

=>> PBS: job killed: walltime 1837 exceeded limit 1800

kill -8786: No such process


So just clone that job, set a longer max run time, submit and it should work fine.

Let me know if you have further issues.


Best,

Mark





klp...@columbia.edu

unread,
Mar 5, 2017, 1:41:16 PM3/5/17
to CIPRES Science Gateway Users
Hello Mark,

As you requested I resubmitted the changed job to let it run for the full time possible (168 hours) but it is still giving me the modulefile error:

intel/2013.1.117(49):ERROR:105: Unable to locate a modulefile for 'mkl'
[gcn-18-31.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 1, pid: 79672) terminated with signal 9 -> abort job
[gcn-18-31.sdsc.edu:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 10. MPI process died?
[gcn-18-31.sdsc.edu:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[gcn-18-31.sdsc.edu:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node gcn-18-31 aborted: Error while reading a PMI socket (4)

Job info once again:

Task\ label=Parrot Bayesian Run 4.3 (nb TEST)_2
Task\ ID=1119083
Tool=MRBAYES_321RESTARTBETA
created\ on=2017-03-02 15:26:23.0
JobHandle=NGBW-JOB-MRBAYES_321RESTARTBETA-36833136526942F88249647B9B33AABC
resource=gordon
User\ ID=94385
User\ Name=kprovost
email=kpro...@amnh.org

Output=(ALL_FILES,*,UNKNOWN,UNKNOWN,UNKNOWN)
ChargeFactor=1.000000
cores=8
JOBID=2880232.gordon-fe2.local

I am a bit stumped! Do you have any other suggestions?

Kaiya

Mark Miller

unread,
Mar 6, 2017, 12:29:17 AM3/6/17
to CIPRES Science Gateway Users
Hey Kaiya,

Something funny going on here, the input file is being read over and over again but nothing is happening.
I do not understand it. Try taking the

execute infile.nex;

statement out of your input file.

Also, this will cause problems I believe:

sumt burnin=10 relburnin=Yes burninfrac=0.25 nruns=2 ntrees=1 minpartfreq=0.05 contype=Halfcompat conformat=Figtree;

quit

end;


I don't think the quit statement belongs there.


Best,

mark


klp...@columbia.edu

unread,
Mar 14, 2017, 1:53:46 PM3/14/17
to CIPRES Science Gateway Users
Hi Mark,

It is working great now! Thank you for all the help!

Cheers,

Kaiya
Reply all
Reply to author
Forward
0 new messages