mpirun error

805 views
Skip to first unread message

Spyros Theodoridis

unread,
Jul 8, 2014, 9:41:55 AM7/8/14
to exab...@googlegroups.com
Hi,

I am trying to run exabayes on a Virtual Machine using cloud computing. The VM runs Ubuntu using 16 CPUs and 64 Gb of RAM. My dataset consists of 157 tips (individuals) and the alignment is 180540 bp long with 26545 distinct alignment patterns. Parameter estimation in conducted for 2006 partitions (concatenated 90 bp Illumina reads).

When I execute a dry run everything seems to work fine. Below is the command used and the log produced during the dry run:

The program was called as follows:

mpirun -np 16 ./exabayes -f Concat_60_1.phy -q DNAPartition_2006.txt -n test -s $RANDOM -c config.nex -R 2 -C 4 -S -d
.
.
.

Will execute 2 runs in parallel.
Will execute 4 chains in parallel.

initialized diagnostics file 
initialized file ExaBayes_topologies.test.0
initialized file ExaBayes_parameters.test.0
initialized file ExaBayes_topologies.test.1
initialized file ExaBayes_parameters.test.1
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.

initial state: 
================================================================
[run=0,heat=0,gen=0] Lnl: -4357936.99 LnPr: 2974.66 RNG(key={515655797,3572140441},ctr={0,0})
[run=0,heat=1,gen=0] Lnl: -4364657.69 LnPr: 2974.66 RNG(key={1548868209,620774698},ctr={0,0})
[run=0,heat=2,gen=0] Lnl: -4365141.33 LnPr: 2974.66 RNG(key={965252863,3735499560},ctr={0,0})
[run=0,heat=3,gen=0] Lnl: -4354019.28 LnPr: 2974.66 RNG(key={2090256741,1388714654},ctr={0,0})
================================================================
[run=1,heat=0,gen=0] Lnl: -4362631.50 LnPr: 2974.66 RNG(key={1591443081,2980482111},ctr={0,0})
[run=1,heat=1,gen=0] Lnl: -4376482.80 LnPr: 2974.66 RNG(key={1022808228,3130032066},ctr={0,0})
[run=1,heat=2,gen=0] Lnl: -4363703.96 LnPr: 2974.66 RNG(key={1155358002,4165128489},ctr={0,0})
[run=1,heat=3,gen=0] Lnl: -4367416.40 LnPr: 2974.66 RNG(key={2544924441,1578712497},ctr={0,0})
================================================================

load distribution (rank,coords,#numParts,#numPatterns,chainsPerRun):
[ 0 ] [0,0,0] 1004 13284 (0,0)
[ 1 ] [0,0,1] 1003 13283 (0,0)
[ 2 ] [0,1,0] 1004 13284 (0,1)
[ 3 ] [0,1,1] 1003 13283 (0,1)
[ 4 ] [0,2,0] 1004 13284 (0,2)
[ 5 ] [0,2,1] 1003 13283 (0,2)
[ 6 ] [0,3,0] 1004 13284 (0,3)
[ 7 ] [0,3,1] 1003 13283 (0,3)
[ 8 ] [1,0,0] 1004 13284 (1,0)
[ 9 ] [1,0,1] 1003 13283 (1,0)
[ 10 ] [1,1,0] 1004 13284 (1,1)
[ 11 ] [1,1,1] 1003 13283 (1,1)
[ 12 ] [1,2,0] 1004 13284 (1,2)
[ 13 ] [1,2,1] 1003 13283 (1,2)
[ 14 ] [1,3,0] 1004 13284 (1,3)
[ 15 ] [1,3,1] 1003 13283 (1,3)

Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.
Command line, input data and config file is okay. Exiting gracefully.



However, when I try to execute the actual run  (without the -d option), after the initialisation of the files I get the following error:

initialized diagnostics file ExaBayes_diagnostics.test
initialized file ExaBayes_topologies.test.0
initialized file ExaBayes_parameters.test.0
initialized file ExaBayes_topologies.test.1
initialized file ExaBayes_parameters.test.1
[spy-raxml:29589] *** Process received signal ***
[spy-raxml:29589] Signal: Segmentation fault (11)
[spy-raxml:29589] Signal code: Address not mapped (1)
[spy-raxml:29589] Failing at address: (nil)
[spy-raxml:29589] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7f2a4a926340]
[spy-raxml:29589] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x92ad1) [0x7f2a4a5e2ad1]
[spy-raxml:29589] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x8d355) [0x7f2a4a5dd355]
[spy-raxml:29589] [ 3] ./exabayes() [0x56c8ff]
[spy-raxml:29589] [ 4] ./exabayes() [0x532704]
[spy-raxml:29589] [ 5] ./exabayes() [0x436ea8]
[spy-raxml:29589] [ 6] ./exabayes() [0x43fc7a]
[spy-raxml:29589] [ 7] ./exabayes() [0x441267]
[spy-raxml:29589] [ 8] ./exabayes() [0x44f5a2]
[spy-raxml:29589] [ 9] ./exabayes() [0x4114ba]
[spy-raxml:29589] [10] ./exabayes() [0x411708]
[spy-raxml:29589] [11] ./exabayes() [0x40f467]
[spy-raxml:29589] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f2a4a571ec5]
[spy-raxml:29589] [13] ./exabayes() [0x4107df]
[spy-raxml:29589] *** End of error message ***
[spy-raxml:29590] *** Process received signal ***
[spy-raxml:29590] Signal: Segmentation fault (11)
[spy-raxml:29590] Signal code: Address not mapped (1)
[spy-raxml:29590] Failing at address: (nil)
[spy-raxml:29590] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7fbfcb865340]
[spy-raxml:29590] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x92ad1) [0x7fbfcb521ad1]
[spy-raxml:29590] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x8d355) [0x7fbfcb51c355]
[spy-raxml:29590] [ 3] ./exabayes() [0x56c8ff]
[spy-raxml:29590] [ 4] ./exabayes() [0x532704]
[spy-raxml:29590] [ 5] ./exabayes() [0x436ea8]
[spy-raxml:29590] [ 6] ./exabayes() [0x43fc7a]
[spy-raxml:29590] [ 7] ./exabayes() [0x441267]
[spy-raxml:29590] [ 8] ./exabayes() [0x44f5a2]
[spy-raxml:29590] [ 9] ./exabayes() [0x4114ba]
[spy-raxml:29590] [10] ./exabayes() [0x411708]
[spy-raxml:29590] [11] ./exabayes() [0x40f467]
[spy-raxml:29590] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fbfcb4b0ec5]
[spy-raxml:29590] [13] ./exabayes() [0x4107df]
[spy-raxml:29590] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 14 with PID 29589 on node spy-raxml exited on signal 11 (Segmentation fault).

The same error occurs regardless of the parallelisation combinations (-R and -C).

For example when I run:
mpirun -np 16 ./exabayes -f Concat_60_1.phy -q DNAPartition_2006.txt -n test -s $RANDOM -c config.nex -R 2 -C 2 -S

I get:
mpirun noticed that process rank 7 with PID 29914 on node spy-raxml exited on signal 11 (Segmentation fault).


Any ideas on how to solve that?

Thanks!

Spyros

Andre J. Aberer

unread,
Jul 8, 2014, 10:24:29 AM7/8/14
to exab...@googlegroups.com
Hi Spyros,

thanks for the detailed problem description.

Phew, looks like an exabayes bug, but unfortunately I do not know the
answer right away. Let's go through it:

1. There was a version 1.3 online for a brief amount of time, where the
-S option was broken. For the current version it should work. In
version 1.3 that is online right now, it says:

> This software has been released in 2014-05-06 by

Could you verify this? Does the run work without -S?

2. Anything suspicious with the alignment? Those are quite a few
partitions...does an unpartitioned run work?

--
Best regards,
Andre
--
Best regards,
Andre Aberer

PreDoc (Bioinformatics) in the Exelixis Lab, Heidelberg Institute for Theoretical Studies

Spyros Theodoridis

unread,
Jul 8, 2014, 11:37:55 AM7/8/14
to exab...@googlegroups.com, andre....@googlemail.com
Hi Andre,

thanks for the prompt reply.

1. The version I am using is the latest I guess:

This is the multi-threaded MPI hybrid variant of ExaBayes (version 1.3),
a tool for Bayesian MCMC sampling of phylogenetic trees, build with the
Phylogenetic Likelihood Library (version 1.0.0, September 2013).

This software has been released in 2014-05-06

I tried without the -S option (although I have about 40% of missing data) and I still get the same error, although this time the error occurs just after the likelihood estimation of the 0 generation.


2.  The unpartitioned run doesn't work either. However, this time it reaches a few thousand generations (see below):

mpirun -np 16 ./exabayes -f Concat_60_1.phy -m DNA -n test -s $RANDOM -c config.nex -R 2 -C 4



Starting MCMC sampling using the SSE implementation for likelihood computations.
ExaBayes will run until topological convergence is achieved
(ASDSF < 5.00%, at least 1000000 generations).
ExaBayes will print log-likelihoods of all chains, grouped by
run id (separated by '=') and sorted by heat (starting with the
cold chain). First column indicates generation number (completed
by all chains) and the time elapsed for this increment.

[0,7.41s]        -4,360,415.60 -4,367,472.98 -4,363,412.94 -4,364,470.43 === -4,365,502.03 -4,369,422.25 -4,362,260.88 -4,368,295.28
[500,33.76s]     -1,151,037.48 -1,208,896.53 -1,331,534.13 -1,543,483.49 === -1,146,955.68 -1,235,166.26 -1,324,500.66 -1,936,702.45
[1000,32.40s]    -1,088,093.45 -1,110,348.33 -1,130,863.36 -1,131,951.79 === -1,081,765.99 -1,122,744.92 -1,125,467.15 -1,133,411.40
[1500,30.90s]    -1,059,747.14 -1,084,811.55 -1,086,339.30 -1,094,096.96 === -1,044,075.11 -1,063,301.72 -1,108,344.08 -1,109,425.33
[2000,29.40s]    -1,028,334.48 -1,057,273.56 -1,057,675.39 -1,059,409.52 === -976,960.79 -1,009,821.59 -1,042,092.48 -1,089,979.03
[2500,30.87s]    -922,440.32 -950,795.42 -991,499.71 -1,006,070.83 === -943,979.01 -967,261.59 -1,013,502.93 -1,029,186.41
[3000,32.94s]    -905,447.36 -906,551.33 -907,023.38 -936,268.28 === -907,370.91 -911,918.63 -916,184.69 -941,591.48
[3500,31.42s]    -903,210.74 -903,538.12 -904,232.08 -907,989.11 === -903,960.24 -904,029.54 -905,352.15 -905,615.88
[4000,33.69s]    -900,595.50 -901,339.50 -901,880.09 -902,491.94 === -900,916.27 -901,630.11 -902,636.53 -903,131.58
[4500,32.71s]    -899,055.14 -899,233.53 -900,437.62 -900,600.48 === -899,684.25 -899,787.30 -900,684.98 -901,324.01

standard deviation of split frequencies for trees 2-11 (avg/max):       14.56%  70.71%

[5000,33.06s]    -897,565.45 -898,113.42 -898,775.80 -899,595.96 === -897,888.90 -898,717.63 -899,663.04 -899,804.53
[5500,32.33s]    -896,846.24 -897,261.69 -898,086.73 -898,634.97 === -896,862.16 -897,739.35 -898,711.49 -898,812.74
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 30678 on node spy-raxml exited on signal 9 (Killed).


I guess there's nothing suspicious with the alignment apart from the fact that it's in sequential phylip format. I could sent it to you if you would like to give it a try)

Thanks for the support,

Spyros

Andre J. Aberer

unread,
Jul 8, 2014, 11:48:57 AM7/8/14
to Spyros Theodoridis, exab...@googlegroups.com
Hi Spyros,

> I guess there's nothing suspicious with the alignment apart from the fact
> that it's in sequential phylip format. I could sent it to you if you would
> like to give it a try)

thanks, that would be the easiest thing. Just send it to my
gmail-address, I think, I will be able to sort this out rather quickly.

--
Best regards,
Andre
Message has been deleted

Andre J. Aberer

unread,
Jul 9, 2014, 1:52:59 PM7/9/14
to Spyros Theodoridis, exab...@googlegroups.com
Hi Spyros,

there are/were a few issues:

* Firstly, a bug in ExaBayes caused a high memory load for your runs
with 2000 partitions. This and a few other minor things have been
fixed and released as version 1.3.1. Please give it a try.

It will appear until tomorrow on the exabayes website. Please do not
expect maximum code stability yet, furthermore apple support is not
fully established for 1.3.1.

* Secondly, the fully partitioned run now should work, but it is
problematic: initially, exabayes writes the number of distinct site
patterns for each partition: most partitions contained less than 10-20
sites, which is not enough data to justify a distinct GTR matrix. So I
would recommend to do unpartitioned runs or to cluster partitions.

* Thirdly, I ran your dataset a bit and I fear it is quite tough as
is. Maybe Metropolis-Coupling (numCoupledChains) helps, but from my
initial try the topological convergence statistic (asdsf) was among
the worst that I have seen so far.

* This probably is not an issue with your alignment (and your 64 GB
RAM), but I'll still write it: The program still may crash, if the
memory requirements are simply too high (adding more RAM to the
virtual node or increasing the number of nodes could help).

You can determine the memory requirements with the RAxML memory tool:

http://sco.h-its.org/exelixis/web/software/raxml/index.html

(keep in mind that the actual number of distinct site patterns is much
lower than the number of characters in your alignment)

If RAxML requires n bytes, then ExaBayes will require k * (m + p) * n
bytes, where m is the number of coupled chains, p is the number of
chains run in parallel and k is the number of runs executed in
parallel. So for -R 2 -C 2 and 4 coupled chains (one of your
configurations), you need 12 times the memory RAxML would need.

The +p factor gets reduced for increasing values of -M x (vanishes for
x = 3).

--
Best regards,
Andre


Spyros Theodoridis writes:

> Hi Andre,
>
> here’s the files (alignment, partitions)
>
> Thanks for the help!
>
> Spyros

Santiago Sánchez

unread,
Aug 4, 2014, 5:38:49 PM8/4/14
to exab...@googlegroups.com, spyrosth...@gmail.com, andre....@googlemail.com
Hi Andre,

I'm having the same errors as in this thread <mpirun noticed that process rank 0 with PID 31379 on node gpc-f145n033 exited on signal 0 (Unknown signal 0)>. I notice that this only happens when I use Metropolis-coupled chains. I have to add that my dataset is huge (183163 patterns of amino acid sequences, 339 genes). Initially I was using this test line:

$ mpirun -np 8 exabayes -f 339_ST_0_CONCAT.phylip -m PROT -c config_ortho.nex -n myRun -s $RANDOM -M 3 -S -R 2 -C 4

But I was getting the mpirun error at the end of the first generation:

[0,37.03s] -19,854,542.21 -19,853,934.65 -19,864,216.21 -19,850,402.68 === -19,848,470.13 -19,857,122.94 -19,849,676.65 -19,857,630.85
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 31413 on node gpc-f145n033 exited on signal 9 (Killed).
--------------------------------------------------------------------------

I also tried with out the -R and -C options as in:

$ mpirun -np 8 exabayes -f 339_ST_0_CONCAT.phylip -m PROT -c config_ortho.nex -n myRun -s $RANDOM -M 3 -S

But I got the same error. Right now its running with the same line but with one chain per run (e.g. no Metropolis-coupling)

[60,67.52s] -19,722,164.82 === -19,780,129.30 === -19,802,940.31 === -19,803,708.55 === -19,794,125.74 === -19,831,820.62 === -19,704,679.55 === -19,850,188.43

Is this another bug? or am I doing something wrong here. Thanks,

Santiago

Andre J. Aberer

unread,
Aug 4, 2014, 6:13:44 PM8/4/14
to Santiago Sánchez, exab...@googlegroups.com, spyrosth...@gmail.com
Hi Santiago,

wow, quite a dataset. It is very likely that asterisk (4) from my
previous reply applies to you: your dataset requires too much memory.

Brief answer:

* reduce metropolis coupling (maybe 2? does your dataset maybe even
converge without it?)

* OR increase the computational resources (e.g., use more computing
nodes of your cluster and thus increase the total memory available for
your run)

You can compute the exact memory requirements as described below (sum up
all "#patterns" from the exabayes-output, feed it into the raxml
webpage, etc. -- see point 4 down below).

But given the size of your dataset, I strongly recommend to execute this
dataset on a cluster: if you have the computational resources, you could
easily use 300-500 cores efficiently, just for a single chain without
Metropolis-coupling.

--
Best regards,
Andre
>> <javascript:>> wrote:
>> >
>> >> Hi Spyros,
>> >>
>> >>> I guess there's nothing suspicious with the alignment apart from the
>> fact
>> >>> that it's in sequential phylip format. I could sent it to you if you
>> would
>> >>> like to give it a try)
>> >>
>> >> thanks, that would be the easiest thing. Just send it to my
>> >> gmail-address, I think, I will be able to sort this out rather quickly.
>> >>
>> >> --
>> >> Best regards,
>> >> Andre
>>
>>
>> --
>> Best regards,
>> Andre Aberer
>>
>> PreDoc (Bioinformatics) in the Exelixis Lab, Heidelberg Institute for
>> Theoretical Studies
>>

--
Sent with my mu4e

Santiago Sánchez

unread,
Aug 4, 2014, 10:10:36 PM8/4/14
to exab...@googlegroups.com, santiag...@gmail.com, spyrosth...@gmail.com
Thanks for the swift reply Andre.

 
* reduce metropolis coupling (maybe 2? does your dataset maybe even
  converge without it?)

This still results in error. I notice that if I use the -R -C flags the run crashes right after the "initialized file ..." print out. Whereas if I don't use these flags the run crashes after printing the first chain.
 

* OR increase the computational resources (e.g., use more computing
  nodes of your cluster and thus increase the total memory available for
  your run)

You can compute the exact memory requirements as described below (sum up
all "#patterns" from the exabayes-output, feed it into the raxml
webpage, etc. -- see point 4 down below).

But given the size of your dataset, I strongly recommend to execute this
dataset on a cluster: if you have the computational resources, you could
easily use 300-500 cores efficiently, just for a single chain without
Metropolis-coupling.

According to the raxml memory calculator exabayes requires 9949.7MB (10432964480 bytes) to run my dataset. The cluster that I use supports 1.7GB of memory per mpi process per node (8 processes per node). So I think I'm safe on the memory side (or maybe not). I'll try to run it in an other cluster with more memory (64GB).

Is there another reason besides memory that explains why running M-coupled chains doesn't work?

Cheers,
Santiago

Santiago Sánchez

unread,
Aug 4, 2014, 11:15:08 PM8/4/14
to exab...@googlegroups.com, Santiago Sánchez, spyrosth...@gmail.com
A quick update:

As suggested by Andre, increasing the computational resources helped to avoid the error. I used:

mpirun -np 16 exabayes -f 339_ST_0_CONCAT.phylip -m PROT -c config_ortho.nex -n myRun $RANDOM -M 3 -S

With 8 independent runs and 2 MC chains per run.

Its running extremely slow (~1.5 min per 10 generations), though.

Andre, do you think that if I increase the resources the computation time will improve?

Cheers,
Santiago
--
Santiago Sánchez-Ramírez
Department of Ecology and Evolutionary Biology, University of Toronto
Department of Natural History (Mycology), Royal Ontario Museum
100 Queen's Park
Toronto, ON
M5S 2C6
Canada

Andre J. Aberer

unread,
Aug 5, 2014, 3:07:55 AM8/5/14
to exab...@googlegroups.com, Santiago Sánchez
Hi again Santiago,

> According to the raxml memory calculator exabayes requires 9949.7MB
> (10432964480 bytes) to run my dataset. The cluster that I use supports
> 1.7GB of memory per mpi process per node (8 processes per node). So I think
> I'm safe on the memory side (or maybe not). I'll try to run it in an other
> cluster with more memory (64GB).

Sorry, calculating the exact memory requirements is really a tad bit
complicated (maybe, we'll provide a little tool for that with the next
release).

If RAxML requires 9 GB, then ExaBayes requires 9 GB for 1 chain without
coupling and with -M 3.

Please go through chapter 8 of the manual page again:
http://sco.h-its.org/exelixis/web/software/exabayes/manual/index.html#cluster

Quick examples:

* w/o -M 3 we have 18 GB

* -C 4 -M 3 => 36 GB

* -R 2 -C 4 => (9 * 5 ) * 4 * 2 = 360 GB

The upshot is: with that many patterns, you can easily use up to 300 or
more processors *without* -R x or -C y and you'll still have full
parallel efficiency and less memory consumption. You only need to use -R
x -C y to go beyond that in number of processes.


> A quick update:
p>
> As suggested by Andre, increasing the computational resources helped to
> avoid the error. I used:
>
> mpirun -np 16 exabayes -f 339_ST_0_CONCAT.phylip -m PROT -c
> config_ortho.nex -n myRun $RANDOM -M 3 -S
>
> With 8 independent runs and 2 MC chains per run.
>
> Its running extremely slow (~1.5 min per 10 generations), though.
>
> Andre, do you think that if I increase the resources the computation time
> will improve?
>

Yes, definitely. I ran a similar dataset recently and used 192 cores. I
recommend check how long 100 or 500 generations take for a few setups
(with 1000 cores, it probably will not get any faster any more, so you
will have to use -R (first choice) and then -C ) before starting your
actual run.

If you just want to run 4 independent chains (with or without
M-coupling), then you could also commit 4 separate jobs to the cluster
and use the sdsf tool manually to check, whether topological convergence
has occurred.

[...]

--
Best regards,
Andre
Reply all
Reply to author
Forward
0 new messages