MPI and Checkpoint error

416 views
Skip to first unread message

Omar Daniel Leon Alvarado

unread,
Jul 29, 2020, 9:52:25 PM7/29/20
to revbayes-users
I'm working with MPI perfectly, using: mpirun -np 4 rb-mpi mcmc_JC.Rev. However, when I try to restart the analysis using the checkpoint it only works without MPI, i.e. with mpirun - np 1 rb-mpi mcmc_JC.Rev, ran normally, but when I ran mpirun -np 4 rb-mpi mcmc_JC.Rev, I have the following error:

> source("mcmc_JC.Rev")
   Processing file "mcmc_JC.Rev"
   Successfully read one character matrix from file 'primates_and_galeopterus_cytb.nex'
[omar-Lenovo-Z40-70:17477] *** Process received signal ***
[omar-Lenovo-Z40-70:17477] Signal: Segmentation fault (11)
[omar-Lenovo-Z40-70:17477] Signal code: Address not mapped (1)
[omar-Lenovo-Z40-70:17477] Failing at address: (nil)
[omar-Lenovo-Z40-70:17478] *** Process received signal ***
[omar-Lenovo-Z40-70:17478] Signal: Segmentation fault (11)
[omar-Lenovo-Z40-70:17478] Signal code: Address not mapped (1)
[omar-Lenovo-Z40-70:17478] Failing at address: (nil)
[omar-Lenovo-Z40-70:17477] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7fb0506f98a0]
[omar-Lenovo-Z40-70:17477] [ 1] [omar-Lenovo-Z40-70:17475] *** Process received signal ***
[omar-Lenovo-Z40-70:17478] [omar-Lenovo-Z40-70:17475] Signal: Segmentation fault (11)
[omar-Lenovo-Z40-70:17475] Signal code: Address not mapped (1)
[omar-Lenovo-Z40-70:17475] Failing at address: (nil)
[omar-Lenovo-Z40-70:17476] *** Process received signal ***
[omar-Lenovo-Z40-70:17476] Signal: Segmentation fault (11)
[omar-Lenovo-Z40-70:17476] Signal code: Address not mapped (1)
[omar-Lenovo-Z40-70:17476] Failing at address: (nil)
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f780a60d8a0]
[omar-Lenovo-Z40-70:17478] [ 1] [omar-Lenovo-Z40-70:17476] [omar-Lenovo-Z40-70:17475] [ 0] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0/lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7fbc910368a0(+0x]
[omar-Lenovo-Z40-70:17476] [ 1] 128a0)[0x7f39d24468a0]
[omar-Lenovo-Z40-70:17475] [ 1] rb-mpi(_ZN12RevBayesCore18MonteCarloAnalysis24initializeFromCheckpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x83a)[0x558e1006055a]
[omar-Lenovo-Z40-70:17476] [ 2] rb-mpi(_ZN12RevBayesCore18MonteCarloAnalysis24initializeFromCheckpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x83a)[0x559a260da55a]
[omar-Lenovo-Z40-70:17477] [ 2] rb-mpi(_ZN12RevBayesCore18MonteCarloAnalysis24initializeFromCheckpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x83a)[0x5594969bb55a]
[omar-Lenovo-Z40-70:17478] [ 2] rb-mpi(_ZN12RevBayesCore18MonteCarloAnalysis24initializeFromCheckpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x83a)[0x5557e3eb255a]
[omar-Lenovo-Z40-70:17475] [ 2] rb-mpi(_ZN11RevLanguage18MonteCarloAnalysis13executeMethodERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_8ArgumentESaISA_EERb+0x944)[0xrb-mpi(_ZN11RevLanguage18MonteCarloAnalysis13executeMethodERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_8ArgumentESaISA_EERb+0x944)[0x559a25c8cf14]
[omar-Lenovo-Z40-70:17477] rb-mpi(_ZN11RevLanguage18MonteCarloAnalysis13executeMethodERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_8ArgumentESaISA_EERb+0x944)[0x5557e3a64f14]
[ 3] 558e0fc12f14]
[omar-Lenovo-Z40-70:17476] [omar-Lenovo-Z40-70:17475] [ 3] rb-mpi[ 3] (_ZN11RevLanguage18MonteCarloAnalysis13executeMethodERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_8ArgumentESaISA_EERb+0x944)[0x55949656df14]
[omar-Lenovo-Z40-70:17478] [ 3] rb-mpi(_ZN11RevLanguage15MemberProcedure7executeEv+0xd4)[0x5557e346fb74]
[omar-Lenovo-Z40-70:17475] [ 4] rb-mpirb-mpi(_ZN11RevLanguage15MemberProcedure7executeEv+0xd4)[0x559495f78b74]
[omar-Lenovo-Z40-70:17478] [ 4] (_ZN11RevLanguage15MemberProcedure7executeEv+0xd4)[0x558e0f61db74]
[omar-Lenovo-Z40-70:17476] [ 4] rb-mpi(_ZN11RevLanguage15MemberProcedure7executeEv+0xd4)[0x559a25697b74]
[omar-Lenovo-Z40-70:17477] [ 4] rb-mpi(_ZN11RevLanguage18SyntaxFunctionCall15evaluateContentERNS_11EnvironmentEb+0x3c1)[0x55949663f5e1]
[omar-Lenovo-Z40-70:17478] [ 5] rb-mpi(_ZN11RevLanguage18SyntaxFunctionCall15evaluateContentERNS_11EnvironmentEb+0x3c1)[0x5557e3b365e1]
[omar-Lenovo-Z40-70:17475] [ 5] rb-mpi(_ZN11RevLanguage18SyntaxFunctionCall15evaluateContentERNS_11EnvironmentEb+0x3c1)[0x558e0fce45e1]
[omar-Lenovo-Z40-70:17476] [ 5] rb-mpi(_ZN11RevLanguage18SyntaxFunctionCall15evaluateContentERNS_11EnvironmentEb+0x3c1)[0x559a25d5e5e1]
[omar-Lenovo-Z40-70:17477] [ 5] rb-mpi(_ZNK11RevLanguage6Parser7executeEPNS_13SyntaxElementERNS_11EnvironmentE+0x98)[0x558e0f61f898]
[omar-Lenovo-Z40-70:17476] [ 6] rb-mpi(_ZNK11RevLanguage6Parser7executeEPNS_13SyntaxElementERNS_11EnvironmentE+0x98)[0x559495f7a898]
[omar-Lenovo-Z40-70:17478] [ 6] rb-mpi(_ZNK11RevLanguage6Parser7executeEPNS_13SyntaxElementERNS_11EnvironmentE+0x98)[0x5557e3471898]
[omar-Lenovo-Z40-70:17475] [ 6] rb-mpi(_ZNK11RevLanguage6Parser7executeEPNS_13SyntaxElementERNS_11EnvironmentE+0x98)[0x559a25699898]
[omar-Lenovo-Z40-70:17477] [ 6] rb-mpi(_Z7yyparsev+0xc02)[0x558e0fcd8a42]
[omar-Lenovo-Z40-70:17476] [ 7] rb-mpi(_Z7yyparsev+0xc02)[0x559496633a42]
[omar-Lenovo-Z40-70:17478] [ 7] rb-mpirb-mpi(_Z7yyparsev+0xc02)[0x5557e3b2aa42]
[omar-Lenovo-Z40-70:17475] [ 7] (_Z7yyparsev+0xc02)[0x559a25d52a42]
[omar-Lenovo-Z40-70:17477] [ 7] rb-mpirb-mpirb-mpi(_ZN11RevLanguage6Parser14processCommandERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_11EnvironmentE+0x1be)[0x558e0f621ede]
[omar-Lenovo-Z40-70:17476] rb-mpi(_ZN11RevLanguage6Parser14processCommandERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_11EnvironmentE+0x1be)[0x559a2569bede]
[omar-Lenovo-Z40-70:17477] [ 8] [ 8] (_ZN11RevLanguage6Parser14processCommandERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_11EnvironmentE+0x1be)[0x5557e3473ede]
[omar-Lenovo-Z40-70:17475] [ 8] (_ZN11RevLanguage6Parser14processCommandERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_11EnvironmentE+0x1be)[0x559495f7cede]
[omar-Lenovo-Z40-70:17478] [ 8] rb-mpi(_ZN11RevLanguage11Func_source7executeEv+0x4dc)[0x558e0fc9250c]
[omar-Lenovo-Z40-70:17476] [ 9] rb-mpirb-mpi(_ZN11RevLanguage11Func_source7executeEv+0x4dc)[0x5594965ed50c]
[omar-Lenovo-Z40-70:17478] [ 9] (_ZN11RevLanguage11Func_source7executeEv+0x4dc)[0x5557e3ae450c]
[omar-Lenovo-Z40-70:17475] [ 9] rb-mpi(_ZN11RevLanguage18SyntaxFunctionCall15evaluateContentERNS_11EnvironmentEb+0x3c1)[0x558e0fce45e1]
[omar-Lenovo-Z40-70:17476] [10] rb-mpi(_ZN11RevLanguage18SyntaxFunctionCall15evaluateContentERNS_11EnvironmentEb+0x3c1)[0x55949663f5e1]
[omar-Lenovo-Z40-70:17478] [10] rb-mpi(_ZNK11RevLanguage6Parser7executeEPNS_13SyntaxElementERNS_11EnvironmentE+0x98)[0x558e0f61f898]
[omar-Lenovo-Z40-70:17476] [11] rb-mpi(_ZNK11RevLanguage6Parser7executeEPNS_13SyntaxElementERNS_11EnvironmentE+0x98)[0x559495f7a898]
[omar-Lenovo-Z40-70:17478] [11] rb-mpi(_ZN11RevLanguage18SyntaxFunctionCall15evaluateContentERNS_11EnvironmentEb+0x3c1)[0x5557e3b365e1]
[omar-Lenovo-Z40-70:17475] [10] rb-mpi(_Z7yyparsevrb-mpi(_ZNK11RevLanguage6Parser7executeEPNS_13SyntaxElementERNS_11EnvironmentE+0x98)[0x5557e3471898]
[omar-Lenovo-Z40-70:17475] [11] +0xrb-mpi(_Z7yyparsev+0xc02)[0x559496633a42]
[omar-Lenovo-Z40-70:17478] [12] c02)[0x558e0fcd8a42]
[omar-Lenovo-Z40-70:17476] [12] rb-mpi(_ZN11RevLanguage6Parser14processCommandERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_11EnvironmentE+0x1be)[0x559495f7cede]
[omar-Lenovo-Z40-70:17478] [13] rb-mpi(_Z7yyparsev+0xc02)[0x5557e3b2aa42]
[omar-Lenovo-Z40-70:17475] [12] rb-mpi(_ZN15RevLanguageMain27startRevLanguageEnvironmentERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS6_EESA_+0x12be)[0x559495e039ce]
[omar-Lenovo-Z40-70:17478] [14] rb-mpi(rb-mpi(_ZN11RevLanguage6Parser14processCommandERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_11EnvironmentE+0x1be)[0x5557e3473ede]
[omar-Lenovo-Z40-70:17475] [13] _ZN11RevLanguage6Parser14processCommandERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_11EnvironmentE+0xrb-mpi(main+0x653)[0x559495da2533]
[omar-Lenovo-Z40-70:17478] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f78093e0b97]
[omar-Lenovo-Z40-70:17478] [16] 1be)[0x558e0f621ede]
[omar-Lenovo-Z40-70:17476] rb-mpi(_ZN15RevLanguageMain27startRevLanguageEnvironmentERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS6_EESA_+0x12be)[0x5557e32fa9ce]
[omar-Lenovo-Z40-70:17475] [14] [13] rb-mpi(_ZN11RevLanguage11Func_source7executeEv+0x4dc)[0x559a25d0c50c]
[omar-Lenovo-Z40-70:17477] [ 9] rb-mpi(main+0x653)[0x5557e3299533]
[omar-Lenovo-Z40-70:17475] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f39d1219b97]
[omar-Lenovo-Z40-70:17475] [16] rb-mpirb-mpi(_ZN11RevLanguage18SyntaxFunctionCall15evaluateContentERNS_11EnvironmentEb+0x3c1)[0x559a25d5e5e1]
[omar-Lenovo-Z40-70:17477] [10] (_ZN15RevLanguageMain27startRevLanguageEnvironmentERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS6_EESA_+0x12be)[0x558e0f4a89ce]
[omar-Lenovo-Z40-70:17476] [14] rb-mpi(_ZNK11RevLanguage6Parser7executeEPNS_13SyntaxElementERNS_11EnvironmentE+0x98)[0x559a25699898]
[omar-Lenovo-Z40-70:17477] [11] rb-mpi(main+0x653)[0x558e0f447533]
[omar-Lenovo-Z40-70:17476] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fbc8fe09b97]
[omar-Lenovo-Z40-70:17476] [16] rb-mpi(_Z7yyparsev+0xc02)[0x559a25d52a42]
[omar-Lenovo-Z40-70:17477] [12] rb-mpi(_ZN11RevLanguage6Parser14processCommandERNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS_11EnvironmentE+0x1be)[0x559a2569bede]
[omar-Lenovo-Z40-70:17477] [13] rb-mpi(_ZN15RevLanguageMain27startRevLanguageEnvironmentERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS6_EESA_+0x12be)[0x559a255229ce]
[omar-Lenovo-Z40-70:17477] [14] rb-mpi(main+0x653)[0x559a254c1533]
[omar-Lenovo-Z40-70:17477] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fb04f4ccb97]
[omar-Lenovo-Z40-70:17477] [16] rb-mpi(_start+0x2arb-mpi(_start+0x2a)[0x558e0f490b6a]
[omar-Lenovo-Z40-70:17476] *** End of error message ***
rb-mpi(_start+0x2a)[0x559a2550ab6a]
[omar-Lenovo-Z40-70:17477] *** End of error message ***
)[0x5557e32e2b6a]
[omar-Lenovo-Z40-70:17475] *** End of error message ***
rb-mpi(_start+0x2a)[0x559495debb6a]
[omar-Lenovo-Z40-70:17478] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node omar-Lenovo-Z40-70 exited on signal 11 (Segmentation fault).

RevBayes version (1.1.0)
Build from development (rapture-309-g7c6f35) on dom jun 21 00:02:49 -03 2020
Compiled on Ubuntu 18.04


Error_MPI_Checkpoint.zip

Alexandre Pedro

unread,
Dec 13, 2021, 6:52:08 PM12/13/21
to revbayes-users
I'm having the same error message when trying the biogeography tutorials with rb-mpi. Works fine with single process rb

Mac OS 10.14.16 (Mojave)

RevBayes version (1.1.1) compiled with cmake version 3.21.2

Apple clang version 11.0.0 (clang-1100.0.33.17)

mpirun (Open MPI) 4.1.1


Any help will be much appreciated


Alex

Jasmine Mah

unread,
Mar 7, 2022, 2:21:42 PM3/7/22
to revbayes-users
Hi All,

I too get an error when I try to use MPI and initalizeFromCheckpoint(). The error I get:
```
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 0 on node revbayes-c2-standard-30vcpu-50gb-9 exited on signal 11 (Segmentation fault).
```
Has anyone found a solution? Or has anyone published a bug report on the RevBayes Github?

Cheers,
Jasmine

Jasmine Mah

unread,
Mar 7, 2022, 2:26:16 PM3/7/22
to revbayes-users
And to be clear, like the above users this error goes away with ```--np 1```. 

I did find this Tweet by April Wright on using mnStochasticVariable, but there is so little documentation out there that I haven't been able to use it successfully.


Cheers,
Jasmine

Raquel Pereira

unread,
Mar 14, 2022, 5:28:02 AM3/14/22
to revbayes-users
I'm having exactly the same error.

Raquel Pereira

unread,
Mar 14, 2022, 10:12:27 AM3/14/22
to revbayes-users
I'm having troubles to understand how to set mnStochasticVariable I don't get any file with the following code:
gen=100
mnStochasticVariable("28S_checkpoint", printgen=1, separator=":",append=FALSE,version=TRUE)
mymc3.run(generations=gen)

Jasmine Mah

unread,
Mar 14, 2022, 1:20:34 PM3/14/22
to Raquel Pereira, revbayes-users
Hi Raquel,

I asked April Wright about the checkpointing issue with MPI and it sounds like she was able to pinpoint the technical problem, which is that "there is a precision disconnect between how numbers are written to file and how they're read in". They're working on correcting that right now and if I hear back I will post here again.

In terms of mnStochasticVariable, I've used it with fewer options than you have, using the default for most options (my line: `mnStochasticVariable(filename=file_name, printgen=100)`).  I noticed that you set the separator as ":" and I wonder if that's a misreading of the manual (link here). From what I understand, the default is an empty space, and the ":" is just the part of the manual that says "Default:" .  Generally ":" is not commonly used as a delimiter, so perhaps it might be worth trying to use the default separator (ie don't set a separator value) or something more common, like a tab or comma. Not sure if that's in fact your problem but it might be worth a try!

Cheers,
Jasmine

--
You received this message because you are subscribed to a topic in the Google Groups "revbayes-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/revbayes-users/MVSJIfzDR9s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to revbayes-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/revbayes-users/a6af5002-936b-4a17-bd47-41f1e4b71600n%40googlegroups.com.

Raquel Pereira

unread,
Mar 18, 2022, 7:05:09 AM3/18/22
to revbayes-users
Hej again, 


I tried the following 

> mnStochasticVariable(filename="please.txt", printgen=100, separator=",")
I get the following 
   Mntr_StochasticVariable

No file is created

Version 1.1.1

Orlando Schwery

unread,
Jun 4, 2022, 6:13:19 PM6/4/22
to revbayes-users
(I realise I had sent this post only as a direct answer, so nobody else who's searching for this will ever find this, so I'm just reposting it for the sake of closure:)

Not sure if this question has been solved yet, but just in case:

The line you ran sets up a monitor (as all functions starting with mn do in rev), which will create and write to a file once the mcmc runs. But for that to work, that monitor has to be saved to a list of monitors. Otherwise, you get the response you got, which is essentially just telling you this made a monitor for stochastic variables, but that monitor is now lost in the ether of rev.

So assuming your monitors are saved under an object called "monitors", you would want to do:

monitors.append( mnStochasticVariable(filename="please.txt", printgen=100, separator=",") )

or, if you use a counter variable that keeps track of the number of monitors already saved, e.g. "mni"

monitors[mni++] = mnStochasticVariable(filename="please.txt", printgen=100, separator=",")

The file will be created as soon as the mcmc starts (i.e. after the burnin/tuning round, if you have one).

I hope this is of any help.

Cheers,
Orlando
Reply all
Reply to author
Forward
0 new messages