Large data set

290 views
Skip to first unread message

chris blair

unread,
Jul 7, 2015, 7:21:40 PM7/7/15
to bioge...@googlegroups.com
Hi Nick, 

I am new to your program and am trying to analyze a relatively large data set consisting of 1277 species classified into 21 areas. The maximum number of areas occupied by a species is 16. I was hoping that you could provide some advice regarding parameter settings in BioGeoBEARS. I ran a preliminary unconstrained DEC analysis in RASP that finished in less than a day. However, I had to constrain the maximum number of areas to a very low value (e.g. 2). I am pretty sure that no analysis will finish if I specify a max areas value of 16. An obvious solution is to reduce the number of areas. However, we are a bit hesitant to do this as we would lose power. There is obviously a trade-off between power and resolution and being able to run a meaningful analysis to completion. Any thoughts are welcome.

Chris

Nick Matzke

unread,
Jul 7, 2015, 9:52:36 PM7/7/15
to bioge...@googlegroups.com
Hi -- I'm traveling to Australia, so I'm only partially able to access things., But briefly, the key thing is the number of states. The most states I've ever done is about 2000. This took a few hours for a 12-taxon tree, on ~12 cores. A 1200-taxon tree would take about 100 times longer.

You can check the number of states with:

numstates_from_numareas(numareas=21, maxareas=16, include_null_range=TRUE)

...you can see that's 2,089,605 states. That means the transition matrix is 2089605x 2089605 -- that's too big to hold in memory, let alone exponentiate.

So, your only choices are:

1. Reduce areas or max number of areas.  Focus on what you think are the "most discrete" features in the geography, i.e. geographical features that are clearly conserved on the phylogeny. IMHO there isn't a huge point in doing ancestral state inference except on things that are pretty conserved on the phylogeny anyway.

1a. One option is to just model the centroid point of each species, thus each species occupies 1 area. You could do up to 1000 or 2000 areas this way, but you are sacrificing whatever is useful about modeling species as widespread (e.g., information about the cladogenetic process). Whether this is important depends on how likely you think it is you will successfully get such information anyway...

2. Use BayArea (the C++ program, not the BAYAREALIKE model in BioGeoBEARS). This means you are assuming BayArea's cladogenesis model (ancestral range always copied exact to both descendants, e.g. assuming widespread sympatry). RevBayes can combine BayArea Bayesian history sampling with cladogenesis models, but my understanding is that sampling cladogenetic histories effectively is very difficult and not really working.

3. Get sparse matrix exponentiation working in BioGeoBEARS or your own program (I can probably get sparse matrix stuff working, it would take me about a solid week which I won't have for awhile. And it might not solve your scale of problem anyway).

4. Be a genius and solve the computational problems in some new way.

4a. One option is subdividing the problem into subregions and doing each separately.

Sorry I don't have an easier answer!

Cheers, Nick



--
You received this message because you are subscribed to the Google Groups "BioGeoBEARS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biogeobears...@googlegroups.com.
To post to this group, send email to bioge...@googlegroups.com.
Visit this group at http://groups.google.com/group/biogeobears.
To view this discussion on the web visit https://groups.google.com/d/msgid/biogeobears/b4bba74d-d390-4f9e-b977-49cae25b912d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

chris blair

unread,
Jul 7, 2015, 10:51:45 PM7/7/15
to bioge...@googlegroups.com, mat...@nimbios.org
Hi Nick, 

Thanks a bunch for the suggestions. I will discuss options with my co-authors.

All the best and safe travels, 

Chris

Oscar Pérez

unread,
Jul 13, 2015, 4:09:37 AM7/13/15
to bioge...@googlegroups.com
Hi Nick,


I am also running BioGeoBEARS on a large dataset (900 terminals and 11 areas - maximum numer of occupied areas=9). 

On a linux cluster (120 cores - 64 GB memory) I manage to run part of DEC analysis. However, before it completes the analysis and loads results on .Rdata, I get this error message:

Uppass starting for marginal ancestral states estimation!
Error in if (sum(relprob_each_split_scenario) > 0) { : 
  missing value where TRUE/FALSE needed
Calls: bears_optim_run ... calc_uppass_probs_new2 -> calc_uppass_scenario_probs_new2
In addition: Warning message:
In (function (npt = min(n + 2L, 2L * n), rhobeg = NA, rhoend = NA,  :
  unused control arguments ignored
Execution halted  

Could you please tell me what does this exactly mean and how could I fix it?

Thank you in advance for your help!



Oscar

Nick Matzke

unread,
Jul 13, 2015, 6:15:20 AM7/13/15
to bioge...@googlegroups.com
(I just realized I replied off-list, might as well reply on-list)

Hi Oscar!  I'm sorry about the error, that sounds frustrating!  It is likely due to not sourcing all of the updated functions (I did some updates back in May/early June).  Just make sure the *current*, *complete* list of source commands is being used. They are listed here:


The list of source() commands expanded in the last update, so an older script won't have them all.

You will have to run these source command *every* time after "library(BioGeoBEARS)" is being used. Alternatively, you can download each file and source them locally -- again, every time after "library(BioGeoBEARS)" is being used.

Yes, I should do a CRAN update or at least github, but I've been moving to Australia and enough updates have been made now that just doing the documentation is a bit of a project. It will happen eventually!

I just ran the Psychotria example, both stratified and non-stratified, and both worked.

If this does not work, please tell me! 

Thanks!
Nick

On Mon, Jul 13, 2015 at 6:09 PM, Oscar Pérez <oapere...@googlemail.com> wrote:
Hi Nick,


I am also running BioGeoBEARS on a large dataset (900 terminals and 11 areas - maximum numer of occupied areas=9). 

On a linux cluster (120 cores - 64 GB memory) I manage to run part of DEC analysis. However, before it completes the analysis and loads results on .Rdata, I get this error message:

Uppass starting for marginal ancestral states estimation!
Error in if (sum(relprob_each_split_scenario) > 0) { : 
  missing value where TRUE/FALSE needed
Calls: bears_optim_run ... calc_uppass_probs_new2 -> calc_uppass_scenario_probs_new2
In addition: Warning message:
In (function (npt = min(n + 2L, 2L * n), rhobeg = NA, rhoend = NA,  :
  unused control arguments ignored
Execution halted  

Could you please tell me what does this exactly mean and how could I fix it?

Thank you in advance for your help!



Oscar

On Wednesday, July 8, 2015 at 1:21:40 AM UTC+2, chris blair wrote:
Hi Nick, 

I am new to your program and am trying to analyze a relatively large data set consisting of 1277 species classified into 21 areas. The maximum number of areas occupied by a species is 16. I was hoping that you could provide some advice regarding parameter settings in BioGeoBEARS. I ran a preliminary unconstrained DEC analysis in RASP that finished in less than a day. However, I had to constrain the maximum number of areas to a very low value (e.g. 2). I am pretty sure that no analysis will finish if I specify a max areas value of 16. An obvious solution is to reduce the number of areas. However, we are a bit hesitant to do this as we would lose power. There is obviously a trade-off between power and resolution and being able to run a meaningful analysis to completion. Any thoughts are welcome.

Chris

--
You received this message because you are subscribed to the Google Groups "BioGeoBEARS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biogeobears...@googlegroups.com.
To post to this group, send email to bioge...@googlegroups.com.
Visit this group at http://groups.google.com/group/biogeobears.

Oscar Pérez

unread,
Jul 15, 2015, 4:00:39 AM7/15/15
to bioge...@googlegroups.com, mat...@nimbios.org
Hi Nick!


Thanks for your answer. However, I have bad news. I updated my script, and so the codes for updating the functions look like this:

library(BioGeoBEARS)
calc_loglike_sp = compiler::cmpfun(calc_loglike_sp_prebyte)    # crucial to fix bug in uppass calculations
calc_independent_likelihoods_on_each_branch = compiler::cmpfun(calc_independent_likelihoods_on_each_branch_prebyte)
# slight speedup hopefully

#or try...16/V/2015
calc_loglike_sp = compiler::cmpfun(calc_loglike_sp_prebyte)    # crucial to fix bug in uppass calculations
calc_independent_likelihoods_on_each_branch = compiler::cmpfun(calc_independent_likelihoods_on_each_branch_prebyte)

calc_loglike_sp = compiler::cmpfun(calc_loglike_sp_prebyte)    # crucial to fix bug in uppass calculations
calc_independent_likelihoods_on_each_branch = compiler::cmpfun(calc_independent_likelihoods_on_each_branch_prebyte)

#or

library(BioGeoBEARS)
calc_loglike_sp = compiler::cmpfun(calc_loglike_sp_prebyte)    # crucial to fix bug in uppass calculations
calc_independent_likelihoods_on_each_branch = compiler::cmpfun(calc_independent_likelihoods_on_each_branch_prebyte)

calc_loglike_sp = compiler::cmpfun(calc_loglike_sp_prebyte)    # crucial to fix bug in uppass calculations
calc_independent_likelihoods_on_each_branch = compiler::cmpfun(calc_independent_likelihoods_on_each_branch_prebyte)

Like I said, I duplicate the code to call for calc_loglike function. 

When I ran my code, I get the same error message: 

                                                                       desc
d                         anagenesis: rate of 'dispersal' (range expansion)
e                      anagenesis: rate of 'extinction' (range contraction)
a           anagenesis: rate of range-switching (i.e. for a standard char.)
b                                    anagenesis: exponent on branch lengths
x                                   exponent on distance (modifies d, j, a)
n                     exponent on environmental distance (modifies d, j, a)
w               exponent on manual dispersal multipliers (modifies d, j, a)
u            anagenesis: exponent on extinction risk with area (modifies e)
j                 cladogenesis: relative per-event weight of jump dispersal
ysv                                                     cladogenesis: y+s+v
ys                                                        cladogenesis: y+s
y       cladogenesis: relative per-event weight of sympatry (range-copying)
s              cladogenesis: relative per-event weight of subset speciation
v           cladogenesis: relative per-event weight of vicariant speciation
mx01                  cladogenesis: controls range size of smaller daughter
mx01j                 cladogenesis: controls range size of smaller daughter
mx01y                 cladogenesis: controls range size of smaller daughter
mx01s                 cladogenesis: controls range size of smaller daughter
mx01v                 cladogenesis: controls range size of smaller daughter
mx01r                       root: controls range size probabilities of root
mf                         mean frequency of truly sampling OTU of interest
dp                 detection probability per true sample of OTU of interest
fdp   false detection of OTU probability per true taphonomic control sample



...successful.


Running calc_loglike_sp():
This run of calc_loglike_sp() has a min_branchlength of: 1e-06
Branches shorter than this will be assumed to be connected to the tree with
sympatric events (i.e., members of fossil lineages on ~0 length branches.)
This tree has 1 branches < 1e-06.


Uppass starting for marginal ancestral states estimation!
Error in if (sum(relprob_each_split_scenario) > 0) { : 
  missing value where TRUE/FALSE needed
Calls: bears_optim_run ... calc_uppass_probs_new2 -> calc_uppass_scenario_probs_new2
In addition: Warning message:
In (function (npt = min(n + 2L, 2L * n), rhobeg = NA, rhoend = NA,  :
  unused control arguments ignored
Execution halted

Maybe is a silly question, but, do you think that this might be caused by very small branches on the tree?


Thank you for your help!





Oscar

Nick Matzke

unread,
Jul 15, 2015, 5:45:00 AM7/15/15
to bioge...@googlegroups.com, mat...@nimbios.org
Yeah, try fixing this branch: 

This tree has 1 branches < 1e-06.

Nick Matzke

unread,
Jul 15, 2015, 5:51:24 AM7/15/15
to bioge...@googlegroups.com
Sorry, that was brief.  It could be that some update I did is incompatible with the special case of very short branches (which get treated as direct ancestors). I will look into it when I get a real computer in my office!

Cheers!
Nick

Oscar Pérez

unread,
Jul 15, 2015, 11:54:30 AM7/15/15
to bioge...@googlegroups.com, mat...@nimbios.org
Hi Nick,


I will run one analysis excluding this very small branch and I will let you know how it goes.

There is also another issue I encounter when trying to run BioGeoBEARS on my big dataset. As I explained before, I am running the analyses on a linux cluster. It has more than 1000 cores (don't remember how much memory, but it is a lot). I am allowed to use up to 512 cores for my analysis. Right now I am running my analysis using 120 cores and 64 GB, and it takes almost two days to run only DEC, or DIVA, etc. The thing is that when I try to use more cores (e.g. 200, 240, 512) the analysis stops right at the beginning of the initial likelihood calculation, and the following error message is printed:

our computer has 1920 cores. You have chosen to use:
num_cores_to_use = 240 cores for the matrix exponentiations in the likelihood calculations.
Error in socketConnection(port = port, server = TRUE, blocking = TRUE,  :
  all connections are in use
Calls: bears_optim_run ... makeCluster -> <Anonymous> -> newSOCKnode -> socketConnection
Execution halted

I have consulted the cluster manager, and he says the number of cores should not be a problem, but rather something with the script or BioGeoBEARS. Do you know perhaps why this error is happening?

Thank you once more!



Oscar

Nick Matzke

unread,
Jul 15, 2015, 1:20:43 PM7/15/15
to bioge...@googlegroups.com
Hi -- I've never seen this before -- you are pushing well past what most people have tried!  The error is not occurring inside BioGeoBEARS but with the parallelization function makeCluster, which is part of base R now I think (formerly package parallel or snow, I forget).  So you could just try to makeCluster a whole bunch of cores without running BioGeoBEARS, and see if that works.

I also recall that once you makeCluster you have to stopCluster (or something like that) at the end, or they will remain open and "in use" as far as R is concerned. I would think this would reset if you close the R session, but maybe not. If not, figure out how to close them.

There is a way to open the cluster manually and pass that to the BioGeoBEARS_run_object I think, look at the list of stuff in BioGeoBEARS_run_object.

Cheers,
Nick


Oscar Pérez

unread,
Jul 16, 2015, 7:21:06 AM7/16/15
to bioge...@googlegroups.com
Hi Nick!


Thank you for the many advices! I will see what I can do. I am just an amateur in bioinformatics. 

I will run today the analysis without very small branches. I will let you know how it goes.


Cheers,



On Wednesday, July 8, 2015 at 1:21:40 AM UTC+2, chris blair wrote:

Nick Matzke

unread,
Feb 5, 2016, 1:37:52 AM2/5/16
to BioGeoBEARS, mat...@nimbios.org


Uppass starting for marginal ancestral states estimation!
Error in if (sum(relprob_each_split_scenario) > 0) { : 
  missing value where TRUE/FALSE needed
Calls: bears_optim_run ... calc_uppass_probs_new2 -> calc_uppass_scenario_probs_new2
In addition: Warning message:
In (function (npt = min(n + 2L, 2L * n), rhobeg = NA, rhoend = NA,  :
  unused control arguments ignored
Execution halted

Maybe is a silly question, but, do you think that this might be caused by very small branches on the tree?



Finally fixed this error -- it would only occur when there were super-short branches, that were being treated as direct ancestors.  I believe I've fixed this issue -- it was an indexing problem in the specific case of the super-short branches, so most people didn't see it.  I've uploaded the fix -- along with a variety of changes (re-do the source commands, be sure to use the new list of sources() on the PhyloWiki BioGeoBEARS webpage), and this problem should go away.

Cheers, Nick


 
Reply all
Reply to author
Forward
0 new messages