Ask suggestions to analyze two huge data sets

315 views
Skip to first unread message

Shujun Ou

unread,
May 26, 2017, 6:28:22 PM5/26/17
to raxml
Dear raxml community,

I have two large data sets, one binary, one DNA, both of them have ~3,400 taxa
The DNA data has ~440,000 unique characters with ~7% ambiguous data
The binary data has ~245,000 unique characters with ~10% ambiguous data

We have a 56ppn 128G server, and also accessible to 28ppn 128G HPC clusters

My questions are:
 1) Is it suggestive to perform bootstraps or just search for the best ML tree?
 2) If bootstrapping is possible in a considerable short time (i.e. <168hr in HPC or <1 month in the private server), what configurations should I use?
     Currently, for the DNA set, I am running in the private server with:

raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRCAT -m GTRCAT -b 12315 -p 12315 -# 50 -o out1,out2,out3,out4 -T 48

                   123,260 CPU minutes (with 33GB mem) has passed and there is nothing generated yet.

      For the binary data, I plan to run in HPC with:

#PBS -l walltime=160:00:00,nodes=50:ppn=20
#PBS -l mem=60000mba
mpirun -n 50 raxmlHPC-HYBRID-AVX -s bin.fa -n BIN.BINCAT -m BINCAT -b 12315 -p 12315 -# 50 -o out1,out2,out3,out4 -T 20

                  I am not sure how many bootstraps could be done given the queue time is quite long.

 3) If bootstrap is too much, what commands should I use to achieve the best ML trees with comparable likelihood scores?
Can I simply do something like this?
raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRGAMMA -m GTRGAMMA  -o out1,out2,out3,out4 -T 48
raxmlHPC-PTHREADS-AVX2 -s bin.fa -n BIN.GTRGAMMA -m BINGAMMA  -o out1,out2,out3,out4 -T 48

Thank you very much!

Shujun

Alexey Kozlov

unread,
Jun 1, 2017, 12:37:25 PM6/1/17
to ra...@googlegroups.com
Hi Shujun,

> I have two large data sets, one binary, one DNA, both of them have ~3,400 taxa
> The DNA data has ~440,000 unique characters with ~7% ambiguous data
> The binary data has ~245,000 unique characters with ~10% ambiguous data

That's a pretty big datasets indeed! First, some general recommendations:

- it seems like you are working with SNP datasets, if this is the case, you should consider using ascertainment bias
correction (https://www.ncbi.nlm.nih.gov/pubmed/26227865)

- since DNA dataset will take much longer to compute, I'd suggest to run it on the HPC cluster and the binary one on
your private server.

- you might consider using ExaML at least for DNA datasets on the cluster. It will allow you to parallelize a single
tree search across multiple nodes. I think, 3nodes/84core or 4nodes/112 cores will be optimal given your dataset and
hardware. Unfortunately, ExaML doesn't support asc. bias models and doing bootstrapping is a bit more laborious
(described in the manual).

- alternatively, you can give RAxML-NG a try, it seems to be quite stable by now. It supports both asc. bias correction
and automatic bootstrapping, but not the GTRCAT model.

https://github.com/amkozlov/raxml-ng/releases/tag/0.4.0

- if you stick to the standard RAxML, you could try rapid bootstrapping mode (-f a)

Please find further comments between the lines:

> We have a 56ppn 128G server, and also accessible to 28ppn 128G HPC clusters

Does "56ppn" mean 56 physical cores, or 28 cores with hyper-threading?
(same for 28ppn)

> My questions are:
> 1) Is it suggestive to perform bootstraps or just search for the best ML tree?

this boils down to the following two questions:
1) do you need a tree with bootstrap support values?
2) will bootstrapping finish in a reasonable time?

> 2) If bootstrapping is possible in a considerable short time (i.e. <168hr in HPC or <1 month in the private server),
> what configurations should I use?
> Currently, for the DNA set, I am running in the private server with:
>
> raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRCAT -m GTRCAT -b 12315 -p 12315 -# 50-o out1,out2,out3,out4 -T 48

So why are you using 48 threads, if your private server has 56 cores?

> 123,260 CPU minutes (with 33GB mem) has passed and there is nothing generated yet.

Not even a RAxML_log file? What about RAxML_info file, any signs of progress?

> For the binary data, I plan to run in HPC with:
>
> #PBS -l walltime=160:00:00,nodes=50:ppn=20
> #PBS -l mem=60000mba
> mpirun -n 50 raxmlHPC-HYBRID-AVX-s bin.fa -n BIN.BINCAT -m BINCAT -b 12315 -p 12315 -# 50-o out1,out2,out3,out4-T 20

Once again, why aren't you using all 28 threads here?

> I am not sure how many bootstraps could be done given the queue time is quite long.

I guess the only way to know is to measure the execution time for 1 bootstrap and then extrapolate.

> 3) If bootstrap is too much, what commands should I use to achieve the best ML trees with comparable likelihood scores?
> Can I simply do something like this?
> raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRGAMMA -m GTRGAMMA -o out1,out2,out3,out4 -T 48
> raxmlHPC-PTHREADS-AVX2 -s bin.fa -n BIN.GTRGAMMA -m BINGAMMA -o out1,out2,out3,out4 -T 48

Yes, but you should preferably use multiple starting trees, either by adding e.g. "-N 20", or by carrying out multiple
runs with different random seeds (-p option) and including some random starting trees (-d).

Another option would be to run all tree inferences under GTRCAT model, and the re-evaluate the resulting topologies
under GAMMA (-f e -m GTRGAMMA) to get comparable likelihoods and branch lengths, this should accelerate the process
quite a bit.

Finally, SNP datasets often have very low rate heterogeneity, which is indicated by a high alpha parameter estimate
(RAxML will issue a warning about this). In this lucky case, you can use "-m GTRCAT -V" to disable rate heterogeneity
and get the best of both worlds (better runtimes and comparable likelihoods).

Hope this helps,
Alexey

Shujun Ou

unread,
Jun 2, 2017, 11:35:19 AM6/2/17
to raxml
Hi Alexey,


On Thursday, June 1, 2017 at 12:37:25 PM UTC-4, Alexey Kozlov wrote:
Hi Shujun,

> I have two large data sets, one binary, one DNA, both of them have ~3,400 taxa
> The DNA data has ~440,000 unique characters with ~7% ambiguous data
> The binary data has ~245,000 unique characters with ~10% ambiguous data

That's a pretty big datasets indeed! First, some general recommendations:

- it seems like you are working with SNP datasets, if this is the case, you should consider using ascertainment bias
correction (https://www.ncbi.nlm.nih.gov/pubmed/26227865

Yes, one of the datasets is an SNP set from whole-genome shotgun sequencing. Since this is not RADseq but all invariant sites did cut out, would it be also affected by ascertainment bias? 
Do I also need to correct for ascertainment bias for the binary dataset since there are also invariable binary loci being excluded?
 
- since DNA dataset will take much longer to compute, I'd suggest to run it on the HPC cluster and the binary one on
your private server.

Thanks for the suggestion!
 
 
- you might consider using ExaML at least for DNA datasets on the cluster. It will allow you to parallelize a single
tree search across multiple nodes. I think, 3nodes/84core or 4nodes/112 cores will be optimal given your dataset and
hardware. Unfortunately, ExaML doesn't support asc. bias models and doing bootstrapping is a bit more laborious
(described in the manual).

- alternatively, you can give RAxML-NG a try, it seems to be quite stable by now. It supports both asc. bias correction
and automatic bootstrapping, but not the GTRCAT model.

https://github.com/amkozlov/raxml-ng/releases/tag/0.4.0

- if you stick to the standard RAxML, you could try rapid bootstrapping mode (-f a)

All three options sound good. I will try them.


Please find further comments between the lines:

> We have a 56ppn 128G server, and also accessible to 28ppn 128G HPC clusters

Does "56ppn" mean 56 physical cores, or 28 cores with hyper-threading?
(same for 28ppn)

It's 28 physical cores being hyperthreaded to 56 threads. If I specify 28 CPU then raxml will only run in 2800% CPU with 50% processors being occupied. But if I specify 48 it ran in 4800% CPU. Is that anything I should aware of?


> My questions are:
>  1) Is it suggestive to perform bootstraps or just search for the best ML tree?

this boils down to the following two questions:
1) do you need a tree with bootstrap support values?

Yes, this is optimum.
 
2) will bootstrapping finish in a reasonable time?
For the SNP dataset, it took 48 threads 3.4 days for one bootstrapping. I should run it in the HPC if I want it quicker.


>  2) If bootstrapping is possible in a considerable short time (i.e. <168hr in HPC or <1 month in the private server),
> what configurations should I use?
>      Currently, for the DNA set, I am running in the private server with:
>
> raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRCAT -m GTRCAT -b 12315 -p 12315 -# 50-o out1,out2,out3,out4 -T 48

So why are you using 48 threads, if your private server has 56 cores?
I saved 8 threads for other daily usage purposes, otherwise, the server is totally occupied by one task.

>                    123,260 CPU minutes (with 33GB mem) has passed and there is nothing generated yet.

Not even a RAxML_log file? What about RAxML_info file, any signs of progress?
The log and info files were produced quickly. It took 3.4 days for 48 threads to produce one bootstrapping and I didn't have it at the time I post the question. 

>       For the binary data, I plan to run in HPC with:
>
> #PBS -l walltime=160:00:00,nodes=50:ppn=20
> #PBS -l mem=60000mba
> mpirun -n 50 raxmlHPC-HYBRID-AVX-s bin.fa -n BIN.BINCAT -m BINCAT -b 12315 -p 12315 -# 50-o out1,out2,out3,out4-T 20

Once again, why aren't you using all 28 threads here?

This is the concern of queue time. Asking for a full node is more difficult (wait longer) than just ask part of it.


>                   I am not sure how many bootstraps could be done given the queue time is quite long.

I guess the only way to know is to measure the execution time for 1 bootstrap and then extrapolate.
Good strategy!
 

>  3) If bootstrap is too much, what commands should I use to achieve the best ML trees with comparable likelihood scores?
> Can I simply do something like this?
> raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRGAMMA -m GTRGAMMA  -o out1,out2,out3,out4 -T 48
> raxmlHPC-PTHREADS-AVX2 -s bin.fa -n BIN.GTRGAMMA -m BINGAMMA  -o out1,out2,out3,out4 -T 48

Yes, but you should preferably use multiple starting trees, either by adding e.g. "-N 20", or by carrying out multiple
runs with different random seeds (-p option) and including some random starting trees (-d).

Thanks for the suggestion!  Does 20 start trees the most practical number or I can add or minus some given what considerations?


Another option would be to run all tree inferences under GTRCAT model, and the re-evaluate the resulting topologies
under GAMMA (-f e -m GTRGAMMA) to get comparable likelihoods and branch lengths, this should accelerate the process
quite a bit.

Good strategy!
 

Finally, SNP datasets often have very low rate heterogeneity, which is indicated by a high alpha parameter estimate
(RAxML will issue a warning about this). In this lucky case, you can use "-m GTRCAT -V" to disable rate heterogeneity
and get the best of both worlds (better runtimes and comparable likelihoods).

I didn't receive a high alpha parameter warning in the info file or stdout but the following (not sure if it's relevant):
Using BFGS method to optimize GTR rate parameters, to disable this specify "--no-bfgs"
The only warning I received is (because I provide ~10 wild species as roots):
WARNING, outgroups are not monophyletic, using first outgroup "W3104"

 
Hope this helps,
Alexey

Thank you so much for the detailed and helpful suggestions!

Shujun 

Alexandros Stamatakis

unread,
Jun 8, 2017, 12:06:36 AM6/8/17
to ra...@googlegroups.com
Dear Shujun,


> Yes, one of the datasets is an SNP set from whole-genome shotgun
> sequencing. Since this is not RADseq but all invariant sites did cut
> out, would it be also affected by ascertainment bias?

If you cut them out you need to correct for asc. bias, otherwise you can
just leave them in there and don't need to correct. The latter is
better, since you actually have the data for the invariant sites.

> Do I also need to correct for ascertainment bias for the binary dataset
> since there are also invariable binary loci being excluded?

Yes.



> Please find further comments between the lines:
>
> > We have a 56ppn 128G server, and also accessible to 28ppn 128G
> HPC clusters
>
> Does "56ppn" mean 56 physical cores, or 28 cores with hyper-threading?
> (same for 28ppn)
>
>
> It's 28 physical cores being hyperthreaded to 56 threads. If I specify
> 28 CPU then raxml will only run in 2800% CPU with 50% processors being
> occupied. But if I specify 48 it ran in 4800% CPU. Is that anything I
> should aware of?

It will probably not change much, you don't get much improved
performance via hyper-threading.

>
>
> > My questions are:
> > 1) Is it suggestive to perform bootstraps or just search for the
> best ML tree?
>
> this boils down to the following two questions:
> 1) do you need a tree with bootstrap support values?
>
>
> Yes, this is optimum.
>
> 2) will bootstrapping finish in a reasonable time?
>
> For the SNP dataset, it took 48 threads 3.4 days for one bootstrapping.
> I should run it in the HPC if I want it quicker.

Yes.

>
>
> > 2) If bootstrapping is possible in a considerable short time
> (i.e. <168hr in HPC or <1 month in the private server),
> > what configurations should I use?
> > Currently, for the DNA set, I am running in the private
> server with:
> >
> > raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRCAT -m GTRCAT -b 12315
> -p 12315 -# 50-o out1,out2,out3,out4 -T 48
>
> So why are you using 48 threads, if your private server has 56 cores?
>
> I saved 8 threads for other daily usage purposes, otherwise, the server
> is totally occupied by one task.

This may lead to performance degradation, you should rather use the
system exclusively for RAxML.

>
> > For the binary data, I plan to run in HPC with:
> >
> > #PBS -l walltime=160:00:00,nodes=50:ppn=20
> > #PBS -l mem=60000mba
> > mpirun -n 50 raxmlHPC-HYBRID-AVX-s bin.fa -n BIN.BINCAT -m BINCAT
> -b 12315 -p 12315 -# 50-o out1,out2,out3,out4-T 20
>
> Once again, why aren't you using all 28 threads here?
>
>
> This is the concern of queue time. Asking for a full node is more
> difficult (wait longer) than just ask part of it.

You should really use ExaML which allows you to use several nodes.


> > 3) If bootstrap is too much, what commands should I use to
> achieve the best ML trees with comparable likelihood scores?
> > Can I simply do something like this?
> > raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRGAMMA -m GTRGAMMA -o
> out1,out2,out3,out4 -T 48
> > raxmlHPC-PTHREADS-AVX2 -s bin.fa -n BIN.GTRGAMMA -m BINGAMMA -o
> out1,out2,out3,out4 -T 48
>
> Yes, but you should preferably use multiple starting trees, either
> by adding e.g. "-N 20", or by carrying out multiple
> runs with different random seeds (-p option) and including some
> random starting trees (-d).
>
>
> Thanks for the suggestion! Does 20 start trees the most practical
> number or I can add or minus some given what considerations?

Please see RAxML manual page 53 and th efollowing:

https://sco.h-its.org/exelixis/resource/download/NewManual.pdf

Alexis

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Shujun Ou

unread,
Jun 12, 2017, 11:46:34 AM6/12/17
to raxml
Dear Alexis,


On Thursday, June 8, 2017 at 12:06:36 AM UTC-4, Alexis wrote:
Dear Shujun,


> Yes, one of the datasets is an SNP set from whole-genome shotgun
> sequencing. Since this is not RADseq but all invariant sites did cut
> out, would it be also affected by ascertainment bias?

If you cut them out you need to correct for asc. bias, otherwise you can
just leave them in there and don't need to correct. The latter is
better, since you actually have the data for the invariant sites.

I think I have to cut them out otherwise the alignment will be in a TB size. It saves time and memory to compute a relatively smaller dataset. 

I have a kind-of-related question: 
Do I neet to control minor allele frequency in the alignment? Minor alleles (maf<1% or 5%) takes the majority of alleles in my dataset. Maf could be the result of low-frequency mutations or alignment errors, and I am not sure how to leverage this.

 
 
> Do I also need to correct for ascertainment bias for the binary dataset
> since there are also invariable binary loci being excluded?

Yes.



>     Please find further comments between the lines:
>
>      > We have a 56ppn 128G server, and also accessible to 28ppn 128G
>     HPC clusters
>
>     Does "56ppn" mean 56 physical cores, or 28 cores with hyper-threading?
>     (same for 28ppn)
>
>
> It's 28 physical cores being hyperthreaded to 56 threads. If I specify
> 28 CPU then raxml will only run in 2800% CPU with 50% processors being
> occupied. But if I specify 48 it ran in 4800% CPU. Is that anything I
> should aware of?

It will probably not change much, you don't get much improved
performance via hyper-threading.

I see 58 threads available since we deployed the CentOs. Not sure how to restrict hyper-threading. If there is no potential risk and significant drawbacks of using hyper-threading I may just let it be.
ExaML seems like a much faster option, does ExaML support ascertain bias rescaling? 

 


>      >  3) If bootstrap is too much, what commands should I use to
>     achieve the best ML trees with comparable likelihood scores?
>      > Can I simply do something like this?
>      > raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRGAMMA -m GTRGAMMA  -o
>     out1,out2,out3,out4 -T 48
>      > raxmlHPC-PTHREADS-AVX2 -s bin.fa -n BIN.GTRGAMMA -m BINGAMMA  -o
>     out1,out2,out3,out4 -T 48
>
>     Yes, but you should preferably use multiple starting trees, either
>     by adding e.g. "-N 20", or by carrying out multiple
>     runs with different random seeds (-p option) and including some
>     random starting trees (-d).
>
>
> Thanks for the suggestion!  Does 20 start trees the most practical
> number or I can add or minus some given what considerations?

Please see RAxML manual page 53 and th efollowing:

https://sco.h-its.org/exelixis/resource/download/NewManual.pdf


Thank you for point out the page # on the very detailed manual. The practice guide is very helpful! 

I encounter with another problem:

mpirun -n 2 raxmlHPC-HYBRID-AVX -s BIN.fa -n TEST -m ASC_BINCAT -x 12316 -p 12316 -N 50 -o W3104 -T 10 -f a --asc-corr=lewis

This is RAxML MPI Process Number: 0
: illegal option -- -

This is RAxML MPI Process Number: 1
: illegal option -- -

When removing the --asc-corr=lewis parameter, this error disappeared. But do I need to specify an asc-corr model for binary data?

Thanks!
Shujun

Alexandros Stamatakis

unread,
Jun 19, 2017, 12:27:49 AM6/19/17
to ra...@googlegroups.com


On 12.06.2017 17:46, Shujun Ou wrote:
> Dear Alexis,
>
>
> On Thursday, June 8, 2017 at 12:06:36 AM UTC-4, Alexis wrote:
>
> Dear Shujun,
>
>
> > Yes, one of the datasets is an SNP set from whole-genome shotgun
> > sequencing. Since this is not RADseq but all invariant sites did cut
> > out, would it be also affected by ascertainment bias?
>
> If you cut them out you need to correct for asc. bias, otherwise you
> can
> just leave them in there and don't need to correct. The latter is
> better, since you actually have the data for the invariant sites.
>
>
> I think I have to cut them out otherwise the alignment will be in a TB
> size. It saves time and memory to compute a relatively smaller dataset.

that's only partially true, invariant sites will be compressed in raxml,
i.e., you will only have one site containing all As, all Cs etc, but
this site will be assigned a weight that corresponds to the number of
times that site occurs, hence, the computational cost and memory reqs of
raxml will be increased only insignificantly

> I have a kind-of-related question:
> Do I neet to control minor allele frequency in the alignment? Minor
> alleles (maf<1% or 5%) takes the majority of alleles in my dataset.
> Maf could be the result of low-frequency mutations or alignment errors,
> and I am not sure how to leverage this.

I don't know, the question is how you would actually do this, I guess
you should rather filter these out in a pre-processing step and assess
if this changes your topologies or not, but I can't give you a clear
advice here, except that you need to experiment and explore,

alexis
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Shujun Ou

unread,
Jun 19, 2017, 6:24:46 PM6/19/17
to raxml
Dear Alexey and Alexis,

Thank you for your helpful advice.
I did 50 rapid bootstraps of the SNP dataset with lewis asc. correction using the following command:

raxmlHPC-PTHREADS-AVX -s SNP.fa -n SNP.ASCGTRCAT -m ASC_GTRCAT -x 12320 -p 12320 -# 50 -o out1,out2,out3,out4 -T 15 -f a --asc-corr=lewis

The trees were generated but the final likelihood estimation is still running. I did an autoMRE test with these 50 trees:

raxmlHPC -m ASC_GTRCAT -z RAxML_bootstrap.SNP.ASCGTRCAT -I autoMRE -n TEST -p 12315

And got:
# Trees          Avg WRF in %    # Perms: wrf <= 3.00 %
50               6.19                    0
Bootstopping test did not converge after 50 trees

I used the majority rule to draw a tree anyways, and found a very shallow phylogeny. 

I understand that the tree will be very difficult to converge because these samples are closely related species, and most of them are samples of the same species but from different ecotypes. I am not hoping to obtain a bipartisan tree from these data, but is that possible to have all the major clades resolved? 

Or, in another route, is that possible to identify phylogenetically close samples (those phylogenies could not be resolved) and exclude them before raxml search?

Thank you!
Shujun

Alexandros Stamatakis

unread,
Jun 22, 2017, 12:41:11 AM6/22/17
to ra...@googlegroups.com
Dear Shujun,

> Dear Alexey and Alexis,
>
> Thank you for your helpful advice.
> I did 50 rapid bootstraps of the SNP dataset with lewis asc. correction
> using the following command:
>
> |
> raxmlHPC-PTHREADS-AVX -s SNP.fa -n SNP.ASCGTRCAT -m ASC_GTRCAT -x
> 12320-p 12320-# 50 -o out1,out2,out3,out4 -T 15 -f a --asc-corr=lewis
> |
>
> The trees were generated but the final likelihood estimation is still
> running. I did an autoMRE test with these 50 trees:
>
> |
> raxmlHPC -m ASC_GTRCAT -z RAxML_bootstrap.SNP.ASCGTRCAT-I autoMRE -n
> TEST -p 12315
> |
>
> And got:
> # Trees Avg WRF in % # Perms: wrf <= 3.00 %
> 50 6.19 0
> Bootstopping test did not converge after 50 trees

so this indicates that you need to do more replicates which is in line
with the rather weak support you are getting.


>
> I used the majority rule to draw a tree anyways, and found a very
> shallow phylogeny.
>
> <https://lh3.googleusercontent.com/-1Q6CK_3AKqU/WUhL-Sp54pI/AAAAAAAAFpA/xAeqNk_fRxA9W5x2VmQGHJ5NPRyyrJtAACLcBGAs/s1600/QQ%25E6%2588%25AA%25E5%259B%25BE20170619181014.jpg>
>
> I understand that the tree will be very difficult to converge because
> these samples are closely related species, and most of them are samples
> of the same species but from different ecotypes. I am not hoping to
> obtain a bipartisan tree from these data, but is that possible to have
> all the major clades resolved?

Not necessarily as t really just depends on your data. I'd defintely do
more bootstrap replicates, but wouldn't expect the resolution to improve
much. What you can do though is to draw BS support values on the
best-scoring ML tree, which might recover the major clades.

> Or, in another route, is that possible to identify phylogenetically
> close samples (those phylogenies could not be resolved) and exclude them
> before raxml search?

Well, to determine what is phylogenetically close you need to run a
phylogenetic inference first. So I don't think there is a way around
building an initial phylogeny comprsing all taxa.

Alexis

>
> Thank you!
> Shujun
>
>
>
>
>
> On Friday, May 26, 2017 at 6:28:22 PM UTC-4, Shujun Ou wrote:
>
> Dear raxml community,
>
> I have two large data sets, one binary, one DNA, both of them have
> ~3,400 taxa
> The DNA data has ~440,000 unique characters with ~7% ambiguous data
> The binary data has ~245,000 unique characters with ~10% ambiguous data
>
> We have a 56ppn 128G server, and also accessible to 28ppn 128G HPC
> clusters
>
> My questions are:
> 1) Is it suggestive to perform bootstraps or just search for the
> best ML tree?
> 2) If bootstrapping is possible in a considerable short time (i.e.
> <168hr in HPC or <1 month in the private server), what
> configurations should I use?
> Currently, for the DNA set, I am running in the private server
> with:
>
> raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRCAT -m GTRCAT -b 12315 -p
> 12315 -# 50-o out1,out2,out3,out4 -T 48
>
> 123,260 CPU minutes (with 33GB mem) has passed
> and there is nothing generated yet.
>
> For the binary data, I plan to run in HPC with:
>
> #PBS -l walltime=160:00:00,nodes=50:ppn=20
> #PBS -l mem=60000mba
> mpirun -n 50 raxmlHPC-HYBRID-AVX-s bin.fa-n BIN.BINCAT -m BINCAT -b
> 12315 -p 12315 -# 50-o out1,out2,out3,out4-T 20
>
> I am not sure how many bootstraps could be done
> given the queue time is quite long.
>
> 3) If bootstrap is too much, what commands should I use to achieve
> the best ML trees with comparable likelihood scores?
> Can I simply do something like this?
> raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRGAMMA -m GTRGAMMA -o
> out1,out2,out3,out4 -T 48
> raxmlHPC-PTHREADS-AVX2 -s bin.fa -n BIN.GTRGAMMA -m BINGAMMA -o
> out1,out2,out3,out4 -T 48
>
> Thank you very much!
>
> Shujun
>
Reply all
Reply to author
Forward
0 new messages