Hi Shujun,
> I have two large data sets, one binary, one DNA, both of them have ~3,400 taxa
> The DNA data has ~440,000 unique characters with ~7% ambiguous data
> The binary data has ~245,000 unique characters with ~10% ambiguous data
That's a pretty big datasets indeed! First, some general recommendations:
- it seems like you are working with SNP datasets, if this is the case, you should consider using ascertainment bias
correction (
https://www.ncbi.nlm.nih.gov/pubmed/26227865)
- since DNA dataset will take much longer to compute, I'd suggest to run it on the HPC cluster and the binary one on
your private server.
- you might consider using ExaML at least for DNA datasets on the cluster. It will allow you to parallelize a single
tree search across multiple nodes. I think, 3nodes/84core or 4nodes/112 cores will be optimal given your dataset and
hardware. Unfortunately, ExaML doesn't support asc. bias models and doing bootstrapping is a bit more laborious
(described in the manual).
- alternatively, you can give RAxML-NG a try, it seems to be quite stable by now. It supports both asc. bias correction
and automatic bootstrapping, but not the GTRCAT model.
https://github.com/amkozlov/raxml-ng/releases/tag/0.4.0
- if you stick to the standard RAxML, you could try rapid bootstrapping mode (-f a)
Please find further comments between the lines:
> We have a 56ppn 128G server, and also accessible to 28ppn 128G HPC clusters
Does "56ppn" mean 56 physical cores, or 28 cores with hyper-threading?
(same for 28ppn)
> My questions are:
> 1) Is it suggestive to perform bootstraps or just search for the best ML tree?
this boils down to the following two questions:
1) do you need a tree with bootstrap support values?
2) will bootstrapping finish in a reasonable time?
> 2) If bootstrapping is possible in a considerable short time (i.e. <168hr in HPC or <1 month in the private server),
> what configurations should I use?
> Currently, for the DNA set, I am running in the private server with:
>
> raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRCAT -m GTRCAT -b 12315 -p 12315 -# 50-o out1,out2,out3,out4 -T 48
So why are you using 48 threads, if your private server has 56 cores?
> 123,260 CPU minutes (with 33GB mem) has passed and there is nothing generated yet.
Not even a RAxML_log file? What about RAxML_info file, any signs of progress?
> For the binary data, I plan to run in HPC with:
>
> #PBS -l walltime=160:00:00,nodes=50:ppn=20
> #PBS -l mem=60000mba
> mpirun -n 50 raxmlHPC-HYBRID-AVX-s bin.fa -n BIN.BINCAT -m BINCAT -b 12315 -p 12315 -# 50-o out1,out2,out3,out4-T 20
Once again, why aren't you using all 28 threads here?
> I am not sure how many bootstraps could be done given the queue time is quite long.
I guess the only way to know is to measure the execution time for 1 bootstrap and then extrapolate.
> 3) If bootstrap is too much, what commands should I use to achieve the best ML trees with comparable likelihood scores?
> Can I simply do something like this?
> raxmlHPC-PTHREADS-AVX2 -s SNP.fa -n SNP.GTRGAMMA -m GTRGAMMA -o out1,out2,out3,out4 -T 48
> raxmlHPC-PTHREADS-AVX2 -s bin.fa -n BIN.GTRGAMMA -m BINGAMMA -o out1,out2,out3,out4 -T 48
Yes, but you should preferably use multiple starting trees, either by adding e.g. "-N 20", or by carrying out multiple
runs with different random seeds (-p option) and including some random starting trees (-d).
Another option would be to run all tree inferences under GTRCAT model, and the re-evaluate the resulting topologies
under GAMMA (-f e -m GTRGAMMA) to get comparable likelihoods and branch lengths, this should accelerate the process
quite a bit.
Finally, SNP datasets often have very low rate heterogeneity, which is indicated by a high alpha parameter estimate
(RAxML will issue a warning about this). In this lucky case, you can use "-m GTRCAT -V" to disable rate heterogeneity
and get the best of both worlds (better runtimes and comparable likelihoods).
Hope this helps,
Alexey